language: en tags: - vision - text - multimodal - comics - contrastive-learning - feature-extraction license: mit

Comic Panel Encoder v1 (Stage 3)

This model is a multimodal encoder specifically designed to generate rich, dense feature representations (embeddings) of individual comic book panels. It serves as "Stage 3" of the Comic Analysis Framework v2.0.

By combining visual details, extracted text (dialogue/narration), and compositional metadata (bounding box coordinates), it generates a single 512-dimensional vector per panel. These embeddings are highly optimized for downstream sequential narrative modeling (Stage 4) and comic retrieval tasks.

Model Architecture

The comic-panel-encoder-v1 utilizes an Adaptive Multi-Modal Fusion architecture:

  1. Visual Branch (Dual Backbone):
    • SigLIP (google/siglip-base-patch16-224): Captures high-level semantic and stylistic features.
    • ResNet50: Captures fine-grained, low-level texture and structural details.
    • Fusion: An attention mechanism fuses the domain-adapted outputs of both backbones.
  2. Text Branch:
    • MiniLM (sentence-transformers/all-MiniLM-L6-v2): Encodes transcribed dialogue, narration, and VLM-generated descriptions.
  3. Compositional Branch:
    • A Multi-Layer Perceptron (MLP) encodes panel geometry (aspect ratio, normalized bounding box coordinates, relative area).
  4. Adaptive Fusion Gate:
    • A learned gating mechanism combines the Vision, Text, and Composition features, dynamically weighting them based on the presence/quality of the modalities (e.g., handles panels with no text gracefully).

Training Data & Methodology

The model was trained on a dataset of approximately 1 million comic pages, filtered specifically for narrative/story content using CoSMo (Comic Stream Modeling).

Objectives

The encoder was trained from scratch (with frozen base backbones) using three simultaneous objectives:

  1. InfoNCE Contrastive Loss (Global Context): Maximizes similarity between panels on the same page while minimizing similarity to panels on different pages. This forces the model to learn distinct page-level stylistic and narrative contexts.
  2. Masked Panel Reconstruction (Local Detail): Predicts the embedding of a masked panel given the context of surrounding panels on the same page. This prevents mode collapse and ensures individual panels retain their unique sequential features.
  3. Modality Alignment: Aligns the visual embedding space with the text embedding space for a given panel using contrastive cross-entropy.

Usage

You can use this model to extract 512-d embeddings from comic panels. The codebase required to run this model is available in the Comic Analysis GitHub Repository under src/version2/stage3_panel_features_framework.py.

Example: Extracting Features

import torch
from PIL import Image
import torchvision.transforms as T
from transformers import AutoTokenizer
# Requires cloning the GitHub repo for the framework class
from stage3_panel_features_framework import PanelFeatureExtractor

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Initialize Model
model = PanelFeatureExtractor(
    visual_backbone='both',
    visual_fusion='attention',
    feature_dim=512
).to(device)

# Load weights from Hugging Face
state_dict = torch.hub.load_state_dict_from_url(
    "https://huggingface.co/RichardScottOZ/comic-panel-encoder-v1/resolve/main/best_model.pt",
    map_location=device
)
model.load_state_dict(state_dict)
model.eval()

# 2. Prepare Inputs
# Image
image = Image.open('sample_panel.jpg').convert('RGB')
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
img_tensor = transform(image).unsqueeze(0).unsqueeze(0).to(device) # (B=1, N=1, C, H, W)

# Text
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
text_enc = tokenizer(["Batman punches the Joker"], return_tensors='pt', padding=True)
input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
attn_mask = text_enc['attention_mask'].unsqueeze(0).to(device)

# Composition (e.g., Aspect Ratio, Area, Center X, Center Y)
comp_feats = torch.zeros(1, 1, 7).to(device) 

# Modality Mask [Vision, Text, Comp]
modality_mask = torch.tensor([[[1.0, 1.0, 1.0]]]).to(device)

batch = {
    'images': img_tensor,
    'input_ids': input_ids,
    'attention_mask': attn_mask,
    'comp_feats': comp_feats,
    'modality_mask': modality_mask
}

# 3. Generate Embedding
with torch.no_grad():
    panel_embedding = model(batch)

print(f"Embedding shape: {panel_embedding.shape}") # Output: torch.Size([1, 512])

Intended Use & Limitations

  • Sequence Modeling: These embeddings are intended to be fed into a temporal sequence model (like a Transformer encoder) to predict narrative flow, reading order, and character coherence (Stage 4 of the framework).
  • Retrieval: Can be used to find visually or semantically similar panels across a large database using Cosine Similarity.
  • Limitation: The visual backbones were frozen during training, meaning the model relies on the pre-trained priors of SigLIP and ResNet50, combined via the newly trained adapter and fusion layers.

Citation

If you use this model or the associated framework, please link back to the Comic Analysis GitHub Repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including RichardScottOZ/comic-panel-encoder-v1