One representation space, three ways to make a decision.

The entire report is anchored on `openai/clip-vit-base-patch32`. Images are encoded once into a shared embedding space, then evaluated through three heads: averaged prompt prototypes for zero-shot, learnable context prompts for CoOp, and a linear classifier for the few-shot probe baseline.

Architecture

Shared image encoder

CLIP ViT-B/32 acts as the fixed representation engine for every route in the report. That design removes backbone retraining from the comparison, which makes the downstream question cleaner: how much adaptation is really needed once a strong image-text encoder is already available?

Backbone

ViT-B/32

OpenAI CLIP image-text model with a shared projection space.

Prompt Ensemble

Each class is represented by the average of five prompt embeddings in zero-shot mode.

CoOp Context

Sixteen learnable context tokens are inserted before the class label by default.

Frozen Weights

CLIP

CLIP stays frozen for both CoOp and the linear probe; only the head or prompt context changes.

Flow

Decision routes

Route	Text representation	Trainable part	Inference rule
Zero-shot	Five hand-written prompt templates per class, averaged into one class prototype.	None	Image embedding multiplied with averaged prompt embeddings.
CoOp	Class label plus learnable context tokens encoded by frozen CLIP text tower.	Context tokens only	Image embedding compared with learned text embeddings.
Linear probe	No text at inference	Linear classifier on frozen image features	Image embedding sent through a trained linear layer.

image -> CLIP image encoder -> normalized image embedding

zero-shot:
  5 prompt templates per class
  -> CLIP text encoder
  -> average 5 prompt embeddings
  -> similarity(image_embed, class_prompt_embed)

CoOp:
  class label + learnable context tokens
  -> frozen CLIP text encoder
  -> learned text embeddings
  -> similarity(image_embed, learned_text_embed)

linear probe:
  normalized image embedding
  -> trainable linear classifier

Zero-shot confusion matrix. — Zero-shot operates by matching image embeddings against averaged prompt embeddings.

CoOp confusion matrix at 128 shots. — CoOp keeps the same image encoder but replaces fixed prompts with learned context tokens.

Source Code

Relevant implementation in `src/`

The actual backbone logic lives in the training and embedding scripts. Zero-shot depends on prompt averaging from extract_embedding.py, while CoOp and the linear probe are implemented in train.py.

# assignments/assignment1/multimodal/src/extract_embedding.py
def build_text_embeddings(model, processor, class_names, prompt_templates, device):
    for label_id, class_name in enumerate(class_names):
        prompts = [template.format(humanize_class_name(class_name)) for template in prompt_templates]
        text_inputs = processor(text=prompts, return_tensors="pt", padding=True, truncation=True)
        text_outputs = model.get_text_features(**text_inputs)
        prompt_embeddings = text_outputs / text_outputs.norm(dim=-1, keepdim=True)
        pooled_embedding = prompt_embeddings.mean(dim=0)
        pooled_embedding = pooled_embedding / pooled_embedding.norm()
        class_embeddings.append(pooled_embedding.cpu())

# assignments/assignment1/multimodal/src/train.py
def run_zero_shot(test_payload, text_payload, batch_size):
    logit_scale = float(text_payload["logit_scale"])
    logits = compute_similarity_logits(
        image_embeddings=test_payload["embeddings"],
        text_embeddings=text_payload["class_embeddings"],
        logit_scale=logit_scale,
        batch_size=batch_size,
    )
    predictions = logits.argmax(dim=-1)
    metrics = compute_metrics(
        true_labels=test_payload["label_ids"].tolist(),
        predicted_labels=predictions.tolist(),
    )
    return metrics, logits

# assignments/assignment1/multimodal/src/train.py
def train_linear_probe(train_embeddings, train_labels, val_embeddings, val_labels, ...):
    classifier = nn.Linear(train_embeddings.shape[1], num_classes).to(device)
    optimizer = torch.optim.AdamW(classifier.parameters(), lr=learning_rate, weight_decay=weight_decay)
    criterion = nn.CrossEntropyLoss()
    ...
    logits = classifier(batch_embeddings)
    loss = criterion(logits, batch_labels)

Why This Design

Why the fixed backbone matters

It keeps the comparison honest by isolating the head rather than conflating head choice with backbone fine-tuning.
It makes zero-shot, CoOp, and linear-probe results directly comparable because they consume the same image feature space.
It reduces training cost and lets the report focus on prompt engineering versus lightweight adaptation.
It supports practical deployment because embeddings can be precomputed and reused across methods.

The key finding from this run is unintuitive but important: the frozen backbone is already strong enough that zero-shot with averaged prompts beats every few-shot variant on final test accuracy.

Shared image encoder

Decision routes

Relevant implementation in src/

Why the fixed backbone matters

Relevant implementation in `src/`