The experiment is a reproducible sequence, not a notebook-only artifact.

The multimodal workflow is defined through a single YAML config, split into explicit stages for download, preprocessing, embedding extraction, training, inference, and evaluation. This keeps the pipeline auditable and lets the final HTML report mirror the exact artifacts produced by the training run.

Run Order

Process graph

1. download.py
   -> export Food101 images + labels to disk

2. preprocess.py
   -> filter to 10 selected classes
   -> write train_full.csv and test.csv
   -> create fewshot_{8,16,32,64,128}/{train,val}.csv

3. extract_embedding.py
   -> extract CLIP image embeddings for train_full and test
   -> extract 5 prompt embeddings per class
   -> average 5 prompt embeddings into one class prototype

4. train.py
   -> zero-shot evaluation
   -> CoOp training per shot setting
   -> linear probe training per shot setting

5. infer.py
   -> load zero-shot, linear probe, or CoOp

6. evaluate.py
   -> build summary.csv, confusion matrices, failure analysis, saliency maps

Core Config

What the YAML controls

Config block	Purpose
`model`	CLIP model id, currently `openai/clip-vit-base-patch32`.
`dataset`	Hugging Face dataset id and the 10 selected Food101 classes.
`prompts`	Five zero-shot templates, averaged per class.
`few_shot`	Support budgets: 8, 16, 32, 64, 128 shots per class.
`splits`	Validation-per-class budget and deterministic seed.
`training`	Linear-probe optimizer settings.
`coop`	Number of context tokens, class-token placement, class-specific context toggle, and prompt-optimizer settings.
`wandb`	Tracking mode, project, entity, and tags.

Method Choices

Training strategy by head

Zero-shot uses no task-specific weight updates and relies only on CLIP similarity between image embeddings and averaged class prompt embeddings.
CoOp freezes CLIP and learns only the prompt context vectors, which keeps adaptation parameter-efficient.
The linear probe uses the same frozen image features but learns a linear classification head from support images only.
Each few-shot setting shares the same held-out test set, so improvements are directly comparable across support budgets.

Comparison bar chart across methods and shot settings. — Methodology is only useful if it reveals a tradeoff. Here the tradeoff is clear: prompt learning helps at low-shot settings, but the backbone itself is already so strong that zero-shot remains hard to beat.

Architecture Size

Linear probe and CoOp are both lightweight adapters

Neither learned method fine-tunes CLIP itself. The image encoder, text encoder, and projection layers stay frozen. The only trainable difference is whether adaptation happens in a small classifier on the image side or in a bank of learnable prompt vectors on the text side.

Method	Trainable component	Shape	Trainable params	Notes
Zero-shot	None	0	0	Uses averaged prompt embeddings and CLIP similarity only.
Linear probe	Classifier weight + bias	512 x 10 + 10	5,130	A single linear layer on top of the 512-d normalized CLIP image embedding.
CoOp	Shared context tokens	16 x 512	8,192	Sixteen learnable prompt vectors because `class_specific_context: false` in the current config.

CoOp has slightly more trainable parameters than the linear probe in this setup, but it is still a very small adapter. If `class_specific_context` were enabled, CoOp would scale to `10 x 16 x 512 = 81,920` prompt parameters for the 10 classes.

CoOp

Why prompt learning changes the few-shot regime

CoOp replaces the fixed hand-written prompt bank with learnable context tokens that are optimized against the few-shot support set while the CLIP backbone stays frozen. This keeps adaptation lightweight but lets the text side bend toward the current task.

Illustration of CoOp with learnable context tokens around the class label. — CoOp inserts learnable context tokens around the class label, feeds the full prompt through the frozen CLIP text encoder, and updates only the context vectors to improve image-text alignment for the target classes.

Source Code

Zero-shot, linear probe, and CoOp implementations in `src/`

# assignments/assignment1/multimodal/src/extract_embedding.py
def build_text_embeddings(model, processor, class_names, prompt_templates, device):
    ...
    prompt_embeddings = text_features / text_features.norm(dim=-1, keepdim=True)
    pooled_embedding = prompt_embeddings.mean(dim=0)
    pooled_embedding = pooled_embedding / pooled_embedding.norm()
    class_embeddings.append(pooled_embedding.cpu())

# assignments/assignment1/multimodal/src/train.py
def train_linear_probe(train_embeddings, train_labels, val_embeddings, val_labels, ...):
    classifier = nn.Linear(train_embeddings.shape[1], num_classes).to(device)
    optimizer = torch.optim.AdamW(classifier.parameters(), lr=learning_rate, weight_decay=weight_decay)
    criterion = nn.CrossEntropyLoss()
    ...
    logits = classifier(batch_embeddings)
    loss = criterion(logits, batch_labels)

Architecture	Forward rule	Effective size in this pipeline
Zero-shot	`image_embed @ avg_prompt_embed.T`	No task-specific trainable parameters.
Linear probe	`Linear(512 -> 10)`	5,130 trainable parameters.
CoOp	`16 learnable prompt vectors -> frozen CLIP text encoder`	8,192 trainable parameters with shared context.

# assignments/assignment1/multimodal/src/train.py
class CoOpPromptLearner(nn.Module):
    def __init__(..., num_context_tokens, class_token_position, class_specific_context, ...):
        self.token_embedding = clip_model.text_model.embeddings.token_embedding
        context_vectors = self._initialize_context(...)
        if class_specific_context:
            context_vectors = context_vectors.unsqueeze(0).repeat(self.num_classes, 1, 1)
        self.context = nn.Parameter(context_vectors)

def train_coop(train_embeddings, train_labels, val_embeddings, val_labels, class_names, model_id, coop_config, device):
    clip_model = CLIPModel.from_pretrained(model_id).to(device)
    for parameter in clip_model.parameters():
        parameter.requires_grad = False
    prompt_learner = CoOpPromptLearner(...)
    ...
    text_features = build_coop_text_features(prompt_learner, clip_model, device=device)
    logits = clip_model.logit_scale.exp() * (batch_embeddings @ text_features.T)

The important methodological distinction is that zero-shot changes no task-specific weights, the linear probe changes only the classifier, and CoOp changes only the prompt context vectors.

Observed Optimization

What training histories show

The 128-shot histories expose two different learning profiles. CoOp drops its training loss quickly and reaches near-saturated validation macro-F1 early, while the linear probe improves more gradually and plateaus lower on validation despite steady training progress.

Method	Epoch 1 loss	Final epoch loss	Peak val macro-F1
CoOp (128-shot)	0.334	0.028	0.980
Linear probe (128-shot)	2.266	1.010	0.970

CoOp’s learned context reaches a stronger validation regime with far fewer trainable parameters than the linear probe, which is exactly why prompt tuning is worth evaluating in a small-data setting.