Assignment 1 / Multimodal Track

Food101 x CLIP Latest Run 2026-04-06

One encoder, three decision routes, ten food classes.

This report documents the full multimodal pipeline for `ethz/food101`: preprocessing the 10-class subset, extracting CLIP embeddings, evaluating zero-shot prompting, adapting with CoOp learnable context, and benchmarking against a few-shot linear probe. The layout has been condensed into a single continuous reading surface so every plot, table, and failure case sits in one consistent visual system.

Current system state

The active experiment filters Food101 to 10 balanced classes, preserves the official validation split as the held-out test set, and evaluates three classification strategies over shared CLIP image embeddings. Zero-shot is still the strongest overall route in this run, but CoOp narrows the gap and outperforms the linear probe at low-shot settings.

Classes
10

Apple pie, bibimbap, chicken wings, donuts, eggs benedict, french fries, grilled cheese, hamburger, ice cream, pizza.

Train / Test Pool
7.5k / 2.5k

Filtered from the official Food101 train and validation splits.

Best Accuracy
98.08%

Zero-shot CLIP remains the strongest final method in the latest run.

Best Few-Shot
97.40%

CoOp at 128 shots edges the linear probe at the highest support budget.

What changed in the latest report

  • Zero-shot now represents each class with the average of five prompt embeddings before computing image-text similarity.
  • Few-shot comparison now includes both a linear probe and CoOp prompt tuning with learnable context tokens.
  • Evaluation exports confusion matrices, top-failure tables, CoOp token decoding, and saliency maps into the docs asset bundle.
  • All current plots, metrics, CSV tables, and failure visualizations have been copied into `docs/assignment-1/multimodal/assets/results` for publication.
Bar chart comparing zero-shot, CoOp, and linear probe performance across few-shot settings.
The headline pattern is stable: zero-shot starts unexpectedly high at 98.08%, CoOp is consistently stronger than the linear probe at 8 to 32 shots, and both few-shot methods approach parity near 128 shots.

Navigate the full-width report

Published files

The docs now ship with the latest run bundle, including the evaluation summary, metrics CSVs, confusion matrices for zero-shot, CoOp, and linear probe, decoded CoOp context vectors, and saliency maps for the top failed predictions.

Interactive experiment report

The full training log, method comparison, and tracked artifacts are also published in WandB for an interactive version of the report.

Embedded WandB report for the latest multimodal experiment run.