One encoder, three decision routes, ten food classes.

This report documents the full multimodal pipeline for `ethz/food101`: preprocessing the 10-class subset, extracting CLIP embeddings, evaluating zero-shot prompting, adapting with CoOp learnable context, and benchmarking against a few-shot linear probe. The layout has been condensed into a single continuous reading surface so every plot, table, and failure case sits in one consistent visual system.

Open Dataset EDA Open Model Backbone Open Methodology Open Evaluation Results View WandB Report

Report Snapshot

Current system state

The active experiment filters Food101 to 10 balanced classes, preserves the official validation split as the held-out test set, and evaluates three classification strategies over shared CLIP image embeddings. Zero-shot is still the strongest overall route in this run, but CoOp narrows the gap and outperforms the linear probe at low-shot settings.

Classes

Apple pie, bibimbap, chicken wings, donuts, eggs benedict, french fries, grilled cheese, hamburger, ice cream, pizza.

Train / Test Pool

7.5k / 2.5k

Filtered from the official Food101 train and validation splits.

Best Accuracy

98.08%

Zero-shot CLIP remains the strongest final method in the latest run.

Best Few-Shot

97.40%

CoOp at 128 shots edges the linear probe at the highest support budget.

Comparison

What changed in the latest report

Zero-shot now represents each class with the average of five prompt embeddings before computing image-text similarity.
Few-shot comparison now includes both a linear probe and CoOp prompt tuning with learnable context tokens.
Evaluation exports confusion matrices, top-failure tables, CoOp token decoding, and saliency maps into the docs asset bundle.
All current plots, metrics, CSV tables, and failure visualizations have been copied into `docs/assignment-1/multimodal/assets/results` for publication.

Bar chart comparing zero-shot, CoOp, and linear probe performance across few-shot settings. — The headline pattern is stable: zero-shot starts unexpectedly high at 98.08%, CoOp is consistently stronger than the linear probe at 8 to 32 shots, and both few-shot methods approach parity near 128 shots.

Report Map

Navigate the full-width report

Dataset EDA

Split design, geometry variation, class balance, and representative samples from the 10 selected classes.

Model Backbone

Shared CLIP encoder, prompt ensemble for zero-shot, prompt learner for CoOp, and the linear-probe alternative.

Methodology

Config surface, run order, split recipe, optimizer settings, and why the frozen-encoder design is useful here.

Evaluation Results

Metrics table, confusion matrices, CoOp token decoding, and high-confidence failures with gradient saliency maps.

Assets

Published files

The docs now ship with the latest run bundle, including the evaluation summary, metrics CSVs, confusion matrices for zero-shot, CoOp, and linear probe, decoded CoOp context vectors, and saliency maps for the top failed predictions.

summary.csv metrics_summary.csv top_failures.csv processed_summary.json

WandB Report

Interactive experiment report

The full training log, method comparison, and tracked artifacts are also published in WandB for an interactive version of the report.

Embedded WandB report for the latest multimodal experiment run.