Assignment 1 / Multimodal / Evaluation Results

Run 20260406_094056 11 Reported Rows

Zero-shot wins the run, CoOp wins the low-shot adaptation story.

The latest run reports three consistent patterns. First, averaged five-prompt zero-shot CLIP is the overall best method at 98.08% accuracy. Second, CoOp beats the linear probe at 8, 16, and 32 shots and stays competitive through 128 shots. Third, the hardest failures are concentrated in visually plausible dessert and sandwich confusions, especially apple pie versus donuts and donuts versus hamburger.

Best scores in the current report

Zero-Shot Accuracy
98.08%

Best overall result, using averaged prompt prototypes and no task-specific training.

Best CoOp
97.40%

CoOp at 128 shots is the strongest learned prompt run.

Best Linear Probe
97.32%

Linear probe at 128 shots nearly matches CoOp but remains below zero-shot.

Best Overall Method
Zero-shot

Averaged five-prompt zero-shot CLIP is the strongest result in the full benchmark.

Best methods by regime

Regime Best method Accuracy Why it matters
Overall Zero-shot 0.9808 Best final result without any task-specific training.
Few-shot 8 CoOp 0.9708 Large gain over the linear probe at the smallest support budget.
Few-shot 16 CoOp 0.9724 Prompt learning remains clearly more sample-efficient.
Few-shot 32 CoOp 0.9724 CoOp still leads before the probe starts catching up.
Few-shot 64 CoOp 0.9692 The gap narrows, but CoOp is still marginally ahead.
Few-shot 128 CoOp 0.9740 CoOp remains the strongest learned adaptation route in this run.
The key story is not just that zero-shot wins. It is that CoOp is the best learned method at every few-shot budget, which suggests prompt tuning is the more effective adaptation strategy than a linear probe for this Food101 subset.

All reported settings

Method Shots Accuracy Balanced Accuracy Macro Precision Macro F1
Zero-shot00.98080.98080.98080.9808
CoOp80.97080.97080.97090.9707
Linear probe80.93400.93400.94090.9345
CoOp160.97240.97240.97270.9724
Linear probe160.95440.95440.95650.9543
CoOp320.97240.97240.97260.9724
Linear probe320.96120.96120.96250.9613
CoOp640.96920.96920.96970.9691
Linear probe640.96720.96720.96800.9673
CoOp1280.97400.97400.97430.9740
Linear probe1280.97320.97320.97320.9732

Comparison across methods and support budgets

Comparison bar chart for zero-shot, CoOp, and linear probe metrics.
The plot makes two things obvious: zero-shot starts from an unusually strong baseline, and CoOp is the more sample-efficient few-shot method through the low- and mid-shot regime.

Method-level confusion matrices

The main residual confusions are not random. They cluster around visually adjacent foods with similar textures or silhouettes: fried rings and pastry surfaces for donuts, layered crust and browned surfaces for apple pie, and bread-heavy compositions for hamburger-like mistakes.

Decoded prompt context is not human-readable language

Nearest-token decoding of the learned CoOp context shows why prompt tuning should not be interpreted as conventional text editing. The learned vectors do not settle into fluent phrases; instead they occupy regions of CLIP token space that are useful for classification even when the nearest vocabulary items look fragmented or semantically unrelated.

Class Context token Nearest decoded tokens
apple_pie0slowdown, replacement-char, dumping, ringing, accounting
apple_pie1rou, muck, ora, poz, crock
apple_pie2!, flower-emoji, !!, lor, cr
apple_pie3llen, death, sous, poured, grading

Top failed tests and saliency

The highest-confidence failures are especially useful because they reveal what the system believes very strongly but incorrectly. In this run, the most repeated failure cases are `apple_pie -> donuts` and `donuts -> hamburger`, and both persist across zero-shot and CoOp.

Method Shots True Predicted Confidence
CoOp128apple_piedonuts0.9996
CoOp32donutshamburger0.9991
Zero-shot0apple_piedonuts0.9981
Zero-shot0donutshamburger0.9970
CoOp32apple_piedonuts0.9962