Zero-shot wins the run, CoOp wins the low-shot adaptation story.

The latest run reports three consistent patterns. First, averaged five-prompt zero-shot CLIP is the overall best method at 98.08% accuracy. Second, CoOp beats the linear probe at 8, 16, and 32 shots and stays competitive through 128 shots. Third, the hardest failures are concentrated in visually plausible dessert and sandwich confusions, especially apple pie versus donuts and donuts versus hamburger.

Headline Metrics

Best scores in the current report

Zero-Shot Accuracy

98.08%

Best overall result, using averaged prompt prototypes and no task-specific training.

Best CoOp

97.40%

CoOp at 128 shots is the strongest learned prompt run.

Best Linear Probe

97.32%

Linear probe at 128 shots nearly matches CoOp but remains below zero-shot.

Best Overall Method

Zero-shot

Averaged five-prompt zero-shot CLIP is the strongest result in the full benchmark.

Highlights

Best methods by regime

Regime	Best method	Accuracy	Why it matters
Overall	Zero-shot	0.9808	Best final result without any task-specific training.
Few-shot 8	CoOp	0.9708	Large gain over the linear probe at the smallest support budget.
Few-shot 16	CoOp	0.9724	Prompt learning remains clearly more sample-efficient.
Few-shot 32	CoOp	0.9724	CoOp still leads before the probe starts catching up.
Few-shot 64	CoOp	0.9692	The gap narrows, but CoOp is still marginally ahead.
Few-shot 128	CoOp	0.9740	CoOp remains the strongest learned adaptation route in this run.

The key story is not just that zero-shot wins. It is that CoOp is the best learned method at every few-shot budget, which suggests prompt tuning is the more effective adaptation strategy than a linear probe for this Food101 subset.

Metrics Table

All reported settings

Method	Shots	Accuracy	Balanced Accuracy	Macro Precision	Macro F1
Zero-shot	0	0.9808	0.9808	0.9808	0.9808
CoOp	8	0.9708	0.9708	0.9709	0.9707
Linear probe	8	0.9340	0.9340	0.9409	0.9345
CoOp	16	0.9724	0.9724	0.9727	0.9724
Linear probe	16	0.9544	0.9544	0.9565	0.9543
CoOp	32	0.9724	0.9724	0.9726	0.9724
Linear probe	32	0.9612	0.9612	0.9625	0.9613
CoOp	64	0.9692	0.9692	0.9697	0.9691
Linear probe	64	0.9672	0.9672	0.9680	0.9673
CoOp	128	0.9740	0.9740	0.9743	0.9740
Linear probe	128	0.9732	0.9732	0.9732	0.9732

Download metrics_summary.csv Download run summary.csv Download top_failures.csv

Plot

Comparison across methods and support budgets

Comparison bar chart for zero-shot, CoOp, and linear probe metrics. — The plot makes two things obvious: zero-shot starts from an unusually strong baseline, and CoOp is the more sample-efficient few-shot method through the low- and mid-shot regime.

Confusion

Method-level confusion matrices

The main residual confusions are not random. They cluster around visually adjacent foods with similar textures or silhouettes: fried rings and pastry surfaces for donuts, layered crust and browned surfaces for apple pie, and bread-heavy compositions for hamburger-like mistakes.

CoOp Analysis

Decoded prompt context is not human-readable language

Nearest-token decoding of the learned CoOp context shows why prompt tuning should not be interpreted as conventional text editing. The learned vectors do not settle into fluent phrases; instead they occupy regions of CLIP token space that are useful for classification even when the nearest vocabulary items look fragmented or semantically unrelated.

Class	Context token	Nearest decoded tokens
apple_pie	0	slowdown, replacement-char, dumping, ringing, accounting
apple_pie	1	rou, muck, ora, poz, crock
apple_pie	2	!, flower-emoji, !!, lor, cr
apple_pie	3	llen, death, sous, poured, grading

Open decoded CoOp context JSON

Failure Analysis

Top failed tests and saliency

The highest-confidence failures are especially useful because they reveal what the system believes very strongly but incorrectly. In this run, the most repeated failure cases are `apple_pie -> donuts` and `donuts -> hamburger`, and both persist across zero-shot and CoOp.

Method	Shots	True	Predicted	Confidence
CoOp	128	apple_pie	donuts	0.9996
CoOp	32	donuts	hamburger	0.9991
Zero-shot	0	apple_pie	donuts	0.9981
Zero-shot	0	donuts	hamburger	0.9970
CoOp	32	apple_pie	donuts	0.9962

CoOp saliency map for apple pie predicted as donuts.

CoOp saliency map for donuts predicted as hamburger.

Zero-shot saliency map for apple pie predicted as donuts.

Zero-shot saliency map for donuts predicted as hamburger.