Food101 becomes a controlled ten-class benchmark.

The multimodal experiment narrows Food101 to ten visually distinct yet semantically overlapping food categories, then constructs balanced few-shot train and validation subsets from the official training split while preserving the full official validation split as test. This makes the evaluation stable and makes method differences easier to attribute to representation quality rather than data skew.

Scope

Dataset framing

The filtered subset contains 7,500 training images and 2,500 test images, exactly 1,000 images per selected class overall. Each few-shot configuration draws balanced support sets from the filtered training pool, while the test split stays fixed at 250 images per class.

Filtered Total

10k

Ten classes, each with exactly 1,000 images before few-shot slicing.

Train Pool

7.5k

Official Food101 train split after class filtering.

Held-Out Test

2.5k

Official validation split, preserved in full for stable reporting.

Balance Ratio

0.00

The working subset is perfectly balanced by design across all active classes.

EDA JSON Class Counts CSV Split Summary CSV Processed Summary

Split Recipe

How the few-shot subset is built

Component	Definition
Original source	`ethz/food101` with 101 classes and official train / validation splits.
Selected classes	apple pie, bibimbap, chicken wings, donuts, eggs benedict, french fries, grilled cheese sandwich, hamburger, ice cream, pizza.
Filtered train pool	7,500 images total, 750 images per class.
Filtered test pool	2,500 images total, 250 images per class.
Validation budget	20 images per class for every few-shot setting.
Shot settings	8, 16, 32, 64, 128 images per class for few-shot training.
Random seed	42 for deterministic support and validation sampling.

This split recipe makes the experiment conservative: the test set is large enough to expose confusion structure, while the few-shot train and validation subsets stay class-balanced so metric changes reflect adaptation quality rather than label frequency.

Distribution

Every active split is deliberately balanced

Grouped bar chart of Food101 class counts across few-shot train, few-shot dev, and test splits. — For the highest-shot run, each class contributes 128 training images, 20 validation images, and 250 test images. The same symmetry holds at every lower support setting.

Split composition plot for the active experiment subset. — The active experiment surface is still test-heavy. That is intentional: most of the reporting weight sits on held-out evaluation rather than support examples.

Heatmap of balanced class counts across the few-shot and test splits. — Because each split is class-balanced, the downstream metrics are less likely to be inflated by class frequency effects.

Geometry

Food101 is close to square, but not uniform

Most Food101 images live near a 512-pixel square frame, but aspect ratio still ranges widely enough to matter for crop policy. Elongated plates, tall plated burgers, and tightly framed desserts all stress a naive resize-plus-center-crop pipeline differently.

Image width, height, aspect ratio, and area distributions. — Width spans roughly 287 to 512 pixels and height spans 239 to 512 pixels, so geometric variation is real even in a curated food dataset.

Scatter plot of image width and height colored by aspect ratio. — The width-height scatter confirms that many images are near-square, but enough examples sit above and below the diagonal to influence downstream framing.

Samples

Visual variation remains high inside each class

Single example from each selected Food101 class. — Even one reference image per class shows broad variation in plating, background clutter, viewpoint, and crop tightness.

Gallery of multiple examples per selected Food101 class. — The full gallery makes the modeling challenge clearer: apple pie and donuts share dessert textures, while hamburgers and grilled cheese often overlap in bread-heavy compositions.