Assignment 1 Image Evaluation Results

1. Performance Overview (Test Set)

A direct quantitative comparison between the baseline CNN and the advanced Vision Transformer on the unseen test split.

Metric	ResNet50 (Baseline)	ViT-Base/16 (Advanced)
Peak Validation Accuracy	~ 88.5%	~ 86.4%
Peak Validation F1-Score	~ 0.87	~ 0.84
Lowest Validation Loss	~ 0.43	~ 0.58
Training Steps (Epochs)	10	10

2. Training Dynamics (Weights & Biases)

Analyzing the learning curves reveals a distinct advantage for the Convolutional architecture over the Transformer on this specific dataset size.

Convergence Speed & Stability: The orange line (ResNet50) consistently outperforms the blue line (ViT) across all metrics. ResNet50 converges to a lower loss and higher accuracy optimum.
The "Data Hunger" Phenomenon: These charts perfectly validate our hypothesis stated in the Methodology section. Despite having over 3x the parameters (86M vs 25M), ViT struggles to generalize as effectively as ResNet50. The lack of CNN inductive biases (translation invariance) means ViT requires significantly more data than the ~30,000 images provided by Caltech-256 to fully map global context representations.

3. Error Analysis & Confusion Patterns

A deep dive into the specific classes that caused the most misclassifications, extracted directly from our test predictions.

ResNet50 Confusion Matrix

ViT-B/16 Confusion Matrix

Top 5 Misclassification Pairs

ResNet50 (CNN) Errors

sneaker → tennis-shoes (11 times)
fighter-jet → airplanes-101 (6 times)
baseball-bat → tweezer (4 times)
t-shirt → people (4 times)
theodolite → microscope (3 times)

ViT-Base/16 (Transformer) Errors

sneaker → tennis-shoes (7 times)
microscope → lathe (5 times)
tennis-shoes → sneaker (5 times)
tuning-fork → tweezer (5 times)
teapot → ewer-101 (4 times)

Key Insights from Error Data

Semantic & Visual Overlap: The most frequent error across both models was confusing sneaker with tennis-shoes. This indicates a dataset limitation rather than a model failure, as the visual features defining these two classes are nearly identical. Similar logic applies to ResNet confusing fighter-jet with general airplanes.
Contextual Bias in CNNs: ResNet50 misclassifying a t-shirt as people reveals a classic CNN flaw. Because t-shirts are typically worn by humans in the dataset, the CNN learned the surrounding context (faces, arms) rather than isolating the object itself.
Shape Distraction in ViT: Vision Transformers process images globally. This caused ViT to confuse objects with similar overall silhouettes and metallic textures, such as mistaking a tuning-fork for a tweezer, or a microscope for a lathe.

Final Verdict: The Triumph of Inductive Bias

Contrary to the general trend where Transformers dominate deep learning tasks, our empirical tracking proves that ResNet50 is the superior architecture for the Caltech-256 dataset. It achieved higher accuracy (~88.5%), lower loss, and exhibited more predictable contextual errors compared to ViT. This serves as a powerful reminder that for medium-sized datasets, the built-in spatial assumptions (inductive biases) of Convolutional Neural Networks remain incredibly robust and sample-efficient.