Assignment 1 / Image / Evaluation Results

Evaluation Results

Summarize quantitative outcomes, analyze training dynamics via W&B, and investigate confusion patterns.

Assignment 1 / Image / Evaluation Results

1. Performance Overview (Test Set)

A direct quantitative comparison between the baseline CNN and the advanced Vision Transformer on the unseen test split.

Metric ResNet50 (Baseline) ViT-Base/16 (Advanced)
Peak Validation Accuracy ~ 88.5% ~ 86.4%
Peak Validation F1-Score ~ 0.87 ~ 0.84
Lowest Validation Loss ~ 0.43 ~ 0.58
Training Steps (Epochs) 10 10

2. Training Dynamics (Weights & Biases)

Analyzing the learning curves reveals a distinct advantage for the Convolutional architecture over the Transformer on this specific dataset size.

W&B Train Loss
W&B Val Loss
W&B Validation Accuracy
W&B Validation F1
  • Convergence Speed & Stability: The orange line (ResNet50) consistently outperforms the blue line (ViT) across all metrics. ResNet50 converges to a lower loss and higher accuracy optimum.
  • The "Data Hunger" Phenomenon: These charts perfectly validate our hypothesis stated in the Methodology section. Despite having over 3x the parameters (86M vs 25M), ViT struggles to generalize as effectively as ResNet50. The lack of CNN inductive biases (translation invariance) means ViT requires significantly more data than the ~30,000 images provided by Caltech-256 to fully map global context representations.

3. Error Analysis & Confusion Patterns

A deep dive into the specific classes that caused the most misclassifications, extracted directly from our test predictions.

ResNet50 Confusion Matrix

ResNet Confusion Matrix

ViT-B/16 Confusion Matrix

ViT Confusion Matrix

Top 5 Misclassification Pairs

ResNet50 (CNN) Errors
  • sneakertennis-shoes (11 times)
  • fighter-jetairplanes-101 (6 times)
  • baseball-battweezer (4 times)
  • t-shirtpeople (4 times)
  • theodolitemicroscope (3 times)
ViT-Base/16 (Transformer) Errors
  • sneakertennis-shoes (7 times)
  • microscopelathe (5 times)
  • tennis-shoessneaker (5 times)
  • tuning-forktweezer (5 times)
  • teapotewer-101 (4 times)

Key Insights from Error Data

Final Verdict: The Triumph of Inductive Bias

Contrary to the general trend where Transformers dominate deep learning tasks, our empirical tracking proves that ResNet50 is the superior architecture for the Caltech-256 dataset. It achieved higher accuracy (~88.5%), lower loss, and exhibited more predictable contextual errors compared to ViT. This serves as a powerful reminder that for medium-sized datasets, the built-in spatial assumptions (inductive biases) of Convolutional Neural Networks remain incredibly robust and sample-efficient.