Assignment 1 / Image / Model Backbone

Model Architecture

Interactive exploration of the end-to-end training pipeline and backbone architectures.

Assignment 1 / Image / Model Backbone

1. End-to-End Pipeline

Raw Data
(Images)
Preprocessing
(Resize 256 → Crop 224)
Model Backbone
ResNet50
Classifier Head
(Linear Layer)
Output
(257 Class Logits)
ResNet50 Internal Flow:
  • Conv1 & MaxPool: Initial spatial downsampling (7x7 convolution).
  • Layers 1 to 4: Stacked Bottleneck blocks with Residual Connections (Skip Connections) to solve the vanishing gradient problem.
  • Global Average Pooling (GAP): Flattens the 7x7 spatial feature map into a 2048-dimensional feature vector.

2. Technical Specifications

Specification Details

3. Learning Mechanism: Local Features vs. Global Context

A fundamental difference exists in how these two architectures process spatial information and build representations from raw pixels.

Convolutional Neural Network (CNN)

  • Local Receptive Fields: Employs sliding convolutional filters (e.g., 3x3) to capture local spatial correlations such as edges, corners, and textures.
  • Hierarchical Abstraction: Progressively builds representations. Early layers "see" local patterns, while deeper layers combine them into full semantic objects through pooling.
  • High Inductive Bias: Inherently assumes that pixels close to each other are related (spatial locality) and that objects can appear anywhere in the image (translation invariance). This makes CNNs highly sample-efficient on smaller datasets.

Vision Transformer (ViT)

  • Global Self-Attention: Treats the image as a sequence of flattened 16x16 patches. Every patch calculates its relationship (attention score) with every other patch from the very first layer.
  • Contextual Understanding: Capable of capturing long-range dependencies and global context immediately, rather than waiting for deep layers to pool features together.
  • Low Inductive Bias: Has no built-in assumption about translation invariance or spatial locality. It must learn these spatial properties entirely from data, making it inherently "data-hungry" and heavily reliant on large-scale pre-training.