Assignment 1 Image Model Backbone

1. End-to-End Pipeline

Raw Data
(Images)

→

Preprocessing
(Resize 256 → Crop 224)

→

Model Backbone
ResNet50

→

Classifier Head
(Linear Layer)

→

Output
(257 Class Logits)

ResNet50 Internal Flow:

Conv1 & MaxPool: Initial spatial downsampling (7x7 convolution).
Layers 1 to 4: Stacked Bottleneck blocks with Residual Connections (Skip Connections) to solve the vanishing gradient problem.
Global Average Pooling (GAP): Flattens the 7x7 spatial feature map into a 2048-dimensional feature vector.

Specification	Details

A fundamental difference exists in how these two architectures process spatial information and build representations from raw pixels.

Local Receptive Fields: Employs sliding convolutional filters (e.g., 3x3) to capture local spatial correlations such as edges, corners, and textures.
Hierarchical Abstraction: Progressively builds representations. Early layers "see" local patterns, while deeper layers combine them into full semantic objects through pooling.
High Inductive Bias: Inherently assumes that pixels close to each other are related (spatial locality) and that objects can appear anywhere in the image (translation invariance). This makes CNNs highly sample-efficient on smaller datasets.

Global Self-Attention: Treats the image as a sequence of flattened 16x16 patches. Every patch calculates its relationship (attention score) with every other patch from the very first layer.
Contextual Understanding: Capable of capturing long-range dependencies and global context immediately, rather than waiting for deep layers to pool features together.
Low Inductive Bias: Has no built-in assumption about translation invariance or spatial locality. It must learn these spatial properties entirely from data, making it inherently "data-hungry" and heavily reliant on large-scale pre-training.