Assignment 1 Image Dataset EDA

1. Dataset Overview

This section outlines the fundamental attributes of the image dataset used for model training.

Attribute	Details
Dataset Name	Caltech-256 (Kaggle Reference)
Total Images	30,607 images
Number of Classes	257 classes (256 object categories + 1 clutter class)
Original Image Size	Variable (various resolutions)
Train/Val/Test Split	[70% / 15% / 15%]

2. Comprehensive Data Analysis

An in-depth analysis of the Caltech-256 dataset reveals significant challenges related to class imbalance, dimensional variance, and intra-class diversity. Understanding these characteristics is crucial for designing a robust training pipeline.

Class Distribution (Imbalance)

Observation: The dataset exhibits severe class imbalance. The largest class (`257.clutter`) has over 800 images, while the smallest classes (e.g., `top-hat`, `sunflower`) have fewer than 100 images.
Implication: This long-tail distribution necessitates strategies like class weighting, oversampling, or specific loss functions (e.g., Focal Loss) to prevent the model from ignoring minority classes.

Dimensional Variance

Observation: Image dimensions vary drastically, with widths and heights ranging from under 100 pixels to nearly 8000 pixels.
Implication: Standardizing the input size (e.g., resizing to 224x224) is mandatory, but aggressive resizing might distort features or cause information loss, highlighting the need for careful augmentation strategies (e.g., RandomResizedCrop).

Aspect Ratio & Color Mode

Observation: Most images are landscape or portrait, avoiding perfect squares (Ratio = 1.0). While 98.6% of images are RGB, 1.4% are Grayscale.
Implication: Grayscale images must be converted to 3 channels before being fed into standard architectures like ResNet or ViT to ensure tensor compatibility.

Intra-class Diversity (Average Images)

Observation: The "Motorbikes" average image retains a distinct shape, indicating consistent object placement. However, the "Frog" average image is a blurry, amorphous blob, suggesting high variance in pose, background, and scale.
Implication: This proves the dataset's high difficulty. Models must learn complex spatial hierarchies rather than relying on simple shape templates, making advanced architectures like Vision Transformers highly relevant.

3. Preprocessing & Augmentation

Following the EDA findings, a robust preprocessing and augmentation pipeline was implemented using PyTorch's torchvision.transforms to handle dimensional variance, color mode inconsistencies, and class imbalance.

Base Preprocessing (All Splits)

RGB Conversion: Explicitly applied x.convert('RGB') to safely handle the 1.4% grayscale images, preventing tensor shape mismatch errors during training.
Base Resize: All images are initially resized to 256x256 pixels to establish a uniform baseline before cropping.
Normalization: Standardized using ImageNet statistics (Mean: [0.485, 0.456, 0.406], Std: [0.229, 0.224, 0.225]) to strictly align with the pre-trained weights of ResNet50 and Vision Transformer.

Training Augmentation

Applied heavily to combat overfitting on minority classes.

Random Resized Crop: Cropped to the final 224x224 resolution (with scale 0.8 - 1.0) to introduce scale and translation invariance.
Horizontal Flip: Applied randomly to double the effective spatial orientations.
RandAugment: Implemented an automated augmentation strategy (num_ops=2, magnitude=9) to inject complex visual variance and improve model robustness.

Validation / Test Pipeline

Center Crop: Deterministically cropped to 224x224 from the 256x256 base image to evaluate the core object without introducing random noise.

Dataset Splits & Loading

Split Ratio: 70% Train, 15% Validation, 15% Test.
Reproducibility: Controlled via PyTorch Generator with a fixed manual_seed(42).
DataLoader: Configured with batch_size=64, num_workers=2, and pin_memory=True to optimize GPU data transfer latency.

4. Raw Data Samples

Click the button above to dynamically load a random batch of original training samples.

Visual Diversity: Observe the drastic variance in subject matter and backgrounds across different samples.
Dimensional Inconsistency: Notice the varying aspect ratios. This raw visualization directly justifies the necessity of our robust Resize → Crop pipeline to standardize tensor inputs.