Assignment 1 Text Methodology

End-to-End Pipeline

1

Data Loading & Caching

20 Newsgroups fetched from Hugging Face (SetFit/20_newsgroups) and cached as local CSV. Falls back to sklearn if the hub is unreachable. 218 / 162 empty train/test texts are passed as empty strings.

2

BERT Tokenization (shared)

bert-base-uncased AutoTokenizer — WordPiece, max_length=256, padding="max_length", truncation=True, return_tensors="pt". Both the BERT and BiLSTM models consume the same input_ids and attention_mask.

3

DataLoaders

Batch size 16 (GPU, AMP on). num_workers=0 on Windows to avoid multiprocessing pickling errors. pin_memory=True on CUDA. Fixed generator seed (42) for shuffled training batches.

4

Fine-tune BERT

All 109.5 M parameters updated. AdamW with separate weight-decay groups (bias/LayerNorm excluded). Linear warmup (10%) + linear decay. Gradient accumulation 2 steps → effective batch 32. AMP FP16 enabled. Best checkpoint saved on test accuracy, patience=3.

5

Train BiLSTM (frozen BERT embeddings)

BERT moved to CPU to free VRAM; BiLSTM loaded to GPU. Only LSTM + classifier parameters are updated (7.4 M). AMP disabled (FP32) for LayerNorm numerical stability. Early stopping patience=4.

6

Pre-compute Logits for Ensemble

Best BERT checkpoint loaded to GPU; BiLSTM on CPU. BERT logits collected for all train/test samples; then swapped (BiLSTM to GPU, BERT to CPU) for BiLSTM logits. All logits cached on CPU.

7

Ensemble Meta-learning

Concatenated logits (shape B×40) fed to a 3-layer meta-MLP. Only 7,340 parameters updated. Batch size 256, 15 epochs. No GPU needed — backbones are frozen and logits are cached. Each epoch finishes in < 1 s.

8

Evaluation & Logging

Accuracy, macro precision/recall/F1, per-class F1, confusion matrix, macro one-vs-rest ROC AUC. All runs optionally tracked with Weights & Biases.

Hyperparameter Settings

Hyperparameter	BERT	BiLSTM	Ensemble meta-MLP
Epochs	3	6	15
Learning rate	2 × 10⁻⁵	1 × 10⁻³	5 × 10⁻⁴
Weight decay	1 × 10⁻²	1 × 10⁻²	1 × 10⁻³
Warmup ratio	10%	5%	10%
LR schedule	Linear decay	Linear decay	Linear decay
Batch size	16 (eff. 32 w/ grad_accum=2)	16	256
Optimizer	AdamW	AdamW	AdamW
Gradient clip	1.0	1.0	—
Label smoothing	0.1	—	0.05
Mixed precision	AMP FP16 ✓	FP32 (stability)	FP32
Early stopping patience	3 epochs	4 epochs	—

Learning Curves

Loss and accuracy curves for BERT, BiLSTM, and Ensemble across training epochs

Figure 2 — Learning curves. Solid = test, dashed = train. BERT converges fastest (3 epochs) and reaches the highest test accuracy. BiLSTM improves steadily over 6 epochs. Ensemble epochs are nearly instant (cached logits).

Regularisation Strategy

Dropout (0.3): applied in BiLSTM embedding, between LSTM layers, and in both classifiers.
Weight decay (1e-2): L2 penalty via AdamW; for BERT, bias and LayerNorm params are excluded (weight_decay=0.0 group).
Label smoothing (0.1 for BERT, 0.05 for Ensemble): prevents overconfidence; not used for BiLSTM (standard CE).
Gradient clip (max_norm=1.0): applied before every optimizer step for both BERT and BiLSTM.
Early stopping: patience 3 (BERT) and 4 (BiLSTM) — best checkpoint persisted to disk.

Reproducibility Controls

Global seed 42 applied to random, numpy, and torch.
torch.cuda.manual_seed_all(42) for GPU op determinism.
DataLoader uses a fixed torch.Generator with worker_init_fn seeding each worker.
num_workers=0 on Windows — avoids process-fork pickling issues in Jupyter.
GPU sanity check at startup; AMP GradScaler uses the new torch.amp (not deprecated torch.cuda.amp) API.