Assignment 1 / Text / Methodology

Methodology

Training pipeline, optimisation settings, regularisation choices, and reproducibility controls for the BERT, BiLSTM, and Ensemble models.

Assignment 1 / Text / Methodology

End-to-End Pipeline

1

Data Loading & Caching

20 Newsgroups fetched from Hugging Face (SetFit/20_newsgroups) and cached as local CSV. Falls back to sklearn if the hub is unreachable. 218 / 162 empty train/test texts are passed as empty strings.

2

BERT Tokenization (shared)

bert-base-uncased AutoTokenizer — WordPiece, max_length=256, padding="max_length", truncation=True, return_tensors="pt". Both the BERT and BiLSTM models consume the same input_ids and attention_mask.

3

DataLoaders

Batch size 16 (GPU, AMP on). num_workers=0 on Windows to avoid multiprocessing pickling errors. pin_memory=True on CUDA. Fixed generator seed (42) for shuffled training batches.

4

Fine-tune BERT

All 109.5 M parameters updated. AdamW with separate weight-decay groups (bias/LayerNorm excluded). Linear warmup (10%) + linear decay. Gradient accumulation 2 steps → effective batch 32. AMP FP16 enabled. Best checkpoint saved on test accuracy, patience=3.

5

Train BiLSTM (frozen BERT embeddings)

BERT moved to CPU to free VRAM; BiLSTM loaded to GPU. Only LSTM + classifier parameters are updated (7.4 M). AMP disabled (FP32) for LayerNorm numerical stability. Early stopping patience=4.

6

Pre-compute Logits for Ensemble

Best BERT checkpoint loaded to GPU; BiLSTM on CPU. BERT logits collected for all train/test samples; then swapped (BiLSTM to GPU, BERT to CPU) for BiLSTM logits. All logits cached on CPU.

7

Ensemble Meta-learning

Concatenated logits (shape B×40) fed to a 3-layer meta-MLP. Only 7,340 parameters updated. Batch size 256, 15 epochs. No GPU needed — backbones are frozen and logits are cached. Each epoch finishes in < 1 s.

8

Evaluation & Logging

Accuracy, macro precision/recall/F1, per-class F1, confusion matrix, macro one-vs-rest ROC AUC. All runs optionally tracked with Weights & Biases.

Hyperparameter Settings

Hyperparameter BERT BiLSTM Ensemble meta-MLP
Epochs3615
Learning rate2 × 10⁻⁵1 × 10⁻³5 × 10⁻⁴
Weight decay1 × 10⁻²1 × 10⁻²1 × 10⁻³
Warmup ratio10%5%10%
LR scheduleLinear decayLinear decayLinear decay
Batch size16 (eff. 32 w/ grad_accum=2)16256
OptimizerAdamWAdamWAdamW
Gradient clip1.01.0
Label smoothing0.10.05
Mixed precisionAMP FP16 ✓FP32 (stability)FP32
Early stopping patience3 epochs4 epochs

Learning Curves

Loss and accuracy curves for BERT, BiLSTM, and Ensemble across training epochs

Figure 2 — Learning curves. Solid = test, dashed = train. BERT converges fastest (3 epochs) and reaches the highest test accuracy. BiLSTM improves steadily over 6 epochs. Ensemble epochs are nearly instant (cached logits).

Regularisation Strategy

  • Dropout (0.3): applied in BiLSTM embedding, between LSTM layers, and in both classifiers.
  • Weight decay (1e-2): L2 penalty via AdamW; for BERT, bias and LayerNorm params are excluded (weight_decay=0.0 group).
  • Label smoothing (0.1 for BERT, 0.05 for Ensemble): prevents overconfidence; not used for BiLSTM (standard CE).
  • Gradient clip (max_norm=1.0): applied before every optimizer step for both BERT and BiLSTM.
  • Early stopping: patience 3 (BERT) and 4 (BiLSTM) — best checkpoint persisted to disk.

Reproducibility Controls

  • Global seed 42 applied to random, numpy, and torch.
  • torch.cuda.manual_seed_all(42) for GPU op determinism.
  • DataLoader uses a fixed torch.Generator with worker_init_fn seeding each worker.
  • num_workers=0 on Windows — avoids process-fork pickling issues in Jupyter.
  • GPU sanity check at startup; AMP GradScaler uses the new torch.amp (not deprecated torch.cuda.amp) API.