Section
Dataset EDA
20-class distribution, word-count profiles, BERT tokenization decisions, and empty-text handling for 20 Newsgroups.
Open Dataset EDAAssignment 1 / Text Track
20 Newsgroups — fine-tuned BERT vs. BiLSTM with frozen BERT embeddings, topped with a learned ensemble meta-learner.
Assignment 1 / Text
| Component | Selection |
|---|---|
| Dataset | 20 Newsgroups — 11,314 train / 7,532 test / 20 classes |
| Tokenizer | bert-base-uncased — WordPiece, vocab 30,522, max_length=256 |
| Baseline | BiLSTM with frozen BERT embeddings + attention pooling — 7.4 M trainable params |
| Advanced | Fine-tuned bert-base-uncased (AutoModelForSequenceClassification) — 109.5 M params, AMP FP16 |
| Ensemble | Learned meta-MLP over pre-computed logits from both models — 7,340 trainable params |
| Best accuracy | 70.23% (BERT) on 7,532 test samples — BiLSTM 68.00%, Ensemble 69.48% |
| Experiment tracking | Weights & Biases — 3 runs (BERT, BiLSTM, Ensemble) |
| GPU | NVIDIA RTX 3060 Laptop (6 GB VRAM) — CUDA 12.1, PyTorch 2.5.1 |
Open each section to see the full report for the Text track.
Section
20-class distribution, word-count profiles, BERT tokenization decisions, and empty-text handling for 20 Newsgroups.
Open Dataset EDASection
Full specs for the fine-tuned BERT classifier, BiLSTM with frozen BERT embeddings, and the ensemble meta-MLP — with parameter counts and design rationale.
Open Model BackboneSection
End-to-end pipeline, hyperparameter tables, AMP + gradient-accumulation strategy, regularisation, and learning curves for all three models.
Open MethodologySection
Test accuracy (BERT 70.23%), macro F1, per-class F1 bars, confusion matrices, and ROC AUC — with error-analysis and next-step recommendations.
Open Evaluation Results