Assignment 1 / Text / Evaluation Results

Evaluation Results

Quantitative outcomes for all three models on the 20 Newsgroups test set — accuracy, macro F1, ROC AUC, confusion patterns, and per-class analysis.

Assignment 1 / Text / Evaluation Results

Summary Metrics

BERT ← best accuracy

70.23%

Macro Precision68.86%
Macro Recall68.56%
Macro F167.93%
Macro ROC AUC0.9640
Parameters109.5 M

BiLSTM

68.00%

Macro Precision67.52%
Macro Recall66.72%
Macro F166.86%
Macro ROC AUC0.9571
Trainable params7.4 M

Ensemble

69.48%

Macro Precision68.19%
Macro Recall67.76%
Macro F167.35%
Macro ROC AUC0.9589
Trainable params7,340

Per-class F1 Comparison

Grouped bar chart — per-class F1-score for BERT, BiLSTM, and Ensemble

Figure 3 — Per-class F1. Easiest: rec.sport.hockey (F1≈0.89), misc.forsale (0.83). Hardest: talk.religion.misc (BERT F1=0.07) and alt.atheism. BERT leads in most classes.

Confusion Matrices

Three side-by-side confusion matrices for BERT, BiLSTM, and Ensemble

Figure 4 — Confusion matrices. BERT shows the sharpest diagonal. Top confusion pairs: alt.atheismtalk.religion.misc and rec.autosrec.motorcycles across all three models.

Macro ROC AUC

Bar chart of macro one-vs-rest ROC AUC for BERT, BiLSTM, and Ensemble

Figure 5 — Macro ROC AUC. All models exceed 0.957 AUC. BERT leads (0.9640); Ensemble sits between (0.9589); BiLSTM trails slightly (0.9571) — all three produce well-calibrated probability estimates.

Detailed Per-class F1 (Test Set)

Class BERT F1 BiLSTM F1 Ensemble F1 Support
alt.atheism0.470.430.41319
comp.graphics0.700.690.56389
comp.os.ms-windows.misc0.680.640.66394
comp.sys.ibm.pc.hardware0.620.650.60392
comp.sys.mac.hardware0.690.710.67385
comp.windows.x0.810.790.81395
misc.forsale0.830.780.82390
rec.autos0.640.590.76396
rec.motorcycles0.740.700.75398
rec.sport.baseball0.860.830.88397
rec.sport.hockey0.890.860.88399
sci.crypt0.730.730.73396
sci.electronics0.630.600.64393
sci.med0.830.780.84396
sci.space0.780.770.78394
soc.religion.christian0.720.700.70398
talk.politics.guns0.600.580.59364
talk.politics.mideast0.820.790.82376
talk.politics.misc0.480.470.50310
talk.religion.misc0.070.270.08251
Macro avg0.680.670.677,532

Key Findings

  • BERT wins convincingly at 70.23% — a +2.23% gap over BiLSTM, demonstrating the power of fine-tuning the full pre-trained Transformer versus using frozen embeddings alone.
  • Ensemble falls between both (69.48%): the meta-MLP improves over BiLSTM (+1.48%) but cannot fully match BERT due to the BiLSTM's lower individual ceiling.
  • Hardest classes: talk.religion.misc (BERT F1=0.07 — severe overlap with alt.atheism and soc.religion.christian) and alt.atheism across all models.
  • All models exceed ROC AUC 0.957 — probability calibration is strong even in hard classes.
  • BiLSTM's AMP-off (FP32) training prevents LayerNorm NaN issues but slows per-epoch speed vs BERT's AMP+grad-accum setup.

Next Experiment Steps

  • Train BERT for more epochs (5–6) with a lower peak LR (1e-5) and cosine schedule — the current run stopped at 3 with clear room left to improve.
  • Experiment with bert-large-uncased or a domain-adapted variant (e.g., RoBERTa) to widen the accuracy ceiling.
  • Apply focal loss or class-weighted CE for talk.religion.misc — the model collapses this class to near-zero F1 under standard CE.
  • Use BERT's hidden states (all layers) in the BiLSTM rather than only the embedding module — this may close the 2% gap without full fine-tuning cost.