Assignment 1 / Text / Evaluation Results

Evaluation Results

Quantitative outcomes for all three models on the 20 Newsgroups test set — accuracy, macro F1, ROC AUC, confusion patterns, and per-class analysis.

Assignment 1 / Text / Evaluation Results

Summary Metrics

BERT ← best accuracy

70.23%

Macro Precision68.86%

Macro Recall68.56%

Macro F167.93%

Macro ROC AUC0.9640

Parameters109.5 M

BiLSTM

68.00%

Macro Precision67.52%

Macro Recall66.72%

Macro F166.86%

Macro ROC AUC0.9571

Trainable params7.4 M

Ensemble

69.48%

Macro Precision68.19%

Macro Recall67.76%

Macro F167.35%

Macro ROC AUC0.9589

Trainable params7,340

Per-class F1 Comparison

Grouped bar chart — per-class F1-score for BERT, BiLSTM, and Ensemble

Figure 3 — Per-class F1. Easiest: rec.sport.hockey (F1≈0.89), misc.forsale (0.83). Hardest: talk.religion.misc (BERT F1=0.07) and alt.atheism. BERT leads in most classes.

Confusion Matrices

Figure 4 — Confusion matrices. BERT shows the sharpest diagonal. Top confusion pairs: alt.atheism ↔ talk.religion.misc and rec.autos ↔ rec.motorcycles across all three models.

Macro ROC AUC

Bar chart of macro one-vs-rest ROC AUC for BERT, BiLSTM, and Ensemble

Figure 5 — Macro ROC AUC. All models exceed 0.957 AUC. BERT leads (0.9640); Ensemble sits between (0.9589); BiLSTM trails slightly (0.9571) — all three produce well-calibrated probability estimates.

Detailed Per-class F1 (Test Set)

Class	BERT F1	BiLSTM F1	Ensemble F1	Support
alt.atheism	0.47	0.43	0.41	319
comp.graphics	0.70	0.69	0.56	389
comp.os.ms-windows.misc	0.68	0.64	0.66	394
comp.sys.ibm.pc.hardware	0.62	0.65	0.60	392
comp.sys.mac.hardware	0.69	0.71	0.67	385
comp.windows.x	0.81	0.79	0.81	395
misc.forsale	0.83	0.78	0.82	390
rec.autos	0.64	0.59	0.76	396
rec.motorcycles	0.74	0.70	0.75	398
rec.sport.baseball	0.86	0.83	0.88	397
rec.sport.hockey	0.89	0.86	0.88	399
sci.crypt	0.73	0.73	0.73	396
sci.electronics	0.63	0.60	0.64	393
sci.med	0.83	0.78	0.84	396
sci.space	0.78	0.77	0.78	394
soc.religion.christian	0.72	0.70	0.70	398
talk.politics.guns	0.60	0.58	0.59	364
talk.politics.mideast	0.82	0.79	0.82	376
talk.politics.misc	0.48	0.47	0.50	310
talk.religion.misc	0.07	0.27	0.08	251
Macro avg	0.68	0.67	0.67	7,532

Key Findings

BERT wins convincingly at 70.23% — a +2.23% gap over BiLSTM, demonstrating the power of fine-tuning the full pre-trained Transformer versus using frozen embeddings alone.
Ensemble falls between both (69.48%): the meta-MLP improves over BiLSTM (+1.48%) but cannot fully match BERT due to the BiLSTM's lower individual ceiling.
Hardest classes: talk.religion.misc (BERT F1=0.07 — severe overlap with alt.atheism and soc.religion.christian) and alt.atheism across all models.
All models exceed ROC AUC 0.957 — probability calibration is strong even in hard classes.
BiLSTM's AMP-off (FP32) training prevents LayerNorm NaN issues but slows per-epoch speed vs BERT's AMP+grad-accum setup.

Next Experiment Steps

Train BERT for more epochs (5–6) with a lower peak LR (1e-5) and cosine schedule — the current run stopped at 3 with clear room left to improve.
Experiment with bert-large-uncased or a domain-adapted variant (e.g., RoBERTa) to widen the accuracy ceiling.
Apply focal loss or class-weighted CE for talk.religion.misc — the model collapses this class to near-zero F1 under standard CE.
Use BERT's hidden states (all layers) in the BiLSTM rather than only the embedding module — this may close the 2% gap without full fine-tuning cost.