Assignment 1 / Text / Dataset EDA

Dataset EDA

Exploring 20 Newsgroups — class balance, word-count profiles, and the input representation fed to the shared BERT tokenizer.

Assignment 1 / Text / Dataset EDA

Dataset at a Glance

Training samples

11,314

218 empty after cleaning

Test samples

7,532

162 empty after cleaning

Number of classes

Newsgroup topics

Max sequence length

256

BERT subword tokens

Data Source & Split Profile

Property	Value
Dataset	20 Newsgroups (`SetFit/20_newsgroups` on Hugging Face)
Train split	11,314 documents (218 empty texts → treated as empty string)
Test split	7,532 documents (162 empty texts)
Classes	20 newsgroup topics (alt.atheism → talk.religion.misc)
Tokenizer	`bert-base-uncased` — WordPiece, vocab size 30,522
Max sequence length	256 BERT subword tokens (covers >95% of posts)
Special tokens	[CLS], [SEP], [PAD] — standard BERT format
Word-count median (train)	83 words per post (mean 185.8, max 11,765)

Exploratory Analysis

Six-panel EDA visualization for 20 Newsgroups dataset

Figure 1 — EDA overview. Classes are near-balanced (≈480–600 samples each). Word-count median ≈83, P95 well above 256 words — the 256-token BERT cap covers the bulk of posts. talk.* groups tend shorter; sci.* and comp.* run longer.

Primary Findings

Near-balanced classes: each topic holds ≈480–600 training samples — no oversampling needed.
Heavy length tail: max post is 11,765 words; median is 83. A 256-token BERT cap safely covers the majority while staying within 6 GB VRAM.
Empty texts: 218 train / 162 test samples are entirely empty after loading — passed as empty strings to the tokenizer (→ [CLS][SEP][PAD…]).
Raw text requires no custom preprocessing beyond what BERT's WordPiece tokenizer handles — lowercasing is implicit in bert-base-uncased.

Tokenization Decisions

Shared tokenizer: bert-base-uncased AutoTokenizer used by both the BERT fine-tuned model and the BiLSTM's frozen embedding layer.
padding="max_length", truncation=True — every sequence is padded or truncated to exactly 256 tokens; attention mask marks valid positions.
No custom vocab: the 30,522-word BERT WordPiece vocabulary handles OOV via subword splitting — no min-frequency filtering needed.
Pre-tokenized in one batch and kept as pt tensors on CPU before DataLoader feeds them to GPU.

The 20 Newsgroup Categories

#	Category	Domain
0	alt.atheism	Religion / Debate
1	comp.graphics	Computing
2	comp.os.ms-windows.misc	Computing
3	comp.sys.ibm.pc.hardware	Computing
4	comp.sys.mac.hardware	Computing
5	comp.windows.x	Computing
6	misc.forsale	Miscellaneous
7	rec.autos	Recreation
8	rec.motorcycles	Recreation
9	rec.sport.baseball	Recreation
10	rec.sport.hockey	Recreation
11	sci.crypt	Science
12	sci.electronics	Science
13	sci.med	Science
14	sci.space	Science
15	soc.religion.christian	Society / Religion
16	talk.politics.guns	Politics
17	talk.politics.mideast	Politics
18	talk.politics.misc	Politics
19	talk.religion.misc	Religion / Debate