Assignment 1 / Text / Dataset EDA

Dataset EDA

Exploring 20 Newsgroups — class balance, word-count profiles, and the input representation fed to the shared BERT tokenizer.

Assignment 1 / Text / Dataset EDA

Dataset at a Glance

Training samples

11,314

218 empty after cleaning

Test samples

7,532

162 empty after cleaning

Number of classes

20

Newsgroup topics

Max sequence length

256

BERT subword tokens

Data Source & Split Profile

Property Value
Dataset 20 Newsgroups (SetFit/20_newsgroups on Hugging Face)
Train split 11,314 documents (218 empty texts → treated as empty string)
Test split 7,532 documents (162 empty texts)
Classes 20 newsgroup topics (alt.atheism → talk.religion.misc)
Tokenizer bert-base-uncased — WordPiece, vocab size 30,522
Max sequence length 256 BERT subword tokens (covers >95% of posts)
Special tokens [CLS], [SEP], [PAD] — standard BERT format
Word-count median (train) 83 words per post (mean 185.8, max 11,765)

Exploratory Analysis

Six-panel EDA visualization for 20 Newsgroups dataset

Figure 1 — EDA overview. Classes are near-balanced (≈480–600 samples each). Word-count median ≈83, P95 well above 256 words — the 256-token BERT cap covers the bulk of posts. talk.* groups tend shorter; sci.* and comp.* run longer.

Primary Findings

  • Near-balanced classes: each topic holds ≈480–600 training samples — no oversampling needed.
  • Heavy length tail: max post is 11,765 words; median is 83. A 256-token BERT cap safely covers the majority while staying within 6 GB VRAM.
  • Empty texts: 218 train / 162 test samples are entirely empty after loading — passed as empty strings to the tokenizer (→ [CLS][SEP][PAD…]).
  • Raw text requires no custom preprocessing beyond what BERT's WordPiece tokenizer handles — lowercasing is implicit in bert-base-uncased.

Tokenization Decisions

  • Shared tokenizer: bert-base-uncased AutoTokenizer used by both the BERT fine-tuned model and the BiLSTM's frozen embedding layer.
  • padding="max_length", truncation=True — every sequence is padded or truncated to exactly 256 tokens; attention mask marks valid positions.
  • No custom vocab: the 30,522-word BERT WordPiece vocabulary handles OOV via subword splitting — no min-frequency filtering needed.
  • Pre-tokenized in one batch and kept as pt tensors on CPU before DataLoader feeds them to GPU.

The 20 Newsgroup Categories

# Category Domain
0alt.atheismReligion / Debate
1comp.graphicsComputing
2comp.os.ms-windows.miscComputing
3comp.sys.ibm.pc.hardwareComputing
4comp.sys.mac.hardwareComputing
5comp.windows.xComputing
6misc.forsaleMiscellaneous
7rec.autosRecreation
8rec.motorcyclesRecreation
9rec.sport.baseballRecreation
10rec.sport.hockeyRecreation
11sci.cryptScience
12sci.electronicsScience
13sci.medScience
14sci.spaceScience
15soc.religion.christianSociety / Religion
16talk.politics.gunsPolitics
17talk.politics.mideastPolitics
18talk.politics.miscPolitics
19talk.religion.miscReligion / Debate