Week 1 — Tokenization, Subwords (BPE), and N-Gram Language Models

Overview

Language is messy long before it becomes a tensor, and the first thing any NLP system does is decide what a “token” is. This week covers tokenization — from naive whitespace splitting through Unicode normalization to subword algorithms like Byte-Pair Encoding (BPE) that every modern LLM uses — and then the original statistical language model: the n-gram, which models language as conditional counts plus smoothing. Pairing these is natural: tokenization defines the vocabulary, and the n-gram is the simplest model over that vocabulary. Building an n-gram LM also forces you to confront evaluation (perplexity) early, which frames everything that follows.

Course 5’s information theory is the backbone here: perplexity is just the exponentiated cross-entropy between the model and the data, so the “how good is this language model” question is answered with tools you already have. We apply them rather than re-derive them.

Readings

J&M Ch. 2: words, tokens, Unicode, edit distance, and subword tokenization. Extract: why subwords beat both words and characters, and how BPE is trained.
J&M Ch. 3: n-gram language models, the Markov assumption, MLE estimation, smoothing (add-k, Kneser-Ney), and perplexity. Extract: the smoothing problem and the perplexity definition.
CS224N / 6.S191 (Lecture 1): intro, tokenization, and language-modeling motivation.
(Entropy, cross-entropy, and KL divergence: assumed from Course 5 information theory.)

Key Concepts

Tokenization and BPE

A tokenizer maps raw text to a sequence of integer IDs. Word-level tokenization has an unbounded vocabulary and can’t handle unseen words; character-level handles anything but makes sequences long and meaning diffuse. Subword tokenization (BPE) is the compromise: start from characters and greedily merge the most frequent adjacent pair, repeatedly, until a target vocabulary size. Common words become single tokens; rare words decompose into known subwords; nothing is ever out-of-vocabulary. Unicode normalization (NFC/NFD) matters — "é" can be one code point or two, and inconsistent normalization corrupts the vocabulary.

The n-gram model and the Markov assumption

An n-gram LM approximates \(P(w_t\mid w_{1:t-1})\approx P(w_t\mid w_{t-n+1:t-1})\) — the next word depends only on the previous \(n-1\). Parameters are conditional counts: \(P(w_t\mid w_{t-1})=\frac{\text{count}(w_{t-1},w_t)}{\text{count}(w_{t-1})}\). It is crude (no long-range dependency, no generalization across similar words) but it is the conceptual ancestor of every LM and a fast, strong baseline.

Smoothing

MLE assigns zero probability to any unseen n-gram, which makes perplexity infinite and the model brittle. Smoothing redistributes mass to unseen events: add-k (Laplace) is simplest; Kneser-Ney (backoff/interpolation weighted by how many contexts a word appears in) is the strong classical method. Smoothing is the n-gram answer to the data-sparsity problem that neural models later solve with shared dense representations.

Perplexity

\[ \text{PPL} = 2^{H(p,q)} = 2^{-\frac1N\sum_i \log_2 q(w_i\mid \text{context})}, \]

the exponentiated cross-entropy: the model’s average “branching factor,” or how many equally-likely words it is effectively choosing among. Lower is better. This is the standard intrinsic LM metric and is exactly Course 5’s cross-entropy made operational.

Theory Exercises

Train BPE by hand on a tiny corpus for a few merges; show how a rare word decomposes into learned subwords.
Explain why NFC vs NFD normalization changes token counts; give a concrete grapheme example.
Derive the MLE estimate for a bigram model and show why unseen bigrams get zero probability.
Derive add-k smoothing; compute its effect on a held-out perplexity for a small example.
Derive perplexity as exponentiated cross-entropy and relate it to Course 5’s entropy; explain what PPL = 100 means operationally.

Implementation

Implement a BPE tokenizer (train merges on a corpus; encode/decode) and an n-gram LM (counts + add-k and interpolated smoothing) in tokenization_ngram/. Train on a real corpus (e.g. WikiText) and compute held-out perplexity. Sample text from the model to inspect its (limited) coherence.

Experiments

Vocabulary-size sweep for BPE (compression ratio, OOV rate). N-gram order sweep (\(n=1..5\)) with each smoothing method: held-out perplexity vs \(n\). Show the bias/variance tradeoff — higher \(n\) fits more but sparsifies, and smoothing controls the resulting variance.

Expected baselines: perplexity drops then plateaus/worsens as \(n\) grows (sparsity); Kneser-Ney beats add-k; BPE with a few-thousand vocab eliminates OOV while keeping sequences short. Sampled text is locally fluent but globally incoherent — the n-gram’s defining limitation.

Connections

Tokenization feeds every later week (the embeddings of Week 2, the transformer of Week 6, the LLM of Week 7 all operate on these token IDs). Perplexity is the LM metric reused for neural LMs. The data-sparsity problem motivates Week 2’s dense embeddings, which generalize across similar words in a way counts cannot. Course 5’s information theory is the evaluation foundation.