Week 3 — RNNs, LSTMs, and Sequence Modeling

Overview

Embeddings give you good word vectors, but language is a sequence — order and context matter, and a model must carry information across time. This week covers recurrent neural networks and their gated variant, the LSTM: the dominant sequence architecture before transformers, and still the clearest place to understand the core challenges of sequence modeling — vanishing/exploding gradients, long-range dependencies, and the tradeoff between a fixed-size hidden state and unbounded context. You will build an RNN/LSTM language model and confront these issues directly.

Course 5’s optimization and the backprop foundations (also exercised in Course 1) carry the weight here; the new ideas are recurrence, backpropagation through time, and gating. Understanding why RNNs struggle with long range is exactly the motivation for attention (Week 5) and transformers (Week 6).

Readings

J&M Ch. 13: RNNs and LSTMs — recurrence, backpropagation through time, the vanishing-gradient problem, and the LSTM/GRU gating mechanisms. Extract: why gates fix gradient flow.
6.S191 Lecture 2: deep sequence modeling. Extract: the recurrence intuition and applications.
CS224N: RNN language models, vanishing gradients, LSTMs. Extract: BPTT mechanics and gradient clipping.

Key Concepts

Recurrence and BPTT

An RNN maintains a hidden state \(h_t = f(W_h h_{t-1} + W_x x_t + b)\) updated each step, sharing weights across time. Training uses backpropagation through time: unroll the recurrence and apply backprop (Course 5/Course 1) over the unrolled graph. The shared weights mean gradients are products of many Jacobians — which is exactly where the trouble starts.

Vanishing and exploding gradients

Because BPTT multiplies many Jacobians, gradients shrink toward zero (vanishing) or blow up (exploding) over long sequences. Vanishing gradients prevent learning long-range dependencies — the central limitation of vanilla RNNs. Exploding gradients are patched with gradient clipping; vanishing gradients need an architectural fix.

LSTMs and gating

The LSTM adds a cell state with additive updates and three gates (forget, input, output) that control what is kept, written, and read. The additive cell-state path lets gradients flow across many steps without repeated multiplicative shrinkage — the gates learn to preserve information over long ranges. The GRU is a simpler two-gate variant with similar benefits.

The fundamental bottleneck

Even an LSTM compresses all past context into a fixed-size hidden state, and access to distant tokens is indirect (must survive every intermediate step). This is the bottleneck attention removes by letting the model look directly at any past token — the motivation for Weeks 5–6.

Theory Exercises

Write the RNN recurrence and derive one step of BPTT; show how the gradient becomes a product of Jacobians.
Explain vanishing/exploding gradients from that product; show how the spectral radius of the recurrent Jacobian governs which occurs.
Write the LSTM cell equations and explain how the additive cell-state path mitigates vanishing gradients.
Explain gradient clipping and why it addresses exploding but not vanishing gradients.
Argue why a fixed-size hidden state limits long-range modeling, motivating attention.

Implementation

Implement an RNN and an LSTM language model in rnn_lm/ (PyTorch), trained on the Week 1 corpus with the Week 2 embeddings as input. Include gradient clipping and BPTT truncation. Generate text and compute perplexity. Optionally implement the LSTM cell by hand and check against the library.

Experiments

Perplexity: RNN vs LSTM vs the Week 1 n-gram baseline. Long-range probe: a task requiring information from many steps back (e.g. subject-verb agreement across distance) — show the LSTM beats the RNN and both struggle at long range. Gradient-norm logging to visualize vanishing/exploding and the effect of clipping.

Expected baselines: LSTM beats RNN beats n-gram on perplexity; both neural models beat the n-gram by generalizing via embeddings; long-range accuracy degrades with distance, worse for the vanilla RNN — the quantitative case for gating and, ultimately, attention.

Connections

The sequence-modeling bottleneck identified here is precisely what attention (Week 5) and transformers (Week 6) solve. LSTMs reappear as BiLSTM taggers in Week 4 (sequence labeling) and as the historical baseline for seq2seq (Week 5). BPTT is backprop (Course 5/Course 1) applied to recurrence; the perplexity metric is Week 1’s.