Week 5 — Attention and Seq2Seq Machine Translation

Overview

This week introduces the single most important idea in modern NLP: attention. The setting is sequence-to-sequence machine translation — map a source sentence to a target sentence — where the limitation of RNNs becomes acute: an encoder must compress the entire source into one fixed vector, and a long sentence overflows it. Attention removes that bottleneck by letting the decoder look back at all encoder states and form a weighted combination relevant to the current output word. You will build an encoder-decoder translator, add attention, and watch translation quality on long sentences jump.

This is the conceptual hinge of the course. The attention mechanism you build here, generalized and stacked, is the transformer of Week 6. Course 5’s linear algebra (dot products, softmax-weighted combinations) is the math; the new idea is using it to create dynamic, content-based connections between positions.

Readings

J&M Ch. 12: machine translation, the encoder-decoder architecture, and attention. Extract: the seq2seq bottleneck and how attention scores/weights are computed.
CS224N: seq2seq and attention lectures. Extract: additive (Bahdanau) vs multiplicative (Luong) attention and the alignment interpretation.
6.S191: sequence-modeling material. Extract: the bridge from recurrence to attention.

Key Concepts

The seq2seq bottleneck

An encoder RNN/LSTM (Week 3) reads the source into a final hidden state; a decoder RNN generates the target conditioned on it. The problem: one fixed-size vector must hold the whole source, so quality degrades sharply with length — the same fixed-state bottleneck Week 3 identified, now visibly limiting translation.

Attention as content-based lookup

Attention lets the decoder, at each step, compute a relevance score between its current state (a query) and every encoder state (the keys), softmax them into weights, and take the weighted sum of encoder states (the values) as a context vector. The decoder thus reads the most relevant source words for the word it is currently producing:

\[ \alpha_{ij} = \text{softmax}_j(\text{score}(q_i, k_j)), \qquad c_i = \sum_j \alpha_{ij} v_j. \]

The score can be a dot product (multiplicative) or a small MLP (additive). The weights \(\alpha\) are interpretable as soft alignment between source and target words.

Query, key, value — the abstraction that becomes the transformer

Framing attention as (query, key, value) is deliberate: it generalizes beyond translation. Self-attention uses the same sequence as queries, keys, and values, letting every token attend to every other token — that is the transformer block (Week 6). Recognizing seq2seq attention and self-attention as the same operation is the key insight of this week.

Decoding

Generation is sequential: greedy decoding takes the argmax each step; beam search keeps the top-\(k\) partial hypotheses for better global sequences. Evaluation uses BLEU (n-gram overlap with references) and its known limitations.

Theory Exercises

Explain the fixed-vector bottleneck quantitatively and how attention removes it.
Derive the attention weights and context vector; contrast additive and multiplicative scoring and their cost.
Cast attention in query/key/value terms and show how self-attention is the special case keys=values=queries-source.
Compare greedy vs beam search; explain when beam helps and the length-bias problem.
Define BLEU and discuss why it correlates imperfectly with translation quality.

Implementation

Build an encoder-decoder translator in seq2seq_attention/ (LSTM encoder/decoder from Week 3) on a small MT dataset. First without attention, then add (multiplicative) attention with the query/key/value formulation. Implement greedy and beam decoding. Visualize the attention alignment matrix.

Experiments

BLEU with vs without attention, bucketed by source-sentence length — show attention’s gain concentrates on long sentences. Beam-width sweep on BLEU. Inspect alignment heatmaps for sensible source-target correspondence.

Expected baselines: attention improves BLEU, especially on long sentences where the no-attention model collapses; beam search adds a smaller gain over greedy; alignment heatmaps show interpretable, roughly monotonic correspondences. These results motivate dropping recurrence entirely next week.

Connections

Attention is the core of the transformer (Week 6) — the QKV abstraction built here is reused verbatim, just applied as self-attention and stacked. The encoder-decoder structure underlies later generation and RAG (Week 9). Course 5’s linear algebra and softmax are the math; this week supplies the idea that reorganizes the entire field.