Week 7 — Large Language Models, Scaling, and Masked Models (BERT)
Overview
This week scales the transformer up and out. First, what changes when a language model becomes large: pretraining at scale, scaling laws, emergent in-context learning, and prompting — the phenomena that make LLMs qualitatively different from the small transformer you built in Week 6. Second, the other branch of the family: masked language models like BERT, which are encoder-only transformers trained to fill in blanked tokens, producing the contextual embeddings that power classification, retrieval, and reranking. It consolidates the LLM and MLM weeks because they are two training objectives over the same architecture, best understood in contrast.
Course 5’s information theory (cross-entropy, the scaling-law view of loss) and the transformer of Week 6 are the foundations; the new content is the objectives, the scale phenomena, and when to reach for a decoder vs an encoder.
Readings
- J&M Ch. 7: large language models, pretraining, decoding, and prompting. Extract: the pretraining objective at scale and in-context learning.
- J&M Ch. 10: masked language models (BERT). Extract: the masked-LM objective and contextual embeddings.
- CS224N: LLMs, scaling, BERT/contextual embeddings. · 6.S191: New Frontiers. (Cross-entropy/scaling: C&T review from Course 5.)
Key Concepts
Pretraining and scaling laws
A decoder-only transformer (Week 6) trained on a huge corpus with next-token prediction is a base LLM. Scaling laws describe how loss falls predictably as a power law in model size, data, and compute — and motivate compute-optimal (Chinchilla) allocation between parameters and tokens. The loss being optimized is Week 1’s cross-entropy; scaling laws are an empirical regularity over it (Course 5’s information view).
In-context learning and prompting
Large enough models exhibit in-context learning: they perform a task from examples in the prompt, with no weight updates. Zero-shot, few-shot, and chain-of-thought prompting are the practical interface. This emergent behavior is why “prompting” is a skill and why the base model is useful before any fine-tuning.
Decoding strategies
Generation samples from the next-token distribution: greedy, temperature, top-k, and nucleus (top-p) sampling trade coherence against diversity. Understanding these is essential for controlling LLM outputs and for the RAG generator (Week 9).
Masked language models (BERT)
A masked LM is an encoder-only transformer trained to predict randomly masked tokens using bidirectional context (it sees both sides, unlike the causal decoder). It is not a generator; it produces strong contextual embeddings for each token, ideal for classification, NER (Week 4 redone better), and — crucially — retrieval/reranking. Decoder (generate) vs encoder (understand/embed) is the design choice: use BERT-style bi-encoders for dense retrieval and cross-encoders for reranking (Week 9).
Theory Exercises
- State the scaling-law form (loss vs parameters/data/compute) and explain compute-optimal allocation; relate loss to cross-entropy (Course 5).
- Explain in-context learning and why it is “emergent”; contrast zero-/few-shot/chain-of-thought.
- Derive the effect of temperature and top-p on the sampling distribution; describe a failure mode of each.
- Contrast the causal-LM and masked-LM objectives; explain why BERT’s bidirectionality suits embeddings but not generation.
- Explain bi-encoder vs cross-encoder for retrieval/reranking and the latency/quality tradeoff (sets up Week 9).
Implementation
In llm_mlm/: (1) fine-tune or prompt a small pretrained decoder LLM and implement decoding strategies (greedy/temp/top-k/top-p); run few-shot and chain-of-thought prompts. (2) Use a pretrained BERT to produce contextual embeddings; build a sentence bi-encoder and a cross-encoder reranker, evaluated on a small semantic-similarity/retrieval task.
Experiments
Prompting: zero- vs few- vs chain-of-thought accuracy on a reasoning task. Decoding: diversity/quality vs temperature and top-p. BERT: bi-encoder retrieval quality and cross-encoder reranking gain; contextual vs static (Week 2) embeddings on a disambiguation task.
Expected baselines: few-shot and chain-of-thought beat zero-shot on reasoning; higher temperature increases diversity and incoherence; contextual embeddings beat static ones on word-sense tasks; the cross-encoder reranker improves retrieval precision at higher latency — the exact tradeoff Week 9 exploits.
Connections
The decoder LLM is the generator for RAG (Week 9) and the target of post-training (Week 8). BERT bi-/cross-encoders are the retrieval and reranking components of the Week 9/Week 10 RAG system. Scaling and decoding connect to Course 1’s ML-systems concerns (inference cost, quantization). The architecture is Week 6’s; the objectives and scale phenomena are this week’s.
Further Reading
- J&M Ch. 7, 10.
- Kaplan et al. and Hoffmann et al. (Chinchilla) — scaling laws; Devlin et al. (BERT); Brown et al. (GPT-3, in-context learning).
- CS224N pretraining/LLM lectures.