Week 9 — Information Retrieval, RAG, Agents, and Extraction

Overview

LLMs hallucinate, can’t cite sources, and have a frozen knowledge cutoff. Retrieval-augmented generation fixes this by giving the model an external, searchable memory: retrieve relevant documents, then generate an answer grounded in (and citing) them. This week builds the full retrieval stack — sparse (BM25), dense (bi-encoder embeddings), hybrid fusion, and cross-encoder reranking — and the RAG pipeline on top, then extends to the structured layer: information extraction, coreference, and tool-calling agents. It consolidates the retrieval/RAG and the agents/extraction weeks because a production assistant needs both: grounded answers and the ability to extract structure and call tools.

Course 5’s algorithms (BM25 scoring, nearest-neighbor search, entity linking) and Week 7’s bi-/cross-encoders are the building blocks. This is the week that turns the course’s components into a system, directly preparing the Week 10 capstone.

Readings

J&M Ch. 11: information retrieval and retrieval-augmented generation. Extract: BM25, dense retrieval, and the RAG pipeline.
J&M Ch. 20, 23, 25: information extraction, coreference, and conversation/agents. Extract: relation/entity extraction, coreference, and tool-calling loops.
CS224N: LLM application and evaluation. (BM25/ANN/entity-linking as algorithms: DPV review from Course 5; bi-/cross-encoders from Week 7.)

Key Concepts

Sparse, dense, and hybrid retrieval

BM25 scores documents by weighted term overlap (TF-IDF with length normalization) — fast, strong, no training, but blind to synonyms. Dense retrieval embeds query and documents with a bi-encoder (Week 7) and finds nearest neighbors by cosine similarity (approximate NN via FAISS) — captures semantic similarity but can miss exact terms. Hybrid fuses both (e.g. reciprocal-rank fusion), getting lexical precision and semantic recall. This is Course 5’s similarity/search machinery applied to text.

Reranking

First-stage retrieval favors recall (get the right doc in the top-k cheaply); a cross-encoder reranker (Week 7) then scores each query-document pair jointly for precision, reordering the shortlist. The two-stage retrieve-then-rerank pattern is the standard production design and the latency/quality tradeoff from Week 7 made concrete.

The RAG pipeline and citations

Chunk documents, embed and index them, retrieve+rerank for a query, build a prompt with the retrieved context, and have the (post-trained, Week 8) LLM generate an answer that cites the chunks it used. Grounding plus citation is what makes the answer trustworthy and checkable — the core value proposition. Failure modes: retrieval misses, context overflow, and the model ignoring the context (answer-not-grounded).

Extraction, coreference, and agents

Information extraction pulls structured facts (entities, relations, events) from text — the structured-prediction skills of Week 4 plus LLM prompting, often schema-constrained. Coreference links mentions of the same entity. Agents wrap the LLM in a loop that can call tools (search, a calculator, code execution), enabling multi-step tasks the model can’t do in one shot. Schema-constrained/structured output is the bridge from free text to reliable downstream use.

Theory Exercises

Derive the BM25 score; explain term-frequency saturation and length normalization, and why BM25 misses synonyms.
Contrast sparse, dense, and hybrid retrieval; design a reciprocal-rank-fusion scheme and argue when each component helps.
Explain the retrieve-then-rerank two-stage design in terms of the bi-/cross-encoder tradeoff (Week 7).
Define retrieval metrics (recall@k, MRR, nDCG) and RAG answer metrics (faithfulness/groundedness, answer relevance).
Describe a tool-calling agent loop and the failure modes (wrong tool, infinite loop); explain schema-constrained extraction.

Implementation

In rag_system/ and agents_ie/: build BM25, a dense bi-encoder retriever over a FAISS index, hybrid fusion, and a cross-encoder reranker; assemble the RAG pipeline with citation-aware generation using the Week 8 model. Add a schema-constrained information extractor and a small tool-calling agent (e.g. retrieval + calculator). Build an evaluation harness (retrieval + RAG faithfulness metrics).

Experiments

Retrieval: recall@k / MRR / nDCG for BM25 vs dense vs hybrid vs hybrid+rerank on a QA set. RAG: answer faithfulness and citation accuracy, with vs without retrieval (show hallucination reduction). Extraction: precision/recall vs a schema. Agent: task success with vs without tools.

Expected baselines: hybrid+rerank tops single-method retrieval; RAG sharply reduces hallucination and adds checkable citations vs the bare LLM; schema-constrained extraction is more reliable than free-form prompting; the tool-using agent solves multi-step tasks the LLM alone fails. Retrieval quality dominates end-to-end RAG quality — the key system lesson.

Connections

This is the system spine of the capstone (Week 10), which packages the RAG assistant into a polished, evaluated product. It composes Week 7’s encoders and Week 8’s aligned generator with Course 5’s retrieval algorithms. The retrieve-then-rerank and grounding patterns are the production-NLP skills the course targets.