Week 10 — Capstone: Citation-Grounded RAG Assistant

Overview

The capstone integrates the entire course into one portfolio-grade system: a citation-grounded RAG assistant over your own corpus (course notes, papers, books, technical documents). It composes the tokenization and retrieval (Weeks 1, 9), the embeddings and encoders (Weeks 2, 7), the post-trained generator (Week 8), and the RAG pipeline (Week 9) into a deployed, evaluated application. The emphasis shifts from any single model to system design and rigorous evaluation: a staff-level NLP system is judged by its eval harness as much as its features.

A reviewer should be able to ask the assistant a question, get a grounded answer with citations they can verify, and read a benchmark report quantifying retrieval quality, answer faithfulness, latency, and cost. This is the course’s headline artifact and mirrors how production RAG systems are actually built and judged.

Readings

Review all prior docs/ — the components you are integrating.
CS224N: final-project and LLM-evaluation material. Extract: evaluation methodology for generation systems.
Production NLP/LLM engineering references: serving, caching, and evaluation context. Extract: the operational concerns of a deployed assistant.

Key Concepts

System architecture

Ingestion (load, chunk, embed, index — Weeks 1–2, 9) → retrieval (hybrid + rerank — Week 9) → generation (post-trained LLM with citation-aware prompting — Week 8) → an application layer (API/UI) with conversation state. Each component is swappable behind a clean interface — the design discipline that lets you A/B retrievers or generators without rewrites, mirroring the contract-based design in Courses 1–2.

Evaluation harness — the real deliverable

Evaluate at two levels: retrieval (recall@k, MRR, nDCG against labeled relevant chunks) and end-to-end answer quality (faithfulness/groundedness, answer relevance, citation correctness — via an LLM-judge and a small human-rated set). Build a held-out question set over your corpus. The eval harness is what makes claims about the system credible; treat it as the primary artifact.

Grounding, citations, and failure handling

Every answer cites the chunks it used, and the system should abstain (“I don’t know”) when retrieval returns nothing relevant — abstention is a feature, not a bug, and a key trust property. Handle context-window limits (rank and truncate), retrieval misses, and ungrounded generation (detect and flag).

Operational concerns

Latency budget (retrieval + rerank + generation), cost per query (tokens, embedding calls), and caching (embeddings, frequent queries). These connect to Course 1’s systems/efficiency thinking — a RAG system is also an inference-serving problem.

Theory Exercises

Draw the end-to-end architecture with component interfaces; justify which components are swappable and why.
Define the full evaluation suite (retrieval + answer metrics) and the procedure for building a labeled held-out set over your corpus.
Specify the abstention policy: when should the assistant refuse to answer, and how is that decided from retrieval scores?
Build the latency and cost budget per query; identify the dominant term and a caching strategy that reduces it.
Describe an A/B test comparing two retrievers end-to-end; state the metric and what would constitute a real improvement.

Implementation

In final_project/: build the full RAG assistant over your corpus — ingestion/indexing, hybrid retrieval + reranking (Week 9), citation-aware generation (Week 8), and an API/UI with conversation state and abstention. Build the evaluation harness (retrieval + answer metrics, LLM-judge + small human set) and a benchmark report. Add caching and a latency/cost dashboard.

Experiments / Benchmark

Retrieval metrics and end-to-end faithfulness/citation-accuracy on the held-out set; latency (p50/p90/p99) and cost per query; an ablation across retriever variants (BM25-only vs hybrid vs hybrid+rerank) showing end-to-end impact; abstention behavior on out-of-corpus questions. Assemble the full benchmark report.

Expected baselines: hybrid+rerank gives the best grounded-answer quality; the assistant cites verifiable sources and abstains appropriately on out-of-corpus questions; latency/cost are dominated by generation, mitigated by caching; the ablation shows retrieval quality driving end-to-end quality — the system’s central lesson.

Connections

This capstone composes every prior week into one evaluated system — the portfolio deliverable for an NLP/LLM engineering role. It draws on Course 1’s efficiency/serving mindset and Course 5’s retrieval and information-theory foundations. The optional Weeks 11+ directions extend it toward speech/multimodal (building on Course 6 DSP), production serving, graph RAG, or richer agents.