Week 7 — Limit Theorems, Markov Chains, and Statistical Inference

Overview

This week takes probability from “one random variable” to “what happens at scale and over time,” then to “what can I conclude from data.” Limit theorems explain why averages concentrate and why Gaussians appear everywhere. Markov chains model systems that evolve with memoryless transitions, with steady-state behavior computed from the linear algebra of Weeks 1–2. Inference — both Bayesian (treat the parameter as random) and classical (treat it as fixed and unknown) — is the formal answer to “estimate the thing you cannot observe.”

Readings

BT Ch 5 — Limit theorems. Markov and Chebyshev inequalities, the weak law of large numbers, the central limit theorem.
BT Ch 6 — Bernoulli and Poisson processes (for event streams and timing).
BT Ch 7 — Markov chains. Discrete-time chains, classification of states, steady-state behavior, absorption.
BT Ch 8 — Bayesian inference. MAP and LMS/LLMS estimation.
BT Ch 9 — Classical inference. Estimators, maximum likelihood, confidence intervals, hypothesis testing.

Key Concepts

Inequalities and the laws of large numbers

Markov: \(P(X \ge a) \le \mathbb{E}[X]/a\) for \(X\ge 0\). Chebyshev: \(P(|X-\mu| \ge k\sigma) \le 1/k^2\). These give the weak law of large numbers: the sample mean \(M_n = \frac1n\sum X_i\) converges in probability to \(\mu\). The central limit theorem is sharper — the standardized sum converges in distribution to a standard normal:

\[\frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} \mathcal{N}(0,1).\]

This is why Gaussian noise models are so often justified and why error bars shrink like \(1/\sqrt{n}\).

Bernoulli and Poisson processes

The Bernoulli process is a stream of i.i.d. coin flips; interarrival times are geometric. Its continuous-time limit is the Poisson process with rate \(\lambda\): the number of arrivals in an interval of length \(t\) is Poisson(\(\lambda t\)), interarrival times are exponential, and the process is memoryless. Merging and splitting Poisson processes stay Poisson — the right model for event streams, queues, and sensor-arrival timing.

Markov chains

A discrete-time Markov chain has the memoryless property \(P(X_{n+1}=j \mid X_n=i, \dots) = P_{ij}\), encoded in a transition matrix \(P\). States are classified as recurrent/transient and periodic/aperiodic. For an irreducible aperiodic chain there is a unique stationary distribution \(\pi\) solving

\[\pi = \pi P, \qquad \sum_i \pi_i = 1,\]

i.e. \(\pi\) is the left eigenvector of \(P\) for eigenvalue \(1\) — the eigen-machinery of Week 1 applied directly. Absorbing chains let you compute hitting/absorption probabilities and expected times.

Bayesian inference

Treat the unknown \(\Theta\) as a random variable with prior \(p_\Theta\). After observing data, form the posterior by Bayes’ rule. Point estimates: the MAP estimate maximizes the posterior, while the LMS estimate is the posterior mean \(\mathbb{E}[\Theta\mid X]\) (minimizes mean-squared error, from Week 6). The linear LMS estimator is the best estimator restricted to linear functions of the data and depends only on means/variances/covariances.

Classical inference

Treat \(\theta\) as a fixed unknown constant. Maximum likelihood chooses \(\hat\theta = \arg\max_\theta f(x;\theta)\); it is the optimization problem of Week 9 in disguise and, for Gaussian noise, coincides with least squares. Confidence intervals quantify estimator uncertainty; hypothesis testing (likelihood-ratio tests, significance levels) formalizes deciding between models — directly relevant to the detection/estimation framing in Week 10.

Connections

Backward: the convergence notions here (in probability, in distribution) rest on the sequence/limit machinery of Week 5; Markov-chain steady states use the eigen-theory of Week 1.
Forward: ML estimation is the objective minimized in Week 9; the Gaussian-as-limit and entropy connect to Week 10; Markov chains foreshadow the entropy rate of a process (Week 10).
Across courses: Bayes filters and state estimation (Course 1), language-model likelihood and evaluation (Course 3).