Week 6 — Probability Foundations and Random Variables
Overview
Probability is the language of every system that reasons under uncertainty: prediction, estimation, perception, and language modeling. This week builds it from the ground up — sample spaces and the axioms, conditioning and Bayes’ rule, then the full machinery of discrete and continuous random variables, expectation, variance, joint and conditional distributions, covariance, conditional expectation, and transforms. The goal is fluency with the manipulations, since Weeks 7–10 lean on all of them.
Readings
- BT Ch 1 — Sample space and probability. Axioms, conditional probability, total probability, Bayes’ rule, independence, counting.
- BT Ch 2 — Discrete random variables. PMFs, functions of random variables, expectation/mean/variance, joint PMFs, conditioning.
- BT Ch 3 — General random variables. PDFs, CDFs, continuous random variables, conditioning, multiple random variables.
- BT Ch 4 — Further topics. Derived distributions, covariance and correlation, conditional expectation, transforms (MGFs).
Key Concepts
Axioms, conditioning, Bayes
A probability measure assigns \(P(A) \ge 0\), \(P(\Omega)=1\), and is countably additive on disjoint events. Conditional probability \(P(A\mid B) = P(A\cap B)/P(B)\) leads to the total probability theorem and Bayes’ rule:
\[P(A_i \mid B) = \frac{P(B\mid A_i)\,P(A_i)}{\sum_j P(B\mid A_j)\,P(A_j)}.\]
Events are independent when \(P(A\cap B) = P(A)P(B)\). Bayes’ rule is the hinge of all inference in Week 7 and the estimation in Week 9.
Random variables, expectation, variance
A random variable maps outcomes to numbers. Discrete RVs have a PMF \(p_X\); continuous RVs a PDF \(f_X\) with \(P(a\le X\le b) = \int_a^b f_X(x)\,dx\). The expectation and variance are
\[\mathbb{E}[X] = \sum_x x\,p_X(x) \ \text{or}\ \int x f_X(x)\,dx, \qquad \operatorname{Var}(X) = \mathbb{E}[(X-\mathbb{E}X)^2] = \mathbb{E}[X^2] - (\mathbb{E}X)^2.\]
Expectation is linear regardless of dependence: \(\mathbb{E}[aX+bY] = a\mathbb{E}X + b\mathbb{E}Y\). The law of the unconscious statistician gives \(\mathbb{E}[g(X)] = \sum g(x)p_X(x)\) without finding the distribution of \(g(X)\).
Joint, marginal, conditional, independence
For multiple RVs, the joint PMF/PDF determines marginals (sum/integrate out) and conditionals \(f_{X\mid Y}(x\mid y) = f_{X,Y}(x,y)/f_Y(y)\). Independence factorizes the joint into the product of marginals. Standard distributions to know cold: Bernoulli, binomial, geometric, Poisson (discrete); uniform, exponential, Gaussian (continuous).
Covariance, correlation, conditional expectation
\[\operatorname{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}X)(Y-\mathbb{E}Y)], \qquad \rho = \frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}X\,\operatorname{Var}Y}} \in [-1,1].\]
The conditional expectation \(\mathbb{E}[X\mid Y]\) is itself a random variable (a function of \(Y\)) and satisfies the tower property \(\mathbb{E}[\mathbb{E}[X\mid Y]] = \mathbb{E}[X]\) and the law of total variance \(\operatorname{Var}(X) = \mathbb{E}[\operatorname{Var}(X\mid Y)] + \operatorname{Var}(\mathbb{E}[X\mid Y])\). \(\mathbb{E}[X\mid Y]\) is the minimum-mean-squared-error estimator of \(X\) from \(Y\) — the bridge to Week 9’s estimation.
Derived distributions and transforms
For \(Y = g(X)\) with \(g\) monotone, the change-of-variables formula is \(f_Y(y) = f_X(g^{-1}(y))\,\big|\frac{d}{dy}g^{-1}(y)\big|\) (the 1-D Jacobian; the determinant from Week 2 in higher dimensions). The moment generating / transform \(M_X(s) = \mathbb{E}[e^{sX}]\) encodes all moments and turns sums of independent RVs into products — the analytic tool behind the limit theorems in Week 7.
Connections
- Backward: Week 5’s real-analysis foundations (limits, convergence) make the limiting statements here rigorous.
- Forward: Week 7 adds limit theorems, Markov chains, and inference; Week 10 reinterprets PMFs/PDFs through entropy and recognizes the Gaussian as maximum-entropy.
- Across courses: Uncertainty and multimodal prediction (Course 1), probabilistic language models and softmax (Course 3), randomized algorithms (Week 14).
Further Reading
- Bertsekas & Tsitsiklis, Introduction to Probability, 2nd ed., Chapters 1–4.