Week 3 — Numerical Linear Algebra I: QR, Least Squares, Conditioning, and Stability

Overview

Week 2’s theorems assume exact arithmetic. Real computers use floating point, and the gap between “mathematically correct” and “numerically computed” is where most production numerical bugs live. This week develops the working vocabulary of numerical linear algebra: matrix norms, orthogonal projectors, the QR factorization (Gram–Schmidt vs. Householder), least squares solved three ways, the condition number as a measure of problem sensitivity, the distinction between a stable algorithm and a well-conditioned problem, and the floating-point model that makes the whole analysis precise. LU with pivoting closes the week.

Readings

T&B Lectures 1–5 — matrix-vector multiplication, orthogonal vectors and matrices, vector and matrix norms, the SVD.
T&B Lectures 6–10 — projectors, QR factorization, Gram–Schmidt, Householder triangularization.
T&B Lectures 11–15 — least squares, conditioning and condition numbers, floating-point arithmetic, stability, backward stability.
T&B Lectures 17–23 — Gaussian elimination, pivoting, stability of LU, Cholesky.

Key Concepts

Norms and orthogonal matrices

Vector \(p\)-norms \(\|x\|_p = (\sum_i |x_i|^p)^{1/p}\); the induced matrix norm \(\|A\| = \max_{\|x\|=1}\|Ax\|\). The 2-norm is \(\|A\|_2 = \sigma_1\) and the Frobenius norm is \(\|A\|_F = (\sum_{ij} A_{ij}^2)^{1/2} = (\sum_i \sigma_i^2)^{1/2}\). An orthogonal matrix \(Q\) (\(Q^\top Q = I\)) preserves the 2-norm, \(\|Qx\|_2 = \|x\|_2\), which is why orthogonal transformations are the workhorses of stable algorithms — they never amplify errors.

Projectors and QR

A projector satisfies \(P^2 = P\); it is orthogonal when \(P = P^\top\). The (reduced) QR factorization \(A = \hat Q \hat R\) writes the columns of \(A\) in an orthonormal basis \(\hat Q\) with upper-triangular coefficients \(\hat R\). Classical Gram–Schmidt is numerically unstable (loss of orthogonality); modified Gram–Schmidt and especially Householder reflectors — orthogonal matrices that zero out a column below the diagonal — are backward stable. Givens rotations are the sparse-friendly alternative.

Least squares, three ways

To minimize \(\|Ax - b\|_2\) for tall \(A\):

Normal equations \(A^\top A x = A^\top b\) — fast but squares the condition number; avoid when \(A\) is ill-conditioned.
QR: solve \(\hat R x = \hat Q^\top b\) — the standard, stable choice.
SVD: \(x = V \Sigma^{+} U^\top b\) — most robust, reveals rank deficiency, and defines the minimum-norm solution.

Conditioning vs. stability

The condition number of a nonsingular matrix is \(\kappa(A) = \|A\|\,\|A^{-1}\| = \sigma_1/\sigma_n\) in the 2-norm. It bounds how much relative input error is amplified into output error when solving \(Ax = b\):

\[\frac{\|\delta x\|}{\|x\|} \lesssim \kappa(A)\,\frac{\|\delta b\|}{\|b\|}.\]

Conditioning is a property of the problem; stability is a property of the algorithm. A backward-stable algorithm returns the exact answer to a slightly perturbed problem — it gives a good answer to a well-conditioned problem but cannot rescue an ill-conditioned one. Keep the two ideas separate.

Floating point

IEEE double precision represents reals with unit roundoff \(\epsilon_{\text{mach}} \approx 1.1\times 10^{-16}\): every basic operation satisfies \(\mathrm{fl}(x \circ y) = (x\circ y)(1+\delta)\) with \(|\delta| \le \epsilon_{\text{mach}}\). Catastrophic cancellation (subtracting nearly equal numbers) and non-associativity of summation are the recurring hazards; this model is what makes “backward stable” a provable statement rather than folklore.

LU and pivoting

Gaussian elimination factors \(A = LU\) (\(L\) unit lower-triangular, \(U\) upper-triangular). Without pivoting it can be unstable or fail on a zero pivot; partial pivoting (\(PA = LU\), swapping rows to use the largest available pivot) makes it reliable in practice. For symmetric positive-definite \(A\), Cholesky \(A = R^\top R\) is twice as fast and unconditionally stable.

Connections

Forward: Week 4 handles eigenvalues and large sparse systems where direct factorization is too expensive. Conditioning reappears in Week 9 as the curvature/scaling that governs gradient-descent convergence.
Across courses: Mixed-precision and numerical-stability reasoning for ML kernels (Course 1), float layout and IEEE 754 internals (Course 4).