Week 3 — Mixed Precision, Numerical Stability, and Quantization for ML

Overview

Accelerators are fast at low precision and slow at high precision, so every serious ML systems engineer must reason about which bits actually matter. This week connects Course 5’s conditioning and floating-point theory to the concrete formats that ML hardware accelerates: fp32, fp16, bf16, and int8. The central tension is that lower precision means more throughput and less memory traffic (directly improving the Week 1 roofline) but risks overflow, underflow, and accumulated rounding error that can silently degrade a model. You will learn where mixed precision is safe, where it bites, and how quantization trades accuracy for speed in a controlled way.

By the end you should be able to take a numerical workload, choose a precision strategy, and prove (by measurement) that the result is still correct to a stated tolerance. This is the discipline that separates “it ran faster” from “it ran faster and is still right.”

Readings

T&B (review/apply): condition number, perturbation, backward stability, and floating-point behavior. Extract: how input conditioning bounds the achievable accuracy regardless of algorithm.
CUDA Book: mixed-precision and tensor-core sections. Extract: which operations the hardware accelerates at fp16/bf16 and how accumulation precision is handled.
C&T: entropy and KL divergence. Extract: an information view of how much precision a distribution actually needs (motivates quantization).
CS231n: training-instability and optimization review. Extract: where low precision destabilizes training (loss scaling).
(Conditioning, SVD, and the floating-point model: assumed from Course 5.)

Key Concepts

The formats and what they trade

fp32 (1-8-23), fp16 (1-5-10), bf16 (1-8-7), int8. bf16 keeps fp32’s exponent range (8 bits) but sacrifices mantissa, so it rarely overflows and is the modern training default; fp16 has more mantissa but a narrow exponent range, so it needs loss scaling to avoid gradient underflow. int8 is integer quantization with a scale/zero-point. The choice is governed by dynamic range vs precision needs of the specific tensor.

Mixed precision done correctly

The standard recipe: store and compute most ops in low precision, but accumulate in fp32 (matmul/reductions) and keep a master copy of weights in fp32. Tensor cores do exactly this — fp16/bf16 multiply, fp32 accumulate. The hazard is summation: adding many small numbers in fp16 loses them to the larger running sum (catastrophic absorption). This is why accumulation precision matters more than storage precision.

Conditioning sets the ceiling

A problem with condition number \(\kappa\) can lose roughly \(\log_{10}\kappa\) decimal digits no matter the algorithm. If \(\kappa\) is large, fp16 may not have enough digits to represent the answer at all — the failure is the problem’s, not the format’s. Course 5’s conditioning analysis is the tool for predicting which workloads tolerate low precision.

Quantization

Map reals to integers: \(q=\text{round}(x/s)+z\) with scale \(s\) and zero-point \(z\). Post-training quantization calibrates \(s,z\) from activation statistics; quantization-aware training simulates rounding during training. Quantization error behaves like uniform noise of step \(s\); the signal-to-quantization-noise ratio improves ~6 dB per bit (the same result as Course 6’s ADCs — quantizing weights and quantizing a sampled signal are the same mathematics).

Theory Exercises

Give the exact representable range and machine epsilon for fp16 and bf16; explain why bf16 rarely overflows but is coarser.
Construct a sum of values where fp16 accumulation loses a term that fp32 accumulation keeps; quantify the error.
For a matrix with condition number \(\kappa=10^4\), estimate the digits of accuracy lost and whether fp16 storage is viable.
Derive the int8 quantization error as uniform noise and the ~6 dB/bit SQNR; relate it to Course 6’s ADC result.
Explain loss scaling: why multiplying the loss by a constant rescues fp16 gradients, and how to choose the factor.

Implementation

Take a representative workload (the Week 1 reduction/matmul, or a small layer from the Week 5 predictor). Implement fp32, fp16-with-fp32-accumulate, bf16, and an int8-quantized variant. Add loss scaling where needed. Verify each against the fp32 reference and report the error.

Benchmark

For each precision: throughput (GFLOP/s, GB/s), memory footprint, and accuracy (max/relative error vs fp32). Plot the speed/accuracy frontier. Re-examine the Week 1 roofline — halving bytes-per-element should move a bandwidth-bound kernel’s ceiling up; confirm it does.

Expected baselines: bf16 matches fp32 accuracy closely on well-conditioned work with ~2× bandwidth benefit; fp16 needs loss scaling to avoid degradation; int8 gives the largest speedup with a measurable, bounded accuracy hit. Ill-conditioned cases degrade regardless — exactly as conditioning predicts.

Connections

This directly improves the Week 1 roofline and is the precision policy for Week 7’s custom kernels and the Week 10 Jetson deployment (where memory and power are tight). The quantization mathematics is shared with Course 6’s ADC quantization. Course 5’s conditioning theory is what lets you predict, rather than discover, where low precision is safe.