Week 7 — ML Efficiency and Custom GPU Kernels

Overview

This is the ML-systems-engineering heart of the course. You take the trained predictor from Week 5, treat it as a production workload, and make it fast — first by profiling to find where time actually goes, then by writing a custom kernel below the framework level to beat it. The discipline is measure-first: you do not optimize what you have not profiled. You will profile in PyTorch/JAX, identify a bottleneck primitive (softmax, LayerNorm, an attention-like reduction), implement it in Triton/CUDA, and compare the same primitive across CUDA (Jetson) and Metal (Mac).

This week ties Week 1’s execution model, Week 3’s precision policy, and Week 5’s model into one optimization story. The goal role is ML frameworks/efficiency engineering, and this is exactly that job: understand the framework, find the inefficiency, drop below it when warranted, and prove the win with numbers.

Readings

CUDA Book: occupancy, coalescing, shared memory, reductions, warp-level primitives, and tiled matmul. Extract: the optimization levers and how to apply them to a real kernel.
Triton docs: program IDs, masks, block loads/stores, and reductions. Extract: the block-programming model that makes custom kernels tractable.
MBT: threadgroup memory, compute pipelines, and performance tuning. Extract: the Metal path for the same primitive.
CA: parallel architectures. · CS231n: softmax/LayerNorm as the kernels under common models.

Key Concepts

Profile before optimizing

Use the framework profiler (PyTorch profiler / Nsight / Metal capture) to get a per-op time breakdown, kernel launch counts, and memory transfer time. The bottleneck is rarely where intuition says — often it’s launch overhead from many tiny ops, or a memory-bound elementwise chain that should be fused. Amdahl’s law bounds your possible speedup by the fraction of time the target op consumes.

Kernel fusion and why custom kernels win

A sequence of elementwise ops (e.g. softmax: max, subtract, exp, sum, divide) launched separately re-reads the tensor from DRAM each time — bandwidth-bound and launch-heavy. A fused kernel does the whole sequence in one pass with one launch, reading the data once. This is the single biggest lever for memory-bound ML ops and the main reason hand-written kernels beat naive framework graphs.

Writing the kernel

In Triton, you program at the block level: each program instance handles a tile, loads with masks, computes in registers/shared memory, and writes once. Apply Week 1’s lessons — coalesce loads, use shared memory for the reduction, keep occupancy high — and Week 3’s precision policy — compute in fp16/bf16, accumulate in fp32.

CUDA vs Metal

The same primitive on both backends exposes architectural differences: warp vs SIMD-group size, shared vs threadgroup memory sizing, and dispatch overhead. The comparison builds intuition for portable performance and for what the Jetson can sustain in the capstone.

Theory Exercises

From a profile, apply Amdahl’s law to bound the end-to-end speedup from optimizing the target op; decide if it’s worth it.
Compute the DRAM traffic of unfused vs fused softmax over an \(N\times D\) tensor; predict the speedup from fusion.
Derive the occupancy of your kernel given its register and shared-memory usage; identify the limiting resource.
Explain why a naive softmax is numerically unsafe and how the max-subtraction (Week 3 stability) fixes it within the kernel.
Predict where CUDA and Metal will differ for this primitive based on warp/SIMD-group size and memory model.

Implementation

Profile the Week 5 predictor; pick the dominant primitive. Implement a fused Triton/CUDA kernel for it on the Jetson and a Metal version on the Mac. Verify against the framework reference within tolerance. Integrate the kernel back into the model and re-profile end to end.

Benchmark

Per kernel: GB/s, GFLOP/s, and speedup vs the framework op, across precisions (Week 3) and problem sizes (Week 1 roofline). End-to-end: model latency before/after integration. CUDA vs Metal: same primitive, normalized comparison.

Expected baselines: the fused kernel substantially beats the unfused framework path on the memory-bound primitive; end-to-end speedup tracks Amdahl’s prediction; CUDA and Metal differ in line with their warp/SIMD-group and memory characteristics.

Connections

This is the capstone of the ML-systems phase and the clearest portfolio artifact for an efficiency-engineering role. It uses Week 1 (execution model), Week 3 (precision), and Week 5 (the model). The profiling-and-budgeting mindset carries directly into deciding what runs on the Jetson in Week 10.