Week 1 — GPU Compute First Principles: CUDA and Metal
Overview
Every ML system that matters runs on a parallel accelerator, and the difference between a fast pipeline and a slow one is almost always whether the engineer understands the execution model underneath the framework. This week establishes that model from first principles on both platforms you own: CUDA on the Jetson Orin Nano and Metal on the Mac. The goal is not to memorize APIs but to internalize the mental model — thousands of threads grouped into blocks/threadgroups, a memory hierarchy where bandwidth is the scarce resource, and a launch cost that punishes tiny kernels. By the end you should be able to write vector_add, saxpy, dot_product, and a reduction, and explain every performance number you measure.
This is the platform the rest of the course’s systems work stands on. Week 3’s mixed-precision tradeoffs, Week 7’s custom kernels, and the inference budget of the Jetson capstone all assume you can reason about occupancy, coalescing, and memory traffic. The linear-algebra and floating-point foundations are assumed from Course 5 — here we apply them to silicon.
Readings
- CUDA Book (Motta): threads, blocks, grids, global vs shared memory, synchronization, and reductions. Extract: the thread/block/grid hierarchy and why shared memory exists.
- MBT (Metal by Tutorials): the compute pipeline, buffers, threadgroups, and dispatching. Extract: the Metal analogues of CUDA concepts (threadgroup ≈ block, threadgroup memory ≈ shared memory).
- CA (Computer Architecture): memory hierarchy and parallel architectures. Extract: why DRAM bandwidth and latency dominate, and what a cache line buys you.
- Optional CS231n skim: CNN workloads as matrix/conv workloads — motivation for why these primitives matter.
Key Concepts
The execution hierarchy
A kernel launches a grid of blocks (CUDA) or threadgroups (Metal), each containing many threads. Threads within a block execute in lockstep groups (warps of 32 on NVIDIA, SIMD-groups of 32 on Apple) and can share fast on-chip memory and synchronize cheaply; threads in different blocks cannot. This hierarchy is the entire reason GPU code is structured the way it is: data that must be shared and synchronized has to live within a block.
The memory hierarchy and bandwidth wall
Registers → shared/threadgroup memory → L2 → global (DRAM), in increasing size and decreasing speed. Most ML kernels are memory-bound: limited by how fast bytes move, not by arithmetic. The key metric is arithmetic intensity (FLOPs per byte). The roofline model says performance is min(peak_FLOPs, intensity × bandwidth) — below a threshold intensity you are bandwidth-limited no matter how fast the ALUs are. vector_add (one add per 12 bytes) is hopelessly bandwidth-bound; tiled matmul can be compute-bound.
Coalescing and occupancy
Coalesced access — consecutive threads reading consecutive addresses — lets the hardware service a warp with a few wide memory transactions; strided or random access wastes most of each transaction. Occupancy is the ratio of active warps to the maximum; enough warps in flight lets the scheduler hide memory latency by switching to ready warps. Both are levers you will measure directly.
Reductions and the launch-overhead floor
A reduction (sum, max) is the canonical non-trivial parallel pattern: tree-reduce within a block using shared memory, then combine block partials. It exposes synchronization, shared memory, and the fact that a kernel launch has a fixed cost (microseconds) that makes tiny launches inefficient — the reason small-batch inference often loses to CPU.
Theory Exercises
- Compute the arithmetic intensity (FLOPs/byte) of
vector_add,saxpy,dot_product, and naivematmul. Predict which are memory-bound. - Using a roofline with the Jetson Orin Nano’s peak bandwidth and FLOPs, predict the achievable throughput of each kernel.
- Explain why a warp of 32 threads reading a column of a row-major matrix is uncoalesced, and how to fix it.
- Derive the step count and shared-memory traffic of a tree reduction over
nelements in a block ofbthreads. - Estimate the break-even problem size where GPU beats CPU given a fixed launch overhead.
Implementation
Write the four kernels in three backends: CPU C++ (baseline), CUDA (Jetson), and Metal (Mac). Keep a single host harness that fills inputs, runs each backend, and verifies results against the CPU reference within a floating-point tolerance. For the reduction, implement the shared-memory tree version and a naive global-atomic version to contrast them.
Benchmark
Sweep five problem sizes spanning L2-resident to DRAM-bound. Record wall time, derived GB/s, and GFLOP/s for each kernel × backend, plus CPU↔︎GPU transfer time separately (it often dominates for small sizes). Plot achieved GB/s against the roofline. Note where launch overhead, transfer cost, or bandwidth is the binding constraint.
Expected baselines: vector_add/saxpy saturate bandwidth well below peak FLOPs; the shared-memory reduction beats the atomic version by a wide margin; small sizes are dominated by launch + transfer, confirming the CPU break-even prediction.
Connections
This execution model is the substrate for Week 3 (mixed precision changes bytes-per-element, hence bandwidth) and Week 7 (writing a fused kernel to beat the framework). The roofline and break-even reasoning recur whenever you decide what runs on the Jetson in the Week 10 capstone. Course 5 supplied the linear algebra and floating-point model; this week is where those abstractions meet measured bandwidth.