Week 16 — Coding Projects
Core
Optimize matmul for locality and throughput.
- NumPy: Compare naive loops, blocked loops, and NumPy BLAS-backed matmul. Estimate arithmetic intensity.
- Metal: Tuned tiled matmul using threadgroup memory carefully. Sweep tile sizes. · Reading: MBT — threadgroup memory, compute performance tuning, buffer layout.
- Vulkan: Tuned tiled compute matmul with workgroup shape experiments. · Reading: Vulkan Book — compute tuning, workgroup shape experiments, storage buffer layout.
- CUDA: Shared-memory tiled matmul exploring occupancy vs. bandwidth tradeoffs. · Reading: CUDA Book — optimized GEMM foundations, occupancy vs bandwidth tradeoffs.
- Stretch: Add transposed operand version. Explore alignment/padding effects. Run tile-size sweep.
- Verify: Correctness versus NumPy · Performance changes with tile size · Tiling reduces global memory traffic conceptually.