Week 16 — Coding Projects

Core

Optimize matmul for locality and throughput.

NumPy: Compare naive loops, blocked loops, and NumPy BLAS-backed matmul. Estimate arithmetic intensity.
Metal: Tuned tiled matmul using threadgroup memory carefully. Sweep tile sizes. · Reading: MBT — threadgroup memory, compute performance tuning, buffer layout.
Vulkan: Tuned tiled compute matmul with workgroup shape experiments. · Reading: Vulkan Book — compute tuning, workgroup shape experiments, storage buffer layout.
CUDA: Shared-memory tiled matmul exploring occupancy vs. bandwidth tradeoffs. · Reading: CUDA Book — optimized GEMM foundations, occupancy vs bandwidth tradeoffs.
Stretch: Add transposed operand version. Explore alignment/padding effects. Run tile-size sweep.
Verify: Correctness versus NumPy · Performance changes with tile size · Tiling reduces global memory traffic conceptually.