Week 3 — Coding Projects
Core
Implement dense matmul at multiple optimization levels.
- NumPy: Write naive triple-loop matmul and a blocked CPU matmul. Compare against
numpy.matmul. Implement Gaussian elimination or LU for small dense matrices. - Metal: Write a naive compute matmul, then a tiled/threadgroup-memory version. Benchmark both. · Reading: MBT — compute pipelines, threadgroups, threadgroup memory, performance-oriented compute examples.
- Vulkan: Write a compute shader matmul, then a tiled/shared-memory-style version. · Reading: Vulkan Book — compute pipelines, storage buffers, synchronization basics, compute dispatch structure.
- CUDA: Write a naive matmul kernel, then a tiled shared-memory version. · Reading: CUDA Book — shared memory, thread blocks, memory coalescing, tiled matrix multiplication.
- Stretch: Benchmark square and rectangular cases. Add simple performance plots.
- Verify: Max absolute error versus NumPy is tiny · Tiled version outperforms naive for large enough matrices · Non-square cases handled correctly.