Week 3 — Coding Projects

Core

Implement dense matmul at multiple optimization levels.

NumPy: Write naive triple-loop matmul and a blocked CPU matmul. Compare against numpy.matmul. Implement Gaussian elimination or LU for small dense matrices.
Metal: Write a naive compute matmul, then a tiled/threadgroup-memory version. Benchmark both. · Reading: MBT — compute pipelines, threadgroups, threadgroup memory, performance-oriented compute examples.
Vulkan: Write a compute shader matmul, then a tiled/shared-memory-style version. · Reading: Vulkan Book — compute pipelines, storage buffers, synchronization basics, compute dispatch structure.
CUDA: Write a naive matmul kernel, then a tiled shared-memory version. · Reading: CUDA Book — shared memory, thread blocks, memory coalescing, tiled matrix multiplication.
Stretch: Benchmark square and rectangular cases. Add simple performance plots.
Verify: Max absolute error versus NumPy is tiny · Tiled version outperforms naive for large enough matrices · Non-square cases handled correctly.