Week 10 — Capstone: Performance Engineering and `syslens`

Overview

The capstone unites the course in two moves. First, performance engineering: the discipline of measuring before optimizing, understanding compiler optimizations, profiling with perf, and quantifying the cache/branch effects that earlier weeks exposed. Second, syslens: a Linux systems-observability tool in C/C++ that reads /proc and /sys to report processes, memory maps, open file descriptors, network sockets, disk usage, and devices — exercising every layer of the course in one portfolio artifact. It consolidates the performance week with the capstone because the tool is also the thing you profile and optimize.

A reviewer should see a working observability tool plus a benchmark report that demonstrates measurement discipline: a profile that found the real bottleneck, an optimization justified by data, and an honest before/after. This is the staff-level systems-engineering story the course builds toward.

Readings

HLW Ch. 15–16: development tools and compiling from C source. Extract: the profiling/build toolchain.
CA advanced-CPU-design and memory-hierarchy sections: pipelining, branch prediction, superscalar execution. Extract: the microarchitectural effects performance work targets.
ARM assembly chapters (review) and prior course notes. Extract: reading optimized assembly to understand what the compiler did.

Key Concepts

Measure before optimizing

The cardinal rule: profile to find where time actually goes; never optimize on intuition. Amdahl’s law bounds the payoff by the fraction of time the target consumes. perf stat/perf record give cycle counts, cache misses, branch mispredictions, and a function-level profile; hyperfine gives robust wall-clock comparison. The workflow — profile → hypothesize → change → re-measure — is the entire point.

Compiler optimizations

-O2/-O3 apply inlining, loop unrolling, vectorization, constant folding, strength reduction, and register allocation. Reading the -O0 vs -O2 disassembly (Weeks 4–5) shows exactly what changed. Most of this optimization happens on LLVM IR, before either ISA is chosen: opt -O2 transforms the IR (run it twice to see a pass’s effect), and only then does the backend lower to ARM64 or x86-64. So inspect three layers — -O0 vs -O2 IR (clang -emit-llvm/opt) for the target-independent optimizations, then each ISA’s assembly (llc) for target-specific codegen and vectorization (NEON on ARM64 vs SSE/AVX on x86-64). Comparing codegen quality across the two targets for the same hot loop is a sharp way to see what the compiler does well on each. Understanding these lets you write code the compiler can optimize (and recognize when it can’t, e.g. aliasing or virtual calls from Week 6 blocking inlining).

Microarchitectural effects

The cache hierarchy (Week 8), branch prediction, and pipelining (Week 3 preview) determine real performance. Cache misses (stride/layout, Weeks 6/8), branch mispredictions (unpredictable branches vs branchless conditional-select — csel on ARM64, cmov on x86-64), and instruction-level parallelism are the levers. perf’s hardware counters expose these on both ISAs (cycles, cache-misses, branch-misses) — Linux perf runs the same on an ARM64 Jetson and an x86-64 box — making the effects quantifiable rather than theoretical.

syslens: the integrating artifact

syslens reads /proc/[pid]/{stat,status,maps,fd} (Weeks 7–8), /proc/net and socket info (Week 9), /sys device attributes (Week 9), and filesystem/disk usage (Week 8), presenting a coherent system view — a small htop/lsof/ss in one tool. It is written in performance-conscious C/C++ (Week 6 layout, Week 8 I/O) and is itself profiled and optimized as the performance exercise.

Theory Exercises

Apply Amdahl’s law: given a profile, bound the end-to-end speedup from optimizing the hottest function and decide whether it’s worth it.
Read -O0 vs -O2 disassembly of a hot loop; identify three specific optimizations the compiler applied (Weeks 4–5).
Interpret a perf stat output: relate cache-miss and branch-misprediction rates to the Week 6/8 layout and Week 5 control-flow choices.
Explain how an unpredictable branch hurts the pipeline and when a branchless csel rewrite helps.
Design the data model and /proc//sys parsing plan for syslens; identify which course week supplies each data source.

Implementation

Build syslens (labs/week10-performance-capstone): parse /proc and /sys to report processes (CPU/mem from stat/status), memory maps (maps/smaps), open FDs, network sockets, disk usage, and devices, with a clean CLI. Then profile syslens with perf, find a bottleneck (parsing, allocation, I/O), apply a data-justified optimization (e.g. buffered I/O from Week 8, layout from Week 6, fewer allocations), and re-measure.

Measurement / Inspection

A benchmark report: perf stat/record profile of syslens before and after optimization, with the bottleneck identified and the speedup quantified; cross-check syslens output against ps/lsof/ss/df for correctness. Include a microbenchmark or two (e.g. parsing throughput) and an -O0 vs -O2 comparison of a hot path.

Expected baselines: syslens output matches the standard tools; profiling reveals a concrete bottleneck (often I/O or allocation); the data-justified optimization yields a measurable, Amdahl-consistent speedup; the report honestly shows before/after with the reasoning.

Connections

This capstone composes every week: bits/arithmetic (Week 1), the CPU/assembly model (Weeks 3–5), C/C++ layout (Week 6), processes//proc (Week 7), virtual memory/caches/I/O (Week 8), and networking/devices (Week 9). The measurement discipline transfers directly to Course 1’s ML-systems profiling and Course 2’s engine performance work. The optional Weeks 11+ directions extend toward kernel internals, compilers, embedded/real-time, or security.