Week 4 — Assembly Basics on Two ISAs: ARM64, x86-64, and the LLVM IR Bridge

Overview

Now you move from your toy CPU (Week 3) to real ones. ARM64/AArch64 is the primary ISA you work through — it is the architecture in your MacBook, Jetson, and Raspberry Pi, and it is a clean RISC, load/store design with fixed 32-bit instructions and 31 general-purpose registers, which makes it an excellent teaching ISA: regular, orthogonal, and ubiquitous. The goal is to read compiler output and write small functions in ARM64 assembly so the rest of the course can show you exactly how high-level code becomes machine instructions.

Then you meet ARM64’s x86-64 equivalent — the CISC architecture in most servers and desktops — and, crucially, LLVM IR, the architecture-neutral intermediate representation that sits between C and both ISAs. The unifying idea for the whole course is this: one C function compiles to one LLVM IR, which the backend then lowers to either ARM64 or x86-64. Seeing the same program in IR and in both assemblies is the clearest way to understand what is fundamental (the computation) versus what is an ISA convention (the instructions). This is the same fetch/decode/execute model from Week 3, now instantiated on hardware you can run and debug.

Readings

ARM (Plantz) Ch. 10: programming in assembly language. Extract: the assembler workflow, register conventions, and basic instructions.
ARM Ch. 12: instruction details. Extract: the load/store model, addressing modes, and the common arithmetic/move instructions.
CS:APP (Bryant & O’Hallaron) Ch. 3: machine-level representation of programs (x86-64). Extract: the x86-64 register set, register-memory operands, mov/lea, and AT&T vs Intel syntax.
LLVM Language Reference (skim): types, instructions, and SSA form. Extract: what IR looks like and why it is target-independent.
CA embedded-architecture sections (skim): the RISC philosophy. Extract: why load/store + fixed-width instructions simplify the pipeline.

Key Concepts

Registers and the load/store model

AArch64 has 31 general registers, usable as 64-bit (x0–x30) or 32-bit (w0–w30), plus the zero register, stack pointer, and PC. It is a load/store architecture: arithmetic operates only on registers, and memory is touched only by explicit ldr/str. This separation (compute vs memory access) is the defining RISC trait and makes performance reasoning (Week 8/10) cleaner — you can see every memory access.

Core instructions and addressing modes

Moves (mov, movz/movk to build a 64-bit constant in pieces), arithmetic (add, sub, mul; the adds/subs variants set flags — note add does not set flags, adds does), logical/shift ops, and ldr/str with addressing modes (base, base+offset, pre-/post-indexed). Building a 64-bit immediate takes up to four movz/movk instructions — a direct consequence of fixed 32-bit instruction width (Week 1/3 encoding tradeoff).

Flags and the bridge to control flow

adds, subs, cmp (an alias for subs to the zero register), cmn, and tst set the PSTATE condition flags N, Z, C, V — the same flags your Week 3 emulator had. Branches will read them (Week 5). Knowing which instructions set flags and which don’t is a common source of bugs.

Reading compiler output

The single most useful skill: compile a small C function with -O0 and -O2, disassemble it, and map each line of C to instructions. This demystifies the compiler and is how you debug, optimize, and reverse-engineer behavior for the rest of the course.

x86-64 equivalent: registers and CISC

x86-64 is the architecture’s CISC counterpart, and contrasting it with ARM64 sharpens both. The register set is 16 general-purpose 64-bit registers — rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp and r8–r15 — each with 32-/16-/8-bit sub-views (eax, ax, al), versus ARM64’s 31 (x/w). The deep differences:

CISC, variable-length instructions. x86-64 instructions are 1–15 bytes (vs ARM64’s fixed 4), so building a constant is often a single mov $0x..., %rax rather than ARM64’s movz/movk sequence — at the cost of a more complex decoder.
Register-memory operands. Unlike ARM64’s strict load/store, many x86-64 instructions operate directly on memory: add (%rdi), %rax reads memory and adds in one instruction. Memory access is not segregated to ldr/str, which makes “count the memory touches” harder than on ARM64 — a real reason RISC is cleaner to reason about.
lea computes an address (and doubles as cheap arithmetic), and two syntaxes exist: AT&T (mov %rsp, %rbp, source-first, used by GNU as/gcc) and Intel (mov rbp, rsp, dest-first). Know both; Godbolt toggles between them.
Flags. x86-64 has the EFLAGS register (ZF, SF, CF, OF) set by most arithmetic and by cmp/test — the same role as ARM64’s N/Z/C/V from Week 3, with different names.

LLVM IR: the architecture-neutral layer

LLVM IR is the intermediate representation a compiler produces before choosing an ISA. It is typed, in SSA form (every value assigned once), and target-independent. The workflow that unifies this whole course:

clang -O1 -S -emit-llvm sum.c -o sum.ll     # C  -> LLVM IR (one file)
llc -march=aarch64 sum.ll -o sum.arm64.s    # IR -> ARM64 assembly
llc -march=x86-64  sum.ll -o sum.x86.s      # IR -> x86-64 assembly

The same sum.ll lowers to both ISAs: the IR captures the computation (a loop, an add, a return), while the two .s files differ only in register names, the load/store-vs-register-memory style, and instruction encoding. This is the precise sense in which ARM64 and x86-64 are “the same program, lowered differently,” and opt lets you watch optimization passes transform the IR itself (revisited in Week 10). Godbolt shows C, IR, and both assemblies side by side.

Theory Exercises

List the AArch64 register set and explain the x vs w (64- vs 32-bit) views and the zero register; give the x86-64 equivalents (rax/eax/ax/al, r8–r15) and the count difference (31 vs 16).
Explain why building a 64-bit constant needs up to four movz/movk instructions on ARM64; contrast with x86-64’s single variable-length mov, and relate both to the fixed- vs variable-width encoding tradeoff (Week 1/3).
Name the ARM64 condition flags (N, Z, C, V) and the x86-64 EFLAGS (ZF, SF, CF, OF); give an instruction and scenario that sets each.
Write an AArch64 function for int sum(int n) (sum 1..n) using a loop and hand-assemble the loop body; then emit LLVM IR with clang -emit-llvm -S and compile it to both ISAs with llc, comparing the two outputs.
For a short C function, show one statement as an ARM64 ldr/str pair and as a single x86-64 register-memory instruction; explain why the load/store form is easier to reason about for performance.

Implementation

Write several small AArch64 functions by hand (arithmetic, a loop, an array sum), assemble and run them, and single-step in lldb (Mac) or gdb (Linux), watching registers and flags. Then compile small C functions at -O0 and -O2, disassemble, and annotate the mapping from C to instructions. For each one, also emit LLVM IR (clang -emit-llvm -S) and lower it to both ISAs (llc -march=aarch64 / -march=x86-64); run the x86-64 version under qemu-user or compare on Godbolt. Annotate which differences are fundamental (the computation, visible in the IR) versus ISA convention (register names, load/store vs register-memory, encoding).

Measurement / Inspection

Single-step a hand-written loop and verify register/flag evolution against a hand trace (continuity with Week 3). Compare -O0 vs -O2 disassembly of the same C function: count instructions and note optimizations (constant folding, register allocation, loop transformations) — a preview of Week 10.

Expected baselines: hand-written functions produce correct results and step as predicted; -O2 output is dramatically shorter and harder to map line-by-line than -O0, demonstrating the compiler’s work. Load/store instructions are clearly identifiable as the only memory touches.

Connections

This ISA fluency is required by Week 5 (control flow and the stack in assembly), Week 6 (seeing C/C++ structures in memory and the ABI), and Week 10 (reading optimized output to understand performance). It is the real-hardware version of Week 3’s emulator and the foundation for the Jetson/embedded work in Course 1.