Inference Engineering · Lesson 21 · GPU Architecture: SMs & HBM Home · Glossary · Your Lab

GPU Architecture: SMs & HBM

The hardware floor — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict a GPU's internals — SMs, Tensor Cores, the memory hierarchy — and connect each to a lesson you've already done.

The setup

We've leaned on "the pantry" (HBM) and "the cook" (compute) all course. Time to open the box: what's actually inside the GPU?

Step 1 — the compute units

Step 2 — the LLM workhorse

Recall — cover the screen: the GPU memory hierarchy, fast→big.
Registers / SRAM (tiny, on each SM, ~19 TB/s) → L2 (~50 MB, shared) → HBM (~94 GB, ~3.9 TB/s). Always capacity vs bandwidth. Decode waits on HBM; fusion/FlashAttention keep data in SRAM. (tap/hover to check)

Step 3 — the memory tiers

Step 4 — tie it back

Step 5 — your hardware

In Kubernetes terms infra bridge

A GPU is a node: SMs = cores, HBM = RAM, SRAM/L2 = CPU caches. Knowing it is like knowing your instance type's cores/bandwidth/cache — it's how you reason about which resource bounds a workload instead of treating the node as a black box.

On YOUR cluster your hardware

Each GPU is an H100 NVL (Hopper): ~130 SMs with FP8 Tensor Cores + Transformer Engine, ~50 MB L2, ~94 GB HBM3 @ ~3.9 TB/s. Those exact numbers set your roofline ridge (~214), KV capacity, and why FP8 + FlashAttention win here. Your Lab →

Read this next — primary source H100 architecture · runnable: day19 notebook.

Final check — teach it back

Explain to a colleague: "The whole course in one chip picture…"
…a GPU has ~130 SMs (compute, incl. FP8 Tensor Cores) fed by a memory hierarchy: tiny fast SRAM on each SM, then L2, then big slow HBM. Decode is memory-bound = waiting on HBM; fusion/FlashAttention keep data in SRAM; batching keeps the SMs busy. The hardware explains every optimization. (tap/hover)
I'm your teacher — ask me anything. Want to map a specific optimization to the exact part of the chip it targets?
← Lesson 20Next: Lesson 22 →
References
  1. day19 — GPU architecture (notebook); H100.

GPU Architecture: SMs & HBM

The hardware floor — where every optimization in this course physically lives.

Today's win: you'll explain a GPU's internals — the SMs (compute units), Tensor Cores, and the memory hierarchy (registers → SRAM → L2 → HBM) — so memory-bound decode, fusion, and FlashAttention all have a concrete physical home.

The picture: the kitchen building itself

We've talked about the pantry and the van for the whole course — now look at the building. A GPU is a kitchen with ~130 cook stations (SMs), each holding general tools (CUDA cores) and one specialized appliance (a Tensor Core for matrix multiplies). The pantry is HBM; each station's cutting board is tiny, instant SRAM.

a cook stationSM (Streaming Multiprocessor) — runs your kernels
the matrix-multiply applianceTensor Core — the LLM workhorse
the pantry across townHBM (~94 GB, ~3.9 TB/s on your H100 NVL)
the cutting board at the stationSRAM / registers (tiny, ~19 TB/s)

1 · The SM — where kernels run

A GPU is an array of Streaming Multiprocessors (an H100 has ~130). Each SM runs threads in groups of 32 called warps, in lockstep. Your kernels are scheduled across these SMs — and the GPU hides memory latency by swapping in another warp whenever one is waiting on data.1

one GPU ≈ ~130 SMs (showing a few): SM CUDA cores Tensor Core SRAM / regs SM SM … ×130 HBM — shared pantry (~94 GB, ~3.9 TB/s)
Many SMs, each with its own cores + Tensor Core + scratchpad, all pulling from one big shared HBM. Keeping those SMs fed is the whole performance game.

2 · Tensor Cores — why GPUs are fast at LLMs

The Tensor Core is a dedicated unit that does small matrix multiplies in one shot. LLMs are almost entirely matrix multiplies, so Tensor Cores are where the FLOPs come from — and on Hopper they run FP8 (via the Transformer Engine), which is why FP8 is so fast on your hardware.1

3 · The memory hierarchy — capacity vs bandwidth

Memory comes in tiers: tiny+instant registers and SRAM (~19 TB/s) on each SM, a shared L2 (~50 MB), and big+slower HBM (~94 GB, ~3.9 TB/s). The trade is always capacity vs bandwidth.2 This single picture explains the whole course:

registers / SRAM L2 ~50 MB HBM ~94 GB ~19 TB/stiny, instant ~3.9 TB/shuge, slower faster ↑bigger ↓
The pyramid behind everything: data is fast when it's small and on-chip, slow when it's big and in HBM. Every optimization is a move up this pyramid.

In Kubernetes terms infra bridge

A GPU is a node: SMs are its cores, HBM is its RAM, and the SRAM/L2 tiers are the node's CPU caches. Knowing this is like knowing your instance type's core count, memory bandwidth, and cache — it's what lets you reason about why a workload is bound by one resource and not another, instead of treating the node as a black box.

On YOUR cluster — the H100 (Hopper), concretely your hardware

Each of your 4 GPUs is an H100 NVL (Hopper): ~130 SMs with FP8 Tensor Cores + Transformer Engine, ~50 MB L2, and ~94 GB HBM3 at ~3.9 TB/s. Those exact numbers are what set your roofline ridge (~214 FLOP/byte), your KV-cache capacity, and why FP8 + FlashAttention are such wins here. The hardware is the constraint every lesson has been dancing around. · Your Lab →

Read this next — primary source NVIDIA H100 architecture. Runnable companion: day19 notebook — SMs, the memory hierarchy, warps.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to map a specific lesson's optimization to the exact part of the chip it targets? Just ask.
← Lesson 20 — disaggregated serving Next: Lesson 22 — GPU generations →
References
  1. GPU architecture: SMs, Tensor Cores, warps — day19 (gpu-architecture-sms-hbm.ipynb).
  2. Memory hierarchy & bandwidth — NVIDIA H100 (architecture).