The roofline — and the one number that decides memory- vs compute-bound.
Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict the H100's ridge point, and why decode stays memory-bound no matter how much compute you add — until you run out of KV memory first.
The setup
Every GPU has two ceilings: how fast it can do math (compute) and how fast it can move bytes (memory bandwidth). Which one limits you depends on your arithmetic intensity = FLOPs done per byte moved.
Step 1 — which side of the line?
Step 2 — where's the line?
Recall — cover the screen: what is the ridge point, in words? The arithmetic intensity (FLOP/byte) where the two ceilings meet — peak FLOPs ÷ memory bandwidth. Below it you're memory-bound; above it, compute-bound.(tap/hover to check)
Step 3 — does batching help?
Decode's big weight-multiplies (the FC layers) read the weights once and reuse them across the batch, so their intensity ≈ 2 × batch ÷ bytes.
Step 4 — what about attention?
Step 5 — so where does your model actually sit?
Recall — say it: two ways to speed up decode, and which one is futile. Speed it up by moving fewer bytes — quantize, GQA, smaller KV. Adding FLOPs (compute) is futile: decode is memory-bound, the extra compute sits idle.(tap/hover)
On YOUR cluster computed
H100 NVL: 3.9 TB/s, ~836 TFLOPS BF16 dense → ridge ≈ 214 FLOP/byte. qwen36 at max-num-seqs 8 → FC intensity ≈ 16 FLOP/byte ≈ 4% of the ridge: deeply memory-bound. You'd need ~batch 200+ to reach compute, but KV memory (Lesson 10) caps you at 8. Memory runs out before compute ever bites — so attack bytes, not FLOPs. Your Lab →
Explain to a colleague: "More GPU compute won't speed up our decode because…" …decode sits far below the roofline's ridge point — it's bottlenecked on memory bandwidth (re-reading weights + KV per token), not FLOPs, so spare compute goes unused. The fix is fewer bytes (quantization, GQA), not more compute.(tap/hover)
I'm your teacher — ask me anything. Want the roofline drawn for FP8 vs BF16, or the crossover batch derived?
Roofline — Williams et al., CACM 2009 (dl.acm.org); JAX scaling book (inference).
The Roofline
When — and whether — decode ever turns compute-bound.
Today's win: you'll read a roofline plot, compute your H100's
ridge point, and explain why decode throughput climbs with batch size but
attention never escapes memory — and why you usually run out of
KV memory before you ever hit the
compute ceiling.
Pantry recap: two fixed limits — the van (memory bandwidth) and the
stoves (compute). Which one bottlenecks you depends on how much cooking you
do per pound hauled — the arithmetic intensity.
(full analogy →)
1 · Two ceilings, one plot
Every GPU has two hard limits: a memory-bandwidth ceiling and a
compute ceiling. Plot achievable throughput against arithmetic
intensity and you get the roofline: a rising slope (bandwidth-limited)
that flattens into a roof (compute-limited). They meet at the ridge
point = peak FLOPs ÷ bandwidth.1
ridge point = peak FLOPs ÷ memory bandwidth (FLOP/byte)
intensity < ridge → memory-bound (on the slope) ·
intensity > ridge → compute-bound (on the roof)
Prefill's huge matmuls land on the compute roof. Decode at batch 1 sits far
down the bandwidth slope. Same GPU, opposite ceilings — exactly Lesson 9, now plotted.
2 · Batching slides the FC layers up — attention stays stuck
Here's the subtle part. The decode FC/GEMM layers read the weights
once and reuse them across the batch, so their intensity ≈
2 × batch ÷ bytes_per_weight — it rises with batch and climbs the
slope toward the ridge. But attention has no such reuse (each sequence
has its own KV), so its intensity does not move with batch — it stays
memory-bound no matter how big the batch.2
Pantry: batching more customers per van trip means more
cooking per pound hauled — the shared prep climbs toward stove-limited. But each
customer's own dish (attention) can't be shared, so that part stays van-limited.
Bigger batch → the FC point climbs the slope (throughput rises). Attention
can't climb — so decode as a whole approaches, but never cleanly reaches, the roof.
3 · So where's the crossover?
Set FC intensity equal to the ridge: 2 × batch ÷ bytes = ridge. For an
H100-class GPU that lands in the tens-to-low-hundreds of concurrent
sequences (a commonly-cited rule of thumb is ~batch 32 once you account for real,
sub-peak kernels).2 Below that you're on the
slope — adding batch raises throughput. Above it, the FC layers flatten out and
more batch stops helping them.
In Kubernetes terms infra bridge
A workload is memory-bound or compute-bound the way a pod is memory-bound or CPU-bound. The roofline is the node's resource ceiling; the ridge point is the ratio where the bottleneck flips. Decode sits deep in the memory-bound region — so you right-size for bandwidth, not FLOPs, just as you'd give a memory-bound pod more RAM rather than more CPU.
On YOUR cluster — decode is nowhere near the roof computed · real
At --max-num-seqs 8, qwen36's decode is
deeply memory-bound — ~4% of the way to the compute ceiling. That's why
you measured ~108 generation tok/s while prefill ripped at 3,288: decode is van-limited,
exactly as the roofline predicts.
You'd need ~batch 214 to reach the ridge — but Lesson 10 showed
KV memory caps you at 8. The punchline: you run out of memory long before compute
becomes the limit. The lever that actually matters for decode is bandwidth + KV
memory, not FLOPs. · Your Lab →
This closes the loop on Lessons 9–12: decode is memory-bound (L9); the KV cache eats
your memory and caps the batch (L10); paging + continuous batching let you actually reach
that cap (L12); and the roofline (L11) shows the cap arrives well before compute
ever becomes the bottleneck. To go faster at decode, you attack bytes moved
(quantization, GQA, smaller KV) — not FLOPs.
Picture the slope, the roof, and the ridge, then answer from memory.
I'm your teacher — ask me anything. Want the roofline drawn for FP8
vs BF16, the attention-vs-FC intensity derived step by step, or to see how speculative
decoding "fakes" higher intensity to beat the memory wall? Just ask.