When Does Decode Get Faster?

The roofline — and the one number that decides memory- vs compute-bound.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: you'll predict the H100's ridge point, and why decode stays memory-bound no matter how much compute you add — until you run out of KV memory first.

The setup

Every GPU has two ceilings: how fast it can do math (compute) and how fast it can move bytes (memory bandwidth). Which one limits you depends on your arithmetic intensity = FLOPs done per byte moved.

Step 1 — which side of the line?

Step 2 — where's the line?

Recall — cover the screen: what is the ridge point, in words?
The arithmetic intensity (FLOP/byte) where the two ceilings meet — peak FLOPs ÷ memory bandwidth. Below it you're memory-bound; above it, compute-bound. (tap/hover to check)

Step 3 — does batching help?

Decode's big weight-multiplies (the FC layers) read the weights once and reuse them across the batch, so their intensity ≈ 2 × batch ÷ bytes.

Step 4 — what about attention?

Step 5 — so where does your model actually sit?

Recall — say it: two ways to speed up decode, and which one is futile.
Speed it up by moving fewer bytes — quantize, GQA, smaller KV. Adding FLOPs (compute) is futile: decode is memory-bound, the extra compute sits idle. (tap/hover)

On YOUR cluster computed

H100 NVL: 3.9 TB/s, ~836 TFLOPS BF16 dense → ridge ≈ 214 FLOP/byte. qwen36 at max-num-seqs 8 → FC intensity ≈ 16 FLOP/byte ≈ 4% of the ridge: deeply memory-bound. You'd need ~batch 200+ to reach compute, but KV memory (Lesson 10) caps you at 8. Memory runs out before compute ever bites — so attack bytes, not FLOPs. Your Lab →

Read this next — primary source How to Scale Your Model: Inference — JAX ML · Roofline (Williams et al.).

Final check — teach it back

Explain to a colleague: "More GPU compute won't speed up our decode because…"
…decode sits far below the roofline's ridge point — it's bottlenecked on memory bandwidth (re-reading weights + KV per token), not FLOPs, so spare compute goes unused. The fix is fewer bytes (quantization, GQA), not more compute. (tap/hover)

I'm your teacher — ask me anything. Want the roofline drawn for FP8 vs BF16, or the crossover batch derived?

← Lesson 10 · The KV Cache & Its MemoryLesson 12 · PagedAttention & Batching →

References

Roofline — Williams et al., CACM 2009 (dl.acm.org); JAX scaling book (inference).

The Roofline

When — and whether — decode ever turns compute-bound.

Today's win: you'll read a roofline plot, compute your H100's ridge point, and explain why decode throughput climbs with batch size but attention never escapes memory — and why you usually run out of KV memory before you ever hit the compute ceiling.

Pantry recap: two fixed limits — the van (memory bandwidth) and the stoves (compute). Which one bottlenecks you depends on how much cooking you do per pound hauled — the arithmetic intensity. (full analogy →)

1 · Two ceilings, one plot

Every GPU has two hard limits: a memory-bandwidth ceiling and a compute ceiling. Plot achievable throughput against arithmetic intensity and you get the roofline: a rising slope (bandwidth-limited) that flattens into a roof (compute-limited). They meet at the ridge point = peak FLOPs ÷ bandwidth.1

ridge point = peak FLOPs ÷ memory bandwidth (FLOP/byte)
intensity < ridge → memory-bound (on the slope) · intensity > ridge → compute-bound (on the roof)

Prefill's huge matmuls land on the compute roof. Decode at batch 1 sits far down the bandwidth slope. Same GPU, opposite ceilings — exactly Lesson 9, now plotted.

2 · Batching slides the FC layers up — attention stays stuck

Here's the subtle part. The decode FC/GEMM layers read the weights once and reuse them across the batch, so their intensity ≈ 2 × batch ÷ bytes_per_weight — it rises with batch and climbs the slope toward the ridge. But attention has no such reuse (each sequence has its own KV), so its intensity does not move with batch — it stays memory-bound no matter how big the batch.2

Pantry: batching more customers per van trip means more cooking per pound hauled — the shared prep climbs toward stove-limited. But each customer's own dish (attention) can't be shared, so that part stays van-limited.

Bigger batch → the FC point climbs the slope (throughput rises). Attention can't climb — so decode as a whole approaches, but never cleanly reaches, the roof.

3 · So where's the crossover?

Set FC intensity equal to the ridge: 2 × batch ÷ bytes = ridge. For an H100-class GPU that lands in the tens-to-low-hundreds of concurrent sequences (a commonly-cited rule of thumb is ~batch 32 once you account for real, sub-peak kernels).2 Below that you're on the slope — adding batch raises throughput. Above it, the FC layers flatten out and more batch stops helping them.

In Kubernetes terms infra bridge

A workload is memory-bound or compute-bound the way a pod is memory-bound or CPU-bound. The roofline is the node's resource ceiling; the ridge point is the ratio where the bottleneck flips. Decode sits deep in the memory-bound region — so you right-size for bandwidth, not FLOPs, just as you'd give a memory-bound pod more RAM rather than more CPU.

On YOUR cluster — decode is nowhere near the roof computed · real

H100 NVL: 3.9 TB/s · BF16 ~836 TFLOPS dense → ridge ≈ 214 FLOP/byte (FP8 ~1,670 → 428)
qwen36 decode FC intensity = 2 × batch ÷ 1 (fp8 weights):
batch 8 → 16 FLOP/byte ≈ 4% of the ridge · batch 64 → 128 ≈ 30%

At --max-num-seqs 8, qwen36's decode is deeply memory-bound — ~4% of the way to the compute ceiling. That's why you measured ~108 generation tok/s while prefill ripped at 3,288: decode is van-limited, exactly as the roofline predicts.

You'd need ~batch 214 to reach the ridge — but Lesson 10 showed KV memory caps you at 8. The punchline: you run out of memory long before compute becomes the limit. The lever that actually matters for decode is bandwidth + KV memory, not FLOPs. · Your Lab →

This closes the loop on Lessons 9–12: decode is memory-bound (L9); the KV cache eats your memory and caps the batch (L10); paging + continuous batching let you actually reach that cap (L12); and the roofline (L11) shows the cap arrives well before compute ever becomes the bottleneck. To go faster at decode, you attack bytes moved (quantization, GQA, smaller KV) — not FLOPs.

Read this next — primary source How to Scale Your Model: Inference — JAX ML scaling book. The clearest LLM-specific roofline treatment. For the original model, see Williams et al., "Roofline" (CACM 2009).

Check yourself (recall, don't peek)

Picture the slope, the roof, and the ridge, then answer from memory.

I'm your teacher — ask me anything. Want the roofline drawn for FP8 vs BF16, the attention-vs-FC intensity derived step by step, or to see how speculative decoding "fakes" higher intensity to beat the memory wall? Just ask.

← Lesson 10 · The KV Cache & Its MemoryLesson 12 · PagedAttention & Batching →

References

Roofline: An Insightful Visual Performance Model — Williams et al., CACM 2009. dl.acm.org · Modal GPU Glossary
Memory- vs compute-bound & the batch-32 FC crossover; H100 datasheet (1,671 BF16 / 3,341 FP8 TFLOPS, 3.9 TB/s). modal.com · nvidia.com
How to Scale Your Model: Inference — JAX ML scaling book. jax-ml.github.io