The KV cache, worked out one prediction at a time.
Your H100 NVL has 94 GB. Qwen's 27B weights take ~27 GB in FP8, so ~67 GB looks free.
In decode, attention needs the keys and values of every earlier token. Recomputing them each step would be wasteful, so the model keeps them.
Llama 2 7B: 32 layers, kv_heads × head_dim = 4096, FP16 (2 bytes).
The formula is 2 × layers × (kv_heads × head_dim) × bytes per token.
That cap is your max batch — the number behind autoscaling and capacity planning. The weights are a fixed tax; the KV cache is what actually decides concurrency.
Steps 1–4 skipped one thing. Attention is a lookup: each token makes a
Query (what it wants), a Key (how it labels itself), and a
Value (what it hands over) — and the model runs ~32 of these in parallel, the
heads. The KV cache stores the K and V of every head; that's the
kv_heads in the formula. Now:
Look back at 2 × layers × (kv_heads × head_dim) × bytes. Which factors can you change
without it being a different model?
Qwen3.6-27B-FP8 = GQA (4 KV heads, 6×) + fp8 KV → 128 KiB/token (vs 256 at FP16).
On 94 GiB at 0.75 util (~70 GiB), weights ~27 → ~43 GiB KV pool ≈ 2.7 full-128K requests — which is
exactly why --max-num-seqs 8 and --max-model-len 131072 can't both be maxed.
Your Lab →
The growing pile that quietly decides how many requests you can serve.
In decode, attention needs the keys and values of every earlier token. Rather than recompute them each step, the model stores them — that's the KV cache. Every token you generate appends one more K,V entry that all future steps must read.1
The size of one token's entry is just the product of the things that exist in the model. Memorize this — it's the whole lesson:23
The first line is fixed by the model. The second line is what you control at serving time — and both factors multiply, which is why memory explodes fast.
Attention is a lookup. For each token the model builds three little vectors: a Query ("what am I looking for?"), a Key ("how do I label myself?"), and a Value ("the info I hand over if picked"). A token's Query is matched against every earlier token's Key; the best matches' Values get pulled in. That is attention.
The model runs many of these lookups in parallel — the heads (one tracks the subject, another the tense, …). So every token produces a K and a V per head, and the KV cache stores the K and V of every head, in every layer.
kv_heads = how many heads store a K/V per token (32 for Llama 2 7B).head_dim = the length of one head's K (or V) vector (128 numbers).2 × is the two things stored per head: the Key and the Value.So kv_heads × head_dim × 2 is the width of one token's cache
entry per layer. And GQA, coming up, is simply: keep all the Query heads, store fewer K/V heads.
kv_heads heads each store a Key + Value of
head_dim numbers. That's the whole formula, drawn.Llama 2 7B uses plain multi-head attention — one key/value head per attention head — so
kv_heads = 32, and each head's key (and value) is a vector of
head_dim = 128 numbers. Multiply them: every token stores a
32 × 128 = 4096-number key and a 4096-number value, per layer. That
kv_heads × head_dim product is just the width of the cache entry.
Plug it in with the model's 32 layers and FP16 (2 bytes):4
The calculator makes the ceiling concrete: change nothing and you simply run out of cache before you run out of GPU. To serve more requests, you have to make each token's entry smaller — and the formula tells you exactly which knobs exist.
In 2 × layers × (kv_heads × head_dim) × bytes, the 2 is fixed (you
always store both K and V), and layers and head_dim are baked
into the model — change them and it's a different model. That leaves just two real levers:
fp8 = 1 byte instead of 2). Your cluster does this.kv_heads — and that's what GQA does.Normally every attention head carries its own query, key, and value — that's
multi-head attention (MHA), so the number of KV heads equals the model's head
count. GQA keeps all the query heads (so the
model stays just as expressive) but lets a group of them share one key/value
head. Fewer KV heads stored → a smaller kv_heads in the formula → a smaller
cache, at almost no quality cost.5
Here's the spectrum, spelled out. MHA (multi-head attention) gives every query head its own K/V — Llama 2 7B has 32 query and 32 KV heads, 1:1. MQA is the opposite extreme: all query heads share a single K/V (tiny cache, a little more quality loss). GQA is the middle — keep all the query heads, but let groups share one K/V head. Llama 3 8B has 32 query but only 8 KV heads, so 4 queries share each → 4× smaller cache. Many askers, fewer answer-files.5
| Query heads | KV heads stored | Cache | |
|---|---|---|---|
| MHA (Llama 2 7B) | 32 | 32 — one each | full |
| GQA (Llama 3 8B) | 32 | 8 — 4 share each | quarter (4× smaller) |
| MQA | 32 | 1 — all share | tiny (more quality loss) |
The KV cache is per-request ephemeral state that grows over the request's life — like a pod's emptyDir / ephemeral volume. And it's the binding resource: how many requests (pods) fit on one GPU (node) is capped by KV memory, not compute — the inference version of a memory-bound node where you hit the RAM limit long before the CPU limit.
Qwen3.6-27B-FP8 on a 94 GiB H100 NVL uses GQA and an fp8 KV cache at once — both levers from this lesson:
At --gpu-memory-utilization 0.75 → ~70 GiB for
vLLM; weights ~27 GiB → ~43 GiB KV pool ≈ 2.7 full-context requests. That's the
real reason --max-num-seqs 8 and --max-model-len 131072
can't both be maxed — the cache, not the weights, is the limit. (Observed KV usage is
only 0.2–0.5% because your RAG prompts are far shorter than 128K.)
Hit the "Qwen3.6-27B-FP8 · YOUR H100" preset in the calculator above and watch it overflow at batch 8. · Your Lab →
Ever notice a Claude Code session is cheap on repeated turns but pricey to resume the next day? That's this exact KV cache — persisted across requests, given a TTL, and put on a price tag. It's called prompt caching, and Lesson 12 breaks down the read/write multipliers and why resuming tomorrow re-pays to rebuild it → prompt caching in Lesson 12.
Picture the binders and the GPU bar, then answer from memory.