Inference Engineering · Lesson 10 · KV Cache & Memory Home · Glossary · Your Lab

How Many Users Fit?

The KV cache, worked out one prediction at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: from a model's config alone you'll predict its KV-cache memory and why the cache, not the weights, caps how many requests fit.

The setup

Your H100 NVL has 94 GB. Qwen's 27B weights take ~27 GB in FP8, so ~67 GB looks free.

Step 1 — what gets stored?

In decode, attention needs the keys and values of every earlier token. Recomputing them each step would be wasteful, so the model keeps them.

Step 2 — how fast does it grow?

Step 3 — how big, concretely?

Llama 2 7B: 32 layers, kv_heads × head_dim = 4096, FP16 (2 bytes). The formula is 2 × layers × (kv_heads × head_dim) × bytes per token.

Recall — cover the screen: why does the cache grow linearly, and what's it ~per token for a 7B?
One K,V entry is added per token per layer, so it's linear in sequence length × batch; ≈ 0.5 MiB/token for Llama 2 7B in FP16 (2 GiB for a 4K sequence). (tap/hover to check)

Step 4 — so what's the real limit?

That cap is your max batch — the number behind autoscaling and capacity planning. The weights are a fixed tax; the KV cache is what actually decides concurrency.

What's a head? (you'll need it next)

Steps 1–4 skipped one thing. Attention is a lookup: each token makes a Query (what it wants), a Key (how it labels itself), and a Value (what it hands over) — and the model runs ~32 of these in parallel, the heads. The KV cache stores the K and V of every head; that's the kv_heads in the formula. Now:

Step 5 — how do you fit more?

Look back at 2 × layers × (kv_heads × head_dim) × bytes. Which factors can you change without it being a different model?

Recall — say it: the two levers that shrink the cache, and what each does.
GQA — share KV heads across query heads (fewer kv_heads; Llama 3 8B = 4× smaller). fp8 KV — 1 byte/element instead of 2 (halves it). Your cluster runs both. (tap/hover to check)

On YOUR cluster real config

Qwen3.6-27B-FP8 = GQA (4 KV heads, 6×) + fp8 KV → 128 KiB/token (vs 256 at FP16). On 94 GiB at 0.75 util (~70 GiB), weights ~27 → ~43 GiB KV pool ≈ 2.7 full-128K requests — which is exactly why --max-num-seqs 8 and --max-model-len 131072 can't both be maxed. Your Lab →

Read this next — primary source Transformer Inference Arithmetic — kipply. It derives exactly the formula you just predicted.

Final check — teach it back

Explain to a colleague: "We can only batch ~N requests because…"
…each concurrent sequence needs its own KV cache, which grows with context length; once the caches fill the GPU memory left after the weights, no more requests fit — the cache, not the weights, sets the batch cap. (tap/hover)
I'm your teacher — ask me anything. Want to plug your own model into the math, or see GQA derived?
← Lesson 9 · Prefill vs DecodeLesson 11 · The Roofline →
References
  1. Transformer Inference Arithmetic — kipply. kipp.ly · KV-cache formula & GQA.

The KV Cache & Its Memory Math

The growing pile that quietly decides how many requests you can serve.

Today's win: from a model's config alone, you'll compute its KV-cache memory — and explain why the cache, not the weights, usually caps how many requests you can batch.
Pantry recap: last lesson the van hauled the pantry for every bite. The KV cache is your stack of recipe binders — one new page per bite you cook this session. It only grows, and every van trip has to haul the whole stack. (full analogy →)

1 · What's in the cache, and why it grows

In decode, attention needs the keys and values of every earlier token. Rather than recompute them each step, the model stores them — that's the KV cache. Every token you generate appends one more K,V entry that all future steps must read.1

tokens generated → tok 1tok 2tok 3tok 4 KV cache K₁,V₁K₂,V₂K₃,V₃K₄,V₄ every decode step re-reads the whole row — and it only gets longer
+1 K,V cell per token (the new binder page). The cache grows linearly with sequence length — and the whole thing is re-read every step.

2 · The formula

The size of one token's entry is just the product of the things that exist in the model. Memorize this — it's the whole lesson:23

bytes per token = 2 (K&V) × layers × (kv_heads × head_dim) × bytes/elem

total cache = bytes per token × seq_len × batch

The first line is fixed by the model. The second line is what you control at serving time — and both factors multiply, which is why memory explodes fast.

sequence length × batch → memory linear — no free lunch
Double the context or double the batch, double the cache. Linear, but it adds up to gigabytes quickly.

First — what's a "head"? (this is the piece GQA needs)

Attention is a lookup. For each token the model builds three little vectors: a Query ("what am I looking for?"), a Key ("how do I label myself?"), and a Value ("the info I hand over if picked"). A token's Query is matched against every earlier token's Key; the best matches' Values get pulled in. That is attention.

The model runs many of these lookups in parallel — the heads (one tracks the subject, another the tense, …). So every token produces a K and a V per head, and the KV cache stores the K and V of every head, in every layer.

So kv_heads × head_dim × 2 is the width of one token's cache entry per layer. And GQA, coming up, is simply: keep all the Query heads, store fewer K/V heads.

1 token of text kv_heads = how many heads (32) head 1 K V head 2 K V ⋮   …32 heads in total each card is head_dim = 128 numbers long K = how to find this token · V = what it carries × 32 layers (per token) = this token's full cache entry
Per token, per layer: kv_heads heads each store a Key + Value of head_dim numbers. That's the whole formula, drawn.

3 · Worked example — Llama 2 7B on an 80 GB GPU

Llama 2 7B uses plain multi-head attention — one key/value head per attention head — so kv_heads = 32, and each head's key (and value) is a vector of head_dim = 128 numbers. Multiply them: every token stores a 32 × 128 = 4096-number key and a 4096-number value, per layer. That kv_heads × head_dim product is just the width of the cache entry. Plug it in with the model's 32 layers and FP16 (2 bytes):4

per token = 2 × 32 × 4096 × 2 = 524,288 B = 0.5 MiB
@ 4,096-token context → 0.5 MiB × 4096 = 2 GiB per sequence
batch of 8 → 16 GiB of cache  ·  weights ≈ 14 GiB
Pantry: the shelves (HBM) are a fixed size. The weights are a permanent display you can't move. Every concurrent customer's binders eat the shelf space that's left — run out, and you can't seat another customer.
80 GiB GPU memory weights KV cache fills the rest → weights ≈14 GiB fixed · ~66 GiB left ÷ 2 GiB/seq ≈ 33 concurrent 4K sequences, then you're full
The weights are a fixed tax; the KV cache is what actually limits how many requests fit. That cap is your max batch — the lever behind autoscaling and capacity planning.

4 · You're capped by the cache — which knob can you turn?

The calculator makes the ceiling concrete: change nothing and you simply run out of cache before you run out of GPU. To serve more requests, you have to make each token's entry smaller — and the formula tells you exactly which knobs exist.

In 2 × layers × (kv_heads × head_dim) × bytes, the 2 is fixed (you always store both K and V), and layers and head_dim are baked into the model — change them and it's a different model. That leaves just two real levers:

Grouped-Query Attention (GQA)

Normally every attention head carries its own query, key, and value — that's multi-head attention (MHA), so the number of KV heads equals the model's head count. GQA keeps all the query heads (so the model stays just as expressive) but lets a group of them share one key/value head. Fewer KV heads stored → a smaller kv_heads in the formula → a smaller cache, at almost no quality cost.5

Here's the spectrum, spelled out. MHA (multi-head attention) gives every query head its own K/V — Llama 2 7B has 32 query and 32 KV heads, 1:1. MQA is the opposite extreme: all query heads share a single K/V (tiny cache, a little more quality loss). GQA is the middle — keep all the query heads, but let groups share one K/V head. Llama 3 8B has 32 query but only 8 KV heads, so 4 queries share each → 4× smaller cache. Many askers, fewer answer-files.5

top row = query heads (the questions) · bottom row = stored K/V heads (the cache) MHA — 32 KV Llama 2 7B · 1 KV per query full cache GQA — 8 KV Llama 3 8B · 4 share 1 quarter the cache MQA — 1 KV all queries share 1 tiny cache
All three keep the same query heads — only the number of stored K/V heads changes. Fewer KV heads → smaller cache. GQA is the practical middle.
Query headsKV heads storedCache
MHA (Llama 2 7B)3232 — one eachfull
GQA (Llama 3 8B)328 — 4 share eachquarter (4× smaller)
MQA321 — all sharetiny (more quality loss)

In Kubernetes terms infra bridge

The KV cache is per-request ephemeral state that grows over the request's life — like a pod's emptyDir / ephemeral volume. And it's the binding resource: how many requests (pods) fit on one GPU (node) is capped by KV memory, not compute — the inference version of a memory-bound node where you hit the RAM limit long before the CPU limit.

On YOUR cluster — both fixes, stacked real config

Qwen3.6-27B-FP8 on a 94 GiB H100 NVL uses GQA and an fp8 KV cache at once — both levers from this lesson:

config: 64 layers · 4 KV heads (GQA 24→4 = 6×) · head_dim 256 · kv fp8 = 1 B
per token = 2 × 64 × (4×256) × 1 = 128 KiB  (256 KiB at FP16)
128K-token request → 16 GiB of KV for ONE sequence

At --gpu-memory-utilization 0.75 → ~70 GiB for vLLM; weights ~27 GiB → ~43 GiB KV pool ≈ 2.7 full-context requests. That's the real reason --max-num-seqs 8 and --max-model-len 131072 can't both be maxed — the cache, not the weights, is the limit. (Observed KV usage is only 0.2–0.5% because your RAG prompts are far shorter than 128K.)

Hit the "Qwen3.6-27B-FP8 · YOUR H100" preset in the calculator above and watch it overflow at batch 8. · Your Lab →

Bonus: you pay for this cache every day bridge to the API

Ever notice a Claude Code session is cheap on repeated turns but pricey to resume the next day? That's this exact KV cache — persisted across requests, given a TTL, and put on a price tag. It's called prompt caching, and Lesson 12 breaks down the read/write multipliers and why resuming tomorrow re-pays to rebuild it → prompt caching in Lesson 12.

Read this next — primary source Transformer Inference Arithmetic — kipply. The rigorous reference for this math (KV cache, weights, arithmetic intensity). Read the "kv cache" section; it derives exactly the formula above.

Check yourself (recall, don't peek)

Picture the binders and the GPU bar, then answer from memory.

I'm your teacher — ask me anything. Want to plug your own model into the calculator, see the FP8 KV-cache trick, work out the math for a 70B with GQA, or connect this cap to a specific autoscaling notebook in your repo? Just ask.
← Lesson 9 · Prefill vs DecodeLesson 11 · The Roofline →
References
  1. Mastering LLM Techniques: Inference Optimization — NVIDIA Technical Blog. developer.nvidia.com
  2. Transformer Inference Arithmetic — kipply. kipp.ly
  3. How to Scale Your Model: Inference — JAX ML scaling book. jax-ml.github.io
  4. Llama 2 config (32 layers, 4096 hidden, 32 heads) — Hugging Face. huggingface.co
  5. Grouped-Query Attention: shrinking the KV cache. zeroentropy.dev