Inference Engineering · Lesson 14 · Prefix Caching & the KV Hierarchy Home · Glossary · Your Lab

Prefix Caching & the KV Hierarchy

Reuse the shared opening — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict how prefix caching cuts TTFT, where cached KV lives, and why routing must be cache-aware to get the win.

The setup

Your requests mostly start the same way — same system prompt, same RAG context, same chat history. Re-prefilling that shared opening every time feels wasteful. Can you avoid it?

Step 1 — why reuse is even possible

Step 2 — what does a hit improve?

Recall — cover the screen: what is prefix caching?
Hash an incoming prefix; if its KV is already cached (a hit), reuse those blocks and skip re-prefilling them — only the new suffix is processed, so TTFT drops. vLLM: --enable-prefix-caching; SGLang: RadixAttention (a tree of shared prefixes). (tap/hover to check)

Step 3 — where cached KV lives

Step 4 — the operational catch

Step 5 — your traffic (real)

In Kubernetes terms infra bridge

Prefix caching = the image-layer cache (shared base layers built once). Cache-aware routing = session affinity / consistent-hash to the replica with the warm cache — round-robin scatters and misses, exactly like sending a stateful session to the wrong pod.

On YOUR cluster live flag

--enable-prefix-caching is ON, and your traffic is 19–47:1 prefill-heavy (big shared RAG prompts) — the ideal case. Same mechanism as the prompt caching bridge in Lesson 12, minus the TTL/price. Capture more with cache-aware routing (Lesson 25). Your Lab →

Read this next — primary source SGLang & RadixAttention · runnable: day16 notebook.

Final check — teach it back

Explain to a colleague: "Prefix caching helps our RAG app because…"
…every request shares a big system prompt + retrieved context, whose KV is identical, so we prefill it once and reuse it — the user only waits for the new question to be processed, slashing TTFT. We just have to route requests with the same prefix to the same replica (cache-aware routing), or we miss the cache. (tap/hover)
I'm your teacher — ask me anything. Want to estimate your hit rate, or restructure prompts so the variable part comes last?
← Lesson 13Next: Lesson 15 →
References
  1. day16 — prefix caching & KV hierarchy (notebook); SGLang RadixAttention.

Prefix Caching & the KV Hierarchy

Reuse the KV of a shared prefix — and route so the cache actually hits.

Today's win: you'll explain how reusing a shared prefix's KV skips re-prefilling it (cutting TTFT), the memory hierarchy that stores those cached blocks, and why routing has to be cache-aware — the production form of Lesson 12's prompt-caching bridge.

The picture: make the common base sauce once

Tons of your requests start the same way — the same system prompt, the same RAG context, the same chat history so far. Re-prefilling that shared opening every time is like re-chopping the same onions for every order. Prefix caching preps the shared base once and reuses its KV for everyone who starts the same way.

the shared base sauce, made oncecached prefix KV (reused across requests)
fridge → pantry → cold storageKV hierarchy: VRAM → host RAM → SSD
send the order to the station with the saucecache-aware routing

1 · The opportunity: shared prefixes everywhere

System prompts, retrieved RAG documents, multi-turn history — these repeat across requests. Since the KV cache for a token depends only on the tokens before it, identical prefixes produce identical KV. So you can compute it once and reuse it.1

shared system prompt + RAG context …question A same shared prefix …question B req 1 ↑ req 2 ↓ — identical opening prefix KV: prefill ONCE reused by both → only suffix is new
The KV of a token depends only on what precedes it, so a shared prefix has shared KV. Prefill it once; every request that starts the same way reuses it.

2 · Prefix caching cuts TTFT

The engine hashes each incoming prefix; on a hit, it reuses the cached KV blocks and skips re-prefilling them — so the user's time to first token drops to just processing the new suffix. vLLM does this with --enable-prefix-caching; SGLang generalizes it with RadixAttention (a radix tree of all live prefixes, so even partial overlaps share).2

Pantry: the base sauce is already simmering — a new order only needs its finishing touches, so the first bite arrives much sooner.

In Kubernetes terms infra bridge

Prefix caching is the image-layer cache: identical base layers are pulled and built once, and every image that shares them starts fast — only the top (changed) layer is new work. A cache miss is a cold pull. Which is exactly why the next piece matters…

3 · The KV memory hierarchy

Cached prefixes compete for scarce VRAM. So engines tier them: hottest in GPU VRAM, spilled to host RAM, then local SSD, then networked storage — each bigger but slower. There's a race: if fetching a cached block from a lower tier is slower than just re-prefilling it, you re-prefill instead.1

VRAMhottest host RAM local SSD networked storage faster, smaller bigger, slower → the race: if a lower tier is slower to fetch than to re-prefill, just re-prefill
Cached KV spills down the hierarchy as VRAM fills. The engine weighs fetch-vs-recompute for each block — bandwidth, not capacity, decides.

4 · Cache-aware routing — or you miss

Here's the operational catch you'll feel: prefix caches are per replica. If your load balancer sends a request to a replica that doesn't hold its prefix, it's a miss — full re-prefill, no savings. So routing must be cache-aware: hash the prefix and send matching requests to the same replica (covered in Lesson 25).2

In Kubernetes terms infra bridge

This is session affinity / consistent-hash routing. A round-robin Service scatters requests and tanks your hit rate; a sessionAffinity or prefix-hash Ingress pins related requests to the replica that already has the warm cache — the same reason you route a user's session to the pod holding their state.

On YOUR cluster — already on, and it matters most for you live flag

Your vLLM runs --enable-prefix-caching (confirmed in the server args). And your traffic is 19–47:1 prefill-heavy (classic RAG with big shared system prompts) — exactly the workload where prefix caching is a huge TTFT win. This is the same mechanism as the prompt caching bridge in Lesson 12, just without the TTL/billing wrapper. Next step to capture more of it: cache-aware routing (Lesson 25). · Your Lab →

Read this next — primary source SGLang & RadixAttention — LMSYS. Runnable companion: day16 notebook — prefix caching as a prompt-design problem.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to estimate your prefix-cache hit rate, or structure prompts so the variable part comes last? Just ask.
← Lesson 13 — CUDA kernels Next: Lesson 15 — speculative decoding →
References
  1. Prefix caching & the KV memory hierarchy — day16 (kv-cache-prefix-caching.ipynb).
  2. SGLang: RadixAttention for automatic KV reuse — LMSYS (lmsys.org).