Speculative Decoding

Several tokens per pass — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: you'll predict how a cheap draft + one verification pass yields several tokens at once, why it's lossless, and when it actually helps.

The setup

Decode is memory-bound: per token the GPU streams the whole model from memory but does little math — the compute sits mostly idle. That idle compute is an opportunity.

Step 1 — the mechanism

Step 2 — does it change the output?

Recall — cover the screen: speculative decoding in one line.
A small draft model proposes K next tokens; the target model verifies all K in one forward pass and accepts the longest correct prefix (rejection sampling keeps the output identical). It trades decode's idle compute for fewer passes. (tap/hover to check)

Step 3 — what determines the speedup?

Step 4 — your model (real)

Step 5 — when does it NOT help much?

In Kubernetes terms infra bridge

Optimistic / speculative execution: the draft is a cheap guess (like an optimistic lock or admission webhook), the target validates then commits and rolls back the wrong tail. Bet on the cheap path, verify before you trust it.

On YOUR cluster config + flags

Qwen3.6's config has mtp_num_hidden_layers: 1 — a built-in MTP (multi-token-prediction) self-speculation head. But the server has no --speculative-* flag: it's not enabled. Turning it on could speed up single-stream decode; at high batch the gain shrinks (batching already uses the compute). Measure it. Your Lab →

Read this next — primary source Speculative Decoding — Leviathan et al. · runnable: day15 notebook.

Final check — teach it back

Explain to a colleague: "Speculative decoding speeds up decode by…"
…using a cheap draft model to guess the next K tokens, then verifying all K in one pass of the real model and accepting the correct prefix. Because decode is memory-bound, that verification uses otherwise-idle compute, so we get several tokens per pass — and rejection sampling keeps the output exactly the same as normal decoding. (tap/hover)

I'm your teacher — that's Part II done. Want to enable MTP speculation on your cluster and measure it?

← Lesson 14Next: Lesson 16 →

References

day15 — speculative decoding (notebook); Leviathan et al.; Medusa/EAGLE self-speculation.

Speculative Decoding

Spend the idle compute: guess several tokens, verify them in one pass.

Today's win: you'll explain how a cheap draft proposes K tokens that the big model verifies in a single forward pass — turning decode's wasted compute into fewer passes, with identical output. It's the fix teased back in Lesson 7.

The picture: a junior cook guesses, the chef checks all at once

Decode is memory-bound — the GPU's compute mostly sits idle while it streams weights for one token. Speculative decoding fills that idle time: a fast junior cook (a small draft model) guesses the next few steps; the head chef (your real model) checks all the guesses in one glance and keeps the correct run. Several tokens for the price of one pass.

the fast junior cook's guesses	draft model proposes K tokens
the chef checking them all at once	target model verifies in one forward pass
keep the correct run, redo the rest	accept the longest correct prefix

1 · Why it's even possible: decode wastes compute

At small batch, decode's arithmetic intensity is tiny — the tensor cores are nearly idle while memory bandwidth is the bottleneck (Lesson 11). Checking several candidate tokens in one pass costs almost the same as checking one, because you were compute-starved anyway. That spare compute is the opportunity.1

Decode leaves compute on the table. Speculative decoding spends it — verifying a batch of candidate tokens in the pass you were going to make anyway.

2 · Draft → verify → accept

The loop: a small, cheap draft model quickly proposes the next K tokens. Your real target model then runs one forward pass over all K and checks each. You accept the longest correct prefix and discard the rest. Crucially, a rejection-sampling rule makes the result mathematically identical to normal decoding from the target — it's lossless, not an approximation.1

One target pass yielded 3 accepted tokens plus a corrected 4th. On a normal day that's 3–4 forward passes' worth of output for the cost of one.

In Kubernetes terms infra bridge

This is speculative / optimistic execution. The draft is a fast, cheap guess (like an admission webhook or an optimistic lock); the target is the authoritative check that validates then commits, rolling back the wrong tail. You bet that the cheap path is usually right, and verify before you trust it — same pattern as optimistic concurrency control.

3 · It lives or dies on the acceptance rate

If the draft agrees with the target often (high acceptance rate), you accept many tokens per pass — a 2–3× decode speedup. If it's a poor match, you accept few and the verification was near-wasted. So drafts must be cheap and aligned. The flavors:2

Separate draft model — a small sibling of the target.
Self-speculation (Medusa, EAGLE) — extra lightweight heads on the target itself predict several tokens ahead — no second model.
N-gram / prompt lookup — for repetitive text, just propose from what's already there.

On YOUR cluster — your model ships with a speculative head config + flags

From Qwen3.6's config.json: mtp_num_hidden_layers: 1 — a built-in Multi-Token Prediction (MTP) head, i.e. a self-speculation head that proposes the next token cheaply. But your server args have no --speculative-* flag — so it isn't switched on yet.

The lever: enabling MTP speculative decoding could speed up single-stream decode (low batch), where the compute is idle. The catch (from Lesson 11): at high batch, batching already uses that compute, so speculation helps less — measure before committing. A great experiment for your lab. · Your Lab →

Measured: DFlash on our cluster live experiment · 2026-06-21

We actually ran it: vLLM + z-lab/Qwen3.6-27B-DFlash (a block-diffusion draft) serving Qwen3.6-27B-FP8 on one H100, isolated on a free GPU so the production models were never touched. Greedy, 256 tokens, vs the plain-vLLM baseline:

workload	baseline	+ DFlash	speedup
single-stream, code	83 tok/s	357 tok/s (2.8 ms/tok)	~4.3×
single-stream, prose	83 tok/s	124 tok/s (8.1 ms)	~1.5×
batch 8, code	588 tok/s	583 tok/s	~break-even
batch 8, prose	588 tok/s	393–421 tok/s	0.7× (slower)

The whole lesson, validated on our own GPU: speculative decoding is a single-stream win — huge for predictable text like code (82% draft-token acceptance → ~4 bonus tokens per verify) — and a batched loss (draft+verify eats the compute batching wanted). The draft block size is the dial: larger (15) for single-stream latency, smaller (8) for concurrency. Net: enable it for low-concurrency / latency-bound traffic; skip it for hot batches. Your Lab →

Read this next — primary source Fast Inference via Speculative Decoding — Leviathan et al.. Runnable companion: day15 notebook — draft/verify and rejection sampling.

Check yourself (recall, don't peek)

I'm your teacher — that's Part II done. Want to try enabling MTP speculation on your cluster and measure the speedup? Just ask.

← Lesson 14 — prefix caching Next: Lesson 16 — quantization →

References

Fast Inference from Transformers via Speculative Decoding — Leviathan et al. (2211.17192); day15 (speculative-decoding.ipynb).
Medusa (2401.10774) · EAGLE (2401.15077) — self-speculation heads.