Several tokens per pass — one guess at a time.
Decode is memory-bound: per token the GPU streams the whole model from memory but does little math — the compute sits mostly idle. That idle compute is an opportunity.
Optimistic / speculative execution: the draft is a cheap guess (like an optimistic lock or admission webhook), the target validates then commits and rolls back the wrong tail. Bet on the cheap path, verify before you trust it.
Qwen3.6's config has mtp_num_hidden_layers: 1 — a built-in MTP
(multi-token-prediction) self-speculation head. But the server has no --speculative-*
flag: it's not enabled. Turning it on could speed up single-stream decode; at high batch the
gain shrinks (batching already uses the compute). Measure it. Your Lab →
Spend the idle compute: guess several tokens, verify them in one pass.
Decode is memory-bound — the GPU's compute mostly sits idle while it streams weights for one token. Speculative decoding fills that idle time: a fast junior cook (a small draft model) guesses the next few steps; the head chef (your real model) checks all the guesses in one glance and keeps the correct run. Several tokens for the price of one pass.
| the fast junior cook's guesses | draft model proposes K tokens |
| the chef checking them all at once | target model verifies in one forward pass |
| keep the correct run, redo the rest | accept the longest correct prefix |
At small batch, decode's arithmetic intensity is tiny — the tensor cores are nearly idle while memory bandwidth is the bottleneck (Lesson 11). Checking several candidate tokens in one pass costs almost the same as checking one, because you were compute-starved anyway. That spare compute is the opportunity.1
The loop: a small, cheap draft model quickly proposes the next K tokens. Your real target model then runs one forward pass over all K and checks each. You accept the longest correct prefix and discard the rest. Crucially, a rejection-sampling rule makes the result mathematically identical to normal decoding from the target — it's lossless, not an approximation.1
This is speculative / optimistic execution. The draft is a fast, cheap guess (like an admission webhook or an optimistic lock); the target is the authoritative check that validates then commits, rolling back the wrong tail. You bet that the cheap path is usually right, and verify before you trust it — same pattern as optimistic concurrency control.
If the draft agrees with the target often (high acceptance rate), you accept many tokens per pass — a 2–3× decode speedup. If it's a poor match, you accept few and the verification was near-wasted. So drafts must be cheap and aligned. The flavors:2
From Qwen3.6's config.json: mtp_num_hidden_layers:
1 — a built-in Multi-Token Prediction (MTP) head, i.e. a self-speculation head that
proposes the next token cheaply. But your server args have no --speculative-* flag —
so it isn't switched on yet.
The lever: enabling MTP speculative decoding could speed up single-stream decode (low batch), where the compute is idle. The catch (from Lesson 11): at high batch, batching already uses that compute, so speculation helps less — measure before committing. A great experiment for your lab. · Your Lab →
We actually ran it: vLLM + z-lab/Qwen3.6-27B-DFlash (a block-diffusion
draft) serving Qwen3.6-27B-FP8 on one H100, isolated on a free GPU so the production models were
never touched. Greedy, 256 tokens, vs the plain-vLLM baseline:
| workload | baseline | + DFlash | speedup |
|---|---|---|---|
| single-stream, code | 83 tok/s | 357 tok/s (2.8 ms/tok) | ~4.3× |
| single-stream, prose | 83 tok/s | 124 tok/s (8.1 ms) | ~1.5× |
| batch 8, code | 588 tok/s | 583 tok/s | ~break-even |
| batch 8, prose | 588 tok/s | 393–421 tok/s | 0.7× (slower) |
The whole lesson, validated on our own GPU: speculative decoding is a single-stream win — huge for predictable text like code (82% draft-token acceptance → ~4 bonus tokens per verify) — and a batched loss (draft+verify eats the compute batching wanted). The draft block size is the dial: larger (15) for single-stream latency, smaller (8) for concurrency. Net: enable it for low-concurrency / latency-bound traffic; skip it for hot batches. Your Lab →