Inference Engineering · Lesson 6 · The Forward Pass & SamplingHome · Glossary · Your Lab
The Forward Pass & Sampling
From blocks to one token — one guess at a time.
Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict what a forward pass outputs and how decoding
(greedy, temperature, top-k/p) turns it into the token you get — the knobs you set per request.
The setup
Embeddings (L4) and attention (L5) run all the way through the stack. Out the end comes… what, exactly —
and how do we get from that to a single token?
Step 1 — the forward pass output
Step 2 — the simplest decoder
Recall — cover the screen: forward pass → token, in one line. The blocks produce a logit (score) for every vocab token; softmax turns logits into probabilities; a decoder picks one (greedy = the top; or sample). That picked token is the model's output for this step.(tap/hover to check)
Step 3 — temperature (real numbers)
Your Qwen's top guess for "Kubernetes pods are scheduled by the" is " kube".
Step 4 — a special case
Step 5 — why top-k / top-p?
Recall — say it: your three sampling knobs and what each does. temperature reshapes the distribution (low=safe/peaky, high=creative/flat; 0=greedy/deterministic); top-k keeps the k most likely tokens; top-p keeps the smallest set summing to p. top-k/p cut the unreliable tail so sampling stays sane.(tap/hover)
On YOUR cluster live Qwen3.6
Real reshaping of the same logits: T=0.1 → kube ~100% · T=1.0 → 48/23/15% ·
T=2.0 → 29/20/16% (coin-flippy). Set temperature/top_p/top_k
per request; temperature 0 = greedy = deterministic (use it for evals). Your Lab →
Explain to a colleague: "A forward pass gives us a distribution; we get one token by…" …decoding it — greedy takes the argmax (deterministic), or we sample with a temperature that reshapes the distribution and top-k/top-p that trim the tail. Temperature 0 = greedy. Those are the per-request knobs that trade determinism for creativity.(tap/hover)
I'm your teacher — ask me anything. Want the temperature math, or to see top_p change a real answer?
Embeddings → blocks → logits → one token. And how you steer that last step.
Today's win: you'll explain what a single forward pass actually outputs (a score for
every token), and how decoding — greedy, temperature, top-k/top-p — turns that distribution
into the one token you get back. These are the request knobs you tune in production.
The picture: score every option, then pick
Lessons 4–5 fed your tokens through embeddings and attention. The
forward pass runs that all the way to the end and produces a confidence score for every
possible next token (the logits). Decoding is the cook's policy for choosing:
always take the top (greedy), or roll weighted dice (sampling) — with temperature setting how
loaded the dice are.
a score for every menu item
logits → softmax → probabilities
always pick the top item
greedy (argmax) decoding
how loaded the dice are
temperature
1 · A forward pass outputs a score for every token
Run the input vectors through all the transformer blocks and a final projection, and you get
logits: one raw score per vocabulary token — 248,320 of them for your Qwen. A
softmax turns those scores into a probability distribution. This is the
distribution you met in Lesson 1; now you know where it
comes from.1
The whole stack exists to turn your tokens into one number per vocabulary entry. Softmax makes
it a probability; decoding picks from it.
2 · Greedy: just take the top
The simplest decoder picks the single highest-probability token (argmax). It's
deterministic — same prompt, same output every time — which is great for reproducibility but
tends to be repetitive.2 For "Kubernetes pods are scheduled by
the", greedy always returns " kube".
Pantry: always cook the single most-ordered dish. Safe, predictable, a
little boring.
3 · Temperature: reshape the dice before rolling
Instead of always taking the top, you can sample from the distribution — and
temperature reshapes it first. Low temperature sharpens it toward the top token (safe);
high temperature flattens it (more variety, more risk).2
Pantry: temperature is how loaded the dice are — near 0 they always land
on the favorite; crank it up and long-shots start winning.
Real Qwen numbers. Low temperature ≈ greedy; high temperature spreads probability into the
long shots. Temperature 0 just is greedy.
4 · Top-k / top-p: cut the unreliable tail
A 248,320-way distribution has an enormous tail of near-zero, often nonsensical tokens. At
high temperature one of them occasionally wins — and derails the output. Top-k keeps only
the k most likely tokens; top-p (nucleus) keeps the smallest set whose probability sums
to p (e.g. 0.9). You sample only from that trusted head.2
Top-k/top-p trim the unreliable tail, so sampling stays sane even at higher temperature. Most
serving defaults combine a modest temperature with top-p.
On YOUR cluster — these are request knobs live
Every distribution here is real Qwen3.6 output for "Kubernetes pods are
scheduled by the" (same call as Lesson 1). In your API requests you set temperature,
top_p, top_k per request; temperature 0 = greedy = deterministic. vLLM
applies them to the logits before sampling each token.
Tip: for reproducible evals use temperature 0; for chat/creative use ~0.7 + top_p
0.9. Pull the raw logits:
curl …/v1/completions … "max_tokens":1,"logprobs":8. · Your Lab →