Inference Engineering · Lesson 6 · The Forward Pass & Sampling Home · Glossary · Your Lab

The Forward Pass & Sampling

From blocks to one token — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict what a forward pass outputs and how decoding (greedy, temperature, top-k/p) turns it into the token you get — the knobs you set per request.

The setup

Embeddings (L4) and attention (L5) run all the way through the stack. Out the end comes… what, exactly — and how do we get from that to a single token?

Step 1 — the forward pass output

Step 2 — the simplest decoder

Recall — cover the screen: forward pass → token, in one line.
The blocks produce a logit (score) for every vocab token; softmax turns logits into probabilities; a decoder picks one (greedy = the top; or sample). That picked token is the model's output for this step. (tap/hover to check)

Step 3 — temperature (real numbers)

Your Qwen's top guess for "Kubernetes pods are scheduled by the" is " kube".

Step 4 — a special case

Step 5 — why top-k / top-p?

Recall — say it: your three sampling knobs and what each does.
temperature reshapes the distribution (low=safe/peaky, high=creative/flat; 0=greedy/deterministic); top-k keeps the k most likely tokens; top-p keeps the smallest set summing to p. top-k/p cut the unreliable tail so sampling stays sane. (tap/hover)

On YOUR cluster live Qwen3.6

Real reshaping of the same logits: T=0.1 → kube ~100% · T=1.0 → 48/23/15% · T=2.0 → 29/20/16% (coin-flippy). Set temperature/top_p/top_k per request; temperature 0 = greedy = deterministic (use it for evals). Your Lab →

Read this next — primary source HF: how to generate text · runnable: day01 notebook (logits + temperature table).

Final check — teach it back

Explain to a colleague: "A forward pass gives us a distribution; we get one token by…"
…decoding it — greedy takes the argmax (deterministic), or we sample with a temperature that reshapes the distribution and top-k/top-p that trim the tail. Temperature 0 = greedy. Those are the per-request knobs that trade determinism for creativity. (tap/hover)
I'm your teacher — ask me anything. Want the temperature math, or to see top_p change a real answer?
← Lesson 5Next: Lesson 7 →
References
  1. day01 — forward pass & decoding (notebook); HF: how to generate.

The Forward Pass & Sampling

Embeddings → blocks → logits → one token. And how you steer that last step.

Today's win: you'll explain what a single forward pass actually outputs (a score for every token), and how decoding — greedy, temperature, top-k/top-p — turns that distribution into the one token you get back. These are the request knobs you tune in production.

The picture: score every option, then pick

Lessons 4–5 fed your tokens through embeddings and attention. The forward pass runs that all the way to the end and produces a confidence score for every possible next token (the logits). Decoding is the cook's policy for choosing: always take the top (greedy), or roll weighted dice (sampling) — with temperature setting how loaded the dice are.

a score for every menu itemlogits → softmax → probabilities
always pick the top itemgreedy (argmax) decoding
how loaded the dice aretemperature

1 · A forward pass outputs a score for every token

Run the input vectors through all the transformer blocks and a final projection, and you get logits: one raw score per vocabulary token — 248,320 of them for your Qwen. A softmax turns those scores into a probability distribution. This is the distribution you met in Lesson 1; now you know where it comes from.1

input vectors (L4 + L5) 64 transformer blocks attention + FFN logits 248,320 scores softmax → probabilities one pass = one distribution over the next token (e.g. " kube" 39%, " scheduler" 18% …)
The whole stack exists to turn your tokens into one number per vocabulary entry. Softmax makes it a probability; decoding picks from it.

2 · Greedy: just take the top

The simplest decoder picks the single highest-probability token (argmax). It's deterministic — same prompt, same output every time — which is great for reproducibility but tends to be repetitive.2 For "Kubernetes pods are scheduled by the", greedy always returns " kube".

Pantry: always cook the single most-ordered dish. Safe, predictable, a little boring.

3 · Temperature: reshape the dice before rolling

Instead of always taking the top, you can sample from the distribution — and temperature reshapes it first. Low temperature sharpens it toward the top token (safe); high temperature flattens it (more variety, more risk).2

Pantry: temperature is how loaded the dice are — near 0 they always land on the favorite; crank it up and long-shots start winning.
same logits, three temperatures (top candidates, renormalized): T = 0.1 (peaky) kube sched K8s ≈ greedy (100% kube) T = 1.0 (default) kube sched K8s 48% / 23% / 15% T = 2.0 (flat) kube sched K8s 29% / 20% / 16% — coin-flippy
Real Qwen numbers. Low temperature ≈ greedy; high temperature spreads probability into the long shots. Temperature 0 just is greedy.

4 · Top-k / top-p: cut the unreliable tail

A 248,320-way distribution has an enormous tail of near-zero, often nonsensical tokens. At high temperature one of them occasionally wins — and derails the output. Top-k keeps only the k most likely tokens; top-p (nucleus) keeps the smallest set whose probability sums to p (e.g. 0.9). You sample only from that trusted head.2

keep (top-p) the long tail — cut before sampling, so a garbage token can't sneak in
Top-k/top-p trim the unreliable tail, so sampling stays sane even at higher temperature. Most serving defaults combine a modest temperature with top-p.

On YOUR cluster — these are request knobs live

Every distribution here is real Qwen3.6 output for "Kubernetes pods are scheduled by the" (same call as Lesson 1). In your API requests you set temperature, top_p, top_k per request; temperature 0 = greedy = deterministic. vLLM applies them to the logits before sampling each token.

Tip: for reproducible evals use temperature 0; for chat/creative use ~0.7 + top_p 0.9. Pull the raw logits: curl …/v1/completions … "max_tokens":1,"logprobs":8. · Your Lab →

Read this next — primary source How to generate text — Hugging Face (greedy, sampling, top-k, nucleus). Runnable companion: day01 notebook — logits, softmax, and a temperature table.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want the temperature math, or to see how top_p changes your model's answers on a real prompt? Just ask.
← Lesson 5 — attention Next: Lesson 7 — the autoregressive loop →
References
  1. The forward pass & logits — day01 (llm-inference-mechanics.ipynb).
  2. How to generate text (decoding strategies) — Hugging Face. huggingface.co