Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict how attention works (Q/K/V, scores, the causal
mask, heads) — the exact operation the KV cache, GQA, and FlashAttention later optimize.
The setup
Embeddings (Lesson 4) gave each token a vector that knows only itself. But "the pod crashed
because it ran out of memory" — to handle "it", the model must use the other tokens.
Step 1 — what each token produces
Step 2 — how relevance is scored
Recall — cover the screen: the attention recipe in one line. Each token makes a Query, Key, Value. Score every earlier token by Query·Key, softmax the scores into weights, and output the weighted sum of their Values. So "it" ends up carrying mostly "pod"'s value.(tap/hover to check)
Step 3 — what can a token look at?
Step 4 — what's a "head"?
Step 5 — the cost, and the fixes (real)
Recall — say it: three optimizations that all target attention. KV cache (don't recompute past Keys/Values — Lesson 10); GQA (many query heads share a few K/V heads → smaller cache); FlashAttention (compute the n×n scores in on-chip SRAM, never materializing the full matrix — Lesson 13).(tap/hover)
On YOUR cluster from config.json
Qwen3.6: 24 query heads, 4 KV heads (head_dim 256) = GQA — 6 query heads share
each K/V head, so the KV cache is 6× smaller (you'll do the math in Lesson 10). It's also a hybrid: cheap
linear attention most layers, full attention every 4th. Your Lab →
Explain to a colleague: "Attention lets a token…" …pull information from the relevant earlier tokens: it forms a Query, matches it against every earlier token's Key (Q·K → softmax), and blends their Values. Multiple heads do this in parallel, each tracking a different relationship; a causal mask keeps it from seeing the future.(tap/hover)
I'm your teacher — ask me anything. Want the Q·K·V math on a tiny example, or why GQA barely costs quality?
How each token reads the others — the operation everything later optimizes.
Today's win: you'll explain attention — Query/Key/Value, the relevance scores, the
causal mask, and what a "head" is — and see why this is exactly the operation the KV cache (Lesson 10),
GQA, and FlashAttention all exist to make cheaper.
The picture: which earlier items matter right now?
A token's embedding (Lesson
4) only knows itself. To understand context, the cook plating the current bite asks: which
earlier items on the order ticket matter for this one? Each token asks a Query ("what am I
looking for?"), every token advertises a Key ("here's what I offer"), and carries a
Value ("here's my content"). Match query to keys → mix the matching values.
the question this item asks
Query (Q)
the label each item advertises
Key (K)
the content each item carries
Value (V)
one specialist doing this lookup
an attention head (many run in parallel)
1 · Why attention: a token needs the others
Take "the pod crashed because it ran out of memory". To handle it, the model must
look back and find that it = pod. Attention is the mechanism that lets every token pull
in information from the relevant earlier tokens.1
Resolving "it" = "pod" is attention at work — every token builds its meaning from the relevant
earlier ones.
2 · Query · Key · Value — the lookup
From each token's vector, three learned projections produce its Q, K,
and V. The relevance of one token to another is Query · Key (a dot product);
those scores go through softmax to become weights; the output is the weighted sum of the
Values.1
Pantry: the cook holds up the current item's question (Q) against every
earlier item's label (K), sees which match best, and blends those items' contents (V) into the answer.
Score every earlier token by Q·K, softmax into weights, blend their Values. "it" ends up
carrying mostly "pod"'s content. That's one attention computation.
3 · Two essentials: the causal mask, and heads
Two things make it work for generation. Causal mask: a token may only attend to
tokens before it — it can't peek at the future it's trying to predict. Multi-head:
the model runs several attentions in parallel — each a head with its own
Q/K/V projections, learning a different kind of relationship (one tracks what-refers-to-what, another
tracks code syntax, …). Their outputs are concatenated.1
Pantry: several specialist cooks each scan the ticket for a different
pattern (one for allergies, one for timing, one for sauces) — then combine notes. Each specialist is a
head.
Heads = parallel specialists, each its own Q/K/V. The causal triangle keeps each token from
seeing the future — essential, since generation predicts that future.
4 · The cost — and what comes next
Every token attends to every earlier token, so attention is O(n²) in sequence
length — the expensive part of long contexts. Three optimizations flow directly from this lesson, and each
is a later one:2
KV cache — the Keys and Values of past tokens don't change, so cache them
instead of recomputing every step (Lesson 10).
GQA — let many query heads share a few K/V heads, shrinking that cache.
FlashAttention — compute the n×n scores in fast on-chip SRAM, never writing the
full matrix to memory (Lesson 13).
On YOUR cluster — GQA, set in the config from config.json
Qwen3.6 has 24 query heads but only 4 key/value heads
(head_dim 256) — that's Grouped-Query Attention: 6 query heads share each K/V head, so the
KV cache is 6× smaller than full multi-head would be.
This single config choice is why you can fit long contexts; you'll do the memory math in Lesson 10.
Advanced note: Qwen3.6 is a hybrid — most layers use a cheap linear attention,
with full attention only every 4th layer — a frontier trick to dodge the O(n²) wall. The foundation
here (full attention) is what those variants optimize. · Your Lab →