The ubiquitous language for this course. Every lesson uses these
terms exactly as defined here. Grows as we go.
The whole course in one picture: arithmetic intensity places every
workload on this spectrum. Most terms below are about moving along it.
Token
The atomic unit of model input and output: a piece of text (usually a
subword) paired with an integer ID — its index in the
model's fixed vocabulary. The model only ever sees these integers, never
your letters. Averages ≈ ¾ of an English word (~4 characters). See
Lesson 3.
Tokenizer
The component that converts text ↔ token IDs (encode and decode), using a learned
vocabulary plus BPE merge rules. Fixed per model and
loaded by the server (vLLM exposes it at /tokenize).
Vocabulary
The fixed set of all tokens a model knows (your Qwen3.6 = 248,320). A token's position in this set
is its integer ID.
Subword
A token that is a fragment of a word. Frequent words are a single subword; rare words split into
several — so any string is representable and nothing is "out-of-vocabulary" (it can fall back to raw
bytes).
Byte-Pair Encoding (BPE)
The algorithm that builds the vocabulary: starting from raw bytes, repeatedly merge the
most frequent adjacent pair into a new token, saving each rule. Byte-level BPE
operates on UTF-8 bytes, so a leading space becomes part of the next piece (shown as Ġ) and
any character is always encodable.
Special tokens
Reserved tokens that aren't ordinary text — end-of-sequence (EOS), and chat-role markers like
<|im_start|> / <|im_end|>. Added around your content before
prefill.
Chat template
The model-specific pattern that wraps your messages in special tokens
(system / user / assistant roles) before tokenization. It is what actually gets prefilled — fixed
overhead on every turn (Qwen3.6 even auto-opens a <think> block). See
Lesson 3.
Context window
The maximum number of tokens (prompt + generation, all held in the KV cache)
a model can attend to at once — your Qwen3.6 serves 131,072. Exceeding it forces
truncation or eviction.
Prefill
The first phase of inference: the model processes the entire input prompt in
one forward pass to produce the first output token (and to populate the KV cache).
Work scales with prompt length; compute-bound.
Decode
The second phase: the model generates output tokens one at a time, each step
consuming the previously generated token. Each step does little arithmetic but
must re-read all weights + the growing KV cache; memory-bandwidth-bound
at typical batch sizes.
KV cache
The stored key and value vectors for every past token, kept so attention
doesn't recompute them each decode step. Grows linearly with sequence length and
batch size; usually the dominant consumer of GPU memory during decode.
Memory-bandwidth-bound
A workload limited by how fast data moves from GPU memory (HBM), not by how
fast the GPU can compute. Speeding up the math doesn't help; you must move less
data or move it faster. Decode lives here.
Compute-bound
A workload limited by the GPU's arithmetic throughput (FLOPs / tensor cores),
not by memory traffic. Prefill lives here. Decode also crosses into this regime
once the batch is large enough (roughly batch 32+).
Arithmetic intensity
FLOPs performed per byte read from memory. High intensity → compute-bound;
low intensity → memory-bound. Prefill has high intensity (~200–400 ops/byte);
decode has very low intensity. This single number explains the phase asymmetry.
Roofline
A plot of achievable throughput vs arithmetic
intensity: a rising bandwidth-limited slope that flattens into a compute-limited
roof. Tells you, for any workload, whether speeding up compute or memory will help.
See Lesson 11.
Ridge point
Where the roofline's slope meets its roof: ridge = peak FLOPs ÷ memory
bandwidth (FLOP/byte). Intensity below it → memory-bound; above → compute-bound.
H100 NVL ≈ 214 FLOP/byte (BF16 dense).
Tensor parallelism (TP)
Split each layer's weight matrices across N GPUs so one model runs on several GPUs at
once. Needs an all-reduce to recombine partial results every layer.
Use when a model won't fit on one GPU, or for lower single-stream latency over
NVLink. See Lesson 19.
All-reduce
A collective that sums a tensor across all GPUs and returns the result to each. TP does
two per transformer layer (after attention, after the MLP) — in decode that's per token,
on the critical path, so it lands directly in token latency.
NVLink
NVIDIA's high-bandwidth GPU-to-GPU interconnect (H100 ≈ 900 GB/s) — ~7× faster than PCIe
(≈ 128 GB/s). Decisive for TP, whose all-reduce rides it. In
the analogy: the express lane between prep stations.
Latency vs. throughput
Latency = time for one request (e.g. time-to-first-token, then
per-token). Throughput = tokens/sec across all requests. Batching trades
latency for throughput; the two phases sit on opposite sides of this trade. The
knee is where throughput saturates but latency keeps climbing — run just
below it.
TTFT (time to first token)
Request arrival → first output token, dominated by prefill. The
first SLO users feel; protected by chunked prefill + prefix caching.
TPOT / ITL (time per output token)
Average gap between output tokens once generation starts, dominated by
decode. The streaming-smoothness SLO; per-request throughput ≈ 1 ÷ TPOT.
Goodput
The throughput that meets both the TTFT and TPOT SLOs. Past the knee, raw
throughput can stay flat while goodput collapses — so optimize for goodput, not throughput.
See Lesson 24.
Little's Law
in-flight requests = throughput × latency. Turns a latency target + the
concurrency cap (max-num-seqs) into max QPS and replica count — the bridge from
the knee to autoscaling.
HBM (high-bandwidth memory)
The GPU's main memory, where model weights and the KV cache live. Big but
"far" — reaching it is the cost that makes decode
memory-bound. In the analogy: the pantry across town.
Attention head
One of several parallel "tasters" inside attention. Per token, each head emits a Query
(what it's looking for), a Key (how it labels itself), and a Value (what it
carries) — each a vector of head_dim numbers. The
KV cache stores the K and V of each KV head.
head_dim
The length of one head's Key (or Value) vector — how many numbers describe a token from that
head's angle (128 in Llama 2 7B). In the formula, kv_heads × head_dim is the width of
one token's cache entry per layer (the index card's length).
GQA (grouped-query attention)
An attention variant where several query heads share one key/value
head, so the model stores fewer KV heads. Shrinks the KV cache
by the group factor (e.g. Llama 3 8B: 32 query ÷ 8 KV heads = 4× smaller).
The middle ground between full multi-head attention (MHA) and single-KV MQA.
Quantization
Store weights, activations, and/or the KV cache in fewer bits
(FP16 → FP8 → INT4), rounding each number onto a coarser grid. Shrinks memory and the
bytes moved per token (faster decode), at a small accuracy cost — the
risk lives in a few outlier values. See Lesson 16.
FP8 (E4M3 / E5M2)
8-bit floating point. E4M3 (4 exponent, 3 mantissa bits) — higher precision, small
range — is used for weights/activations/KV; E5M2 (wider range) for gradients. ~Lossless for
inference on H100-class GPUs, which also run FP8 at ~2× their BF16 tensor-core throughput.
Static batching
Group N requests, run them together, and wait for all to finish before
starting the next batch. Because requests vary in length, finished slots idle until
the slowest ends — ~60% GPU idle on real traffic. The thing continuous batching fixes.
Continuous batching (iteration-level scheduling)
Schedule one decode step at a time: after each step, evict finished requests and
admit waiting ones, so the running batch never drains. Introduced by Orca; keeps the
memory-bound decode phase fed. See Lesson 12.
PagedAttention
Store the KV cache in small fixed-size blocks
allocated on demand and non-contiguously — virtual memory / paging for the GPU.
Eliminates the 60–80% fragmentation waste of contiguous per-request reservation, so
far more sequences fit. vLLM's core trick. Lesson 12.
Prefix caching
Reuse PagedAttention blocks across requests that share a prompt prefix (e.g. a
common system prompt) instead of recomputing/restoring their KV — a big win for RAG.
Prompt caching
The productized, billed form of prefix caching exposed by LLM
APIs: the provider stores a stable prefix's prefill KV under a TTL so
repeat requests skip re-prefilling it. A cache read (hit) costs ~0.1× of input price; a
cache write (cold, or after expiry) ~1.25–2×. It's the KV cache with a clock and a price
tag. See Lesson 12.
Cache TTL
How long a cached prefix survives before eviction — Anthropic's default is 5 minutes
(refreshed on each use), with a 1-hour option. Idle past the TTL and the next request is a cache
miss that re-prefills from scratch.