Glossary

The ubiquitous language for this course. Every lesson uses these terms exactly as defined here. Grows as we go.

The whole course in one picture: arithmetic intensity places every workload on this spectrum. Most terms below are about moving along it.

Token: The atomic unit of model input and output: a piece of text (usually a subword) paired with an integer ID — its index in the model's fixed vocabulary. The model only ever sees these integers, never your letters. Averages ≈ ¾ of an English word (~4 characters). See Lesson 3.
Tokenizer: The component that converts text ↔ token IDs (encode and decode), using a learned vocabulary plus BPE merge rules. Fixed per model and loaded by the server (vLLM exposes it at /tokenize).
Vocabulary: The fixed set of all tokens a model knows (your Qwen3.6 = 248,320). A token's position in this set is its integer ID.
Subword: A token that is a fragment of a word. Frequent words are a single subword; rare words split into several — so any string is representable and nothing is "out-of-vocabulary" (it can fall back to raw bytes).
Byte-Pair Encoding (BPE): The algorithm that builds the vocabulary: starting from raw bytes, repeatedly merge the most frequent adjacent pair into a new token, saving each rule. Byte-level BPE operates on UTF-8 bytes, so a leading space becomes part of the next piece (shown as Ġ) and any character is always encodable.
Special tokens: Reserved tokens that aren't ordinary text — end-of-sequence (EOS), and chat-role markers like <|im_start|> / <|im_end|>. Added around your content before prefill.
Chat template: The model-specific pattern that wraps your messages in special tokens (system / user / assistant roles) before tokenization. It is what actually gets prefilled — fixed overhead on every turn (Qwen3.6 even auto-opens a <think> block). See Lesson 3.
Context window: The maximum number of tokens (prompt + generation, all held in the KV cache) a model can attend to at once — your Qwen3.6 serves 131,072. Exceeding it forces truncation or eviction.
Prefill: The first phase of inference: the model processes the entire input prompt in one forward pass to produce the first output token (and to populate the KV cache). Work scales with prompt length; compute-bound.
Decode: The second phase: the model generates output tokens one at a time, each step consuming the previously generated token. Each step does little arithmetic but must re-read all weights + the growing KV cache; memory-bandwidth-bound at typical batch sizes.
KV cache: The stored key and value vectors for every past token, kept so attention doesn't recompute them each decode step. Grows linearly with sequence length and batch size; usually the dominant consumer of GPU memory during decode.
Memory-bandwidth-bound: A workload limited by how fast data moves from GPU memory (HBM), not by how fast the GPU can compute. Speeding up the math doesn't help; you must move less data or move it faster. Decode lives here.
Compute-bound: A workload limited by the GPU's arithmetic throughput (FLOPs / tensor cores), not by memory traffic. Prefill lives here. Decode also crosses into this regime once the batch is large enough (roughly batch 32+).
Arithmetic intensity: FLOPs performed per byte read from memory. High intensity → compute-bound; low intensity → memory-bound. Prefill has high intensity (~200–400 ops/byte); decode has very low intensity. This single number explains the phase asymmetry.
Roofline: A plot of achievable throughput vs arithmetic intensity: a rising bandwidth-limited slope that flattens into a compute-limited roof. Tells you, for any workload, whether speeding up compute or memory will help. See Lesson 11.
Ridge point: Where the roofline's slope meets its roof: ridge = peak FLOPs ÷ memory bandwidth (FLOP/byte). Intensity below it → memory-bound; above → compute-bound. H100 NVL ≈ 214 FLOP/byte (BF16 dense).
Tensor parallelism (TP): Split each layer's weight matrices across N GPUs so one model runs on several GPUs at once. Needs an all-reduce to recombine partial results every layer. Use when a model won't fit on one GPU, or for lower single-stream latency over NVLink. See Lesson 19.
All-reduce: A collective that sums a tensor across all GPUs and returns the result to each. TP does two per transformer layer (after attention, after the MLP) — in decode that's per token, on the critical path, so it lands directly in token latency.
NVLink: NVIDIA's high-bandwidth GPU-to-GPU interconnect (H100 ≈ 900 GB/s) — ~7× faster than PCIe (≈ 128 GB/s). Decisive for TP, whose all-reduce rides it. In the analogy: the express lane between prep stations.
Latency vs. throughput: Latency = time for one request (e.g. time-to-first-token, then per-token). Throughput = tokens/sec across all requests. Batching trades latency for throughput; the two phases sit on opposite sides of this trade. The knee is where throughput saturates but latency keeps climbing — run just below it.
TTFT (time to first token): Request arrival → first output token, dominated by prefill. The first SLO users feel; protected by chunked prefill + prefix caching.
TPOT / ITL (time per output token): Average gap between output tokens once generation starts, dominated by decode. The streaming-smoothness SLO; per-request throughput ≈ 1 ÷ TPOT.
Goodput: The throughput that meets both the TTFT and TPOT SLOs. Past the knee, raw throughput can stay flat while goodput collapses — so optimize for goodput, not throughput. See Lesson 24.
Little's Law: in-flight requests = throughput × latency. Turns a latency target + the concurrency cap (max-num-seqs) into max QPS and replica count — the bridge from the knee to autoscaling.
HBM (high-bandwidth memory): The GPU's main memory, where model weights and the KV cache live. Big but "far" — reaching it is the cost that makes decode memory-bound. In the analogy: the pantry across town.
Attention head: One of several parallel "tasters" inside attention. Per token, each head emits a Query (what it's looking for), a Key (how it labels itself), and a Value (what it carries) — each a vector of head_dim numbers. The KV cache stores the K and V of each KV head.
head_dim: The length of one head's Key (or Value) vector — how many numbers describe a token from that head's angle (128 in Llama 2 7B). In the formula, kv_heads × head_dim is the width of one token's cache entry per layer (the index card's length).
GQA (grouped-query attention): An attention variant where several query heads share one key/value head, so the model stores fewer KV heads. Shrinks the KV cache by the group factor (e.g. Llama 3 8B: 32 query ÷ 8 KV heads = 4× smaller). The middle ground between full multi-head attention (MHA) and single-KV MQA.
Quantization: Store weights, activations, and/or the KV cache in fewer bits (FP16 → FP8 → INT4), rounding each number onto a coarser grid. Shrinks memory and the bytes moved per token (faster decode), at a small accuracy cost — the risk lives in a few outlier values. See Lesson 16.
FP8 (E4M3 / E5M2): 8-bit floating point. E4M3 (4 exponent, 3 mantissa bits) — higher precision, small range — is used for weights/activations/KV; E5M2 (wider range) for gradients. ~Lossless for inference on H100-class GPUs, which also run FP8 at ~2× their BF16 tensor-core throughput.
Static batching: Group N requests, run them together, and wait for all to finish before starting the next batch. Because requests vary in length, finished slots idle until the slowest ends — ~60% GPU idle on real traffic. The thing continuous batching fixes.
Continuous batching (iteration-level scheduling): Schedule one decode step at a time: after each step, evict finished requests and admit waiting ones, so the running batch never drains. Introduced by Orca; keeps the memory-bound decode phase fed. See Lesson 12.
PagedAttention: Store the KV cache in small fixed-size blocks allocated on demand and non-contiguously — virtual memory / paging for the GPU. Eliminates the 60–80% fragmentation waste of contiguous per-request reservation, so far more sequences fit. vLLM's core trick. Lesson 12.
Prefix caching: Reuse PagedAttention blocks across requests that share a prompt prefix (e.g. a common system prompt) instead of recomputing/restoring their KV — a big win for RAG.
Prompt caching: The productized, billed form of prefix caching exposed by LLM APIs: the provider stores a stable prefix's prefill KV under a TTL so repeat requests skip re-prefilling it. A cache read (hit) costs ~0.1× of input price; a cache write (cold, or after expiry) ~1.25–2×. It's the KV cache with a clock and a price tag. See Lesson 12.
Cache TTL: How long a cached prefix survives before eviction — Anthropic's default is 5 minutes (refreshed on each use), with a 1-hour option. Idle past the TTL and the next request is a cache miss that re-prefills from scratch.

Lesson 9: Prefill vs Decode →