Inference Engineering · Reference Home · Your Lab · Lessons →

Glossary

The ubiquitous language for this course. Every lesson uses these terms exactly as defined here. Grows as we go.

memory-bandwidth-bound compute-bound Decode low ops/byte Prefill high ops/byte
The whole course in one picture: arithmetic intensity places every workload on this spectrum. Most terms below are about moving along it.
Token
The atomic unit of model input and output: a piece of text (usually a subword) paired with an integer ID — its index in the model's fixed vocabulary. The model only ever sees these integers, never your letters. Averages ≈ ¾ of an English word (~4 characters). See Lesson 3.
Tokenizer
The component that converts text ↔ token IDs (encode and decode), using a learned vocabulary plus BPE merge rules. Fixed per model and loaded by the server (vLLM exposes it at /tokenize).
Vocabulary
The fixed set of all tokens a model knows (your Qwen3.6 = 248,320). A token's position in this set is its integer ID.
Subword
A token that is a fragment of a word. Frequent words are a single subword; rare words split into several — so any string is representable and nothing is "out-of-vocabulary" (it can fall back to raw bytes).
Byte-Pair Encoding (BPE)
The algorithm that builds the vocabulary: starting from raw bytes, repeatedly merge the most frequent adjacent pair into a new token, saving each rule. Byte-level BPE operates on UTF-8 bytes, so a leading space becomes part of the next piece (shown as Ġ) and any character is always encodable.
Special tokens
Reserved tokens that aren't ordinary text — end-of-sequence (EOS), and chat-role markers like <|im_start|> / <|im_end|>. Added around your content before prefill.
Chat template
The model-specific pattern that wraps your messages in special tokens (system / user / assistant roles) before tokenization. It is what actually gets prefilled — fixed overhead on every turn (Qwen3.6 even auto-opens a <think> block). See Lesson 3.
Context window
The maximum number of tokens (prompt + generation, all held in the KV cache) a model can attend to at once — your Qwen3.6 serves 131,072. Exceeding it forces truncation or eviction.
Prefill
The first phase of inference: the model processes the entire input prompt in one forward pass to produce the first output token (and to populate the KV cache). Work scales with prompt length; compute-bound.
Decode
The second phase: the model generates output tokens one at a time, each step consuming the previously generated token. Each step does little arithmetic but must re-read all weights + the growing KV cache; memory-bandwidth-bound at typical batch sizes.
KV cache
The stored key and value vectors for every past token, kept so attention doesn't recompute them each decode step. Grows linearly with sequence length and batch size; usually the dominant consumer of GPU memory during decode.
Memory-bandwidth-bound
A workload limited by how fast data moves from GPU memory (HBM), not by how fast the GPU can compute. Speeding up the math doesn't help; you must move less data or move it faster. Decode lives here.
Compute-bound
A workload limited by the GPU's arithmetic throughput (FLOPs / tensor cores), not by memory traffic. Prefill lives here. Decode also crosses into this regime once the batch is large enough (roughly batch 32+).
Arithmetic intensity
FLOPs performed per byte read from memory. High intensity → compute-bound; low intensity → memory-bound. Prefill has high intensity (~200–400 ops/byte); decode has very low intensity. This single number explains the phase asymmetry.
Roofline
A plot of achievable throughput vs arithmetic intensity: a rising bandwidth-limited slope that flattens into a compute-limited roof. Tells you, for any workload, whether speeding up compute or memory will help. See Lesson 11.
Ridge point
Where the roofline's slope meets its roof: ridge = peak FLOPs ÷ memory bandwidth (FLOP/byte). Intensity below it → memory-bound; above → compute-bound. H100 NVL ≈ 214 FLOP/byte (BF16 dense).
Tensor parallelism (TP)
Split each layer's weight matrices across N GPUs so one model runs on several GPUs at once. Needs an all-reduce to recombine partial results every layer. Use when a model won't fit on one GPU, or for lower single-stream latency over NVLink. See Lesson 19.
All-reduce
A collective that sums a tensor across all GPUs and returns the result to each. TP does two per transformer layer (after attention, after the MLP) — in decode that's per token, on the critical path, so it lands directly in token latency.
NVIDIA's high-bandwidth GPU-to-GPU interconnect (H100 ≈ 900 GB/s) — ~7× faster than PCIe (≈ 128 GB/s). Decisive for TP, whose all-reduce rides it. In the analogy: the express lane between prep stations.
Latency vs. throughput
Latency = time for one request (e.g. time-to-first-token, then per-token). Throughput = tokens/sec across all requests. Batching trades latency for throughput; the two phases sit on opposite sides of this trade. The knee is where throughput saturates but latency keeps climbing — run just below it.
TTFT (time to first token)
Request arrival → first output token, dominated by prefill. The first SLO users feel; protected by chunked prefill + prefix caching.
TPOT / ITL (time per output token)
Average gap between output tokens once generation starts, dominated by decode. The streaming-smoothness SLO; per-request throughput ≈ 1 ÷ TPOT.
Goodput
The throughput that meets both the TTFT and TPOT SLOs. Past the knee, raw throughput can stay flat while goodput collapses — so optimize for goodput, not throughput. See Lesson 24.
Little's Law
in-flight requests = throughput × latency. Turns a latency target + the concurrency cap (max-num-seqs) into max QPS and replica count — the bridge from the knee to autoscaling.
HBM (high-bandwidth memory)
The GPU's main memory, where model weights and the KV cache live. Big but "far" — reaching it is the cost that makes decode memory-bound. In the analogy: the pantry across town.
Attention head
One of several parallel "tasters" inside attention. Per token, each head emits a Query (what it's looking for), a Key (how it labels itself), and a Value (what it carries) — each a vector of head_dim numbers. The KV cache stores the K and V of each KV head.
head_dim
The length of one head's Key (or Value) vector — how many numbers describe a token from that head's angle (128 in Llama 2 7B). In the formula, kv_heads × head_dim is the width of one token's cache entry per layer (the index card's length).
GQA (grouped-query attention)
An attention variant where several query heads share one key/value head, so the model stores fewer KV heads. Shrinks the KV cache by the group factor (e.g. Llama 3 8B: 32 query ÷ 8 KV heads = 4× smaller). The middle ground between full multi-head attention (MHA) and single-KV MQA.
Quantization
Store weights, activations, and/or the KV cache in fewer bits (FP16 → FP8 → INT4), rounding each number onto a coarser grid. Shrinks memory and the bytes moved per token (faster decode), at a small accuracy cost — the risk lives in a few outlier values. See Lesson 16.
FP8 (E4M3 / E5M2)
8-bit floating point. E4M3 (4 exponent, 3 mantissa bits) — higher precision, small range — is used for weights/activations/KV; E5M2 (wider range) for gradients. ~Lossless for inference on H100-class GPUs, which also run FP8 at ~2× their BF16 tensor-core throughput.
Static batching
Group N requests, run them together, and wait for all to finish before starting the next batch. Because requests vary in length, finished slots idle until the slowest ends — ~60% GPU idle on real traffic. The thing continuous batching fixes.
Continuous batching (iteration-level scheduling)
Schedule one decode step at a time: after each step, evict finished requests and admit waiting ones, so the running batch never drains. Introduced by Orca; keeps the memory-bound decode phase fed. See Lesson 12.
PagedAttention
Store the KV cache in small fixed-size blocks allocated on demand and non-contiguously — virtual memory / paging for the GPU. Eliminates the 60–80% fragmentation waste of contiguous per-request reservation, so far more sequences fit. vLLM's core trick. Lesson 12.
Prefix caching
Reuse PagedAttention blocks across requests that share a prompt prefix (e.g. a common system prompt) instead of recomputing/restoring their KV — a big win for RAG.
Prompt caching
The productized, billed form of prefix caching exposed by LLM APIs: the provider stores a stable prefix's prefill KV under a TTL so repeat requests skip re-prefilling it. A cache read (hit) costs ~0.1× of input price; a cache write (cold, or after expiry) ~1.25–2×. It's the KV cache with a clock and a price tag. See Lesson 12.
Cache TTL
How long a cached prefix survives before eviction — Anthropic's default is 5 minutes (refreshed on each use), with a 1-hour option. Idle past the TTL and the next request is a cache miss that re-prefills from scratch.
Lesson 9: Prefill vs Decode →