Inference Engineering my learning journal

Inference Engineering

Bridging from the ops layer down into the GPU internals — taught against a real 4×H100 cluster. Start with Lesson 1, or open the Lab.

Lessons

Part I · How an LLM works

Lesson 1
What Is an LLM?
Training vs inference, next-token prediction, the pipeline.
Lesson 2
What's Inside a Model
A model is just numbered tensor arrays — SafeTensors.
Lesson 3
Tokenization
Text → integer tokens — the unit everything is measured in.
Lesson 4
Embeddings
Token IDs → learned vectors that carry meaning + position.
Lesson 5
Attention
Q/K/V, softmax, causal mask, multi-head — tokens read each other.
Lesson 6
The Forward Pass & Sampling
Blocks → logits → token: greedy, temperature, top-k.
Lesson 7
The Autoregressive Loop & Cost
The naive loop, and why each step gets slower (the staircase).

Part II · The inference runtime

Lesson 8
What Is an Inference Engine?
The naive loop vs vLLM/SGLang/TRT/Dynamo — what they add.
Lesson 9
Prefill vs Decode
The asymmetry behind every latency/throughput tradeoff.
Lesson 10
The KV Cache & Its Memory
Compute the cache; see why it caps your batch.
Lesson 11
The Roofline
ops:byte — when decode is memory- vs compute-bound.
Lesson 12
PagedAttention & Batching
Why naïve batching wastes the cache; keeping decode fed.
Lesson 13
CUDA Kernels & Fusion
Cutting HBM round-trips; FlashAttention and SRAM tiling.
Lesson 14
Prefix Caching & the KV Hierarchy
Reusing KV across requests; the memory hierarchy; routing.
Lesson 15
Speculative Decoding
A draft model proposes; the target verifies in parallel.

Part III · Precision & formats

Lesson 16
Quantization: Number Formats
Fewer bits per number — FP8/INT8/INT4 on your cluster.
Lesson 17
Quantization Algorithms
GPTQ, AWQ, SmoothQuant — taming activation outliers.
Lesson 18
Model Formats & Compilation
SafeTensors, ONNX, TensorRT — serialize vs compile.

Part IV · Scaling across GPUs

Lesson 19
Model Parallelism & NVLink
Splitting a model across GPUs — and the wire that decides it.
Lesson 20
Disaggregated Serving
Split prefill & decode onto separate workers (Dynamo).

Part V · The hardware

Lesson 21
GPU Architecture: SMs & HBM
SMs, Tensor Cores, and the HBM/SRAM memory hierarchy.
Lesson 22
GPU Generations
Hopper → Blackwell → Rubin: precision, interconnect, capacity.
Lesson 23
Multi-Instance GPU (MIG)
Right-size instances; slice one GPU for multi-tenant serving.

Part VI · Production serving

Lesson 24
Latency, Throughput & SLOs
The knee, goodput, Little's Law — internals meet ops.
Lesson 25
Routing, Load Balancing & Queueing
Token-aware & KV-aware routing; queues under load.
Lesson 26
Autoscaling
Concurrency targets, scaling signals, and cold starts.
Lesson 27
Containerization: Docker & NIM
Reproducible inference containers — Docker and NVIDIA NIM.
Lesson 28
Multi-Cloud Capacity
Inference fleets across clouds — supply, latency, reliability.
Lesson 29
Zero-Downtime Deployment & Cost
Blue-green/canary deploys with rollback; $/1M-token cost.

Reference

Lab
Your 4×H100 Cluster
Real hardware, live models, and the vLLM-flag → lesson map.
Ops · action needed
Cluster Findings
NVLink pair down, imbalance, tuning proposal for the 4×H100.
Analogy
The Faraway Pantry
The running metaphor every lesson reuses.
Reference
Glossary
The course's shared vocabulary.
Why
Mission
The goal grounding every lesson.
Sources
Resources
Curated, high-trust reading.
Refresh live cluster numbers anytime: bash learning/tools/cluster-probe.sh