Inference Engineering

Bridging from the ops layer down into the GPU internals — taught against a real 4×H100 cluster. Start with Lesson 1, or open the Lab.

Lessons

Part I · How an LLM works

What Is an LLM?

Training vs inference, next-token prediction, the pipeline.

What's Inside a Model

A model is just numbered tensor arrays — SafeTensors.

Text → integer tokens — the unit everything is measured in.

Token IDs → learned vectors that carry meaning + position.

Q/K/V, softmax, causal mask, multi-head — tokens read each other.

The Forward Pass & Sampling

Blocks → logits → token: greedy, temperature, top-k.

The Autoregressive Loop & Cost

The naive loop, and why each step gets slower (the staircase).

Part II · The inference runtime

What Is an Inference Engine?

The naive loop vs vLLM/SGLang/TRT/Dynamo — what they add.

Prefill vs Decode

The asymmetry behind every latency/throughput tradeoff.

The KV Cache & Its Memory

Compute the cache; see why it caps your batch.

ops:byte — when decode is memory- vs compute-bound.

PagedAttention & Batching

Why naïve batching wastes the cache; keeping decode fed.

CUDA Kernels & Fusion

Cutting HBM round-trips; FlashAttention and SRAM tiling.

Prefix Caching & the KV Hierarchy

Reusing KV across requests; the memory hierarchy; routing.

Speculative Decoding

A draft model proposes; the target verifies in parallel.

Part III · Precision & formats

Quantization: Number Formats

Fewer bits per number — FP8/INT8/INT4 on your cluster.

Quantization Algorithms

GPTQ, AWQ, SmoothQuant — taming activation outliers.

Model Formats & Compilation

SafeTensors, ONNX, TensorRT — serialize vs compile.

Part IV · Scaling across GPUs

Model Parallelism & NVLink

Splitting a model across GPUs — and the wire that decides it.

Disaggregated Serving

Split prefill & decode onto separate workers (Dynamo).

Part V · The hardware

GPU Architecture: SMs & HBM

SMs, Tensor Cores, and the HBM/SRAM memory hierarchy.

GPU Generations

Hopper → Blackwell → Rubin: precision, interconnect, capacity.

Multi-Instance GPU (MIG)

Right-size instances; slice one GPU for multi-tenant serving.

Part VI · Production serving

Latency, Throughput & SLOs

The knee, goodput, Little's Law — internals meet ops.

Routing, Load Balancing & Queueing

Token-aware & KV-aware routing; queues under load.

Concurrency targets, scaling signals, and cold starts.

Containerization: Docker & NIM

Reproducible inference containers — Docker and NVIDIA NIM.

Multi-Cloud Capacity

Inference fleets across clouds — supply, latency, reliability.

Zero-Downtime Deployment & Cost

Blue-green/canary deploys with rollback; $/1M-token cost.

Reference

Your 4×H100 Cluster

Real hardware, live models, and the vLLM-flag → lesson map.

Ops · action needed

Cluster Findings

NVLink pair down, imbalance, tuning proposal for the 4×H100.

The Faraway Pantry

The running metaphor every lesson reuses.

The course's shared vocabulary.

The goal grounding every lesson.

Curated, high-trust reading.

Refresh live cluster numbers anytime: bash learning/tools/cluster-probe.sh