Mission

The reason behind the course — every lesson traces back here.

Why

I cloned a 100-days-of-inference repo and want to turn it into my own learning journal — working through inference engineering for real. I'm already strong at the serving/ops layer (autoscaling, routing, multi-cloud capacity, zero-downtime deploys, cost). The goal is to push down into the GPU/runtime internals so I understand why serving systems behave the way they do — enough to reason about, tune, and debug a high-performance inference stack, not just operate one.

Success looks like

Explain, from memory, why decode is memory-bandwidth-bound and prefill is compute-bound — and use it to predict latency/throughput behavior.
Connect each ops decision (batching, routing, autoscaling, capacity) back to a concrete internals reason.
Read a serving framework's design (vLLM/TGI/TensorRT-LLM) and explain its KV-cache, batching, and scheduling choices.
Do back-of-envelope math: KV-cache memory, when decode turns compute-bound, the throughput cost of a latency SLO.
Use the lab: verify the theory against my real 4×H100 cluster.

Constraints

Daily-notebook cadence — short lessons, one tangible win each.
Self-taught; lean on high-trust primary sources (see Resources).
Build from the known (ops) toward the unknown (internals) — that's the ZPD.

Out of scope (for now)

Training / fine-tuning internals — this is an inference mission.
Writing custom CUDA kernels from scratch (revisit later).
Model-architecture research beyond what inference needs.

Canonical source: learning/MISSION.md. This page is the browsable view.

← Home Lesson 9 →