Mission
The reason behind the course — every lesson traces back here.
Why
I cloned a 100-days-of-inference repo and want to turn it into my own
learning journal — working through inference engineering for real. I'm already strong
at the serving/ops layer (autoscaling, routing, multi-cloud capacity, zero-downtime
deploys, cost). The goal is to push down into the GPU/runtime internals so I
understand why serving systems behave the way they do — enough to
reason about, tune, and debug a high-performance inference stack, not just operate one.
Success looks like
- Explain, from memory, why decode is memory-bandwidth-bound and prefill is
compute-bound — and use it to predict latency/throughput behavior.
- Connect each ops decision (batching, routing, autoscaling, capacity) back to a
concrete internals reason.
- Read a serving framework's design (vLLM/TGI/TensorRT-LLM) and explain its
KV-cache, batching, and scheduling choices.
- Do back-of-envelope math: KV-cache memory, when decode turns compute-bound, the
throughput cost of a latency SLO.
- Use the lab: verify the theory against my real 4×H100 cluster.
Constraints
- Daily-notebook cadence — short lessons, one tangible win each.
- Self-taught; lean on high-trust primary sources (see Resources).
- Build from the known (ops) toward the unknown (internals) — that's the ZPD.
Out of scope (for now)
- Training / fine-tuning internals — this is an inference mission.
- Writing custom CUDA kernels from scratch (revisit later).
- Model-architecture research beyond what inference needs.
Canonical source: learning/MISSION.md. This page is the
browsable view.