Resources
Curated, high-trust sources. Lesson knowledge is drawn from here.
Knowledge
- LLM Inference Handbook — BentoML — canonical mental
models & vocabulary (prefill/decode, KV cache, batching, disaggregation).
- Mastering
LLM Techniques: Inference Optimization — NVIDIA — hardware-grounded overview.
- Transformer Inference Arithmetic —
kipply — THE rigorous arithmetic source (KV cache, weights, intensity).
- How to Scale Your Model: Inference —
JAX ML — deep roofline/throughput treatment.
- vLLM:
PagedAttention & Continuous Batching — RunPod — how a real engine solves KV waste.
- Prefill
Compute-Bound, Decode Memory-Bound — TDS — the phase-asymmetry roofline intuition.
- SARATHI (arXiv 2308.16369) — primary paper
behind chunked prefill.
- PagedAttention / vLLM — Kwon et al., SOSP'23
— primary paper for paging the KV cache (Lesson 12).
- Orca — Yu et al., OSDI'22
— primary paper for continuous batching / iteration-level scheduling (Lesson 12).
- Roofline — Williams et al., CACM 2009
— the original roofline model (Lesson 11); see also the Modal GPU Glossary.
- Prompt caching — Anthropic docs
— read/write multipliers + TTL; backs the Lesson 12 prompt-caching bridge.
- FP8 Formats — Micikevicius et al.,
AWQ, SmoothQuant,
and the vLLM FP8 KV-cache blog
— the quantization sources for Lesson 16.
- Megatron-LM — Shoeybi et al., 2019
— primary paper for tensor parallelism (Lesson 19); plus NVLink vs PCIe.
- Sarathi-Serve — Agrawal et al., OSDI 2024
— the latency-throughput knee / chunked prefill (Lesson 24); plus DistServe (goodput) and Little's Law.
Wisdom (Communities)
- r/LocalLLaMA — practitioner serving tradeoffs,
hardware sizing, framework comparisons.
- vLLM GitHub — where serving-engine details
get argued in the open.
Community participation is optional —
listed for testing understanding against practitioners.
Gaps
Rigorous arithmetic source — closed (kipply + JAX scaling book above).
PagedAttention / KV-fragmentation source for Lesson 12 — closed (vLLM +
Orca papers above).
Canonical source: learning/RESOURCES.md. This page is the
browsable view.