Inference Engineering · Lesson 24 · Latency, Throughput & SLOs Home · Glossary · Your Lab

The Knee

Where the internals meet your ops layer — predicted step by step.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict the latency–throughput knee, what goodput is, and how Little's Law turns it all into a replica count.

The setup

Two clocks your users feel: TTFT (time to first token) and TPOT (time per output token). And one system-wide number: throughput.

Step 1 — which phase sets TTFT?

Step 2 — and TPOT?

Step 3 — the knee

Batching more requests raises throughput — but bigger batches and fuller queues raise latency.

Recall — cover the screen: what is "the knee"?
The load level where throughput saturates but latency starts climbing steeply. Run just below it; past it you mostly buy queueing delay, not more work done. (tap/hover to check)

Step 4 — what actually counts?

Step 5 — sizing it

Recall — say it: the three signals that drive routing & autoscaling here.
TTFT/TPOT vs their SLOs; the queue (num_requests_waiting) — growing means past the knee; and max-num-seqs as the concurrency cap in Little's Law for sizing replicas. (tap/hover)

On YOUR cluster real signals

vLLM emits exactly these: time_to_first_token (TTFT SLO), time_per_output_token (TPOT SLO), num_requests_running = live concurrency, num_requests_waiting = the queue (your cleanest "autoscale now" signal). --max-num-seqs is the Little's-Law cap; --enable-chunked-prefill protects TTFT under load. Your Lab →

Read this next — primary source Sarathi-Serve — Agrawal et al., OSDI 2024 · Little's Law.

Final check — teach it back

Explain to a colleague: "We should autoscale when…"
…the queue (waiting requests) or TTFT-P99 starts rising — that means each replica is past its knee, so throughput won't improve and only latency will. Size new replicas with Little's Law: max QPS ≈ concurrency cap ÷ latency. (tap/hover)
I'm your teacher — ask me anything. Want concrete TTFT/TPOT SLOs for your RAG app, or a Little's Law sizing for qwen36?
← Lesson 23 · Multi-Instance GPU (MIG)Lesson 25 · Routing, Load Balancing & Queueing →
References
  1. Sarathi-Serve — OSDI 2024 (2403.02310); Little's Law (Modal); goodput — DistServe (2401.09670).

Latency, Throughput & SLOs

Where the internals meet your ops layer — the knee that drives routing, autoscaling, and cost.

Today's win: you'll define TTFT, TPOT, throughput, and goodput; explain the latency–throughput knee; use Little's Law to size capacity; and connect all of it to the routing and autoscaling decisions you already make — measured on your cluster.
Pantry: two clocks the customer feels — time to first bite (TTFT) and time between bites (TPOT). Throughput is dishes-per-hour kitchen-wide. Pack the kitchen to serve more, but past a point everyone's wait explodes. (full analogy →)

1 · The two clocks users actually feel

TTFT (time to first token) is set by prefill — processing the whole prompt. TPOT (time per output token, a.k.a. inter-token latency) is set by decode. Those are per-request latencies; throughput (tokens or requests/sec) is the system-wide rate. They are different axes — and often in tension.1

time → prompt in TTFT (prefill) first token TPOT …one token every TPOT seconds (decode)
One TTFT to the first token, then a steady drip every TPOT. Two different promises you make to a user — and two different SLOs.

2 · The knee: throughput and latency pull apart

Batching more requests raises throughput (the memory-bound decode weight-read is shared — Lessons 9–12). But bigger batches and fuller queues also raise latency. Push load up and throughput climbs… until it saturates, while latency keeps climbing — steeply. That bend is the knee. You want to run near it, not past it.1

Pantry: seat more customers and you serve more meals per hour — up to a point. Past it the kitchen is slammed, the line spills out the door, and everyone waits longer for the same number of meals.
offered load (requests / sec) → throughput (saturates) latency (explodes) the knee run here ✓ not here ✗
Before the knee, more load buys throughput. After it, you mostly buy latency. The whole job of routing & autoscaling is keeping each replica just below its knee.

3 · Goodput: only SLO-meeting work counts

Raw throughput lies: a request served late is a failure, not a success. Goodput = the throughput that meets both the TTFT and TPOT SLOs. Past the knee, throughput may look flat while goodput collapses as requests blow their SLO — which is why you optimize for goodput, not throughput.2

offered load → throughput goodput SLO knee gap = served but SLO-violating (wasted)
The dashed line keeps rising; real goodput peels off at the SLO knee. The widening gap is work you paid for but can't count.

4 · Little's Law — from the knee to capacity

One equation turns all this into headcount:3

in-flight requests = throughput × latency
→ max throughput = concurrency cap ÷ latency  ·  replicas needed = peak QPS ÷ per-replica QPS (at the knee)

Your --max-num-seqs is the concurrency cap. Divide it by your average request latency and you have a replica's max QPS; divide peak demand by that and you have how many replicas to autoscale to. When the queue grows, you're past the knee — add a replica.

In Kubernetes terms infra bridge

This lesson is your world. TTFT/TPOT are your latency SLOs; the knee is where you'd burn your error budget; goodput is SLO-meeting throughput; and Little's Law (concurrency = throughput × latency) is the same math behind an HPA target and replica sizing. The next lessons — routing, autoscaling — are literally your day job, now wired to the internals underneath.

On YOUR cluster — the metrics ARE the SLO dashboard real signals

Everything in this lesson is already emitted by your vLLM servers:

Ops loop: route to keep each replica below its knee, autoscale on num_requests_waiting / TTFT-P99, size replicas with Little's Law. Watch it: bash learning/tools/cluster-probe.sh · Your Lab →

That closes the loop back to your day-job: prefill/decode (L9) set the two latencies; KV memory (L10) and paging/batching (L12) set the concurrency cap; the roofline (L11) says decode is bandwidth-bound; interconnect (L19) sets TP latency; and goodput + Little's Law (L24) turn all of it into routing, autoscaling, and cost.

Read this next — primary source Taming the Throughput-Latency Tradeoff with Sarathi-Serve — Agrawal et al., OSDI 2024. The chunked-prefill / token-budget paper behind the exact flags you run. Pair with Little's Law (Modal) and DistServe (goodput).

Check yourself (recall, don't peek)

Picture the two clocks and the knee, then answer from memory.

I'm your teacher — ask me anything. Want to set concrete TTFT/TPOT SLO numbers for your RAG app, work a Little's Law sizing for qwen36, or wire an autoscaler to num_requests_waiting? Just ask.
← Lesson 23 · Multi-Instance GPU (MIG)Lesson 25 · Routing, Load Balancing & Queueing →
References
  1. Taming the Throughput-Latency Tradeoff (Sarathi-Serve) — Agrawal et al., OSDI 2024. arxiv.org · tradeoff explainer
  2. Goodput & SLO attainment — DistServe (Zhong et al., 2024). arxiv.org
  3. Little's Law for inference — Modal GPU Glossary. modal.com