Autoscaling

Scaling LLM replicas — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: you'll predict the right scaling signal for LLMs, why cold starts change the game, and how to size with Little's Law.

The setup

You autoscale stateless web on CPU. Now you're autoscaling GPU replicas serving Qwen.

Step 1 — the right signal

Step 2 — the cold-start surprise

Recall — cover the screen: two ways LLM autoscaling differs from stateless web.
(1) Scale on concurrency / queue depth (num_requests_waiting) / TTFT, not CPU — GPU util misleads. (2) Cold starts are minutes (pull image + load tens of GB of weights into VRAM), so you must scale early and keep warm capacity. (tap/hover to check)

Step 3 — the strategies

Step 4 — sizing

In Kubernetes terms infra bridge

HPA/KEDA, with two changes: scale on a custom metric (num_requests_waiting/ concurrency/TTFT) not CPU; and treat the cold start as pod-startup-latency on steroids (multi-GB weight load) → warm pools, NVMe weight caches, scale-up-early / slow-scale-down windows.

On YOUR cluster your signals

Autoscale vLLM on num_requests_waiting + TTFT-P99 (not GPU util). Budget the 27 GB FP8 weight load as cold-start time → keep a warm baseline, scale up early. --max-num-seqs (8/64) is your per-replica cap for Little's-Law sizing. Your Lab →

Read this next — primary source runnable: day23 notebook.

Final check — teach it back

Explain to a colleague: "Autoscaling our LLM is different from our web tier because…"
…we scale on the request queue / concurrency / TTFT, not CPU (GPU util is misleading), and cold starts are minutes — loading tens of GB of weights into VRAM — so we keep a warm baseline and scale up early instead of reacting. We size replicas with Little's Law: peak QPS ÷ (concurrency cap ÷ latency). (tap/hover)

I'm your teacher — ask me anything. Want to design a num_requests_waiting autoscaler with a warm-pool buffer for your cluster?

← Lesson 25Next: Lesson 27 →

References

day23 — autoscaling/concurrency/cold-starts (notebook); Little's Law (Lesson 24).

Autoscaling

Like HPA — but scale on the queue, not CPU, and respect minute-long cold starts.

Today's win: you'll explain why autoscaling LLM replicas differs from stateless web — the right scaling signal isn't CPU, and cold starts are measured in minutes — plus the strategies that handle it.

The picture: opening another kitchen takes a while

Scaling a stateless web app is hiring a waiter — seconds. Scaling an LLM replica is opening a whole new kitchen: pull the equipment (the image), stock the pantry (load tens of GB of weights into VRAM) — minutes, not seconds. And you scale on how long the waitlist is (queued requests), not on how hot the stoves run.

length of the waitlist	queue depth (`num_requests_waiting`) — the scale signal
time to open a new kitchen	cold start — image pull + weight load (minutes)
kitchens kept warm on standby	warm pool

1 · Scale on the right signal — not CPU

For stateless web you scale on CPU. For LLMs that's misleading: GPU "utilization" can read high while goodput is fine, or the box can be memory-bound with spare compute. Scale on what actually reflects load: concurrency, queue depth (num_requests_waiting), or TTFT creeping toward its SLO.1

2 · The cold-start problem

This is the big difference. Spinning up a new replica means pulling a large container image and loading tens of GB of weights into VRAM — often minutes. If you only start scaling when you're already overloaded, the new replica arrives far too late.1

Reactive scaling loses the race: the new replica is ready minutes after the spike. You must scale early, keep a warm buffer, or speed the cold start.

3 · The strategies

Baseline + headroom — run enough always-on replicas to absorb normal bursts.
Warm pool — pre-loaded replicas on standby for instant promotion.
Predictive — scale ahead of known patterns (business hours, batch jobs).
Faster cold start — cache weights on local NVMe, smaller images, or a quantized variant to shrink load time.
Scale-to-zero — only when a minutes-long first-request latency is acceptable.

In Kubernetes terms infra bridge

It's HPA/KEDA, with two changes. Scale on a custom metric (num_requests_waiting / concurrency / TTFT), not CPU — the built-in CPU target is meaningless for GPUs. And the cold start is your pod-startup-latency problem on steroids: a multi-GB weight load dwarfs a normal image pull, so you lean on warm pools, node/NVMe weight caches, and generous scale-up-early / slow-scale-down windows to avoid thrashing.

4 · Sizing — Little's Law again

How many replicas? Use Little's Law: a replica's max throughput ≈ its concurrency cap (--max-num-seqs, bounded by KV memory) ÷ average latency; replicas needed ≈ peak QPS ÷ that. Scale on the queue, size with the math.1

On YOUR cluster — scale on the queue, mind the load time your signals

Autoscale your vLLM deployments on num_requests_waiting (and TTFT-P99), not GPU util. Budget the cold start: loading 27 GB of FP8 weights into VRAM isn't instant, so keep a warm baseline and scale up early. Your --max-num-seqs (8 / 64) is the per-replica concurrency cap for the Little's-Law sizing. Time-slicing (Lesson 23) gives you logical headroom, but watch noisy neighbors under autoscale. · Your Lab →

Read this next — primary source Runnable companion: day23 notebook — concurrency targets, scaling signals, cold-start strategies.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to design an autoscaler on num_requests_waiting with a warm-pool buffer for your cluster? Just ask.

← Lesson 25 — routing Next: Lesson 27 — containerization →

References

Autoscaling: concurrency, batching & cold starts — day23 (autoscaling-concurrency-cold-starts.ipynb).