Scaling LLM replicas — one guess at a time.
You autoscale stateless web on CPU. Now you're autoscaling GPU replicas serving Qwen.
HPA/KEDA, with two changes: scale on a custom metric (num_requests_waiting/
concurrency/TTFT) not CPU; and treat the cold start as pod-startup-latency on steroids (multi-GB weight
load) → warm pools, NVMe weight caches, scale-up-early / slow-scale-down windows.
Autoscale vLLM on num_requests_waiting + TTFT-P99 (not GPU util). Budget the
27 GB FP8 weight load as cold-start time → keep a warm baseline, scale up early. --max-num-seqs
(8/64) is your per-replica cap for Little's-Law sizing. Your Lab →
Like HPA — but scale on the queue, not CPU, and respect minute-long cold starts.
Scaling a stateless web app is hiring a waiter — seconds. Scaling an LLM replica is opening a whole new kitchen: pull the equipment (the image), stock the pantry (load tens of GB of weights into VRAM) — minutes, not seconds. And you scale on how long the waitlist is (queued requests), not on how hot the stoves run.
| length of the waitlist | queue depth (num_requests_waiting) — the scale signal |
| time to open a new kitchen | cold start — image pull + weight load (minutes) |
| kitchens kept warm on standby | warm pool |
For stateless web you scale on CPU. For LLMs that's misleading: GPU "utilization" can read high while
goodput is fine, or the box can be memory-bound with spare
compute. Scale on what actually reflects load: concurrency, queue depth
(num_requests_waiting), or TTFT creeping toward its SLO.1
This is the big difference. Spinning up a new replica means pulling a large container image and loading tens of GB of weights into VRAM — often minutes. If you only start scaling when you're already overloaded, the new replica arrives far too late.1
It's HPA/KEDA, with two changes. Scale on a custom metric
(num_requests_waiting / concurrency / TTFT), not CPU — the built-in CPU target is meaningless
for GPUs. And the cold start is your pod-startup-latency problem on steroids: a multi-GB weight load
dwarfs a normal image pull, so you lean on warm pools, node/NVMe weight caches, and generous
scale-up-early / slow-scale-down windows to avoid thrashing.
How many replicas? Use Little's Law: a replica's
max throughput ≈ its concurrency cap (--max-num-seqs, bounded by
KV memory) ÷ average latency; replicas needed ≈ peak QPS ÷
that. Scale on the queue, size with the math.1
Autoscale your vLLM deployments on num_requests_waiting (and TTFT-P99), not
GPU util. Budget the cold start: loading 27 GB of FP8 weights into VRAM isn't instant, so keep a
warm baseline and scale up early. Your --max-num-seqs (8 / 64) is the per-replica
concurrency cap for the Little's-Law sizing. Time-slicing (Lesson 23)
gives you logical headroom, but watch noisy neighbors under autoscale. · Your Lab →
num_requests_waiting with a warm-pool buffer for your cluster? Just ask.