Shipping safely + costing tokens — one guess at a time.
You need to ship a new model version with no downtime — and finally put a dollar figure on a token.
Deployment rolling updates / Argo Rollouts canary — with a preStop hook + long terminationGracePeriodSeconds so pods finish in-flight generations (default 30s truncates), and maxSurge/maxUnavailable sized around the minutes-long weight-load readiness. Canary analysis hooks to your TTFT/P99 SLOs (L24).
Canary new Qwen versions with auto-rollback on TTFT/P99, a preStop drain for your longest generation, surge sized for the 27 GB load. Cost: ~$3/GPU-hr ÷ (~2,000 tok/s × 0.7 × 3600) × 1e6 ≈ $0.60/1M tokens — and every lesson moves that number. Your Lab →
Ship model updates with no downtime — and put a dollar figure on every token.
Swapping in a new model version is changing the crew during dinner rush without stopping service. Bring the new line up before retiring the old (blue-green), or swap a few cooks at a time (rolling/canary) while watching complaints (P99). The catch unique to you: a cook mid-dish (an in-flight generation) must be allowed to finish the plate, not be yanked off.
| new line up before old comes down | blue-green / rolling / canary |
| watch complaints, revert if bad | auto-rollback on P99 / errors |
| let the cook finish the plate | graceful drain of in-flight generations |
| cost per dish | $ / 1M tokens |
Three standard plays:1 blue-green (stand up a full new version, switch traffic, keep the old as instant rollback); rolling (replace replicas gradually); and canary (send 1% → 5% → 25% → 100%, watching metrics, with auto-rollback if P99 latency or error rate crosses a threshold).
Generations can run for many seconds. If a rollout kills a replica mid-generation, those users
get truncated. So on shutdown you must drain gracefully: stop admitting new requests, let
in-flight ones finish, then exit — via a preStop hook, SIGTERM handling,
and a drain timeout generous enough for the longest expected output. And remember: each new
replica has a minutes-long cold start, so rollouts are
inherently slow — plan the window.1
It's Deployment rolling updates / Argo Rollouts canary — you know this. Two GPU
twists: set a preStop hook + long terminationGracePeriodSeconds so a pod
finishes its in-flight generations before SIGKILL (a default 30s grace will truncate long outputs); and
size maxSurge/maxUnavailable around the minutes-long weight-load readiness
so you don't drop capacity mid-rollout. Canary analysis hooks straight to your TTFT/P99 SLOs from Lesson 24.
The number that ties it all to the business:2
The levers are exactly what this course has been about: tokens/sec (quantization L16, batching L12, speculative decoding L15, the right hardware L22) and utilization (routing L25, autoscaling L26, keeping replicas near the knee L24). Halve the bytes or double the batch and the cost per token drops with it.
Roll out new Qwen versions with a canary + auto-rollback on TTFT/P99,
a preStop drain long enough for your longest generation, and surge sizing that accounts for the
27 GB weight-load readiness. Then cost it: at, say, an amortized $3/GPU-hr and ~2,000 agg
tok/s at 70% utilization, that's ≈ $0.60 / 1M tokens — and every lever in this course moves that
figure.
That's the whole arc: a token (L1–7) → the runtime that serves it fast (L8–15) → fewer bits (L16–18) → spread across GPUs and hardware (L19–23) → routed, scaled, shipped, and costed in production (L24–29). You started in the ops layer; now you can see all the way down to the silicon and back up to the bill. · Your Lab →