Inference Engineering · Lesson 29 · Zero-Downtime Deployment & Cost Home · Glossary · Your Lab

Zero-Downtime Deployment & Cost

Shipping safely + costing tokens — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict the zero-downtime rollout plays, the LLM-specific drain catch, and the cost-per-million-tokens formula — the capstone of the whole course.

The setup

You need to ship a new model version with no downtime — and finally put a dollar figure on a token.

Step 1 — the rollout plays

Step 2 — the LLM-specific catch

Recall — cover the screen: the deployment catch unique to LLMs.
In-flight generations run for seconds, so a rollout must drain gracefully — stop admitting new requests, let running ones finish (preStop hook + SIGTERM + a long enough drain/grace timeout), then exit. Plus each new replica has a minutes-long cold start, so rollouts are inherently slow. (tap/hover to check)

Step 3 — the cost formula

Step 4 — the levers

In Kubernetes terms infra bridge

Deployment rolling updates / Argo Rollouts canary — with a preStop hook + long terminationGracePeriodSeconds so pods finish in-flight generations (default 30s truncates), and maxSurge/maxUnavailable sized around the minutes-long weight-load readiness. Canary analysis hooks to your TTFT/P99 SLOs (L24).

On YOUR cluster — the whole course in one number capstone

Canary new Qwen versions with auto-rollback on TTFT/P99, a preStop drain for your longest generation, surge sized for the 27 GB load. Cost: ~$3/GPU-hr ÷ (~2,000 tok/s × 0.7 × 3600) × 1e6 ≈ $0.60/1M tokens — and every lesson moves that number. Your Lab →

Read this next — primary source runnable: day26 notebook.

Final check — teach it back

Explain to a colleague: "We ship a new model with no downtime by…"
…canary or blue-green: ramp traffic to the new version while watching TTFT/P99, auto-rollback on regression, and drain in-flight generations gracefully (preStop + long grace) so nobody gets truncated — accounting for the minutes-long cold start. And we cost it as $/hr ÷ (tokens/s × utilization × 3600) × 1e6, which every optimization in the course pushes down. (tap/hover)
I'm your teacher — and that's the course. From "what is a token" to deploying and costing a fleet. Want to design the real NVLink/TP fix + a canary rollout, or work a live cost number on your cluster?
← Lesson 28Course home ↺
References
  1. day26 — zero-downtime deploy + cost (notebook).

Zero-Downtime Deployment & Cost

Ship model updates with no downtime — and put a dollar figure on every token.

Today's win: you'll explain how to roll out model updates with zero downtime (blue-green / rolling / canary + auto-rollback), the LLM-specific catch (draining in-flight generations), and how to model GPU cost per million tokens. This is the last lesson — it closes the loop from internals to dollars.

The picture: re-staff the kitchen mid-service

Swapping in a new model version is changing the crew during dinner rush without stopping service. Bring the new line up before retiring the old (blue-green), or swap a few cooks at a time (rolling/canary) while watching complaints (P99). The catch unique to you: a cook mid-dish (an in-flight generation) must be allowed to finish the plate, not be yanked off.

new line up before old comes downblue-green / rolling / canary
watch complaints, revert if badauto-rollback on P99 / errors
let the cook finish the plategraceful drain of in-flight generations
cost per dish$ / 1M tokens

1 · Zero-downtime rollout strategies

Three standard plays:1 blue-green (stand up a full new version, switch traffic, keep the old as instant rollback); rolling (replace replicas gradually); and canary (send 1% → 5% → 25% → 100%, watching metrics, with auto-rollback if P99 latency or error rate crosses a threshold).

1% 5% 25% 100% ramp the new version while watching P99 + errors regress? auto-rollback
Canary: ramp traffic to the new version in stages, auto-rolling-back the moment P99 or errors regress. Blue-green is the instant-switch variant with a kept-warm fallback.

2 · The LLM twist: drain in-flight generations

Generations can run for many seconds. If a rollout kills a replica mid-generation, those users get truncated. So on shutdown you must drain gracefully: stop admitting new requests, let in-flight ones finish, then exit — via a preStop hook, SIGTERM handling, and a drain timeout generous enough for the longest expected output. And remember: each new replica has a minutes-long cold start, so rollouts are inherently slow — plan the window.1

In Kubernetes terms infra bridge

It's Deployment rolling updates / Argo Rollouts canary — you know this. Two GPU twists: set a preStop hook + long terminationGracePeriodSeconds so a pod finishes its in-flight generations before SIGKILL (a default 30s grace will truncate long outputs); and size maxSurge/maxUnavailable around the minutes-long weight-load readiness so you don't drop capacity mid-rollout. Canary analysis hooks straight to your TTFT/P99 SLOs from Lesson 24.

3 · Cost: dollars per million tokens

The number that ties it all to the business:2

$ / 1M tokens = ( GPU $/hr ÷ ( tokens/sec × utilization × 3600 ) ) × 1,000,000

The levers are exactly what this course has been about: tokens/sec (quantization L16, batching L12, speculative decoding L15, the right hardware L22) and utilization (routing L25, autoscaling L26, keeping replicas near the knee L24). Halve the bytes or double the batch and the cost per token drops with it.

Pantry: cost per dish = the kitchen's hourly cost ÷ dishes served per hour. Everything you've learned either serves more dishes per hour or keeps the kitchen fuller.

On YOUR cluster — and the whole course, in one number capstone

Roll out new Qwen versions with a canary + auto-rollback on TTFT/P99, a preStop drain long enough for your longest generation, and surge sizing that accounts for the 27 GB weight-load readiness. Then cost it: at, say, an amortized $3/GPU-hr and ~2,000 agg tok/s at 70% utilization, that's ≈ $0.60 / 1M tokens — and every lever in this course moves that figure.

That's the whole arc: a token (L1–7) → the runtime that serves it fast (L8–15) → fewer bits (L16–18) → spread across GPUs and hardware (L19–23) → routed, scaled, shipped, and costed in production (L24–29). You started in the ops layer; now you can see all the way down to the silicon and back up to the bill. · Your Lab →

Read this next — primary source Runnable companion: day26 notebook — blue-green / canary, graceful drain, and the $/1M-token cost model.

Check yourself (recall, don't peek)

I'm your teacher — and that's the course. You've gone from "what is a token" to deploying and costing a fleet. Want to revisit any lesson, work a real number on your cluster, or design the NVLink/TP fix and a canary rollout for real? Just ask.
← Lesson 28 — multi-cloud Course home ↺
References
  1. Zero-downtime deployment (blue-green / canary / drain) — day26 (zero-downtime-deployment.ipynb).
  2. Cost per million tokens — day26 cost model.