Choosing silicon — one guess at a time.
NVIDIA ships a new GPU generation every couple of years: Ada → Hopper → Blackwell → Rubin. What actually changes that matters for inference?
Choosing a GPU generation = choosing an instance family (m5 vs m6i vs p5/p6). Newer usually = better perf/$, but availability/price/quota vary, so right-size to the workload; fleets often mix generations like a mixed node pool.
4× H100 NVL (Hopper) — a great match for FP8 Qwen: FP8 Tensor Cores + ~3.9 TB/s HBM3. You'd move to Blackwell only for much bigger models, FP4, or large NVLink domains (disaggregation). For today's workload you're well-matched. Your Lab →
Hopper → Blackwell → Rubin: what changes, and how to choose.
Each GPU generation is a newer kitchen build: ovens that support a coarser but faster setting (lower precision — FP8, then FP4), wider delivery roads between stations (faster NVLink), and bigger pantries (more, faster HBM). Newer isn't automatically right — you match the kitchen to the menu.
| a faster low-precision oven setting | precision: FP16 → FP8 (Hopper) → FP4 (Blackwell) |
| wider roads between stations | interconnect: NVLink 4 → 5, NVSwitch domains |
| a bigger, faster pantry | memory: HBM3 → HBM3e (more GB, more TB/s) |
Generation to generation, the things that matter for inference are: precision (new low-bit formats with hardware support), interconnect (how fast GPUs talk — decisive for tensor parallelism and disaggregation), and memory (capacity + bandwidth, which set your KV ceiling and roofline).1
Hopper introduced FP8 Tensor Cores + the Transformer Engine, ~3.9 TB/s HBM3, and NVLink 4. For FP8 LLM serving it's the value sweet spot today — which is exactly the workload you run.1
Blackwell (B200/GB200) adds hardware FP4, HBM3e, and NVLink 5 — and the GB200 NVL72 wires 72 GPUs into a single NVLink domain, so giant models and disaggregated serving behave almost like one machine. Rubin is next. This is the tier for frontier-scale models and very large deployments.2
Picking a GPU generation is choosing an instance family — like m5 vs m6i vs the p5/p6 GPU families. Newer usually means better perf-per-dollar, but availability, price, and quota vary, so you right-size to the workload rather than always grabbing the newest SKU. A fleet often mixes generations (older nodes for steady load, newest for the heaviest), exactly like a mixed node pool.
Rule of thumb: FP8 inference at normal scale → Hopper is excellent value. Frontier models, FP4, or huge multi-GPU domains → Blackwell. Don't overbuy capability your model and traffic can't use.2
Your 4× H100 NVL (Hopper) are a great fit for FP8 Qwen serving — the model is FP8, and Hopper's FP8 Tensor Cores + ~3.9 TB/s HBM are precisely what that needs. You'd reach for Blackwell only to (a) run much larger models, (b) exploit FP4, or (c) build big NVLink domains for disaggregated serving. For today's workload, you're not leaving much on the table. · Your Lab →