Inference Engineering · Lesson 22 · GPU Generations Home · Glossary · Your Lab

GPU Generations

Choosing silicon — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict what changes across GPU generations and how to match one to your workload.

The setup

NVIDIA ships a new GPU generation every couple of years: Ada → Hopper → Blackwell → Rubin. What actually changes that matters for inference?

Step 1 — the axes

Step 2 — your generation

Recall — cover the screen: the three things that improve each generation.
Precision (new low-bit formats with HW support: FP8 on Hopper, FP4 on Blackwell), interconnect (faster NVLink/NVSwitch — decisive for TP and disaggregation), and memory (more + faster HBM, setting KV ceiling and roofline). (tap/hover to check)

Step 3 — the frontier

Step 4 — choosing

In Kubernetes terms infra bridge

Choosing a GPU generation = choosing an instance family (m5 vs m6i vs p5/p6). Newer usually = better perf/$, but availability/price/quota vary, so right-size to the workload; fleets often mix generations like a mixed node pool.

On YOUR cluster your hardware

H100 NVL (Hopper) — a great match for FP8 Qwen: FP8 Tensor Cores + ~3.9 TB/s HBM3. You'd move to Blackwell only for much bigger models, FP4, or large NVLink domains (disaggregation). For today's workload you're well-matched. Your Lab →

Read this next — primary source Blackwell architecture · runnable: day20 notebook.

Final check — teach it back

Explain to a colleague: "We chose H100s because…"
…Hopper's FP8 Tensor Cores + fast HBM3 are exactly what an FP8 model like our Qwen needs — the value sweet spot for inference at our scale. We'd only jump to Blackwell for frontier-size models, FP4, or big NVLink domains for disaggregated serving; otherwise it's overbuying. (tap/hover)
I'm your teacher — ask me anything. Want a Hopper-vs-Blackwell cost/perf sketch for your Qwen workload?
← Lesson 21Next: Lesson 23 →
References
  1. day20 — GPU generations (notebook); Blackwell.

GPU Generations

Hopper → Blackwell → Rubin: what changes, and how to choose.

Today's win: you'll explain the three axes that move across GPU generations — precision, interconnect, and memory — and how to match a generation to an inference workload (instead of just buying the newest).

The picture: kitchen model years

Each GPU generation is a newer kitchen build: ovens that support a coarser but faster setting (lower precision — FP8, then FP4), wider delivery roads between stations (faster NVLink), and bigger pantries (more, faster HBM). Newer isn't automatically right — you match the kitchen to the menu.

a faster low-precision oven settingprecision: FP16 → FP8 (Hopper) → FP4 (Blackwell)
wider roads between stationsinterconnect: NVLink 4 → 5, NVSwitch domains
a bigger, faster pantrymemory: HBM3 → HBM3e (more GB, more TB/s)

1 · Three axes move each generation

Generation to generation, the things that matter for inference are: precision (new low-bit formats with hardware support), interconnect (how fast GPUs talk — decisive for tensor parallelism and disaggregation), and memory (capacity + bandwidth, which set your KV ceiling and roofline).1

AdaRTX 40xxFP8, no NVLink HopperH100 — youFP8 · NVLink4 · HBM3 BlackwellB200 / GB200FP4 · NVLink5 · HBM3e Rubinnext precision ↓ (FP16→FP8→FP4) · interconnect ↑ · memory ↑ — newer →
Each step adds a lower-precision format, a faster link, and more/faster memory. Your H100 sits at Hopper — the current FP8 workhorse.

2 · Hopper (your H100) — the FP8 workhorse

Hopper introduced FP8 Tensor Cores + the Transformer Engine, ~3.9 TB/s HBM3, and NVLink 4. For FP8 LLM serving it's the value sweet spot today — which is exactly the workload you run.1

3 · Blackwell & beyond — frontier scale

Blackwell (B200/GB200) adds hardware FP4, HBM3e, and NVLink 5 — and the GB200 NVL72 wires 72 GPUs into a single NVLink domain, so giant models and disaggregated serving behave almost like one machine. Rubin is next. This is the tier for frontier-scale models and very large deployments.2

In Kubernetes terms infra bridge

Picking a GPU generation is choosing an instance family — like m5 vs m6i vs the p5/p6 GPU families. Newer usually means better perf-per-dollar, but availability, price, and quota vary, so you right-size to the workload rather than always grabbing the newest SKU. A fleet often mixes generations (older nodes for steady load, newest for the heaviest), exactly like a mixed node pool.

4 · Choosing — match generation to workload

Rule of thumb: FP8 inference at normal scale → Hopper is excellent value. Frontier models, FP4, or huge multi-GPU domains → Blackwell. Don't overbuy capability your model and traffic can't use.2

On YOUR cluster — well-matched your hardware

Your 4× H100 NVL (Hopper) are a great fit for FP8 Qwen serving — the model is FP8, and Hopper's FP8 Tensor Cores + ~3.9 TB/s HBM are precisely what that needs. You'd reach for Blackwell only to (a) run much larger models, (b) exploit FP4, or (c) build big NVLink domains for disaggregated serving. For today's workload, you're not leaving much on the table. · Your Lab →

Read this next — primary source NVIDIA Blackwell architecture. Runnable companion: day20 notebook.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want a Hopper-vs-Blackwell cost/perf sketch for your Qwen workload? Just ask.
← Lesson 21 — GPU architecture Next: Lesson 23 — MIG →
References
  1. GPU generations: Hopper / Ada — day20 (gpu-generations-hopper-blackwell.ipynb).
  2. Blackwell & GB200 NVL72 — NVIDIA.