Inference Engineering · Findings Home · Analogy · Your Lab · Glossary · Mission

Your NVLink Is Half-Down

A real fault on your 4×H100 lab — diagnosed, with a fix-it plan.

The headline: the prefill/decode asymmetry is normal physics — not a fault. But the cluster is degraded: one of your two NVLink pairs is down, your tensor-parallel model is sitting on the broken pair, and one whole H100 is idle while another runs four servers. All fixable.
Pantry: NVLink is a private express lane between two prep stations so they can cook as one. A 4-GPU box should have two express lanes (two station-pairs). Yours has one — and your two-station dish (the TP model) is on the pair whose lane is closed, so it's stuck on the slow service road. (full analogy →)

1 · You have one NVLink pair, not two

NUMA 0 NUMA 1 GPU0GPU1GPU2GPU3 qwen35-27bTP worker 093 GB qwen35-27bTP worker 193 GB qwen36 + 3small models88 / 94 GB IDLE0 GB NVLink DOWN — PCIe only ✓ NVLink ×12 — but unused
Verified live: only GPU2↔GPU3 carries NVLink. GPU0↔GPU1 report zero links — and that's exactly where your tensor-parallel model is running.

2 · Why it bites

Tensor-parallel decode does an all-reduce every layer, every token to stitch the two GPUs' work together. On NVLink that's a blink; over PCIe it crawls — and your TP model has no NVLink option on its pair at all.

Pantry: the two stations must hand each other a dish on every step. Over the express lane that's instant; over the closed lane they walk it around — every layer, every token.
over NVLink (what you want) fast ⇄ over PCIe (what's happening) slow · every layer, every token fix: put the TP model on the NVLink pair, or run it TP=1 (it fits on one 94 GB GPU)
Same work, two very different lanes. Your TP model is on the slow one — with no fast lane available on that pair until the bridge is fixed.

3 · What to fix (by severity)

SevFindingFix
P0 NVLink bridge down on GPU0–GPU1. Only GPU2↔GPU3 has links; GPU0/1 report zero. Reseat / install the bridge on the 0–1 pair (hardware), then re-verify.
P1 TP model on the broken pair. qwen35-27b all-reduces over PCIe. Move it to GPU2+GPU3, or run TP=1 (fits on one GPU).
P1 Imbalance. GPU3 idle; GPU2 runs 4 servers at 88/94 GB. Spread across all 4 GPUs (layouts below).
P2 Prefix caching off on qwen35 — your most prefill-heavy model (47:1). Add --enable-prefix-caching (one flag).
P2 Driver upgrade failed (gpu-driver-upgrade-state). Resolve the GPU-operator upgrade; re-check NVLink.
P3 Time-slicing the big models — no memory isolation, non-deterministic placement. Dedicated whole GPUs (or MIG) for big models; time-slice only the small ones.

4 · Where everything should go

Option B — TP=1 everywhere recommended

A 27B FP8 model (~27 GB) fits on one 94 GB GPU — qwen36 already proves it. Drop TP, one model per GPU: no all-reduce, best throughput, and it doesn't even need the broken bridge.

GPU0qwen36-27b (TP=1)
GPU1embedding + reranker + bert-ner
GPU2qwen35-27b (TP=1)
GPU3replica of the busiest model (or a new model)

Option A — keep TP=2 for lowest single-stream latency. Put qwen35-27b on the working NVLink pair (GPU2+GPU3), move qwen36 to GPU0, small models to GPU1. Gains ~2× decode bandwidth over NVLink, but uses two GPUs for one model. Use this only if per-token latency is a hard SLO.

Heads-up: under ×5 time-slicing the scheduler picks the physical GPU for you (that's how the TP model landed on the wrong pair). To pin models to specific GPUs, request whole GPUs (or MIG) for the big ones — don't blanket-time-slice them.

5 · The receipts

What the cluster reported (read-only), in case ops wants to see it:

nvidia-smi nvlink --status      → GPU2, GPU3: 12 links @ 26.6 GB/s each
nvidia-smi nvlink --status -i 0 → (empty)      # GPU0: no NVLink
nvidia-smi nvlink --status -i 1 → (empty)      # GPU1: no NVLink
nvidia-smi topo -m              → GPU0-GPU1: NODE (PCIe) ; GPU2-GPU3: NV12

compute-apps:  GPU0 Worker_TP0 93GB | GPU1 Worker_TP1 93GB   ← qwen35 TP on broken pair
               GPU2 qwen36 76GB + embed/rerank/bert ~11GB     ← crowded
               GPU3 0 GB                                       ← idle

Verify after any fix

nvidia-smi nvlink --status     # both pairs should list 12 links
nvidia-smi topo -m             # GPU0-GPU1 should read NV12, not NODE
bash learning/tools/cluster-probe.sh   # prefill/decode + KV, live
Want me to take this further? I can draft the exact Deployment edits (GPU pinning, the prefix-caching flag, TP changes) as a reviewable proposal, or turn this into a full Lesson: tensor parallelism & NVLink. Just say which.
← Your Lab Home →