Slicing a GPU — one guess at a time.
A 94 GB GPU running a model that needs only 20 GB wastes most of itself. You want several workloads to share one physical GPU.
MIG = partition a node into Guaranteed-QoS slices with hard quotas (can't overcommit). Time-slicing = overcommit the node (Burstable QoS): more pods than guaranteed capacity, flexible but noisy-neighbor-prone. The same requests/limits isolation-vs-density call.
You run time-slicing ×5 → 20 logical GPUs, MIG disabled — flexible/oversubscribed, but no hardware isolation (noisy-neighbor risk to SLOs). Your 27B Qwen wants a whole GPU regardless. MIG would be the switch if you needed guaranteed per-tenant isolation. Your Lab →
Slice one physical GPU into smaller isolated GPUs — or time-slice it. The tradeoff.
A 94 GB GPU is a big kitchen. If your dish only needs a third of it, the rest is wasted. Two ways to share: build permanent walls into separate stalls each with their own oven and counter (MIG — hard isolation), or let several cooks share the one kitchen on a timer (time-slicing — flexible, but they can step on each other).
| permanent walls + dedicated equipment | MIG — hardware partition, isolated SMs/HBM |
| sharing the kitchen on a timer | time-slicing — soft, oversubscribed, no isolation |
| fitting the dish to the stall | right-sizing by VRAM + KV headroom |
A model needs its weights plus KV headroom. A small model on a 94 GB GPU leaves most of it idle. For multi-tenant or small-model serving, you want to subdivide the GPU so several workloads share it.1
Multi-Instance GPU splits one physical GPU into up to 7 isolated
instances, each with its own dedicated SMs, HBM slice, and L2 (profiles like
1g.10gb, 3g.40gb, 7g.80gb). The isolation is in hardware:
one tenant literally cannot touch another's compute or memory — guaranteed performance, no noisy
neighbor.1
MIG = hard, guaranteed isolation, but rigid (fixed profiles, can't oversubscribe). Time-slicing = soft sharing: flexible and you can pack more logical GPUs than you have physical ones, but there's no isolation — a heavy tenant starves the others (noisy neighbor).2
MIG is partitioning a node into smaller schedulable units with hard resource
quotas — Guaranteed-QoS slices that can't be overcommitted, like dedicated nodepools with strict
limits. Time-slicing is overcommitting the node (Burstable QoS): more pods than guaranteed
capacity, flexible but subject to noisy-neighbor contention. Same isolation-vs-density call you make with
resource requests/limits.
Your 4 H100s run time-slicing ×5 → 20 logical GPUs, with MIG disabled. That's the flexible/oversubscribed choice: great for packing many workloads, but no hardware isolation — a heavy tenant can degrade neighbors sharing the same physical GPU (a real noisy-neighbor risk for your SLOs). If you needed guaranteed isolation per tenant, MIG would be the switch — at the cost of that flexibility. Your big Qwen, though, wants a whole GPU regardless. · Your Lab →