Fleets across clouds — one guess at a time.
Demand outgrows one cluster/region. You consider spreading inference across clouds.
Multi-cluster fleet management (a control plane like Karmada over regional clusters) + GSLB/global ingress for geo-routing; spot = preemptible node pools (drain-on-reclaim); data-residency/compliance = scheduling constraints. The GPU version of any global-service playbook.
You're single-site (on-prem OpenShift, 4×H100) — fine for now. Scale-out path: keep on-prem as the reserved baseline and burst to cloud (on-demand/spot) under a global control plane with geo-aware routing. Your Lab →
Why inference fleets span clouds — for supply, latency, and reliability.
One location can't get enough ovens (GPU scarcity), can't be near every customer (latency), and is one fire away from total outage. So you run branches in many cities with a single head office coordinating them — buying ovens wherever they're available, seating diners at the nearest branch, and surviving any one branch going dark.
| head office | global control plane (one brain) |
| each city branch | per-cloud / per-region workload plane |
| seat diners at the nearest branch | geo-aware load balancing (by RTT) |
| owned vs rented vs day-rate ovens | reserved / on-demand / spot |
Three drivers:1
A global control plane holds policy, routing, and capacity state; per-cloud workload planes actually run the GPUs. A geo-aware load balancer sends each request to the nearest healthy region within its latency budget.1
Mix purchase types to balance cost and certainty: reserved (committed, cheapest per hour, for baseline load), on-demand (flexible, priciest, for bursts), and spot/preemptible (cheapest, but can be reclaimed any time — for interruptible or buffered work). A typical fleet is reserved baseline + on-demand burst + spot for slack.2
This is multi-cluster fleet management: a control plane (think Karmada / fleet manager) over regional clusters, with a GSLB / global ingress doing geo-routing — the GPU version of what you'd build for any global service. Spot is preemptible/spot node pools (drain-on-reclaim), reserved is committed node groups, and data-residency/compliance are scheduling constraints (affinity to in-region clusters). Same playbook, GPU-flavored.
Latency budgets (per-region RTT), active-active vs active-passive failover, and compliance (SOC 2, HIPAA, data residency) all bound where workloads can run. Cost and resilience are the dials; compliance is the fence.1
Today you run one on-prem OpenShift cluster (4× H100) — no multi-cloud yet, and for many workloads that's fine. This lesson is the scale-out future: if demand outgrows your 4 GPUs or you need geo-presence, the pattern is to keep on-prem as the reserved baseline and burst to cloud (on-demand/spot) under a global control plane with geo-aware routing — the same fleet thinking you'd apply to any service, now sized in GPUs. · Your Lab →