Packaging inference — one guess at a time.
You containerize services every day. An LLM server is the same idea — with a few GPU twists.
Your daily work + 3 GPU twists: pod requests nvidia.com/gpu (device plugin),
readiness/startup probe waits out the weight load (normal probe is ready far too early), and huge
images make layer caching / registry locality / baked-vs-mounted weights real choices. NIM = vendor Helm
chart; custom Dockerfile = build your own.
Your vLLM is containers on OpenShift (OpenAI API + Prometheus /metrics, GPU via Container Toolkit). With ~27 GB FP8 weights, readiness gating is the thing to get right, or rollouts (L29) route into cold pods. Your Lab →
Packaging an inference server — the GPU-specific bits, and NIM vs a custom image.
A container is a food truck: the kitchen, the recipes, and the equipment packed into one unit that runs the same anywhere. For inference you also need the gas line to the ovens (the GPU, exposed via the NVIDIA Container Toolkit) — and the truck can't take orders until the pantry is fully stocked (weights loaded). This is your world; the twists are GPU-specific.
| the packed food truck | the container image (CUDA base + vLLM + config) |
| the gas line to the ovens | NVIDIA Container Toolkit — GPU access in the container |
| "open" sign only when stocked | readiness probe waits for weight load |
| a pre-built franchise truck | NVIDIA NIM — prebuilt optimized image |
A serving image is a nvidia/cuda base + the engine
(vLLM) + your model config. The NVIDIA Container Toolkit exposes the GPU to the container;
the engine serves an OpenAI-compatible API and exports Prometheus /metrics —
exactly what your cluster runs.1
The critical twist: a normal readiness probe passes when the process starts — but an LLM server isn't
ready until it has loaded tens of GB of weights into VRAM (the
cold start). If readiness passes too early, the load balancer
routes traffic to a pod that 500s. Gate readiness on the engine's actual ready state, with a
generous startupProbe / start-period.1
NVIDIA NIM ships prebuilt, optimized inference containers (often with TensorRT-LLM engines) and a standard API — the fast path, less to tune. A custom Dockerfile (CUDA base + vLLM) gives full control over versions, flags, and patches. Trade convenience/peak-perf for flexibility.2
This is your daily work — with three GPU twists: the pod must request
nvidia.com/gpu (scheduled via the device plugin + Container Toolkit, not plain CPU/mem);
the readiness/startup probe must wait out a multi-GB weight load (a normal probe marks it ready far
too early); and images are huge (CUDA + weights), so layer caching, registry locality, and
weights-mounted-vs-baked are real decisions. NIM is "use the vendor's Helm chart"; a custom Dockerfile is
"build your own."
Your vLLM deployments are containers on OpenShift, exposing the
OpenAI-compatible API + Prometheus /metrics you've been hitting all course, with the GPU wired
in via the Container Toolkit. Given the ~27 GB FP8 weight load, your readiness gating matters: make
sure the probe waits for the engine to report ready, or rollouts (Lesson
29) will route into cold pods. · Your Lab →