Inference Engineering · Lesson 27 · Containerization: Docker & NIM Home · Glossary · Your Lab

Containerization: Docker & NIM

Packaging inference — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict the GPU-specific parts of containerizing an inference server, and when NIM beats a custom Dockerfile.

The setup

You containerize services every day. An LLM server is the same idea — with a few GPU twists.

Step 1 — the GPU access piece

Step 2 — the probe twist

Recall — cover the screen: the GPU-specific gotchas of containerizing inference.
Request nvidia.com/gpu + use the NVIDIA Container Toolkit (GPU access); gate readiness/startup probes on the multi-GB weight load completing (not process start); and handle huge images (CUDA + weights) — layer caching, registry locality, weights baked vs mounted. (tap/hover to check)

Step 3 — NIM vs custom

In Kubernetes terms infra bridge

Your daily work + 3 GPU twists: pod requests nvidia.com/gpu (device plugin), readiness/startup probe waits out the weight load (normal probe is ready far too early), and huge images make layer caching / registry locality / baked-vs-mounted weights real choices. NIM = vendor Helm chart; custom Dockerfile = build your own.

On YOUR cluster your setup

Your vLLM is containers on OpenShift (OpenAI API + Prometheus /metrics, GPU via Container Toolkit). With ~27 GB FP8 weights, readiness gating is the thing to get right, or rollouts (L29) route into cold pods. Your Lab →

Read this next — primary source NVIDIA NIM docs · runnable: day22 notebook.

Final check — teach it back

Explain to a colleague: "Containerizing our LLM is like our other services except…"
…the pod needs a GPU (nvidia.com/gpu + Container Toolkit), the readiness probe must wait for ~27 GB of weights to load into VRAM (or we route to a dead pod), and the images are huge so we care about layer caching and whether weights are baked in or mounted. NIM gives us a prebuilt optimized image; a custom Dockerfile gives control. (tap/hover)
I'm your teacher — ask me anything. Want a readiness/startup-probe config tuned to your weight-load time?
← Lesson 26Next: Lesson 28 →
References
  1. day22 — containerization (notebook); NVIDIA NIM.

Containerization: Docker & NIM

Packaging an inference server — the GPU-specific bits, and NIM vs a custom image.

Today's win: you'll explain how to package an inference server into a reproducible container, the GPU-specific pieces (CUDA base + the Container Toolkit + readiness gating), and when to reach for NVIDIA NIM versus rolling your own Dockerfile.

The picture: a fully-stocked food truck

A container is a food truck: the kitchen, the recipes, and the equipment packed into one unit that runs the same anywhere. For inference you also need the gas line to the ovens (the GPU, exposed via the NVIDIA Container Toolkit) — and the truck can't take orders until the pantry is fully stocked (weights loaded). This is your world; the twists are GPU-specific.

the packed food truckthe container image (CUDA base + vLLM + config)
the gas line to the ovensNVIDIA Container Toolkit — GPU access in the container
"open" sign only when stockedreadiness probe waits for weight load
a pre-built franchise truckNVIDIA NIM — prebuilt optimized image

1 · The inference container

A serving image is a nvidia/cuda base + the engine (vLLM) + your model config. The NVIDIA Container Toolkit exposes the GPU to the container; the engine serves an OpenAI-compatible API and exports Prometheus /metrics — exactly what your cluster runs.1

2 · Probes must wait for the weights

The critical twist: a normal readiness probe passes when the process starts — but an LLM server isn't ready until it has loaded tens of GB of weights into VRAM (the cold start). If readiness passes too early, the load balancer routes traffic to a pod that 500s. Gate readiness on the engine's actual ready state, with a generous startupProbe / start-period.1

start process load weights into VRAM READY → take traffic readiness must NOT pass here ✗ pass only here ✓
Readiness gated on weight-load completion — not process start. Otherwise the LB sends traffic to a pod that can't serve yet.

3 · NIM vs a custom Dockerfile

NVIDIA NIM ships prebuilt, optimized inference containers (often with TensorRT-LLM engines) and a standard API — the fast path, less to tune. A custom Dockerfile (CUDA base + vLLM) gives full control over versions, flags, and patches. Trade convenience/peak-perf for flexibility.2

In Kubernetes terms infra bridge

This is your daily work — with three GPU twists: the pod must request nvidia.com/gpu (scheduled via the device plugin + Container Toolkit, not plain CPU/mem); the readiness/startup probe must wait out a multi-GB weight load (a normal probe marks it ready far too early); and images are huge (CUDA + weights), so layer caching, registry locality, and weights-mounted-vs-baked are real decisions. NIM is "use the vendor's Helm chart"; a custom Dockerfile is "build your own."

On YOUR cluster — this is how Qwen runs your setup

Your vLLM deployments are containers on OpenShift, exposing the OpenAI-compatible API + Prometheus /metrics you've been hitting all course, with the GPU wired in via the Container Toolkit. Given the ~27 GB FP8 weight load, your readiness gating matters: make sure the probe waits for the engine to report ready, or rollouts (Lesson 29) will route into cold pods. · Your Lab →

Read this next — primary source NVIDIA NIM docs. Runnable companion: day22 notebook — a vLLM Dockerfile vs NIM.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want a readiness/startup-probe config tuned to your 27 GB weight-load time? Just ask.
← Lesson 26 — autoscaling Next: Lesson 28 — multi-cloud →
References
  1. Containerizing inference (Docker + Container Toolkit, probes) — day22 (containerization-docker-nim.ipynb).
  2. NVIDIA NIM — docs.nvidia.com/nim.