Inference Engineering · Lesson 17 · Quantization Algorithms Home · Glossary · Your Lab

Quantization Algorithms

Dropping bits without dropping quality — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict why naive rounding fails and how GPTQ, AWQ, and SmoothQuant each fix it — enough to choose a quantized checkpoint on purpose.

The setup

Lesson 16 said quantization writes each weight with fewer bits. The naive way is just to round every weight to the nearest low-bit value (RTN).

Step 1 — why RTN struggles

Step 2 — GPTQ's idea

Recall — cover the screen: what do all three algorithms have in common?
They're all outlier strategies. Naive rounding ruins a few large/important values; GPTQ compensates the error, AWQ protects the salient weights, SmoothQuant migrates activation outliers into weights. Each uses a small calibration set to learn what matters. (tap/hover to check)

Step 3 — AWQ's idea

Step 4 — SmoothQuant's idea

Step 5 — which applies to you?

In Kubernetes terms infra bridge

RTN = a flat resource cap on every pod (throttles the critical ones). AWQ/SmoothQuant = profile-guided right-sizing: sample real activations (a calibration set ≈ representative traffic) to find the hot path and protect it, capping the rest. Measure before you squeeze.

On YOUR cluster context

Your Qwen is FP8 (W8A8) — wide dynamic range, so it tolerates outliers without these tricks (FP8 "just works"). They matter when you go lower: AWQ/GPTQ-INT4 to fit a bigger model on a smaller GPU, or SmoothQuant-INT8 on non-FP8 hardware. Now ...-AWQ in a checkpoint name means something to you. Your Lab →

Read this next — primary source AWQ — Lin et al. · runnable: day14 notebook.

Final check — teach it back

Explain to a colleague: "We can't just round weights to 4 bits because…"
…a few outlier weights round badly and wreck quality. So we use GPTQ (compensate the rounding error using Hessian info), AWQ (protect the ~1% salient weights tied to big activations), or SmoothQuant (shift activation outliers into the weights so both quantize). Each uses calibration data to learn what to protect. (tap/hover)
I'm your teacher — ask me anything. Want to compare FP8 vs AWQ-INT4 of your model on quality + memory?
← Lesson 16Next: Lesson 18 →
References
  1. day14 — quant algorithms (notebook); GPTQ, AWQ, SmoothQuant.

Quantization Algorithms

GPTQ, AWQ, SmoothQuant — how to drop bits without dropping quality.

Today's win: you'll explain why naive rounding loses accuracy, and how the three big post-training methods — GPTQ, AWQ, SmoothQuant — each beat it by handling outliers, so you can pick a quantized checkpoint with confidence.

The picture: which measurements must stay precise?

Lesson 16 said quantization = writing each number with fewer significant figures. But you learned the catch: most measurements survive a coarse cup, while a pinch of saffron (an outlier) ruins the dish if rounded. These algorithms are the smart strategies for which values to protect and how to compensate for the rounding you do.

round everything to the nearest cupRTN (round-to-nearest) — the naive baseline
adjust later steps to cancel the errorGPTQ
keep the saffron exactAWQ (protect salient weights)
move the hard-to-measure part elsewhereSmoothQuant (migrate outliers)

1 · The baseline: round-to-nearest, and why it hurts

The naive method, RTN, just rounds every weight to the nearest value on the low-bit grid. It's instant and free — but accuracy slips, because a few outlier weights round badly and their error propagates. At INT4 especially, naive RTN can wreck a model.1

typical weights → round cleanly outlier — far from any grid point RTN rounds it badly → big error
Naive RTN treats every weight equally. The few outliers — the saffron — are exactly what breaks. Every good algorithm is an outlier strategy.

2 · GPTQ — compensate the error as you go

GPTQ quantizes one layer at a time and, after rounding each weight, nudges the remaining weights to cancel the error it just introduced — using second-order (Hessian) information about which adjustments matter. It needs a small calibration dataset to estimate that. Result: solid INT4 weights with little quality loss.1

Pantry: after rounding the flour down, add a touch more butter to keep the recipe balanced — each rounding is compensated by the next adjustment.

3 · AWQ — protect the weights that matter

AWQ (Activation-aware Weight Quantization) notices that the weights multiplying the largest activations matter most. It identifies that ~1% of salient weights and scales them so they survive quantization intact, while the rest go low-bit.2

Pantry: measure the saffron and chili to the milligram; eyeball the potatoes. Protect the few ingredients the dish actually hinges on.

4 · SmoothQuant — move the difficulty

Quantizing activations (not just weights) is hard because activations have wild outliers. SmoothQuant rescales per channel to migrate that difficulty from the activations into the weights, so both become easy to quantize — enabling 8-bit weights and activations (W8A8).3

GPTQ round, then adjust the rest to cancel error weight-only INT4 AWQ protect the ~1% salient weights (by activation) weight-only INT4 SmoothQuant migrate outliers from activations → weights W8A8 (weights+activations)
Three angles on the same enemy — outliers. GPTQ & AWQ make weight-only low-bit work; SmoothQuant unlocks quantizing activations too.

In Kubernetes terms infra bridge

RTN is a flat resource cap on every pod — simple, but it throttles the latency- critical ones. AWQ/SmoothQuant are profile-guided right-sizing: they sample real activations (a calibration set ≈ a representative traffic sample) to learn which weights are the hot path and protect those, capping the rest hard. You measure before you squeeze, instead of capping blind.

On YOUR cluster — where this applies context

Your Qwen runs FP8 (W8A8), whose wide dynamic range often tolerates outliers without these tricks — so FP8 is largely "just works." These algorithms earn their keep when you go lower: an AWQ- or GPTQ-INT4 checkpoint to fit a bigger model on a smaller GPU, or SmoothQuant for INT8 on hardware without FP8. Knowing them lets you read a checkpoint's name (e.g. ...-AWQ) and know exactly what you're getting. · Your Lab →

Read this next — primary source AWQ — Lin et al. Runnable companion: day14 notebook — GPTQ, AWQ, SmoothQuant compared.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to compare an FP8 vs AWQ-INT4 checkpoint of your model on quality + memory? Just ask.
← Lesson 16 — number formats Next: Lesson 18 — model formats →
References
  1. GPTQ — Frantar et al. (2210.17323); day14 (notebook).
  2. AWQ: Activation-aware Weight Quantization — Lin et al. (2306.00978).
  3. SmoothQuant — Xiao et al. (2211.10438).