All posts

SmoothQuant: What Activation-Aware Quantization Fixes

Why naive INT8 breaks on large models, and how SmoothQuant and activation-aware methods like AWQ recover near-FP16 accuracy.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202612 min read

Quick answer: Naive INT8 quantization breaks on large language models because of activation outliers — a handful of feature dimensions whose values are 10–100× larger than the rest. These outliers stretch the quantization range so far that every normal value collapses to near-zero, destroying accuracy. SmoothQuant fixes this by mathematically shifting the difficulty from the hard-to-quantize activations into the easy-to-quantize weights, so both become INT8-friendly. Activation-aware methods (like AWQ) fix it differently: they use activation statistics to identify and protect the ~1% of weights that matter most, quantizing the rest aggressively. Both recover near-FP16 accuracy where naive INT8 fails.


Why naive INT8 fails on large language models

For small models and CNNs, you can quantize weights and activations to INT8 with barely any accuracy loss. For LLMs above ~6.7B parameters, the same recipe falls off a cliff. To understand why, you need to know what makes LLM activations special.

The activation outlier problem

When you run text through a transformer, the intermediate values (activations) flowing between layers are not evenly distributed. In specific, consistent feature dimensions — often just a handful out of thousands — the magnitudes explode. While most activation values sit in a tidy range like −1 to +1, these outlier channels routinely hit values of 70, 100, or more.

This matters enormously for quantization, because quantization maps a continuous range onto a fixed grid of integers using the maximum value to set the scale.

Here's the failure, made concrete. Suppose a tensor's normal values are in [−1, 1] but one outlier channel reaches 100. INT8 has 256 levels spread across the full range:

Scenario Range to cover Scale (step size) What happens to a normal value of 0.5
No outlier [−1, 1] 2 / 255 ≈ 0.0078 0.5 → level 64, dequantizes to ~0.5 ✓
One outlier of 100 [−100, 100] 200 / 255 ≈ 0.78 0.5 → level 1, dequantizes to ~0.78 ✗

With the outlier present, every normal value smaller than ~0.4 rounds to zero. The model's fine-grained information is annihilated to make room for one giant number. That is why naive INT8 — specifically INT8 activation quantization — wrecks LLM accuracy.

Why weights are fine but activations aren't

A key asymmetry: LLM weights are easy to quantize, activations are hard. Weight distributions are flat and uniform, with no extreme outliers — so INT8 weight-only quantization (used by bitsandbytes LLM.int8(), GPTQ, etc.) works well. The trouble is purely on the activation side, and only when you try to quantize activations to INT8 to get faster integer matrix multiplication.

This asymmetry is exactly the lever SmoothQuant pulls.


How SmoothQuant works: migrating the difficulty

SmoothQuant's core idea is to mathematically transfer the quantization difficulty from activations (hard) to weights (easy), so that both end up in a range that INT8 can represent accurately.

It exploits a simple algebraic fact about how a linear layer computes its output: Y = X · W. You can divide each input channel of the activations by a per-channel factor s, and multiply the corresponding weight rows by the same s, and the result Y is mathematically unchanged:

Y = X · W = (X / s) · (s · W)

Choose s so that the outlier channels in X get scaled down (becoming quantization-friendly) while the weights W absorb the scaling and get scaled up by a manageable amount. Because weights were uniform to begin with, they tolerate this stretching without developing problematic outliers of their own.

The smoothing factor and migration strength

The per-channel scale is computed from the magnitudes of both sides:

s_j = (max|X_j|)^α / (max|W_j|)^(1−α)

The hyperparameter α (migration strength) controls how much difficulty moves from activations to weights:

α value Effect
α = 0 No migration — equivalent to naive quantization (fails)
α = 0.5 Balanced split — the common default for most models
α → 1 Push nearly all difficulty into the weights (can over-stress them)

The beauty is that this is a one-time, offline transformation. The scaling can be fused into the preceding layer's weights (e.g., the LayerNorm) so there is zero runtime overhead — the model just has pre-adjusted weights, and now both X and W quantize cleanly to INT8.

What SmoothQuant enables

Because it makes both activations and weights INT8-friendly, SmoothQuant unlocks W8A8 (8-bit weights, 8-bit activations). That means the actual matrix multiplications run as fast INT8 integer operations on the GPU's tensor cores — delivering real speedup and memory savings, not just smaller storage. This is the difference from weight-only methods, which still compute in FP16.


How activation-aware quantization (AWQ) works: protect what matters

Activation-Aware Weight quantization (AWQ) takes a different route to the same goal: instead of moving outliers, it identifies the small fraction of weights that are most important — judged by the activations that flow through them — and protects them from quantization error.

The insight: not all weights matter equally. A small percentage (often ~1%) of weight channels are salient — quantizing them coarsely causes most of the accuracy loss, while the other 99% can be quantized aggressively with little harm.

How AWQ finds the important weights

Crucially, AWQ judges weight importance by the magnitude of the activations they multiply, not by the magnitude of the weights themselves. A weight that always gets multiplied by large activation values has a big impact on the output, even if the weight itself is small. By collecting activation statistics on a small calibration dataset, AWQ ranks which weight channels are salient.

Protecting salient weights via scaling

Rather than store the salient weights in higher precision (which would complicate the hardware kernel with mixed precision), AWQ applies a per-channel scaling that effectively gives the important weight channels more of the quantization grid's resolution — reducing their relative quantization error while keeping the entire tensor in a uniform low-bit format. This keeps inference kernels simple and fast.

What AWQ is best at

AWQ shines at weight-only low-bit quantization, especially INT4 (W4A16) — 4-bit weights with FP16 activations. It's one of the highest-quality 4-bit methods available and is widely used for running large models on consumer GPUs with minimal quality loss. It's particularly strong on instruction-tuned and multimodal models, since it doesn't rely on the calibration set matching the deployment distribution as tightly as some alternatives.


SmoothQuant vs. AWQ vs. naive INT8: side by side

Dimension Naive INT8 (W8A8) SmoothQuant AWQ
Core problem solved — (it's the baseline) Activation outliers Salient weight protection
Strategy Quantize everything flat Migrate difficulty A→W Protect important weights
Typical target W8A8 W8A8 (8-bit weights+acts) W4A16 (4-bit weights, FP16 acts)
Activations quantized? Yes (and this breaks it) Yes (now works) No — kept in FP16
Speedup source INT8 matmul (if it worked) Real INT8 matmul Memory bandwidth (smaller weights)
Accuracy on big LLMs Poor (collapses) Near-FP16 Near-FP16
Best use case Don't use for LLM acts Fast INT8 serving Large models on small GPUs (INT4)
Runtime overhead None None (fused offline) None (fused scales)

The key distinction

  • SmoothQuant is about making activation quantization possible so you can run true INT8 matrix multiplies for throughput. Reach for it when you want W8A8 speed on server GPUs.
  • AWQ is about squeezing weights to 4 bits with the least quality loss. Reach for it when memory is the constraint and you're running on consumer hardware.

They're not competitors so much as tools for different goals — one optimizes compute (INT8 activations), the other optimizes memory footprint (INT4 weights).


Why "activation-aware" beats "weight-only-aware"

Older quantization methods looked only at the weight values to decide how to quantize. The breakthrough shared by both SmoothQuant and AWQ is that the activations carry the information you need to quantize well.

  • SmoothQuant uses activation magnitudes to compute how much difficulty to migrate.
  • AWQ uses activation magnitudes to decide which weights are salient.

In both cases, ignoring activations is what made naive quantization fail. A weight matrix that looks perfectly uniform and quantization-friendly can still be catastrophic to quantize naively, because of how the activations interact with it. Looking at the data flowing through the model — not just the static weights — is the whole insight.


How to use these methods in practice

AWQ with AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama-2-7b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,   # block-wise scaling group size
    "w_bit": 4,            # 4-bit weights
    "version": "GEMM",
}

# Calibrates on a small dataset to find salient weights, then quantizes
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

Serving an AWQ model with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model llama-2-7b-awq \
    --quantization awq

SmoothQuant via the original toolkit

SmoothQuant is typically applied as a preprocessing pass that computes per-channel smoothing scales from calibration activations, fuses them into the model, and then exports a W8A8 model for an INT8 inference backend (e.g., TensorRT-LLM, which has SmoothQuant support built in). The workflow is: collect activation scales → smooth → quantize to INT8 → deploy on an INT8-capable runtime.

# Conceptual flow (smoothquant reference implementation)
from smoothquant.smooth import smooth_lm
from smoothquant.calibration import get_act_scales

act_scales = get_act_scales(model, tokenizer, calibration_dataset)
smooth_lm(model, act_scales, alpha=0.5)   # migrate difficulty, default alpha
# model is now ready for INT8 (W8A8) quantization + deployment

Common pitfalls and how to avoid them

Pitfall Why it hurts Fix
Using a tiny or unrepresentative calibration set Activation statistics are wrong → bad scales / wrong salient weights Use 128–512 diverse samples from your real domain
Expecting AWQ to give INT8-activation speedups AWQ keeps activations in FP16 — it saves memory, not matmul time Use SmoothQuant/W8A8 if you need faster compute
Setting SmoothQuant α too high Over-migrating creates weight outliers — moves the problem Start at α = 0.5; tune per-model if needed
Quantizing activations naively to "save more" Triggers the outlier collapse this whole article is about Never quantize LLM activations without smoothing
Assuming all models have the same outlier behavior Outlier severity varies; some models need different α or methods Profile activation ranges before choosing a method

Frequently asked questions

Why does naive INT8 quantization fail on large language models? Because of activation outliers — a small number of feature dimensions whose values are 10–100× larger than the rest. Quantization sets its scale based on the maximum value, so these outliers force a coarse grid where every normal value rounds to near-zero. The model's fine-grained information is destroyed. This emerges in models above roughly 6.7B parameters and is specific to quantizing activations, not weights.

What does SmoothQuant actually do? It mathematically migrates quantization difficulty from activations to weights. Using the identity X·W = (X/s)·(sW), it divides the outlier-heavy activation channels by a per-channel factor and multiplies the corresponding weights by the same factor. Since weights are naturally uniform and tolerate the stretch, both activations and weights end up INT8-friendly. The scaling is fused offline, so there's no runtime overhead, and it enables true W8A8 INT8 matrix multiplication.

What does "activation-aware" mean in AWQ? It means weight importance is judged by the activations that flow through each weight, not by the weight values themselves. A small weight that's always multiplied by large activations has a big effect on the output. AWQ uses a calibration dataset to measure activation magnitudes, identifies the ~1% of salient weight channels, and protects them via per-channel scaling — letting the other 99% be quantized aggressively to 4 bits with minimal accuracy loss.

SmoothQuant vs AWQ — which should I use? Use SmoothQuant when you want faster compute through INT8 activations (W8A8) on server GPUs — it makes activation quantization viable. Use AWQ when memory is the constraint and you want high-quality 4-bit weights (W4A16) for running large models on consumer hardware. SmoothQuant optimizes compute; AWQ optimizes memory footprint. They target different bottlenecks.

Do these methods need calibration data? Yes, both need a small calibration dataset to collect activation statistics — SmoothQuant to compute smoothing scales, AWQ to identify salient weights. Typically 128–512 samples suffice. The data should be reasonably representative of your deployment domain; an unrepresentative calibration set produces poor scales and degraded accuracy.

What is W8A8 vs W4A16? The notation is W[weight bits]A[activation bits]. W8A8 means 8-bit weights and 8-bit activations — the matmul runs in INT8 for real speedup (SmoothQuant's target). W4A16 means 4-bit weights but 16-bit (FP16) activations — weights are tiny for memory savings, but compute stays in FP16 (AWQ's target). W8A8 is faster compute; W4A16 is smaller memory.


Key takeaways

  • Naive INT8 fails on LLMs because of activation outliers — a few channels 10–100× larger than the rest force every normal value to round toward zero.
  • Weights are easy to quantize; activations are hard. That asymmetry is the root cause and the lever both methods exploit.
  • SmoothQuant migrates difficulty from activations to weights via X·W = (X/s)·(sW), enabling true W8A8 INT8 compute with no runtime overhead.
  • AWQ is activation-aware: it protects the ~1% of salient weights (judged by activation magnitude) to enable high-quality W4A16 4-bit weights.
  • Choose by bottleneck: SmoothQuant for INT8 compute speed on servers; AWQ for 4-bit memory savings on consumer GPUs.
  • Both require representative calibration data (128–512 samples) — get this wrong and accuracy suffers.
  • The shared breakthrough: look at the activations, not just the weights, to decide how to quantize.