SmoothQuant: What Activation-Aware Quantization Fixes

Naive INT8 quantization drops OPT-175B accuracy from 71.6% to 32.3%. SmoothQuant fixes that - without retraining - by migrating quantization difficulty from activations to weights via a mathematically equivalent transform.

Mohammed Kafeel

Machine Learning Researcher

June 21, 2026

12 min read

On this page

The Problem: Why Activation Outliers Break INT8 Quantization
What Is SmoothQuant?
How SmoothQuant Works: The Smoothing Formula
The Migration Strength Hyperparameter α
Three Efficiency Levels: O1, O2, O3
Benchmark Results: What It Actually Fixes
SmoothQuant vs. AWQ: Different Problems, Different Tools
Practical Guidance: When to Use SmoothQuant and How
Key Takeaways
FAQ
Useful Sources

TL;DR: Naive INT8 quantization destroys large language model accuracy because activation outliers - values ~100× larger than typical - dominate the quantization range and compress most channels to just 2–3 effective levels. SmoothQuant (MIT + NVIDIA, ICML 2023) solves this with a training-free, per-channel scaling transform that migrates quantization difficulty from activations to weights. The result: W8A8 INT8 quantization that matches FP16 accuracy, with up to 1.56× speedup and 2× memory reduction - no retraining required.

The Problem: Why Activation Outliers Break INT8 Quantization

Naive INT8 quantization of large language models fails catastrophically. OPT-175B drops from 71.6% average accuracy in FP16 to 32.3% with naive W8A8 - that's near-random performance. The culprit is activation outliers.

When LLMs scale past ~6.7B parameters, a small number of activation channels develop extreme values. These outliers are roughly 100× larger than typical activation values in the same layer. They're not random noise either - they're persistent: if a channel has an outlier for one token, it has an outlier for every token.

Why Does This Destroy Quantization Accuracy?

INT8 quantization maps a floating-point tensor to 256 discrete levels (−128 to 127). The quantization step size Δ is calculated as:

Δ = max(|X|) / 127

When a handful of channels contain values 100× larger than the rest, Δ is dominated by those outliers. The non-outlier channels - which represent the vast majority of values - get compressed into just 2–3 effective quantization levels instead of 256. That's not quantization. That's rounding everything to zero. (For a primer on INT8 quantization and outlier handling across precisions, start here.)

Per-token quantization helps slightly but doesn't solve the root problem. Per-channel activation quantization would fix it - but it's not compatible with INT8 GEMM kernels on hardware like NVIDIA Tensor Cores, which only support scaling along outer dimensions (token and output-channel), not the inner input-channel dimension.

This is the exact gap SmoothQuant was built to close.

What Is SmoothQuant?

SmoothQuant is a post-training quantization method that enables accurate, hardware-efficient W8A8 INT8 quantization for large language models - with no retraining.

Published at ICML 2023 (pages 38087–38099) by Guangxuan Xiao and Ji Lin (MIT, equal contribution), Mickael Seznec, Hao Wu, Julien Demouth (NVIDIA), and Song Han (MIT), it's now integrated into NVIDIA TensorRT-LLM, FasterTransformer, Amazon SageMaker, and Microsoft ONNX Runtime.

The core insight: weights are easy to quantize; activations are not. SmoothQuant doesn't fight the outliers directly. It moves the problem somewhere it can be handled.

How SmoothQuant Works: The Smoothing Formula

SmoothQuant applies a mathematically equivalent per-channel scaling transform that shifts quantization difficulty from activations to weights.

Start with a standard linear layer:

Y = X · W

SmoothQuant inserts a diagonal scaling matrix s such that:

Y = (X · diag(s)⁻¹) · (diag(s) · W) = X̂ · Ŵ

The output Y is identical. But now:

X̂ = X · diag(s)⁻¹ - activations divided by s, outliers smoothed out
Ŵ = diag(s) · W - weights multiplied by s, absorbing the scale

Because weights have a naturally flat, uniform distribution, they can absorb the increased scaling without significant quantization error. The activations, stripped of their outliers, now have a tight dynamic range that INT8 can represent cleanly.

The Per-Channel Scaling Factor

The scaling factor for each input channel j is:

sⱼ = max(|Xⱼ|)^α / max(|Wⱼ|)^(1−α)

Where:

max(|Xⱼ|) is the peak absolute activation value for channel j (estimated from calibration data)
max(|Wⱼ|) is the peak absolute weight value for channel j
α is the migration strength hyperparameter

The smoothing factors are computed offline using 512 random sentences from the Pile pre-training validation set. One calibration run. Applied to all downstream tasks. No retraining, no labeled data.

The scaling factor s can also be fused into the preceding layer's weights (e.g., a LayerNorm or linear layer) at zero runtime cost - no extra kernel calls.

The Migration Strength Hyperparameter α

α controls how much quantization difficulty migrates from activations to weights. Getting it right is the difference between FP16-matching accuracy and a broken model.

α value	Effect	Use case
0.0	All difficulty stays in activations	Breaks activation quantization
0.4–0.6	Balanced - sweet spot	OPT, BLOOM models
0.5	Default	Most models
0.75	More difficulty pushed to weights	GLM-130B (~30% outlier channels)
0.8–0.9	Heavy migration to weights	Llama-2, Falcon, Mistral, Mixtral
1.0	All difficulty pushed to weights	Breaks weight quantization

When α is too small (< 0.4), activations remain hard to quantize. When it's too large (> 0.6 for standard models), the weights become the bottleneck. The sweet spot for most architectures sits between 0.4 and 0.6.

For Llama-2-7B, the paper uses α = 0.85. For Llama-2-70B, α = 0.9. For Mistral-7B, α = 0.8. These aren't arbitrary - they reflect how severe the activation outlier problem is in each architecture.

How to find your α: Run a quick grid search on a small subset of your calibration data. The MIT-HAN-Lab repo includes scripts for this. It takes minutes, not hours.

Three Efficiency Levels: O1, O2, O3

SmoothQuant ships with three quantization schemes, trading accuracy for efficiency. All use INT8 for weights.

Level	Weight quant	Activation quant	Quantization type	Accuracy
O1	Per-tensor	Per-token	Dynamic	Matches FP16
O2	Per-tensor	Per-tensor	Dynamic	Matches FP16
O3	Per-tensor	Per-tensor	Static	Near FP16 (≤1% gap)

O1 is the most conservative. Per-token activation quantization computes a fresh scale for each token at runtime - accurate but slower.

O2 switches to per-tensor activation quantization. Slightly coarser, still dynamic. Matches FP16 on most models.

O3 is the production target. Static quantization means the scale factors are fixed at calibration time - no runtime computation. This is what enables the 1.56× speedup and 2× memory reduction. The accuracy cost is typically under 1% on models like OPT-175B and GLM-130B.

The recommendation: Start with O1 to validate accuracy, then push to O3 for deployment. If O3 degrades accuracy beyond your threshold, fall back to O2.

Benchmark Results: What It Actually Fixes

Accuracy Recovery

The numbers from the ICML 2023 paper are stark:

Method	OPT-175B avg accuracy	BLOOM-176B avg accuracy	GLM-130B avg accuracy
FP16	71.6%	68.2%	73.8%
Naive W8A8	32.3%	64.2%	26.9%
ZeroQuant	31.7%	67.4%	26.7%
LLM.int8()	71.4%	68.0%	73.8%
SmoothQuant-O3	71.1%	67.4%	72.8%

SmoothQuant-O3 matches LLM.int8() on accuracy - but without the mixed-precision overhead that makes LLM.int8() slower than FP16 in practice.

On Llama-2 models, perplexity loss is negligible: Llama-2-7B goes from 5.474 (FP16) to 5.515 (SmoothQuant W8A8). Llama-2-13B actually improves slightly: 4.950 → 4.929.

Inference Speedup

Integrated into FasterTransformer, SmoothQuant-O3 delivers:

Up to 1.56× speedup vs. FP16 on OPT-13B and OPT-30B (single GPU)
OPT-66B on 1 GPU instead of 2 - same latency, half the hardware
OPT-175B on 4 GPUs instead of 8 - similar latency, half the cost
MT-NLG 530B on a single 8-GPU node - previously required two nodes in FP16

In the PyTorch implementation, OPT-30B with sequence length 256 goes from 343ms (FP16) to 227ms (SmoothQuant-O3) - a 1.51× speedup with 1.96× memory reduction.

Memory Savings

~2× memory reduction across the board. OPT-175B drops from ~350GB to ~175GB. That's the difference between 8 A100s and 4.

SmoothQuant vs. AWQ: Different Problems, Different Tools

SmoothQuant and AWQ (Activation-Aware Weight Quantization) both use activation statistics to guide quantization - but they solve different problems.

Dimension	SmoothQuant	AWQ
Target	W8A8 (weights + activations)	W4A16 (weights only)
Activation role	Activations are quantized	Activations guide weight quantization, stay in FP16
Speedup source	INT8 GEMM kernels	Reduced memory bandwidth (4-bit weights)
Memory reduction	~2× vs FP16	~4× vs FP16 (weight-only)
Accuracy	Near-lossless at 8-bit	Near-lossless at 4-bit
Hardware fit	Tensor Core INT8	Memory-bandwidth-bound inference

AWQ is the better choice when you need aggressive weight compression (4-bit) and your bottleneck is memory bandwidth - common in autoregressive decoding on consumer GPUs. (For how activation-aware methods beyond GPTQ benchmark against each other, see our AWQ vs GPTQ breakdown.) SmoothQuant wins when you need throughput and can leverage INT8 GEMM hardware acceleration, which is the norm in datacenter inference.

They're not mutually exclusive. Some production pipelines use SmoothQuant-style activation smoothing as a preprocessing step before applying AWQ-style weight quantization.

Practical Guidance: When to Use SmoothQuant and How

When SmoothQuant Is the Right Call

Use SmoothQuant when:

You're running W8A8 inference on hardware with INT8 GEMM support (NVIDIA A100, H100, Intel Sapphire Rapids)
Your model is >6.7B parameters - below that, activation outliers are less severe and naive quantization may work fine
You need production throughput - the 1.56× speedup and 2× memory reduction are real and consistent
You can't retrain - SmoothQuant is fully post-training, calibration takes minutes

Step-by-Step: Applying SmoothQuant

Install the library from github.com/mit-han-lab/smoothquant
Generate activation scales using generate_act_scales.py with 512 sentences from your domain (or the Pile)
Choose α: Start with 0.5 for OPT/BLOOM-family models; use 0.8–0.9 for Llama-2, Falcon, Mistral
Select efficiency level: O1 for accuracy validation, O3 for production deployment
Benchmark on your target hardware - measure both perplexity and latency before committing. (For more on production quantization in practice, see our deployment guide.)

What to Watch Out For

Static quantization (O3) can drift if your production distribution differs significantly from calibration data. Recalibrate on a representative sample of your actual inputs.
GLM-style models with high outlier rates (>20% outlier channels) need α ≥ 0.75. Don't assume 0.5 is universal.
Attention BMMs also get quantized - SmoothQuant applies INT8 to all GEMMs in the transformer block, including batched matrix multiplications in attention. Verify attention accuracy separately.

Key Takeaways

The 5 things that matter most:

Activation outliers are the root cause of LLM quantization failure - not weights. Weights are already easy to quantize.

SmoothQuant migrates the problem, not eliminates it. It moves quantization difficulty from activations (hard) to weights (easy) via a mathematically equivalent transform.

α is the critical knob. Default 0.5 for OPT/BLOOM; 0.8–0.9 for Llama-2/Falcon/Mistral. Run a grid search if you're unsure.

O3 is the production target. Static per-tensor quantization delivers the full 1.56× speedup and 2× memory savings with ≤1% accuracy loss on most models.

No retraining needed. 512 calibration sentences, one offline pass, done. This is post-training quantization that actually works at scale.

FAQ

What is SmoothQuant?

SmoothQuant is a training-free post-training quantization method from MIT and NVIDIA (ICML 2023) that enables accurate W8A8 INT8 quantization for large language models. It works by migrating quantization difficulty from activations to weights using a per-channel scaling transform, without any model retraining.

Why does naive INT8 quantization fail on large language models?

LLMs with more than ~6.7B parameters develop systematic activation outliers - values roughly 100× larger than typical activations - in a small number of fixed channels. These outliers dominate the quantization range and compress non-outlier channels to just 2–3 effective quantization levels, destroying accuracy. OPT-175B drops from 71.6% to 32.3% with naive W8A8.

What does the α hyperparameter control in SmoothQuant?

α (migration strength) controls how much quantization difficulty shifts from activations to weights. At α = 0.5 (the default), difficulty is split evenly. Higher values (0.75–0.9) push more difficulty to weights, which is needed for models with severe outlier rates like GLM-130B, Llama-2, and Mistral. The sweet spot for most models is 0.4–0.6.

What are the three SmoothQuant efficiency levels?

O1 uses per-token dynamic activation quantization (most accurate), O2 uses per-tensor dynamic quantization (balanced), and O3 uses per-tensor static quantization (most efficient). O3 delivers the full 1.56× speedup and 2× memory reduction with typically less than 1% accuracy degradation.

How does SmoothQuant compare to AWQ?

SmoothQuant targets W8A8 quantization - both weights and activations go to INT8, enabling hardware-accelerated GEMM throughput. AWQ targets W4A16 - weights go to 4-bit while activations stay in FP16, reducing memory bandwidth. SmoothQuant wins on throughput; AWQ wins on memory compression. They address different bottlenecks and can be complementary.

Does SmoothQuant require retraining or labeled data?

No. SmoothQuant is fully post-training. Calibration requires 512 unlabeled sentences from the Pile dataset (or your own domain data) and runs in minutes. The resulting scaling factors are applied offline and fused into the model weights - zero runtime overhead.

Which models does SmoothQuant support?

SmoothQuant has been validated on OPT (all scales), BLOOM-176B, GLM-130B, MT-NLG 530B, Llama-1/2/3, Falcon, Mistral, and Mixtral. It works on any transformer architecture where the quantization bottleneck is activation outliers in linear layers.

Useful Sources

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (arXiv) - The original paper (ICML 2023)
MIT-HAN-Lab SmoothQuant GitHub Repository - Official implementation, activation scale scripts, demo notebooks
ICML 2023 Official Proceedings - Peer-reviewed publication record
NVIDIA Developer Blog: Optimizing LLMs with Post-Training Quantization - Production integration context
vLLM LLM Compressor: SmoothQuant Modifier - Practical deployment reference

Have you deployed SmoothQuant in production? What α value worked best for your model, and did you see the full 1.56× speedup? Drop your numbers in the comments - real-world benchmarks from diverse hardware setups are exactly what the community needs more of.

Keep reading

llmquantizationoptimization

LLM Quantization Explained: INT4 vs INT8 vs FP16

A 70B model needs 140 GB of GPU RAM in FP16. Most teams don't have that. This guide breaks down LLM quantization - INT4 vs INT8 vs FP16 - with real memory numbers, speed benchmarks, and a practical decision framework for deploying LLMs efficiently.

MKMohammed Kafeel

12 min read

llmquantizationoptimization

LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory

Standard 2-bit quantization destroys model accuracy - perplexity spikes to 38,000+. Here's what the benchmarks actually say about 2-bit vs 4-bit vs 8-bit, and how to pick the right level for your hardware and task.

MKMohammed Kafeel

17 min read

llmquantizationedge

Quantization for Edge Devices: LLMs Under 4 GB VRAM

A complete technical guide to running LLMs under 4 GB VRAM using quantization. Covers GGUF, GPTQ, AWQ, Bitsandbytes, real model sizes, benchmarks, and a step-by-step Ollama walkthrough.

MKMohammed Kafeel

18 min read

Back to all posts