AWQ vs GPTQ: What the Quantization Benchmarks Show

AWQ and GPTQ are the two dominant 4-bit quantization methods for LLMs - but the benchmarks tell a more nuanced story than most comparisons admit. Here's what the data actually shows.

Mohammed Kafeel

Machine Learning Researcher

June 9, 2026

13 min read

On this page

TL;DR
What Is LLM Quantization - and Why Does It Matter?
What Is GPTQ? (How It Works in Plain English)
What Is AWQ? (The Activation-Aware Approach)
AWQ vs GPTQ: Head-to-Head Benchmark Results
Where AWQ Wins
Where GPTQ Wins
Hardware & Ecosystem: Which One Fits Your Stack?
Which Should You Use? (Decision Framework)
Key Takeaways
FAQ
Useful Sources

Most comparisons of AWQ vs GPTQ stop at "AWQ is more accurate, GPTQ is faster." That's not wrong. But it's not the whole picture either. The real answer depends on your inference stack, your GPU generation, and whether you're running a 7B or a 70B model. Let's go through the actual numbers.

TL;DR

AWQ wins on accuracy at 4-bit: ~5.40 perplexity vs ~5.50+ for GPTQ on WikiText-2; ~83% vs ~81–82% on MMLU.
AWQ wins on inference speed when paired with the Marlin kernel: ~741 tok/s vs ~712 tok/s on Qwen2.5-32B.
GPTQ wins on extreme compression: it handles 3-bit and 2-bit quantization; AWQ is optimized for 4-bit.
GPTQ wins on ecosystem breadth: native support in llama.cpp, Ollama, and text-generation-webui without extra dependencies.

What Is LLM Quantization - and Why Does It Matter?

Quantization shrinks a model's weight precision - from 16-bit floats down to 4-bit integers - so it fits in less VRAM and runs faster.

A Llama-3.1-70B model in FP16 needs ~140 GB of VRAM. Quantize it to 4-bit and that drops to ~35–40 GB. That's the difference between needing a multi-GPU server and running on a single A100. (New to the precision levels here? Start with the INT4 quantization fundamentals.)

The catch: every bit you drop introduces rounding error. The question isn't whether you lose accuracy - you do. The question is how much, and which method loses less.

That's exactly what the AWQ vs GPTQ debate is about.

What Is GPTQ? (How It Works in Plain English)

GPTQ is a one-shot post-training quantization method that uses second-order information (the Hessian matrix) to minimize quantization error layer by layer.

Published at ICLR 2023 (arXiv:2210.17323) by Frantar, Ashkboos, Hoefler, and Alistarh, GPTQ was the first method to quantize a 175B-parameter model to 3–4 bits in roughly four GPU hours - on a single A100.

Here's how GPTQ quantization works:

Calibration - Feed a small dataset through the layer. Compute the Hessian matrix (second-order derivatives of the loss with respect to weights). Run a Cholesky decomposition for numerical stability.
Sequential quantization - Process weights column by column. For each weight, find the best low-bit integer value. Calculate the quantization error. Push that error onto the remaining unquantized weights using the inverse Hessian.
Batch processing - Group weights into blocks of ~128 columns. Apply lazy updates across the whole matrix block.

The update rule is: w_j_new = w_j_old − Δ_i × (H⁻¹)_ji

The key insight: GPTQ doesn't just round each weight independently. It compensates for each rounding error by adjusting the weights that haven't been quantized yet. That's why it preserves accuracy far better than naive rounding.

GPTQ's strengths:

Compresses 175B models in ~4 GPU hours
Supports 3-bit and even 2-bit quantization
Enables single-GPU inference for massive models
Mature ecosystem: AutoGPTQ, llmcompressor, ExLlama v2

The limitation: GPTQ requires a GPU for the quantization step itself. The Hessian computation is expensive - 20–60 minutes for an 8B model, 4+ hours for 70B+.

What Is AWQ? (The Activation-Aware Approach)

AWQ (Activation-aware Weight Quantization) is a post-training quantization method that protects the ~1% of weights that matter most - identified by looking at activation magnitudes, not weight values.

Published as arXiv:2306.00978 and awarded MLSys 2024 Best Paper, AWQ came out of MIT's HAN Lab. The core insight: not all weights are equally important. A tiny fraction - roughly 1% - are "salient weights" connected to high-magnitude activations. Quantizing those aggressively destroys model quality. Protecting them doesn't.

Here's how LLM AWQ works:

Calibration - Run 128–512 samples through the unquantized model. Record the average activation magnitude per channel.
Grid search - Identify salient weight channels (those connected to high-magnitude activations). Search over scaling factors to find the one that minimizes quantization error.
Scaling and fusion - Scale salient weights up. Apply inverse scaling to subsequent activations (mathematically equivalent - the output doesn't change). Fuse the scaling factors into the weights. Quantize to INT4.

No backpropagation. No retraining. Just a single forward pass. (AWQ isn't the only activation-aware method - here's activation-aware quantization beyond AWQ.)

AWQ's strengths:

~10 minutes to quantize an 8B model; ~1 hour for 70B
Generalizes across domains (coding, math, multimodal) without overfitting the calibration set
Hardware-friendly: INT4 weights + FP16 activations - no exotic mixed-precision hardware needed
Supported in vLLM, TensorRT-LLM, HuggingFace TGI, FastChat, LMDeploy

The limitation: AWQ needs specific inference stacks (vLLM, autoawq). It doesn't run natively in llama.cpp or Ollama the way GPTQ does.

AWQ vs GPTQ: Head-to-Head Benchmark Results

Perplexity & Accuracy

AWQ consistently outperforms GPTQ at 4-bit precision across every major accuracy metric.

The September 2024 paper arXiv:2409.11055 - a comprehensive evaluation across 13 benchmarks, models from 7B to 405B, run on H100/A100/RTX 6000 clusters - is the most thorough head-to-head we have. The finding is unambiguous: "AWQ consistently outperforms GPTQ across various LLMs on overall benchmark scores."

Metric	FP16 Baseline	AWQ INT4	GPTQ 4-bit	Winner
Perplexity (WikiText-2)	~5.25	~5.40	~5.50+	AWQ
MMLU Accuracy	~85%	~83%	~81–82%	AWQ
HumanEval (code)	~72%	~70%	~66–67%	AWQ
MT-Bench (Llama-2-70B)	7.20	7.06	7.08	Roughly tied
Avg. OpenLLM v1 (Vicuna-7B)	47.40	46.86 (↓0.54)	44.15 (↓3.25)	AWQ

The Vicuna-7B numbers tell the story clearly. AWQ drops 0.54 points from FP16 baseline on the OpenLLM Leaderboard-v1 average. GPTQ drops 3.25 points. That's a 6× larger accuracy hit for GPTQ on the same model at the same bit-width.

For Llama-2-7B-Chat, AWQ's average drop is 1.17 points. GPTQ's is 3.50 points. (For a closer look at accuracy degradation at 4-bit precision, see our bit-width breakdown.)

One nuance: on very small models (Gemma-2B), GPTQ occasionally edges AWQ on specific sub-benchmarks. The gap narrows at 2B parameters. It widens significantly at 13B and 70B.

Inference Speed

Speed depends almost entirely on which kernel you're using. With the Marlin kernel, AWQ is the fastest option available.

The Marlin kernel (optimized for Ampere/Ada/Hopper GPUs) changes the picture dramatically:

Configuration	Throughput (Qwen2.5-32B)	vs FP16 Baseline
FP16 baseline	~461 tok/s	-
AWQ (no Marlin)	~67 tok/s	−85%
GPTQ (standard)	~276 tok/s	−40%
Marlin-GPTQ	~712 tok/s	+54%
Marlin-AWQ	~741 tok/s	+61%

Without Marlin, standard AWQ is actually slower than standard GPTQ - a counterintuitive result that trips up a lot of practitioners. AWQ's weight layout isn't optimized for naive INT4 kernels.

With Marlin, both methods beat FP16 by 50–60%. AWQ edges ahead by ~4%.

On an RTX 4090 running Llama-2-7B, AWQ delivers ~194 tok/s vs ~133 tok/s for GPTQ - roughly 31% faster latency per token.

On an A100 running Llama-3.1-70B, AWQ achieves ~1,800 tok/s vs ~1,200–1,400 tok/s for GPTQ - a 1.3–1.5× throughput advantage.

The community data point: r/LocalLLaMA benchmarks consistently rank EXL2 as the fastest option for local inference, followed by GPTQ via ExLlama v2, with AWQ competitive but dependent on the backend.

Memory Footprint

At 4-bit, AWQ and GPTQ use virtually identical VRAM.

Both compress weights from 2 bytes per parameter (FP16) to ~0.5 bytes (INT4) - a 4× reduction.

Model Size	FP16	AWQ INT4	GPTQ INT4
7B	~14 GB	~4.0–4.5 GB	~4.0–4.5 GB
13B	~26 GB	~7.0–8.0 GB	~7.0–8.0 GB
70B	~140 GB	~35–40 GB	~35–40 GB

If VRAM is your only constraint, the choice between AWQ and GPTQ doesn't matter. Pick based on accuracy and ecosystem instead.

Quantization Time

AWQ is 2–3× faster to quantize than GPTQ.

Model	AWQ	GPTQ
8B (A100)	~10 minutes	~20–60 minutes
70B (A100)	~1 hour	~4+ hours

GPTQ's Hessian computation is the bottleneck. It's iterative, matrix-heavy, and scales poorly with model size. AWQ's calibration is a single forward pass - no matrix inversion required.

For teams that quantize frequently (new model releases, fine-tuned checkpoints), this difference compounds fast.

Where AWQ Wins

AWQ is the right choice when accuracy at 4-bit is non-negotiable.

Specifically, choose AWQ when:

You're running instruction-tuned models. AWQ's activation-aware approach preserves reasoning quality that GPTQ's Hessian method tends to degrade. The arXiv:2409.11055 paper confirms AWQ outperforms GPTQ on MT-Bench for 13B and 70B instruction-tuned models.
You're using vLLM, SGLang, or TensorRT-LLM. These stacks have first-class AWQ + Marlin support. You get both accuracy and speed.
You're quantizing frequently. Ten minutes vs an hour per 8B model adds up.
You're working with multimodal LLMs. AWQ generalizes across modalities without overfitting the calibration set. GPTQ can overfit to text-only calibration data.
You have limited calibration data. AWQ needs 128–512 samples. GPTQ benefits from larger, more representative datasets.

Where GPTQ Wins

GPTQ is the right choice when ecosystem compatibility or extreme compression matters more than peak accuracy. (If CPU and Apple Silicon are also on the table, GGUF rounds out the picture as a third quantization format.)

Specifically, choose GPTQ when:

You need 3-bit or 2-bit quantization. GPTQ handles extreme compression. AWQ is optimized for 4-bit and doesn't go lower reliably.
You're running llama.cpp, Ollama, or text-generation-webui. GPTQ models load natively. AWQ requires autoawq or a compatible inference stack.
You're on older hardware. GPTQ has broader legacy kernel support, including Turing-era GPUs (RTX 20xx) where AWQ can be unstable.
You need 5,000+ pre-quantized models on HuggingFace Hub. GPTQ has a massive head start in community-quantized model availability.
You're using ExLlama v2 for local inference. ExLlama's GPTQ kernels are highly optimized for consumer GPUs and deliver excellent tokens-per-second on RTX 3060/3070 hardware.

Hardware & Ecosystem: Which One Fits Your Stack?

Both AWQ and GPTQ support LLaMA, Mistral, Falcon, Gemma, and all major open-source model families. The real difference is in the inference engine.

Stack	AWQ	GPTQ
vLLM	✅ First-class (Marlin)	✅ First-class (Marlin)
TensorRT-LLM	✅	✅
HuggingFace TGI	✅	✅
llama.cpp / Ollama	❌ (needs conversion)	✅ Native
text-generation-webui	✅ (via autoawq)	✅ Native
ExLlama v2	✅	✅ (optimized)
LMDeploy	✅	✅

GPU architecture matters for the Marlin kernel. Marlin requires Ampere (RTX 30xx, A100), Ada Lovelace (RTX 40xx), or Hopper (H100). It does not run on Turing (RTX 20xx) or Volta (V100). Without Marlin, you lose the big speed advantage.

Practical GPU recommendations:

RTX 4090 / 4070 (Ada): Use AWQ with vLLM. Best quality + speed for 13B–30B models on 24 GB VRAM.
RTX 3060 / 3070 (Ampere): Use GPTQ with ExLlama v2. Highest tokens-per-second for personal local inference.
RTX 2080 (Turing): Use GPTQ. Broader legacy kernel support; AWQ can be less stable.
A100 / H100 (data center): Either works well. AWQ + Marlin for accuracy; GPTQ + Marlin if you're squeezing every last token/s.

Which Should You Use? (Decision Framework)

Run through this in order:

01 / Do you need 3-bit or 2-bit compression? → Yes: Use GPTQ. AWQ doesn't support sub-4-bit reliably. → No: Continue.

02 / Are you running llama.cpp, Ollama, or text-generation-webui? → Yes: Use GPTQ. Native support, no extra dependencies. → No: Continue.

03 / Are you on Ampere, Ada, or Hopper GPUs with vLLM or SGLang? → Yes: Use AWQ + Marlin. Best combination of accuracy and throughput. → No: Continue.

04 / Is accuracy your top priority at 4-bit? → Yes: Use AWQ. Consistently lower perplexity and higher benchmark scores. → No: Use GPTQ. Mature, fast, and widely supported.

The short version: AWQ for production inference on modern GPUs. GPTQ for local use, legacy stacks, or extreme compression. (Want to see both methods applied step by step? Here's a walkthrough of practical quantization of Llama 3 with both methods.)

Key Takeaways

📦 AWQ vs GPTQ - The 5 Things That Actually Matter

Accuracy at 4-bit: AWQ wins. ~1–3% better on MMLU, HumanEval, and OpenLLM benchmarks. Confirmed across 13 datasets in arXiv:2409.11055.

Speed with Marlin: AWQ edges ahead (~741 vs ~712 tok/s). Without Marlin, GPTQ is faster.

Memory: Identical. Both deliver ~4× compression from FP16 at 4-bit.

Quantization time: AWQ is 2–3× faster. ~10 min vs ~20–60 min for an 8B model.

Ecosystem: GPTQ has broader native support (llama.cpp, Ollama). AWQ requires vLLM or autoawq.

What's your current quantization setup - AWQ, GPTQ, or something else entirely? Drop your stack and GPU in the comments. We're curious what's working in production.

FAQ

What is the main difference between AWQ and GPTQ?

AWQ uses activation statistics to identify and protect the ~1% of weights that matter most, then scales them before quantization. GPTQ uses second-order Hessian information to redistribute quantization error across remaining weights as it quantizes column by column. Both are post-training methods that don't require retraining, but AWQ is faster to run and more accurate at 4-bit; GPTQ supports lower bit-widths and has broader ecosystem support.

Is AWQ better than GPTQ for accuracy?

Yes, at 4-bit precision. AWQ consistently achieves lower perplexity (WikiText-2: ~5.40 vs ~5.50+) and higher benchmark scores (MMLU: ~83% vs ~81–82%). A September 2024 study across 13 benchmarks and models up to 405B confirmed that AWQ shows less accuracy degradation than GPTQ across the board. The gap is largest on instruction-tuned models at 13B and 70B scale.

What is AWQ in the context of LLMs?

LLM AWQ stands for Activation-aware Weight Quantization. It's a post-training quantization technique from MIT's HAN Lab (arXiv:2306.00978, MLSys 2024 Best Paper). It compresses LLM weights to 4-bit integers while preserving model quality by identifying "salient" weight channels - those connected to high-magnitude activations - and protecting them through equivalent scaling transformations.

Does GPTQ quantization support lower bit-widths than AWQ?

Yes. GPTQ quantization supports 3-bit, 2-bit, and even ternary (1.5-bit) compression. AWQ is optimized for 4-bit and doesn't reliably go lower. If you need extreme compression - for example, fitting a 70B model into under 25 GB - GPTQ is currently the only practical option.

Which is faster for inference: AWQ or GPTQ?

It depends on the kernel. With the Marlin kernel (requires Ampere/Ada/Hopper GPUs, available in vLLM and SGLang), AWQ achieves ~741 tok/s vs ~712 tok/s for GPTQ on Qwen2.5-32B. Without Marlin, standard GPTQ (~276 tok/s) is significantly faster than standard AWQ (~67 tok/s). Always check whether your inference stack supports Marlin before benchmarking.

Can I use AWQ and GPTQ with LLaMA, Mistral, and Falcon models?

Yes. Both methods support LLaMA (1, 2, 3, 3.1), Mistral (7B, Mixtral-8x7B), Falcon (7B, 40B, 180B), Gemma, Vicuna, and most major open-source model families. Over 5,000 GPTQ-quantized models and a growing library of AWQ models are available on Hugging Face Hub.

Useful Sources

AWQ paper (arXiv:2306.00978): arxiv.org/abs/2306.00978 - Original AWQ paper, MLSys 2024 Best Paper Award.
GPTQ paper (arXiv:2210.17323): arxiv.org/abs/2210.17323 - Original GPTQ paper, ICLR 2023.
Comprehensive quantization evaluation (arXiv:2409.11055): arxiv.org/html/2409.11055v1 - 13-benchmark study across models up to 405B, September 2024.
MIT HAN Lab AWQ GitHub: github.com/mit-han-lab/llm-awq - Official AWQ implementation and model zoo.
AutoAWQ (HuggingFace integration): huggingface.co/docs/transformers/en/quantization/awq - Official HuggingFace AWQ documentation.
vLLM quantization benchmarks: jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks - Marlin kernel speed benchmarks for AWQ and GPTQ.
AWS SageMaker AWQ + GPTQ guide: aws.amazon.com/blogs/machine-learning/accelerating-llm-inference-with-post-training-weight-and-activation-using-awq-and-gptq-on-amazon-sagemaker-ai/ - Practical deployment comparison on AWS.

Keep reading

llmquantizationgguf

GGUF vs AWQ vs GPTQ: Which Quantization Format to Use?

GGUF, AWQ, and GPTQ compress LLMs to run on less hardware - but each format wins in a different scenario. Here's the data-backed decision framework you need.

MKMohammed Kafeel

14 min read

llmquantizationsmoothquant

SmoothQuant: What Activation-Aware Quantization Fixes

Naive INT8 quantization drops OPT-175B accuracy from 71.6% to 32.3%. SmoothQuant fixes that - without retraining - by migrating quantization difficulty from activations to weights via a mathematically equivalent transform.

MKMohammed Kafeel

12 min read

llmquantizationedge

Quantization for Edge Devices: LLMs Under 4 GB VRAM

A complete technical guide to running LLMs under 4 GB VRAM using quantization. Covers GGUF, GPTQ, AWQ, Bitsandbytes, real model sizes, benchmarks, and a step-by-step Ollama walkthrough.

MKMohammed Kafeel

18 min read

Back to all posts