All posts

AWQ vs GPTQ: What the Quantization Benchmarks Show

A benchmark-driven comparison of AWQ and GPTQ post-training quantization — accuracy, speed, and memory — so you can pick the right method instead of guessing.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202613 min read

Quick answer: AWQ and GPTQ are both 4-bit post-training quantization (PTQ) methods, and on published benchmarks they land close to each other and close to FP16 — the differences are real but small. The consistent pattern across the literature: AWQ usually edges out GPTQ on accuracy at 4-bit (lower perplexity, slightly higher zero-shot accuracy), especially on instruction-tuned models, because it protects activation-salient weights. GPTQ is highly competitive when configured well (group size 128, activation-order on) and benefits from very fast inference kernels (ExLlamaV2). For inference speed, GPTQ with ExLlamaV2 and AWQ with optimized GEMM kernels are both fast and often within the same ballpark; the winner depends on your serving stack. The honest bottom line: the accuracy gap is small enough that your model, calibration data, and serving stack matter more than the AWQ-vs-GPTQ choice — so benchmark on your own setup.


A note on these numbers before we start

This post discusses benchmark results, so provenance matters. The figures and patterns below are drawn from published sources — primarily the GPTQ and AWQ papers and their reported WikiText perplexity and zero-shot accuracy tables — and from widely-reproduced community comparisons. They are representative, not freshly measured here, and exact values shift with:

  • the model (LLaMA-2 vs Llama-3 vs Mistral, and the size),
  • the calibration dataset (domain and sample count),
  • the configuration (group size, activation order, kernel/version),
  • and the benchmark itself (WikiText perplexity vs MMLU vs your task).

So treat specific decimals as illustrative of the shape of the result, and reproduce on your own model and task before making a production decision. Where exact figures matter, consult the cited papers directly.


What the two methods do (quick recap)

Both are post-training quantization (PTQ): they quantize an already-trained model to 4-bit using a small calibration set, with no retraining.

  • GPTQ quantizes weights one column at a time, using approximate second-order (Hessian) information to choose roundings that minimize output error. Mature, GPU-targeted, fast with ExLlamaV2 kernels.
  • AWQ identifies the ~1% of weights that are salient — judged by the magnitude of activations passing through them — and scales to protect them, quantizing the rest to 4-bit. Activation-aware, GPU-targeted, strong support in vLLM/TGI.

(For the mechanism behind AWQ's activation-awareness, see the SmoothQuant / activation-aware quantization post.)


Accuracy: perplexity (lower is better)

The headline academic metric is WikiText perplexity at INT4 with group size 128. The pattern that holds across reported results:

FP16  <  AWQ  ≲  GPTQ(act-order)  <  GPTQ(no act-order)  <  RTN(naive)
(best)                                                      (worst)
  • FP16 is the reference ceiling.
  • AWQ typically lands closest to FP16 among the 4-bit methods.
  • GPTQ with activation ordering is very close behind AWQ — often within a small fraction of a perplexity point.
  • GPTQ without activation ordering is a bit worse.
  • Naive round-to-nearest (RTN) is clearly the worst, and is the baseline both methods beat.
Method (INT4, g128) Relative WikiText perplexity (illustrative) Notes
FP16 (reference) baseline No quantization
AWQ closest to FP16 Usually best 4-bit accuracy
GPTQ (desc_act=True) very close to AWQ Activation order helps notably
GPTQ (desc_act=False) slightly behind Faster to produce, a touch worse
RTN (naive) clearly worst The baseline both methods beat

The practical reading: at 4-bit, both AWQ and well-configured GPTQ sit close to FP16, with AWQ usually a hair ahead. The gap between them is small relative to the gap between either of them and naive RTN — meaning using a real method at all matters far more than which of the two you pick.


Accuracy: downstream tasks

Perplexity doesn't always track task performance, so the more decision-relevant comparison is zero-shot/few-shot accuracy (MMLU, ARC, HellaSwag, GSM8K, etc.). The reported pattern:

  • On base models, AWQ and GPTQ are very close on most zero-shot benchmarks, both near FP16, AWQ often marginally ahead.
  • On instruction-tuned models, AWQ is frequently reported to hold up slightly better — one of its commonly-cited strengths, attributed to not over-relying on a calibration distribution that may not match instruction data.
  • On reasoning/code tasks (GSM8K, HumanEval), both methods show more degradation than on simple zero-shot tasks — this is a property of 4-bit quantization generally, not a differentiator between them.
Workload Typical finding
Base-model zero-shot AWQ ≈ GPTQ, both ≈ FP16 (AWQ often marginally ahead)
Instruction-tuned AWQ often slightly more robust
Reasoning / code Both degrade more than on easy tasks — pick by your eval
Multimodal AWQ frequently cited as strong

Inference speed and throughput

Accuracy is only half the decision; serving speed often decides it. Both methods have fast, specialized kernels, and the comparison is stack-dependent:

Aspect GPTQ AWQ
Fast kernels ExLlamaV2 (very fast) Optimized GEMM/GEMV kernels
Serving support Broad (AutoGPTQ, vLLM, TGI, exllama) Strong in vLLM, TGI, AutoAWQ
Typical throughput High, especially with ExLlamaV2 High, comparable
Latency at low batch Excellent with ExLlamaV2 Excellent with AWQ GEMV

Neither is a universal speed winner. With ExLlamaV2, GPTQ is extremely fast, particularly at low batch sizes; AWQ's kernels are equally strong in vLLM-style serving. The right question isn't "which method is faster?" but "which is faster on the stack I'm deploying on?" — measure with your actual runtime and batch profile.


Quantization cost (producing the model)

A practical difference that's easy to overlook: how long and how much memory it takes to create the quantized model.

  • GPTQ is computationally heavier to produce — its column-by-column, Hessian-based procedure takes longer and is more memory-intensive on large models.
  • AWQ is generally faster to quantize, since computing activation-based scales is lighter than GPTQ's second-order optimization.

For one-off conversions this rarely matters; for pipelines that re-quantize frequently (e.g., after fine-tunes), AWQ's lower quantization cost is a modest convenience.


So which should you use?

Your priority Pick
Best 4-bit accuracy, esp. instruction-tuned AWQ (usually a small edge)
Maximum tooling maturity / compatibility GPTQ (long track record, broad support)
Serving on vLLM/TGI with best quality AWQ (first-class support)
Low-batch latency with ExLlamaV2 GPTQ (extremely fast kernels)
Frequent re-quantization (after fine-tunes) AWQ (cheaper to produce)
You already have a working pipeline Keep it — the gap rarely justifies a switch

The honest recommendation: default to AWQ for a small accuracy edge and modern serving support, or GPTQ if your stack is built around it / you want ExLlamaV2 speed. Then benchmark both on your model and task — the difference is often small enough that calibration quality and serving configuration dominate the outcome.


How to benchmark them yourself (the part that actually decides)

Because results are setup-dependent, the only numbers that should drive your decision are your own. A minimal protocol:

  1. Fix everything except the method — same model, same calibration data (128–512 representative samples), same group size (128), same eval set.
  2. Quantize both — AWQ (GEMM) and GPTQ (group_size=128, desc_act=True).
  3. Measure accuracy — WikiText perplexity for a quick check, plus your real task via lm-evaluation-harness (MMLU/GSM8K/etc.) or a domain eval set.
  4. Measure speed — throughput and p50/p95 latency on your serving stack (vLLM, TGI, exllama) at your real batch size.
  5. Compare against FP16 — so you know the absolute loss, not just the relative ranking.
  6. Decide on the blend — the method that meets your accuracy bar at the best speed on your hardware wins, regardless of which paper reported what.

Frequently asked questions

Is AWQ more accurate than GPTQ at 4-bit? Usually, by a small margin. Across published benchmarks AWQ tends to land closest to FP16 in WikiText perplexity and is often marginally ahead on zero-shot accuracy, with a frequently-cited edge on instruction-tuned models because it protects activation-salient weights. But well-configured GPTQ (group size 128, activation order enabled) is very close behind, and the gap between the two is small relative to the gap versus naive rounding. Benchmark on your own model to be sure.

Which is faster for inference, AWQ or GPTQ? It depends on your serving stack. GPTQ with ExLlamaV2 kernels is extremely fast, particularly at low batch sizes, while AWQ's optimized kernels are equally strong in vLLM/TGI serving. Neither is a universal winner — measure throughput and latency on the runtime and batch profile you'll actually deploy.

Are these benchmark numbers reliable? They reflect consistent patterns from the GPTQ and AWQ papers and widely-reproduced community comparisons, but exact values vary with the model, calibration data, configuration, and benchmark. Treat published figures as the shape of the result (AWQ ≲ well-tuned GPTQ < naive RTN, all near FP16 at 4-bit), not precise guarantees, and reproduce on your own setup before committing.

Does GPTQ's activation order (desc_act) matter? Yes — enabling activation ordering (desc_act=True) noticeably improves GPTQ's accuracy, closing much of the gap to AWQ, at a small cost in quantization time and sometimes inference speed. If you're comparing GPTQ to AWQ, you should enable it; comparing AWQ to GPTQ without activation order is an unfair test that overstates AWQ's lead.

Which uses less time/memory to produce the quantized model? AWQ is generally cheaper to produce, because computing activation-based scales is lighter than GPTQ's column-by-column second-order optimization, which is slower and more memory-intensive on large models. For one-off conversions this is minor; for pipelines that re-quantize often (e.g., after each fine-tune), AWQ's lower quantization cost is a practical advantage.

Should I switch my existing GPTQ pipeline to AWQ? Probably not for the accuracy difference alone — it's usually small. Switch if you want AWQ's specific strengths (slight edge on instruction-tuned models, first-class vLLM support, cheaper re-quantization) or if a benchmark on your own task shows a meaningful gap. If your GPTQ setup meets your accuracy and latency bars, the migration cost rarely pays off.


Key takeaways

  • AWQ and GPTQ are both 4-bit PTQ methods that land close to each other and close to FP16 — the differences are real but small.
  • AWQ usually has a slight accuracy edge (lower perplexity, marginally better zero-shot, stronger on instruction-tuned models).
  • GPTQ is highly competitive when configured well (group size 128, desc_act=True) and is extremely fast with ExLlamaV2.
  • Speed is stack-dependent — GPTQ+ExLlamaV2 and AWQ+vLLM kernels are both fast; the winner depends on your runtime and batch size.
  • AWQ is cheaper to produce; GPTQ has the longer track record and broadest tooling.
  • The published numbers are representative, not guarantees — calibration quality and serving config often matter more than the method choice, so benchmark both on your own model, task, and stack.

References

  1. Lin, J., Tang, J., Tang, H., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (perplexity & accuracy vs GPTQ/RTN). MLSys 2024. https://arxiv.org/abs/2306.00978
  2. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. https://arxiv.org/abs/2210.17323
  3. AutoAWQ library. https://github.com/casper-hansen/AutoAWQ
  4. AutoGPTQ library. https://github.com/AutoGPTQ/AutoGPTQ
  5. ExLlamaV2 (fast GPTQ inference kernels). https://github.com/turboderp/exllamav2
  6. vLLM. Supported quantization methods. https://docs.vllm.ai/en/latest/features/quantization/
  7. EleutherAI. Language Model Evaluation Harness. https://github.com/EleutherAI/lm-evaluation-harness