All posts

GGUF vs AWQ vs GPTQ: Which Quantization Format to Use?

A practical guide to choosing between GGUF, AWQ, and GPTQ quantization formats based on your runtime, hardware, and accuracy needs.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202613 min read

Quick answer: These three names get compared as if they're the same kind of thing, but they aren't. GGUF is a file format (the container used by llama.cpp) that supports many quantization levels and runs on CPU, GPU, or a mix — it's the default for local/consumer use (Ollama, LM Studio, Macs). GPTQ and AWQ are quantization algorithms that produce GPU-optimized 4-bit weights. GPTQ is the mature, widely-supported method with fast GPU kernels; AWQ is activation-aware and usually preserves a bit more accuracy at 4-bit, with strong support in serving stacks like vLLM. Rule of thumb: GGUF for running locally / on CPU / on a Mac, AWQ for high-quality GPU serving, GPTQ for mature, broadly-compatible GPU inference.


First, clear up the category confusion

The biggest source of confusion is treating GGUF, AWQ, and GPTQ as three options on the same axis. They're not:

  • GGUF = a format (a container file). It defines how a quantized model is stored, not how it was quantized. It supports a whole family of quantization schemes (Q2 through Q8, K-quants, importance-matrix quants).
  • GPTQ = an algorithm. A post-training method that decides which integer to round each weight to using second-order (Hessian) information, then stores the result for GPU inference.
  • AWQ = an algorithm. A post-training method that protects the most salient weights (judged by activation magnitude) while quantizing the rest aggressively.

So the honest framing is: GGUF is what file you ship; GPTQ and AWQ are how you computed the weights. In practice, though, the community packages them as distinct downloads ("the GGUF version", "the AWQ version", "the GPTQ version"), each tied to a different runtime — which is why people compare them directly. This post does too, but keep the distinction in mind: it explains why GGUF behaves so differently from the other two.


GGUF: the local-inference format

GGUF (GPT-Generated Unified Format) is the model file format used by llama.cpp and its ecosystem. It stores weights plus all metadata in a single file and supports a wide range of quantization levels.

  • Runtime: llama.cpp, and everything built on it — Ollama, LM Studio, text-generation-webui (llama.cpp backend).
  • Hardware: CPU, GPU, Apple Silicon (Metal), or mixed — you can offload some layers to GPU and keep the rest on CPU. This is GGUF's superpower for memory-constrained machines.
  • Quantization levels: highly flexible — Q2_K through Q8_0, with K-quants (Q4_K_M, Q5_K_M) and importance-matrix (imatrix) variants that allocate precision where it matters.
  • Best for: running models locally, on Macs, on CPUs, or on consumer GPUs with limited VRAM where layer offloading is essential.

GGUF's defining trait is portability and flexibility: one format, many bit-widths, runs almost anywhere, and gracefully spills over to CPU/RAM when VRAM runs out.

Reading GGUF tags: in Q4_K_M, the 4 is the bit count, K means block-wise (K-quant) scaling, and M is the "medium" quality tier. Q4_K_M is the most-recommended 4-bit balance; go up to Q5_K_M/Q6_K for more quality, down to Q3/Q2 only when desperate for memory.


GPTQ: the mature GPU algorithm

GPTQ is a post-training quantization algorithm that quantizes weights one at a time using approximate second-order (Hessian) information to minimize the error introduced by rounding. It needs a small calibration dataset and produces GPU-ready low-bit weights (typically 4-bit, also 3/8).

  • Runtime: AutoGPTQ, and fast inference kernels like ExLlamaV2; supported in many GPU serving stacks.
  • Hardware: GPU-focused (CUDA). Not designed for CPU offloading the way GGUF is.
  • Quality: good at 4-bit; mature and battle-tested across thousands of community models.
  • Best for: GPU inference where you want broad compatibility and well-optimized kernels.

GPTQ's defining trait is maturity and ecosystem breadth — it was one of the first practical 4-bit LLM methods and is supported almost everywhere on the GPU side.


AWQ: the activation-aware GPU algorithm

AWQ (Activation-aware Weight Quantization) protects the ~1% of weights that matter most — identified by the magnitude of the activations flowing through them — while quantizing the rest to 4-bit. It also needs calibration data and targets GPU inference.

  • Runtime: AutoAWQ; strong first-class support in vLLM and Text Generation Inference (TGI).
  • Hardware: GPU-focused, with very fast inference kernels.
  • Quality: typically preserves slightly more accuracy than GPTQ at the same 4-bit budget, and is often praised on instruction-tuned and multimodal models.
  • Best for: high-throughput GPU serving where you want the best 4-bit quality and fast kernels.

AWQ's defining trait is quality-per-bit on the GPU — by being activation-aware, it tends to lose less on hard tasks at 4-bit. (For the mechanism, see the SmoothQuant / activation-aware quantization post.)


Head-to-head comparison

Dimension GGUF GPTQ AWQ
What it is File format / container Quantization algorithm Quantization algorithm
Primary runtime llama.cpp (Ollama, LM Studio) AutoGPTQ, ExLlamaV2 AutoAWQ, vLLM, TGI
Hardware CPU, GPU, Apple Silicon, mixed GPU (CUDA) GPU (CUDA)
CPU offloading Yes (layer-by-layer) No No
Bit-width flexibility Very high (Q2–Q8, K/i-quants) Mainly 4-bit (3/8 possible) Mainly 4-bit
Calibration data Optional (needed for imatrix quants) Required Required
Accuracy at 4-bit Good (Q4_K_M and up) Good Often best of the three
Inference speed (GPU) Good Fast (ExLlamaV2) Fast
Best environment Local / consumer / Mac / CPU Mature GPU serving High-quality GPU serving
Ecosystem Huge for local use Huge, mature Growing fast, serving-focused

Which one should you use?

Use GGUF when…

  • You're running locally — on a laptop, desktop, or Mac (Apple Silicon).
  • You're VRAM-constrained and need to offload layers to CPU/RAM to fit a model at all.
  • You use Ollama or LM Studio (they consume GGUF natively).
  • You want flexibility to pick from many quality tiers, or to run on CPU-only hardware.

Use AWQ when…

  • You're serving on GPUs and want the best 4-bit quality.
  • You're deploying on vLLM or TGI (first-class AWQ support).
  • Your model is instruction-tuned or multimodal, where AWQ tends to hold up well.

Use GPTQ when…

  • You're serving on GPUs and want maximum compatibility with mature tooling and fast kernels (ExLlamaV2).
  • You have an existing AutoGPTQ / GPTQ pipeline or rely on a stack where GPTQ is best supported.
  • You want a broadly-available 4-bit option with a long track record.

Simplest decision: Local or Mac or low-VRAM?GGUF. GPU server, want best quality?AWQ. GPU server, want maturity/compatibility?GPTQ.


Common questions and gotchas

Question / gotcha Answer
"Which is most accurate at 4-bit?" Usually AWQ, narrowly, but the gap is small and model-dependent — measure on your task
"Can I run GPTQ/AWQ on CPU?" Not practically — they're GPU-targeted; use GGUF for CPU
"Can GGUF run on GPU?" Yes — fully on GPU or split GPU/CPU; it's the offloading that's unique
"Is GGUF lower quality because it's for CPU?" No — Q5/Q6 K-quants are high quality; the format isn't the limiter
"Do AWQ/GPTQ need calibration data?" Yes, both — use a representative set (128–512 samples)
"Which has the best ecosystem?" GGUF for local, GPTQ for mature GPU tooling, AWQ for modern serving

Frequently asked questions

What is the difference between GGUF, AWQ, and GPTQ? GGUF is a file format (the container used by llama.cpp) that supports many quantization levels and runs on CPU, GPU, or a mix. AWQ and GPTQ are quantization algorithms that produce GPU-optimized low-bit weights. So GGUF describes how the model is stored and run, while AWQ and GPTQ describe how the weights were quantized. GGUF is best for local/CPU/Mac use; AWQ and GPTQ are for GPU serving.

Which quantization format is best for running a model locally? GGUF, in almost all cases. It runs on CPUs and Apple Silicon, supports offloading some layers to GPU and the rest to CPU/RAM (essential when you're short on VRAM), and is the native format for Ollama and LM Studio. AWQ and GPTQ are GPU-only and don't offload to CPU, so they're not suited to typical local setups without an adequate GPU.

Is AWQ better than GPTQ? AWQ usually preserves slightly more accuracy at 4-bit because it protects the most salient weights using activation statistics, and it has strong support in modern serving stacks like vLLM. GPTQ is more mature, has very broad tooling support, and fast kernels via ExLlamaV2. The accuracy gap is small and model-dependent — choose AWQ for best 4-bit quality on supported stacks, GPTQ for maturity and compatibility, and benchmark on your own task if it matters.

Can GGUF be as accurate as AWQ or GPTQ? Yes. GGUF's higher-quality K-quants (Q5_K_M, Q6_K) and importance-matrix quants are competitive in quality; the format itself doesn't cap accuracy. The trade-off is that GGUF on GPU may not match the raw throughput of AWQ/GPTQ with their specialized kernels, but for quality at a given bit-width GGUF is not inherently worse.

Do I need calibration data for these? GPTQ and AWQ both require a small calibration dataset (typically 128–512 representative samples) to compute their quantization parameters. Plain GGUF K-quants don't strictly require calibration, though GGUF's importance-matrix (imatrix) variants use a calibration corpus to allocate precision better. For all calibration-based methods, a representative dataset matters for final quality.

Which should I use for serving on vLLM? AWQ has first-class support in vLLM and is a common choice for high-quality 4-bit GPU serving; GPTQ is also supported. GGUF is not the typical vLLM path — it's the llama.cpp ecosystem's format. For vLLM serving, pick AWQ for best quality (or GPTQ for compatibility), and reserve GGUF for llama.cpp-based local deployment.


Key takeaways

  • GGUF is a format; GPTQ and AWQ are algorithms — that category difference explains everything else.
  • GGUF runs on CPU/GPU/Mac with layer offloading — the default for local, consumer, and memory-constrained use (Ollama, LM Studio).
  • GPTQ is the mature GPU algorithm with broad tooling and fast ExLlamaV2 kernels.
  • AWQ is the activation-aware GPU algorithm — usually the best 4-bit quality, with strong vLLM/TGI support.
  • Decision shortcut: local/Mac/low-VRAM → GGUF; GPU server, best quality → AWQ; GPU server, maturity → GPTQ.
  • The 4-bit accuracy gap between AWQ and GPTQ is small and model-dependent — measure on your own task before optimizing for it.

References

  1. llama.cpp / GGUF format specification. https://github.com/ggml-org/llama.cpp/blob/master/docs/development/gguf.md
  2. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. https://arxiv.org/abs/2210.17323
  3. Lin, J., Tang, J., Tang, H., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024. https://arxiv.org/abs/2306.00978
  4. AutoAWQ library. https://github.com/casper-hansen/AutoAWQ
  5. AutoGPTQ library. https://github.com/AutoGPTQ/AutoGPTQ
  6. vLLM. Supported quantization methods (AWQ, GPTQ). https://docs.vllm.ai/en/latest/features/quantization/