LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory

Standard 2-bit quantization destroys model accuracy - perplexity spikes to 38,000+. Here's what the benchmarks actually say about 2-bit vs 4-bit vs 8-bit, and how to pick the right level for your hardware and task.

Mohammed Kafeel

Machine Learning Researcher

June 9, 2026

17 min read

On this page

TL;DR - Key Takeaways
What Is LLM Quantization?
How Does Each Bit-Width Work?
Memory Savings: How Much VRAM Does Each Level Use?
Accuracy Trade-offs: What Do You Actually Lose?
Speed: Which Quantization Is Fastest?
2-Bit Quantization: Is It Actually Usable in 2025?
Popular Quantization Methods Compared
How to Choose: 2-Bit vs 4-Bit vs 8-Bit Decision Guide
FAQ
Key Takeaways
Useful Sources

Standard 2-bit GPTQ quantization sends perplexity (a measure of how surprised the model is by text - lower is better) from 6.14 all the way to >38,000. That's not a typo. The model stops making sense entirely. Meanwhile, 4-bit quantization costs you roughly 2% accuracy - and 8-bit costs you almost nothing.

If you're choosing a quantization level for a local or production LLM deployment, this guide gives you the numbers to decide fast.

TL;DR - Key Takeaways

The one-line answer: Use 8-bit when accuracy is non-negotiable and you have 16 GB+ VRAM. Use 4-bit (Q4_K_M or AWQ) for most real-world deployments. Avoid standard 2-bit in production - it catastrophically breaks instruction-following unless you're using advanced methods like SpinQuant or picoLLM.

8-bit (Q8_0): ~0.3% accuracy drop vs FP16. Needs ~8 GB VRAM. 1.8× faster than FP16.
4-bit (AWQ / Q4_K_M): ~1.8–2.1% accuracy drop. Needs ~4–5 GB VRAM. 2.4× faster than FP16.
2-bit (standard GPTQ): ~43% accuracy drop. MMLU collapses from 65.2% → 37.1%. Unusable.
2-bit (advanced - SpinQuant, picoLLM): Only a 4.4-point MMLU gap vs FP16. Viable for edge deployment.
Format matters as much as bit-width: AWQ-4bit beats GPTQ-4bit by more than the gap between Q4_K_M and Q8_0.
Red Hat's 500K+ evaluations confirm: W8A8-INT recovers 99%+ accuracy; W4A16-INT recovers 98.9% on HumanEval.

What Is LLM Quantization?

Quantization compresses model weights from high-precision floats (FP16/FP32) down to lower-bit integers, dramatically cutting memory and speeding up inference.

A standard Llama 3.1 8B model in FP16 weighs ~16 GB and needs over 17 GB of VRAM at runtime. That rules out every consumer GPU under $1,000. Quantization maps those 16-bit floats to 8-bit or 4-bit integers using scale factors that preserve approximate magnitude - shrinking the model to fit on hardware you actually own.

This is why quantization isn't optional for local LLM deployment. It's the mechanism that makes running 7B–70B models on a laptop or a single consumer GPU possible at all.

Why the bit-width choice matters so much

Each step down the precision ladder trades accuracy for memory and speed. The tradeoff isn't linear. Going from FP16 to 8-bit costs almost nothing. Going from 4-bit to 2-bit with standard methods is catastrophic. (If you want the head-to-head numbers first, here's INT4, INT8, and FP16 compared.)

The key insight from 2024–2025 benchmarks: the quantization format (AWQ vs GPTQ vs GGUF Q4_K_M) matters as much as the bit-width itself. A well-configured AWQ-4bit model can outperform a poorly configured GPTQ-4bit model by a wider margin than the gap between Q4_K_M and Q8_0.

How Does Each Bit-Width Work?

8-bit quantization (INT8 / Q8_0)

8-bit maps FP16 weights to 256 possible integer values using per-group scale factors. Quality loss is near-zero.

The quantizer divides each weight by a scale factor derived from its group's value range, rounds to the nearest 8-bit integer, and reconstructs during inference by multiplying back. With 256 representable values, the dynamic range is wide enough that rounding errors stay tiny.

Result: the Llama 3.1 8B Q8_0 model scores a perplexity of 6.17 vs the FP16 baseline of 6.14 - a difference you'd never notice in practice.

4-bit quantization (INT4 / Q4_K_M / AWQ / GPTQ)

4-bit cuts the representable values to just 16 per group. Modern methods compensate with groupwise quantization and importance-aware weight selection.

Halving the bit-width doubles the savings but also doubles the quantization error. Modern 4-bit methods fight this in two ways:

Groupwise quantization: small groups of weights (32–128 values) each get their own scale factor, preserving local variation.
Importance-aware selection: AWQ identifies "salient" weights - those that disproportionately affect output - and preserves them at higher fidelity.

GGUF's K-quant system (Q4_K_M) goes further: it uses mixed precision across tensor groups, giving attention layers slightly more bits than feedforward layers based on their sensitivity. That's why Q4_K_M consistently outperforms naive Q4_0. (For a step-by-step on getting 4-bit as the practical sweet spot running on Llama 3, follow our hands-on guide.)

2-bit quantization (INT2 / Q2_K / GPTQ-2bit)

2-bit leaves only 4 representable values per group. Standard methods fail completely. Advanced methods (SpinQuant, picoLLM, QuIP#) can recover near-FP16 quality.

With just 4 values, uniform bit allocation can't handle the outlier weights that dominate model behavior. Standard GPTQ at 2-bit produces perplexity >38,000 and MMLU accuracy of ~37.1% - essentially random guessing on a multiple-choice test.

Advanced methods solve this differently:

SpinQuant (Meta, 2024) uses learned rotations to eliminate outliers before quantization.
picoLLM (Picovoice) learns optimal bit allocation per weight, assigning more bits to high-importance parameters.
QuIP# (Cornell, arXiv:2402.04396, ICML 2024) uses Hadamard incoherence processing and E₈ lattice codebooks to achieve state-of-the-art 2-bit compression.

Memory Savings: How Much VRAM Does Each Level Use?

The short answer: 8-bit cuts VRAM in half; 4-bit cuts it by 75%; 2-bit cuts it by ~87% but at severe quality cost.

Here's the full picture for Llama 3.1 8B - the most widely benchmarked model for this comparison:

Quantization	File Size	Peak VRAM	Fits 8 GB?	Fits 16 GB?	VRAM Savings vs FP16
FP16 (baseline)	16.1 GB	~17 GB	❌	❌	-
Q8_0 (8-bit)	8.5 GB	~9.8 GB	❌	✅	~50%
Q4_K_M (4-bit)	4.9 GB	~5.7 GB	✅	✅	~75%
AWQ 4-bit	4.6 GB	~5.4 GB	✅	✅	~75%
GPTQ 4-bit	4.5 GB	~5.6 GB	✅	✅	~75%
Q2_K (2-bit)	~2.7 GB	~3 GB	✅	✅	~87%

Peak VRAM always exceeds file size because of KV cache, activations, and runtime buffers. A model that loads fine may OOM at longer context windows.

Scaling to larger models

The same ratios apply at 70B scale:

FP16 Llama 3.1 70B: ~140 GB (requires multi-GPU)
8-bit: ~70 GB (dual A100/H100)
4-bit Q4_K_M: ~38–40 GB (feasible on dual GPU or high-memory Apple Silicon M2 Ultra)

4-bit is the only option for running 70B models on consumer hardware. There's no workaround. (Here's the full walkthrough on fitting extreme quantization for large models onto a single RTX 4090.)

Accuracy Trade-offs: What Do You Actually Lose?

8-bit loses almost nothing. 4-bit loses 1.8–2.9% on general benchmarks, up to 15–20% on multilingual or specialized tasks. Standard 2-bit loses ~43% - the model stops working.

Here's the full benchmark table for Llama 3.1 8B, measured on WikiText-2 perplexity, MMLU (knowledge/reasoning), and HellaSwag (commonsense reasoning):

Method	Perplexity	MMLU	HellaSwag	Delta vs FP16
FP16 (baseline)	6.14	65.2%	78.9%	-
Q8_0 (GGUF 8-bit)	6.17	65.0%	78.7%	−0.3%
AWQ 4-bit	6.38	64.0%	77.6%	−1.8%
Q4_K_M (GGUF 4-bit)	6.41	63.8%	77.4%	−2.1%
GPTQ 4-bit	6.52	63.2%	76.9%	−2.9%
2-bit GPTQ (standard)	>38,000	~37.1%	N/A	~−43%
2-bit SpinQuant (advanced)	-	65.2%	-	−4.4 pts vs FP16 69.6%

Sources: SitePoint benchmarks (March 2026), Picovoice sub-4-bit analysis (March 2026), Red Hat 500K+ evaluations (October 2024)

What these numbers mean in practice

At 8-bit: The 0.3% delta is within benchmark noise. You won't notice it.

At 4-bit: A 2% drop on MMLU is invisible in chat. It shows up on complex code generation (subtle logic errors, missed edge cases) and specialized domains. Red Hat's 500K+ evaluations found W4A16-INT recovers 98.9% accuracy on HumanEval - solid for most production use.

At 4-bit for multilingual/specialized tasks: Accuracy can drop 15–20% on benchmarks like C-Eval. If you're deploying for non-English tasks or narrow domains, test explicitly before committing to 4-bit.

At 2-bit (standard): The model generates incoherent text and fails to follow instructions. MMLU drops from 65.2% to 37.1% - that's near random-chance on a 4-option multiple-choice test.

Speed: Which Quantization Is Fastest?

4-bit is faster than 8-bit, which is faster than FP16. The gains come from reduced memory bandwidth pressure, not raw compute.

LLM inference is memory-bandwidth-bound during autoregressive generation. Smaller weights mean fewer bytes transferred per token. That's the entire mechanism.

Red Hat's production benchmarks on Llama 3.1 with vLLM confirm:

W8A8-INT (8-bit): 1.8× throughput boost vs FP16 in multi-request server scenarios
W4A16-INT (4-bit): 2.4× throughput boost vs FP16 in single-stream/latency-critical scenarios

On consumer hardware (RTX 4090, Llama 3.1 8B GGUF):

Quantization	Generation Speed (RTX 4090)	Generation Speed (M2 Pro 16 GB)
Q8_0	68 tok/s	18 tok/s
Q4_K_M	105 tok/s	31 tok/s
AWQ 4-bit	98 tok/s	N/A (GPU only)
GPTQ 4-bit	94 tok/s	N/A (GPU only)

On Apple Silicon, the gap is even wider: Q4_K_M delivers ~72% higher throughput than Q8_0 on an M2 Pro. Unified memory architectures are especially bandwidth-constrained, so smaller weights help more.

2-bit is technically faster still - but the model is broken, so the speed gain is meaningless.

2-Bit Quantization: Is It Actually Usable in 2025?

With standard methods: no. With advanced learned quantization: yes, for specific use cases.

This is the most misunderstood part of the quantization landscape. "2-bit" doesn't mean one thing - it means very different things depending on the method.

Standard 2-bit: catastrophic failure

Standard GPTQ at 2-bit treats all weights equally. With only 4 representable values, the high-magnitude "salient" weights that dominate model behavior get rounded to the same bucket as irrelevant ones. The result:

Perplexity spikes to >38,000 (vs 6.14 at FP16)
MMLU drops from 65.2% → 37.1% (a 43% collapse)
The model generates incoherent text and can't follow instructions

This is a quality cliff, not a gradual slope.

Advanced 2-bit: genuinely viable

Research methods from 2024 change the picture entirely:

SpinQuant (Meta AI, 2024): Uses learned rotations to eliminate outlier weights before quantization. At W4A4KV4 (full 4-bit including activations and KV cache), SpinQuant achieves 65.2 MMLU vs 69.6 FP16 - a 4.4-point gap. Compare that to GPTQ's 37.1 at the same setting. Meta has shipped SpinQuant-quantized Llama 3.2 models in production via ExecuTorch.

picoLLM (Picovoice): Learns optimal bit allocation per weight rather than applying uniform compression. At 2-bit, picoLLM-quantized Gemma-2b maintains near-FP16 accuracy on MMLU - where GPTQ at the same level collapses to near-random.

QuIP# (Cornell, ICML 2024, arXiv:2402.04396): Uses Hadamard incoherence processing and E₈ lattice codebooks. Achieves state-of-the-art results at ≤4 bits per weight with fast inference support.

VPTQ (EMNLP 2024): Vector post-training quantization that outperforms AQLM by up to 11–22% on QA tasks at ~2-bit on LLaMA-3.

The bottom line on 2-bit

Use standard 2-bit only if you have no choice (sub-4 GB RAM, edge IoT). For real deployments needing sub-4-bit compression, use SpinQuant, picoLLM, or QuIP# - and benchmark on your specific task before shipping. (See our guide to sub-4-bit quantization for edge devices under 4 GB.)

Popular Quantization Methods Compared

Not all 4-bit quantization is equal. The method determines whether you're at 1.8% accuracy loss or 2.9%.

Method	Bits	Calibration?	Accuracy (4-bit MMLU)	Speed	Best For
GGUF Q4_K_M	4	No (imatrix optional)	63.8%	✅ Fast	CPU/GPU, Ollama, llama.cpp
AWQ	4	Yes (~10 min on A100)	64.0%	✅ Fastest	NVIDIA GPU, vLLM, highest 4-bit accuracy
GPTQ	2–8	Yes (~20 min on A100)	63.2%	✅ Fast	Extreme compression, broad HF support
bitsandbytes (bnb)	4/8	No	~64%	⚠️ Slower	Fine-tuning (QLoRA), rapid prototyping
GGUF Q8_0	8	No	65.0%	✅ Good	Near-lossless, CPU/GPU, Ollama
SpinQuant	2–4	Yes	~65.2% (W4A4KV4)	✅	Edge/mobile, Meta ExecuTorch
picoLLM	2–4	Yes	Near FP16 at 2-bit	✅	On-device, cross-platform

Key method notes

AWQ is the best default for 4-bit on NVIDIA GPUs. It consistently outperforms GPTQ on accuracy while being faster at runtime. Requires the AutoAWQ library or vLLM.

GGUF Q4_K_M is the best default for CPU inference and Ollama users. No calibration needed, works everywhere, and the K-quant mixed-precision system gives it an edge over naive Q4_0.

bitsandbytes is the right choice for QLoRA fine-tuning - it integrates directly into Hugging Face Transformers with load_in_4bit=True. Don't use it for production inference; it's slower than AWQ/GPTQ.

GPTQ is worth considering when you need extreme compression (3-bit, 2-bit with advanced settings) or when you need the widest HuggingFace Hub model availability.

How to Choose: 2-Bit vs 4-Bit vs 8-Bit Decision Guide

Start with your hardware constraint. Then optimize for task sensitivity. (Weighing this across a fleet of GPUs? See how bit-width decisions at enterprise scale reshape the cost math.)

Step 1: What's your VRAM?

< 8 GB VRAM: You must use 4-bit. No other option fits an 8B model with usable context.
8–16 GB VRAM: 4-bit for 8B models with headroom; 8-bit if you prioritize quality and have 16 GB.
16–32 GB VRAM: 8-bit is the quality-first choice. 4-bit if you want to run larger models.
32 GB+ VRAM: 8-bit or FP16 depending on whether you need every last point of accuracy.

Step 2: What's your task sensitivity?

Task Type	Recommended Level	Why
Chat, summarization, content creation	4-bit (Q4_K_M or AWQ)	2% accuracy drop is imperceptible
Code generation	8-bit (Q8_0)	Logic errors from 4-bit compound; boundary cases break
RAG pipelines	8-bit	Small quality drops compound across retrieval + generation
Legal / medical text	8-bit	Hallucination risk is too high at 4-bit
Multilingual / specialized domains	8-bit	4-bit can lose 15–20% on non-English benchmarks
Batch processing / throughput	4-bit	2.4× speed boost matters more than 2% accuracy
Edge / mobile deployment	4-bit or advanced 2-bit	Memory is the binding constraint

Step 3: Which format within your bit-width?

For 4-bit:

CPU or Ollama → GGUF Q4_K_M
NVIDIA GPU + vLLM → AWQ
Need extreme compression or HF Hub models → GPTQ
Fine-tuning → bitsandbytes NF4

For 8-bit:

CPU or Ollama → GGUF Q8_0
NVIDIA GPU (A100 and older) → bitsandbytes INT8 or GPTQ Q8
NVIDIA H100 → FP8 (native hardware support, 1.4–1.7× lift over INT8)

For 2-bit (only if forced by memory):

Research/edge → SpinQuant, picoLLM, or QuIP#
Never use standard GPTQ at 2-bit in production

Quick decision tree

Do you have ≥16 GB VRAM?
├── YES → Is accuracy critical (code, legal, medical, RAG)?
│         ├── YES → Use 8-bit (Q8_0 or AWQ-8)
│         └── NO  → Use 4-bit (AWQ or Q4_K_M) for speed
└── NO  → Do you have ≥8 GB VRAM?
          ├── YES → Use 4-bit (Q4_K_M fits with headroom)
          └── NO  → Use 4-bit (only option); consider advanced 2-bit
                    only if 4-bit still doesn't fit

FAQ

What is LLM quantization and why does it matter?

LLM quantization compresses model weights from 16-bit floats (FP16) to lower-bit integers (8-bit, 4-bit, 2-bit), cutting VRAM requirements by 50–87% and speeding up inference by 1.8–2.4×. Without it, a standard 8B model requires ~16 GB of VRAM - out of reach for most consumer GPUs. Quantization is what makes local LLM deployment practical.

How much accuracy do you lose with 4-bit quantization?

On general benchmarks (MMLU, HellaSwag), well-implemented 4-bit quantization loses 1.8–2.9% vs FP16. Red Hat's 500,000+ evaluations found W4A16-INT recovers 98.9% accuracy on HumanEval. The caveat: multilingual and specialized tasks can see 15–20% drops at 4-bit. Always benchmark on your specific task.

Is 2-bit quantization usable in production?

Standard 2-bit (GPTQ): no. Perplexity spikes to >38,000 and MMLU drops from 65.2% to 37.1% - the model stops following instructions. Advanced 2-bit (SpinQuant, picoLLM, QuIP#): yes, for specific use cases. SpinQuant achieves only a 4.4-point MMLU gap vs FP16. Meta ships SpinQuant-quantized Llama 3.2 models in production via ExecuTorch.

What's the difference between GPTQ, AWQ, and GGUF Q4_K_M?

All three are 4-bit quantization methods, but they work differently. GPTQ minimizes quantization error layer-by-layer using second-order optimization - requires ~20 min calibration on an A100. AWQ identifies and preserves "salient" weights that most affect output quality - requires ~10 min calibration, generally more accurate than GPTQ. GGUF Q4_K_M uses mixed-precision K-quants with no calibration required - the best choice for CPU inference and Ollama. AWQ wins on accuracy; GGUF Q4_K_M wins on portability.

Which quantization is fastest for inference?

4-bit is faster than 8-bit because LLM inference is memory-bandwidth-bound. On Llama 3.1 8B with an RTX 4090: Q4_K_M generates 105 tokens/sec vs Q8_0's 68 tokens/sec - a ~55% speed advantage. Red Hat's production data confirms: 8-bit gives a 1.8× speedup vs FP16; 4-bit gives 2.4×. On Apple Silicon (M2 Pro), the gap is even larger: Q4_K_M is ~72% faster than Q8_0.

Should I use 8-bit or 4-bit for code generation?

Use 8-bit. Code generation is highly sensitive to quantization because subtle logic errors and missed edge cases compound. At 4-bit, models more frequently omit boundary conditions and produce incorrect recursive logic. If you're on an 8 GB GPU where 8-bit doesn't fit, use AWQ 4-bit (not GPTQ) and test your specific coding tasks before deploying.

What does "perplexity" mean in quantization benchmarks?

Perplexity measures how surprised a language model is by a text sample - lower is better. A perplexity of 6.14 (FP16 Llama 3.1 8B) means the model is highly confident about the next token. A perplexity of >38,000 (2-bit GPTQ) means the model is essentially guessing randomly. It's the most sensitive early-warning signal for quantization quality degradation.

Key Takeaways

8-bit is near-lossless. Q8_0 on Llama 3.1 8B: perplexity 6.17 vs 6.14 FP16. Use it when accuracy is non-negotiable and you have 16 GB+ VRAM.
4-bit is the practical standard. AWQ and Q4_K_M recover 98–99% of FP16 accuracy with 75% less VRAM and 2.4× the throughput. This is the right choice for 90% of deployments.
Format matters as much as bit-width. AWQ-4bit (MMLU: 64.0%) beats GPTQ-4bit (63.2%) by a wider margin than Q4_K_M beats Q8_0. Don't just pick a bit-width - pick the right method.
Standard 2-bit is broken. GPTQ at 2-bit: MMLU 37.1%, perplexity >38,000. Don't use it.
Advanced 2-bit works, but requires specialized tools. SpinQuant, picoLLM, and QuIP# achieve near-FP16 accuracy at 2-bit. Use them for edge/mobile deployments where 4-bit still doesn't fit.
Multilingual and specialized tasks are more sensitive. 4-bit can lose 15–20% on non-English benchmarks. Test before you ship.
Red Hat's 500K+ evaluations confirm the pattern. W8A8-INT: 99%+ accuracy recovery. W4A16-INT: 98.9% on HumanEval. The numbers hold at scale.

Useful Sources

SitePoint - Quantized Local LLMs: 4-bit vs 8-bit Performance Analysis (March 2026): Full benchmark methodology and Llama 3.1 8B results across GGUF, AWQ, GPTQ, EXL2 formats. sitepoint.com/quantized-local-llms-4bit-vs-8bit-analysis
Red Hat Developer - We Ran Over Half a Million Evaluations on Quantized LLMs (October 2024): Production-scale accuracy recovery data for W8A8-INT, W8A8-FP, and W4A16-INT on Llama 3.1 8B/70B/405B. developers.redhat.com/articles/2024/10/17/we-ran-over-half-million-evaluations-quantized-llms
Picovoice - Sub-4-Bit LLM Quantization: Enterprise Guide (March 2026): SpinQuant vs GPTQ comparison, picoLLM X-bit allocation, and 2-bit usability analysis. picovoice.ai/blog/sub-4-bit-llm-quantization
arXiv:2402.04396 - QuIP#: Even Better LLM Quantization (ICML 2024): Hadamard incoherence + E₈ lattice codebooks for state-of-the-art ≤4-bit compression. arxiv.org/abs/2402.04396
Hugging Face - Selecting a Quantization Method (2025): Official guide to bitsandbytes, AWQ, GPTQ, HQQ, torchao with production recommendations. huggingface.co/docs/transformers/en/quantization/selecting
DigitalApplied - Quantization Tradeoffs: 4-bit vs 8-bit vs FP8 (April 2026): Cross-model regression data across 6 frontier 70B models, throughput lift tables, and workload decision matrix. digitalapplied.com/blog/quantization-tradeoffs-4bit-8bit-fp8-performance-data

What quantization level are you running in production - and what pushed you toward that choice? Drop it in the comments. Real-world deployment stories are more useful than benchmarks, and we read every one.

Keep reading

llmquantizationoptimization

LLM Quantization Explained: INT4 vs INT8 vs FP16

A 70B model needs 140 GB of GPU RAM in FP16. Most teams don't have that. This guide breaks down LLM quantization - INT4 vs INT8 vs FP16 - with real memory numbers, speed benchmarks, and a practical decision framework for deploying LLMs efficiently.

MKMohammed Kafeel

12 min read

llmquantizationsmoothquant

SmoothQuant: What Activation-Aware Quantization Fixes

Naive INT8 quantization drops OPT-175B accuracy from 71.6% to 32.3%. SmoothQuant fixes that - without retraining - by migrating quantization difficulty from activations to weights via a mathematically equivalent transform.

MKMohammed Kafeel

12 min read

llmquantizationgpu

Run 70B Models on a Single RTX 4090 With 4-Bit Quantization

A single RTX 4090 can run 70B parameter models with 4-bit quantization. Here's the exact VRAM math, benchmark numbers, and four complete setup methods - llama.cpp, Ollama, AutoAWQ, and ExLlamaV2.

MKMohammed Kafeel

13 min read

Back to all posts