LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory
How accuracy degrades and memory shrinks as you drop from 8-bit to 4-bit to 2-bit quantization, and how to find the sweet spot for your model and hardware.
Mohammed Kafeel
Machine Learning Researcher
Quick answer: Memory scales linearly with bit-width — 8-bit halves FP16, 4-bit quarters it, 2-bit cuts it to an eighth — but accuracy does not degrade linearly. 8-bit is essentially lossless (typically <1% degradation). 4-bit is the sweet spot: a small, usually-acceptable loss (often ~1–3% on benchmarks with good methods) for 4× memory savings. Below 4-bit the curve bends sharply — 3-bit shows clear degradation, and 2-bit falls off a cliff with naive methods, only becoming usable through specialized algorithms (QuIP#, AQLM) that add significant complexity. The key strategic insight: because the accuracy curve is non-linear, spending your memory budget on a larger model at 4-bit usually beats a smaller model at 8-bit. Match the bit-width to your memory ceiling, then validate accuracy on your actual task — the loss at each level is model- and task-dependent.
The memory side: simple and linear
Memory is the easy half — it's deterministic arithmetic. Each weight costs bits / 8 bytes, so memory scales directly with bit-width:
| Precision | Bits | Bytes/param | Memory vs FP16 | 7B weights | 70B weights |
|---|---|---|---|---|---|
| FP16/BF16 | 16 | 2.0 | 1× (baseline) | ~14 GB | ~140 GB |
| INT8 | 8 | 1.0 | 0.5× | ~7 GB | ~70 GB |
| INT4 | 4 | 0.5 | 0.25× | ~3.5 GB | ~35 GB |
| INT3 | 3 | ~0.375 | ~0.19× | ~2.6 GB | ~26 GB |
| INT2 | 2 | 0.25 | 0.125× | ~1.75 GB | ~17.5 GB |
These are weight-only figures. Add ~20–40% for the KV cache and activations during inference. Real quantized files also carry small overhead for scales/zero-points (more for smaller group sizes), so an effective "4-bit" file is often ~4.5 bits/weight.
The takeaway: each step down roughly halves (8→4) or further shrinks memory. If memory scaled the whole story, you'd always go as low as possible. It's the accuracy curve that stops you.
The accuracy side: non-linear, with a cliff
Accuracy degradation is not proportional to memory savings. The loss is negligible at 8-bit, modest at 4-bit, and then accelerates sharply below 4-bit. The qualitative picture, consistent across the quantization literature:
| Precision | Typical accuracy impact (good method) | Status |
|---|---|---|
| 8-bit | Negligible — within ~1% of FP16 | Essentially lossless |
| 4-bit | Small — often ~1–3% on benchmarks | The practical sweet spot |
| 3-bit | Noticeable — several percent, task-dependent | Usable with care, quality clearly drops |
| 2-bit | Severe with naive methods; large even with SOTA | Needs specialized algorithms; complex |
Provenance note: the percentages above are representative ranges from the quantization literature (e.g., the GPTQ and AWQ papers report WikiText perplexity and zero-shot accuracy across bit-widths), not freshly measured numbers. Exact values depend heavily on the model, the method, the calibration data, and the benchmark — always measure on your own model and task. Treat these as the shape of the curve, not precise guarantees.
Why the curve bends
A weight quantized to b bits can take only 2^b distinct values: 256 levels at 8-bit, 16 at 4-bit, 8 at 3-bit, just 4 at 2-bit. Each halving of bits roughly squares the coarseness of the grid. At 8-bit the grid is fine enough that rounding error is lost in the noise. At 4-bit it's coarse but still captures the weight distribution's structure with good block-wise scaling. At 2-bit — only four possible values per weight — there simply isn't enough resolution to represent the distribution, and naive methods collapse. This is why the damage accelerates rather than accumulating linearly.
The Pareto view: memory budget vs capability
The non-linear accuracy curve creates the most important practical result in low-bit quantization:
For a fixed memory budget, a larger model at lower precision usually beats a smaller model at higher precision.
Because 8-bit barely helps accuracy over 4-bit (you "waste" half your memory on precision the model doesn't need), that memory is better spent on more parameters. Some illustrative same-budget comparisons:
| Memory budget | Option A | Option B | Usual winner |
|---|---|---|---|
| ~7 GB | 13B @ 4-bit (~6.5 GB) | 7B @ 8-bit (~7 GB) | 13B @ 4-bit |
| ~14 GB | 30B @ 4-bit (~15 GB, tight) | 13B @ 8-bit (~13 GB) | 30B @ 4-bit (often) |
| ~35 GB | 70B @ 4-bit (~35 GB) | 34B @ 8-bit (~34 GB) | 70B @ 4-bit |
The extra capacity of the bigger model typically outweighs its lower per-weight precision — up to a point. The pattern holds down to ~4-bit and then reverses below it: a model so aggressively quantized that it's badly damaged (2-bit, naive) loses more than the extra parameters give back. The sweet spot is "the largest model you can fit at ~4-bit," not "the largest model you can fit at any bit-width."
Bit-width by bit-width
8-bit — the safe, near-lossless option
- Use when: you have the memory and want maximum quality with meaningful savings, or for precision-sensitive tasks (some math, code, complex reasoning).
- Accuracy: within ~1% of FP16 — usually indistinguishable in practice.
- Trade-off: only 2× savings; often "leaves memory on the table" that a larger 4-bit model would use better.
4-bit — the sweet spot
- Use when: you want the best balance of size and quality — the default for running large models on consumer or cost-constrained hardware.
- Accuracy: small loss with quality methods (AWQ, GPTQ, high GGUF K-quants); larger on reasoning/code, so validate.
- Trade-off: 4× savings for a usually-acceptable quality dip. This is where most production low-bit deployment lives.
3-bit — the cautious frontier
- Use when: 4-bit still doesn't fit and you've confirmed the model tolerates it on your task.
- Accuracy: clearly degraded versus 4-bit; quality methods help but can't fully close the gap.
- Trade-off: modest extra savings over 4-bit (~25%) for a disproportionate quality hit — often not worth it.
2-bit — specialized territory
- Use when: memory is extremely constrained and you can adopt advanced methods.
- Accuracy: naive 2-bit is unusable; specialized algorithms (QuIP#, AQLM) make 2-bit viable but with significant loss and added complexity, and they work better on larger models.
- Trade-off: the most aggressive savings (8× vs FP16), but the steepest quality cost and the hardest tooling. Reserve for when nothing else fits.
Model size changes the calculus
A critical modifier: larger models tolerate low-bit quantization better. A 70B model at 4-bit loses proportionally less than a 7B at 4-bit, and 2-bit only becomes plausible at all on large models. The intuition: bigger models have more redundancy, so coarse rounding destroys a smaller fraction of their representational capacity.
Practical consequences:
- On small models (≤8B), be conservative — 4-bit is fine, but 3-bit and below degrade fast.
- On large models (≥70B), you can push harder — aggressive 3-bit and specialized 2-bit are more realistic.
- This reinforces the Pareto rule: large-model-low-bit is favored not just for capability but because large models quantize more gracefully.
Note: newer, heavily-trained models (e.g., Llama 3) buck this slightly — they're more quantization-sensitive than earlier models of the same size because their weights are more information-dense. Size helps, but training-token density hurts; measure rather than assume.
How to choose: a decision flow
- Compute your memory ceiling — VRAM minus ~20–40% for KV cache/activations.
- Fit the largest model you can at 4-bit. This is the default starting point.
- If it fits with room to spare → consider 8-bit (more quality) or a bigger model at 4-bit (more capability — usually better).
- If 4-bit doesn't fit → try a smaller model at 4-bit before dropping to 3-bit on the bigger one; measure both.
- If nothing fits even at 4-bit → 3-bit on a large model, or specialized 2-bit (QuIP#/AQLM) as a last resort.
- Always validate the chosen configuration on your real task — perplexity for a quick check, task accuracy for the decision.
Frequently asked questions
How much accuracy do you lose at 2-bit, 4-bit, and 8-bit? 8-bit is essentially lossless — typically within about 1% of FP16. 4-bit loses a small amount with quality methods, often around 1–3% on benchmarks (more on reasoning and code). 2-bit is severe: naive methods are unusable, and even specialized algorithms (QuIP#, AQLM) retain meaningful loss. These are representative ranges from the quantization literature, not guarantees — accuracy depends on the model, method, calibration, and task, so you should measure your own.
Why is memory savings linear but accuracy loss isn't?
Memory is just bits/8 bytes per weight, so it scales linearly with bit-width. Accuracy depends on how many distinct values a weight can take — 2^b — which means each halving of bits roughly squares the grid's coarseness. At 8-bit (256 levels) rounding error is negligible; at 4-bit (16 levels) it's manageable with good scaling; at 2-bit (4 levels) there isn't enough resolution to represent the weight distribution, so quality collapses. The damage accelerates rather than scaling linearly.
Is 4-bit really the sweet spot? For most use cases, yes. It delivers 4× memory savings over FP16 with a small, usually-acceptable accuracy loss when paired with a quality method (AWQ, GPTQ, high GGUF K-quants). 8-bit is safer but only saves 2×, and below 4-bit the accuracy cost rises disproportionately to the extra memory saved. Most production low-bit deployment targets ~4-bit for this reason.
Should I run a bigger model at 4-bit or a smaller one at 8-bit? Usually the bigger model at 4-bit, for the same memory budget. Because 8-bit barely improves accuracy over 4-bit, the memory is better spent on more parameters, and the extra capacity typically outweighs the lower per-weight precision. This holds down to about 4-bit and reverses below it — a model so aggressively quantized that it's badly damaged loses more than the added parameters return. Aim for the largest model you can fit at ~4-bit.
Does model size affect how low I can quantize? Yes. Larger models tolerate low-bit quantization better because they have more redundancy, so coarse rounding destroys a smaller fraction of their capacity. A 70B can be pushed to aggressive 3-bit or specialized 2-bit more safely than an 8B. One caveat: newer, heavily-trained models like Llama 3 are more quantization-sensitive than older models of the same size, so size isn't the only factor — always validate.
When is 2-bit quantization actually worth it? Only when memory is extremely constrained and you can adopt specialized methods like QuIP# or AQLM, and ideally on a large model where 2-bit degrades more gracefully. Naive 2-bit is unusable. Even with state-of-the-art methods you accept significant accuracy loss and added tooling complexity, so 2-bit is a last resort for fitting a model that otherwise wouldn't run at all.
Key takeaways
- Memory scales linearly with bit-width (8-bit = ½, 4-bit = ¼, 2-bit = ⅛ of FP16); accuracy does not — it bends sharply below 4-bit.
- 8-bit ≈ lossless (<1%), 4-bit = the sweet spot (small loss, 4× savings), 3-bit = clear degradation, 2-bit = a cliff that needs specialized methods (QuIP#, AQLM).
- The bend happens because a weight has only
2^blevels — halving bits squares the coarseness. - Pareto rule: for a fixed memory budget, a larger model at ~4-bit usually beats a smaller model at 8-bit — but this reverses below 4-bit.
- Larger models tolerate low-bit better; newer heavily-trained models (Llama 3) are more sensitive — size helps, density hurts.
- The accuracy numbers are model- and task-dependent — treat published ranges as the curve's shape and always validate your chosen config on your real task.
References
- Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (perplexity/accuracy across bit-widths). https://arxiv.org/abs/2210.17323
- Lin, J., Tang, J., Tang, H., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024. https://arxiv.org/abs/2306.00978
- Tseng, A., Chee, J., Sun, Q., et al. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks (2-bit). https://arxiv.org/abs/2402.04396
- Egiazarian, V., Panferov, A., Kuznedelev, D., et al. (2024). AQLM: Extreme Compression of LLMs via Additive Quantization (2-bit). https://arxiv.org/abs/2401.06118
- Dettmers, T., & Zettlemoyer, L. (2022). The case for 4-bit precision: k-bit Inference Scaling Laws. https://arxiv.org/abs/2212.09720
- EleutherAI. Language Model Evaluation Harness. https://github.com/EleutherAI/lm-evaluation-harness
Keep reading
LLM Quantization Explained: INT4 vs INT8 vs FP16
A beginner's guide to LLM quantization: how INT4, INT8, and FP16 trade memory for quality, and the rule of thumb for sizing a model to your GPU.
Run 70B Models on a Single RTX 4090 With 4-Bit Quantization
A how-to for fitting a 70B-parameter model onto one 24 GB RTX 4090 using aggressive 4-bit and 2-bit quantization — what works, what breaks, and the accuracy cost.
SmoothQuant: What Activation-Aware Quantization Fixes
Why naive INT8 breaks on large models, and how SmoothQuant and activation-aware methods like AWQ recover near-FP16 accuracy.