All posts

LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory

How accuracy degrades and memory shrinks as you drop from 8-bit to 4-bit to 2-bit quantization, and how to find the sweet spot for your model and hardware.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202614 min read

Quick answer: Memory scales linearly with bit-width — 8-bit halves FP16, 4-bit quarters it, 2-bit cuts it to an eighth — but accuracy does not degrade linearly. 8-bit is essentially lossless (typically <1% degradation). 4-bit is the sweet spot: a small, usually-acceptable loss (often ~1–3% on benchmarks with good methods) for 4× memory savings. Below 4-bit the curve bends sharply — 3-bit shows clear degradation, and 2-bit falls off a cliff with naive methods, only becoming usable through specialized algorithms (QuIP#, AQLM) that add significant complexity. The key strategic insight: because the accuracy curve is non-linear, spending your memory budget on a larger model at 4-bit usually beats a smaller model at 8-bit. Match the bit-width to your memory ceiling, then validate accuracy on your actual task — the loss at each level is model- and task-dependent.


The memory side: simple and linear

Memory is the easy half — it's deterministic arithmetic. Each weight costs bits / 8 bytes, so memory scales directly with bit-width:

Precision Bits Bytes/param Memory vs FP16 7B weights 70B weights
FP16/BF16 16 2.0 1× (baseline) ~14 GB ~140 GB
INT8 8 1.0 0.5× ~7 GB ~70 GB
INT4 4 0.5 0.25× ~3.5 GB ~35 GB
INT3 3 ~0.375 ~0.19× ~2.6 GB ~26 GB
INT2 2 0.25 0.125× ~1.75 GB ~17.5 GB

These are weight-only figures. Add ~20–40% for the KV cache and activations during inference. Real quantized files also carry small overhead for scales/zero-points (more for smaller group sizes), so an effective "4-bit" file is often ~4.5 bits/weight.

The takeaway: each step down roughly halves (8→4) or further shrinks memory. If memory scaled the whole story, you'd always go as low as possible. It's the accuracy curve that stops you.


The accuracy side: non-linear, with a cliff

Accuracy degradation is not proportional to memory savings. The loss is negligible at 8-bit, modest at 4-bit, and then accelerates sharply below 4-bit. The qualitative picture, consistent across the quantization literature:

Precision Typical accuracy impact (good method) Status
8-bit Negligible — within ~1% of FP16 Essentially lossless
4-bit Small — often ~1–3% on benchmarks The practical sweet spot
3-bit Noticeable — several percent, task-dependent Usable with care, quality clearly drops
2-bit Severe with naive methods; large even with SOTA Needs specialized algorithms; complex

Provenance note: the percentages above are representative ranges from the quantization literature (e.g., the GPTQ and AWQ papers report WikiText perplexity and zero-shot accuracy across bit-widths), not freshly measured numbers. Exact values depend heavily on the model, the method, the calibration data, and the benchmark — always measure on your own model and task. Treat these as the shape of the curve, not precise guarantees.

Why the curve bends

A weight quantized to b bits can take only 2^b distinct values: 256 levels at 8-bit, 16 at 4-bit, 8 at 3-bit, just 4 at 2-bit. Each halving of bits roughly squares the coarseness of the grid. At 8-bit the grid is fine enough that rounding error is lost in the noise. At 4-bit it's coarse but still captures the weight distribution's structure with good block-wise scaling. At 2-bit — only four possible values per weight — there simply isn't enough resolution to represent the distribution, and naive methods collapse. This is why the damage accelerates rather than accumulating linearly.


The Pareto view: memory budget vs capability

The non-linear accuracy curve creates the most important practical result in low-bit quantization:

For a fixed memory budget, a larger model at lower precision usually beats a smaller model at higher precision.

Because 8-bit barely helps accuracy over 4-bit (you "waste" half your memory on precision the model doesn't need), that memory is better spent on more parameters. Some illustrative same-budget comparisons:

Memory budget Option A Option B Usual winner
~7 GB 13B @ 4-bit (~6.5 GB) 7B @ 8-bit (~7 GB) 13B @ 4-bit
~14 GB 30B @ 4-bit (~15 GB, tight) 13B @ 8-bit (~13 GB) 30B @ 4-bit (often)
~35 GB 70B @ 4-bit (~35 GB) 34B @ 8-bit (~34 GB) 70B @ 4-bit

The extra capacity of the bigger model typically outweighs its lower per-weight precision — up to a point. The pattern holds down to ~4-bit and then reverses below it: a model so aggressively quantized that it's badly damaged (2-bit, naive) loses more than the extra parameters give back. The sweet spot is "the largest model you can fit at ~4-bit," not "the largest model you can fit at any bit-width."


Bit-width by bit-width

8-bit — the safe, near-lossless option

  • Use when: you have the memory and want maximum quality with meaningful savings, or for precision-sensitive tasks (some math, code, complex reasoning).
  • Accuracy: within ~1% of FP16 — usually indistinguishable in practice.
  • Trade-off: only 2× savings; often "leaves memory on the table" that a larger 4-bit model would use better.

4-bit — the sweet spot

  • Use when: you want the best balance of size and quality — the default for running large models on consumer or cost-constrained hardware.
  • Accuracy: small loss with quality methods (AWQ, GPTQ, high GGUF K-quants); larger on reasoning/code, so validate.
  • Trade-off: 4× savings for a usually-acceptable quality dip. This is where most production low-bit deployment lives.

3-bit — the cautious frontier

  • Use when: 4-bit still doesn't fit and you've confirmed the model tolerates it on your task.
  • Accuracy: clearly degraded versus 4-bit; quality methods help but can't fully close the gap.
  • Trade-off: modest extra savings over 4-bit (~25%) for a disproportionate quality hit — often not worth it.

2-bit — specialized territory

  • Use when: memory is extremely constrained and you can adopt advanced methods.
  • Accuracy: naive 2-bit is unusable; specialized algorithms (QuIP#, AQLM) make 2-bit viable but with significant loss and added complexity, and they work better on larger models.
  • Trade-off: the most aggressive savings (8× vs FP16), but the steepest quality cost and the hardest tooling. Reserve for when nothing else fits.

Model size changes the calculus

A critical modifier: larger models tolerate low-bit quantization better. A 70B model at 4-bit loses proportionally less than a 7B at 4-bit, and 2-bit only becomes plausible at all on large models. The intuition: bigger models have more redundancy, so coarse rounding destroys a smaller fraction of their representational capacity.

Practical consequences:

  • On small models (≤8B), be conservative — 4-bit is fine, but 3-bit and below degrade fast.
  • On large models (≥70B), you can push harder — aggressive 3-bit and specialized 2-bit are more realistic.
  • This reinforces the Pareto rule: large-model-low-bit is favored not just for capability but because large models quantize more gracefully.

Note: newer, heavily-trained models (e.g., Llama 3) buck this slightly — they're more quantization-sensitive than earlier models of the same size because their weights are more information-dense. Size helps, but training-token density hurts; measure rather than assume.


How to choose: a decision flow

  1. Compute your memory ceiling — VRAM minus ~20–40% for KV cache/activations.
  2. Fit the largest model you can at 4-bit. This is the default starting point.
  3. If it fits with room to spare → consider 8-bit (more quality) or a bigger model at 4-bit (more capability — usually better).
  4. If 4-bit doesn't fit → try a smaller model at 4-bit before dropping to 3-bit on the bigger one; measure both.
  5. If nothing fits even at 4-bit → 3-bit on a large model, or specialized 2-bit (QuIP#/AQLM) as a last resort.
  6. Always validate the chosen configuration on your real task — perplexity for a quick check, task accuracy for the decision.

Frequently asked questions

How much accuracy do you lose at 2-bit, 4-bit, and 8-bit? 8-bit is essentially lossless — typically within about 1% of FP16. 4-bit loses a small amount with quality methods, often around 1–3% on benchmarks (more on reasoning and code). 2-bit is severe: naive methods are unusable, and even specialized algorithms (QuIP#, AQLM) retain meaningful loss. These are representative ranges from the quantization literature, not guarantees — accuracy depends on the model, method, calibration, and task, so you should measure your own.

Why is memory savings linear but accuracy loss isn't? Memory is just bits/8 bytes per weight, so it scales linearly with bit-width. Accuracy depends on how many distinct values a weight can take — 2^b — which means each halving of bits roughly squares the grid's coarseness. At 8-bit (256 levels) rounding error is negligible; at 4-bit (16 levels) it's manageable with good scaling; at 2-bit (4 levels) there isn't enough resolution to represent the weight distribution, so quality collapses. The damage accelerates rather than scaling linearly.

Is 4-bit really the sweet spot? For most use cases, yes. It delivers 4× memory savings over FP16 with a small, usually-acceptable accuracy loss when paired with a quality method (AWQ, GPTQ, high GGUF K-quants). 8-bit is safer but only saves 2×, and below 4-bit the accuracy cost rises disproportionately to the extra memory saved. Most production low-bit deployment targets ~4-bit for this reason.

Should I run a bigger model at 4-bit or a smaller one at 8-bit? Usually the bigger model at 4-bit, for the same memory budget. Because 8-bit barely improves accuracy over 4-bit, the memory is better spent on more parameters, and the extra capacity typically outweighs the lower per-weight precision. This holds down to about 4-bit and reverses below it — a model so aggressively quantized that it's badly damaged loses more than the added parameters return. Aim for the largest model you can fit at ~4-bit.

Does model size affect how low I can quantize? Yes. Larger models tolerate low-bit quantization better because they have more redundancy, so coarse rounding destroys a smaller fraction of their capacity. A 70B can be pushed to aggressive 3-bit or specialized 2-bit more safely than an 8B. One caveat: newer, heavily-trained models like Llama 3 are more quantization-sensitive than older models of the same size, so size isn't the only factor — always validate.

When is 2-bit quantization actually worth it? Only when memory is extremely constrained and you can adopt specialized methods like QuIP# or AQLM, and ideally on a large model where 2-bit degrades more gracefully. Naive 2-bit is unusable. Even with state-of-the-art methods you accept significant accuracy loss and added tooling complexity, so 2-bit is a last resort for fitting a model that otherwise wouldn't run at all.


Key takeaways

  • Memory scales linearly with bit-width (8-bit = ½, 4-bit = ¼, 2-bit = ⅛ of FP16); accuracy does not — it bends sharply below 4-bit.
  • 8-bit ≈ lossless (<1%), 4-bit = the sweet spot (small loss, 4× savings), 3-bit = clear degradation, 2-bit = a cliff that needs specialized methods (QuIP#, AQLM).
  • The bend happens because a weight has only 2^b levels — halving bits squares the coarseness.
  • Pareto rule: for a fixed memory budget, a larger model at ~4-bit usually beats a smaller model at 8-bit — but this reverses below 4-bit.
  • Larger models tolerate low-bit better; newer heavily-trained models (Llama 3) are more sensitive — size helps, density hurts.
  • The accuracy numbers are model- and task-dependent — treat published ranges as the curve's shape and always validate your chosen config on your real task.

References

  1. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (perplexity/accuracy across bit-widths). https://arxiv.org/abs/2210.17323
  2. Lin, J., Tang, J., Tang, H., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024. https://arxiv.org/abs/2306.00978
  3. Tseng, A., Chee, J., Sun, Q., et al. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks (2-bit). https://arxiv.org/abs/2402.04396
  4. Egiazarian, V., Panferov, A., Kuznedelev, D., et al. (2024). AQLM: Extreme Compression of LLMs via Additive Quantization (2-bit). https://arxiv.org/abs/2401.06118
  5. Dettmers, T., & Zettlemoyer, L. (2022). The case for 4-bit precision: k-bit Inference Scaling Laws. https://arxiv.org/abs/2212.09720
  6. EleutherAI. Language Model Evaluation Harness. https://github.com/EleutherAI/lm-evaluation-harness