LLM Quantization Explained: INT4 vs INT8 vs FP16
A beginner's guide to LLM quantization: how INT4, INT8, and FP16 trade memory for quality, and the rule of thumb for sizing a model to your GPU.
Mohammed Kafeel
Machine Learning Researcher
Quick answer: Quantization shrinks a large language model by storing its weights in lower-precision numbers. FP16 (16-bit) is the standard full-quality format. INT8 (8-bit) halves the memory with almost no quality loss. INT4 (4-bit) quarters the memory and lets you run large models on consumer GPUs, with a small, usually acceptable quality drop. The rule of thumb: each weight in FP16 takes 2 bytes, INT8 takes 1 byte, INT4 takes 0.5 bytes — so a 7-billion-parameter model needs ~14 GB in FP16, ~7 GB in INT8, or ~3.5 GB in INT4.
What is quantization in machine learning?
Quantization is the process of converting a model's numbers from a high-precision format to a lower-precision one, so the model takes less memory and runs faster — while trying to keep its answers as close to the original as possible.
A trained LLM is, at its core, billions of numbers called weights. By default these are stored as 16-bit or 32-bit floating-point values, each capable of representing a huge range with fine detail. The insight behind quantization is that you usually don't need all that detail. Most weights cluster in a narrow range, and rounding them to a coarser grid of values changes the model's output far less than you'd expect.
Think of it like image compression. A RAW photo stores enormous color detail; a JPEG throws away detail your eye can't notice and ends up 10× smaller. Quantization does the same to a neural network: discard precision the model barely uses, keep the file small enough to actually run.
Why precision is measured in bits
Every number a computer stores uses a fixed number of bits (binary digits). More bits means more distinct values it can represent, and therefore more precision — but also more memory.
| Format | Bits | Bytes per weight | What it stores |
|---|---|---|---|
| FP32 | 32 | 4 | Full-precision float — original training format |
| FP16 | 16 | 2 | Half-precision float — modern inference standard |
| BF16 | 16 | 2 | "Brain float" — wider range than FP16, less mantissa |
| INT8 | 8 | 1 | 8-bit integer — 256 possible values |
| INT4 | 4 | 0.5 | 4-bit integer — only 16 possible values |
The pattern is simple: halve the bits, halve the memory. The whole game of quantization is figuring out how to drop bits without dropping quality.
How does quantization actually work? (The core idea)
You can't just "round" a floating-point weight to an integer directly — you'd lose the scale. A weight of 0.0023 and a weight of 0.0019 would both round to 0, becoming identical. The trick is scaling.
Step by step
- Find the range. Look at a group of weights and find the maximum absolute value — say the weights range from −0.8 to +0.8.
- Compute a scale factor. Map that range onto the integer range. For INT8 (signed, −127 to +127): scale = 0.8 / 127 ≈ 0.0063.
- Quantize. Divide each weight by the scale and round to the nearest integer. The weight 0.0023 becomes round(0.0023 / 0.0063) = round(0.365) = 0.
- Store the integers plus the single scale factor.
- Dequantize at runtime. To use a weight, multiply the stored integer back by the scale. The integer 50 becomes 50 × 0.0063 = 0.315.
The error introduced is the difference between the original weight and its rounded-then-rescaled value. The fewer bits, the coarser the grid, the larger that error — which is why INT4 loses more quality than INT8.
Why group/block-wise scaling matters
Using one scale factor for an entire massive weight matrix is crude: a single huge outlier weight stretches the range and wastes precision on all the small weights. Modern methods use block-wise quantization — a separate scale factor for every small group of weights (e.g., every 64 or 128 values). This keeps precision high where it matters and is the reason INT4 today is far better than naive 4-bit rounding from a few years ago.
INT4 vs INT8 vs FP16: the head-to-head comparison
| Aspect | FP16 | INT8 | INT4 |
|---|---|---|---|
| Bytes per weight | 2 | 1 | 0.5 |
| Memory vs FP16 | 1× (baseline) | ~0.5× (half) | ~0.25× (quarter) |
| Quality loss | None (reference) | Negligible (<1%) | Small but real (1–5% on benchmarks) |
| Speed | Baseline | Faster (less memory traffic) | Fastest (least memory traffic) |
| Best for | Max quality, ample VRAM | Production sweet spot | Running big models on small GPUs |
| Typical use case | Cloud inference, training | Most production serving | Local/consumer hardware |
What model size means in practice
Here's how much GPU memory the weights alone need (add ~20–40% for the KV cache and activations during actual inference):
| Model size | FP16 | INT8 | INT4 | INT4 fits on… |
|---|---|---|---|---|
| 7B | ~14 GB | ~7 GB | ~3.5 GB | RTX 3060 (12 GB), most laptops |
| 13B | ~26 GB | ~13 GB | ~6.5 GB | RTX 4070 (12 GB), RTX 3090 |
| 70B | ~140 GB | ~70 GB | ~35 GB | Single RTX 4090 (24 GB) — barely, tight |
| 70B | (2× A100) | (1× A100) | (1× consumer) | INT4 is what makes 70B local-feasible |
This table is the real reason quantization matters to beginners: INT4 is what lets a 70B model run on hardware you can actually afford.
When should you use each precision?
Use FP16 when…
- You have plenty of GPU memory (cloud A100/H100, or the model is small).
- You need maximum quality and can't tolerate any degradation.
- You're doing tasks sensitive to small numerical differences (some math, code, or reasoning benchmarks).
Use INT8 when…
- You want to roughly halve memory and cost with essentially no visible quality loss.
- You're running production inference and want the best quality-per-dollar.
- This is the safest default for most people — the quality drop is hard to notice in practice.
Use INT4 when…
- You're running on consumer hardware (a single gaming GPU, a laptop, a Mac).
- You want to run a much larger model than your VRAM would otherwise allow — a quantized 70B often beats a full-precision 13B.
- A small quality drop is acceptable for chat, summarization, drafting, and most everyday tasks.
A useful heuristic: a larger model at INT4 usually beats a smaller model at FP16 for the same memory budget. A 13B model at INT4 (~6.5 GB) generally outperforms a 7B model at FP16 (~14 GB) while using less memory.
Quantization formats you'll actually encounter
When you download quantized models (e.g., from Hugging Face), you'll see these names. Here's what they mean without the jargon:
| Format / Method | What it is | When you'll see it |
|---|---|---|
| GGUF | A file format for quantized models, used by llama.cpp. Runs on CPU + GPU. | Local inference, Macs, Ollama, LM Studio |
| GPTQ | A post-training 4-bit method optimized for GPU inference. | GPU serving of 4-bit models |
| AWQ | Activation-aware quantization; protects the most important weights. | High-quality 4-bit GPU inference |
| bitsandbytes | On-the-fly 8-bit/4-bit quantization, integrated into Hugging Face. | Easiest way to load a model quantized |
| Qn_K_M tags | GGUF quality levels (e.g., Q4_K_M). The number = bits; K_M = block scheme. | Choosing which GGUF file to download |
Reading GGUF tags: In a name like Q4_K_M, the 4 is the bit count, K means block-wise scaling, and M (medium) is the quality tier. As a beginner, Q4_K_M is the most-recommended balance of size and quality for 4-bit. Q5_K_M or Q8_0 give higher quality at larger size.
How to quantize and load a model (code examples)
Easiest: load in 4-bit with bitsandbytes (Hugging Face)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quant_config = BitsAndBytesConfig(
load_in_4bit=True, # turn on 4-bit
bnb_4bit_quant_type="nf4", # "normal float 4" — best for LLM weights
bnb_4bit_compute_dtype="float16", # compute in fp16 for accuracy
bnb_4bit_use_double_quant=True, # quantize the scale factors too (saves more)
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
quantization_config=quant_config,
device_map="auto",
)
That single config turns a ~14 GB model into a ~3.5 GB one that fits on a 12 GB consumer GPU.
Load in 8-bit (highest quality of the quantized options)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quant_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
quantization_config=quant_config,
device_map="auto",
)
Running a GGUF model locally with Ollama (no code)
# Pull a 4-bit quantized model and chat with it
ollama run llama2:7b-chat-q4_K_M
Ollama, LM Studio, and llama.cpp all use GGUF files and handle the quantization details for you — you just pick the quality tier.
What quality actually gets lost? (Setting expectations)
The honest truth, because beginners often either over-worry or under-worry about this:
- INT8 loss is negligible. On standard benchmarks, INT8 models score within ~1% of FP16. You will almost never notice the difference in chat or writing tasks.
- INT4 loss is small but real. Expect a 1–5% drop on reasoning and knowledge benchmarks. For casual chat, summarization, and drafting, it's usually imperceptible. For precise math, complex multi-step reasoning, or code, you may occasionally see more mistakes.
- Below INT4 (INT3, INT2) loss grows fast. Quality degrades sharply. These exist but aren't recommended unless you have no other option.
- Outlier sensitivity. Some models quantize more gracefully than others. Larger models are generally more robust to quantization than small ones — counterintuitively, a 70B model loses less relative quality at INT4 than a 7B model does.
Common beginner mistakes and how to avoid them
| Mistake | Why it's a problem | Fix |
|---|---|---|
| Forgetting KV cache + activation memory | Weights at INT4 fit, but inference OOMs anyway | Budget 20–40% extra VRAM beyond the weight size |
| Using INT4 for math/code-critical tasks | The small quality drop hits exactly these tasks hardest | Use INT8 or FP16 for precision-sensitive workloads |
| Picking the smallest GGUF file (Q2/Q3) | Aggressive quantization tanks quality | Default to Q4_K_M; go up to Q5/Q8 if VRAM allows |
| Quantizing during training | Most beginner tools do post-training quantization | Quantize an already-trained model; don't train in 4-bit |
| Comparing a quantized small model to a full big one | Apples to oranges | Compare at equal memory budget, not equal parameter count |
Frequently asked questions
What is the difference between INT4, INT8, and FP16? They are different numeric precisions for storing a model's weights. FP16 uses 16 bits (2 bytes) per weight and is the full-quality standard. INT8 uses 8 bits (1 byte), halving memory with negligible quality loss. INT4 uses 4 bits (0.5 bytes), quartering memory with a small but real quality drop. Lower precision means smaller, faster models that fit on cheaper hardware.
Does quantization make a model dumber? Only slightly, and often imperceptibly. INT8 models perform within about 1% of the original on benchmarks. INT4 models lose roughly 1–5% on reasoning and knowledge tasks but remain perfectly usable for chat, writing, and summarization. The memory savings almost always outweigh the small quality cost — especially since you can run a much larger, smarter model in the freed-up space.
How much GPU memory do I need for a 7B model? For the weights alone: about 14 GB at FP16, 7 GB at INT8, or 3.5 GB at INT4. Add 20–40% on top for the KV cache and activations during inference. This means a 7B model at INT4 comfortably runs on a 12 GB consumer GPU, while FP16 needs a 16 GB+ card.
Is INT4 or INT8 better? INT8 is better for quality (almost indistinguishable from FP16) and is the safest production default. INT4 is better for fitting large models on small hardware, at the cost of a small quality drop. Choose INT8 when you have the memory and want maximum quality; choose INT4 when memory is the constraint or you want to run a bigger model.
What does Q4_K_M mean?
It's a GGUF quantization label. The 4 is the bit count (4-bit), K means block-wise scaling (separate scale factors for small weight groups, which improves quality), and M is the "medium" quality tier. Q4_K_M is the most widely recommended balance of size and quality for beginners running local models.
Can I quantize any model myself?
Yes — for inference. Tools like bitsandbytes (built into Hugging Face Transformers) let you load almost any model in 4-bit or 8-bit with a few lines of config, no special expertise needed. For producing optimized GPTQ or AWQ files you'd use their respective libraries, but for getting started, load_in_4bit=True is all you need.
Key takeaways
- Quantization stores model weights in fewer bits: FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes per weight.
- Memory scales directly with bits: a 7B model needs ~14 GB (FP16), ~7 GB (INT8), or ~3.5 GB (INT4) for weights.
- INT8 is the safe default — half the memory, essentially no quality loss.
- INT4 unlocks big models on consumer hardware — a quantized 70B runs where it otherwise couldn't, with a small (1–5%) quality drop.
- A larger model at INT4 usually beats a smaller model at FP16 for the same memory budget.
- For local models,
Q4_K_MGGUF is the recommended starting point; use bitsandbytesload_in_4bit=Truefor the easiest Hugging Face path. - Always budget 20–40% extra VRAM beyond the weight size for the KV cache and activations.
Keep reading
LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory
How accuracy degrades and memory shrinks as you drop from 8-bit to 4-bit to 2-bit quantization, and how to find the sweet spot for your model and hardware.
Run 70B Models on a Single RTX 4090 With 4-Bit Quantization
A how-to for fitting a 70B-parameter model onto one 24 GB RTX 4090 using aggressive 4-bit and 2-bit quantization — what works, what breaks, and the accuracy cost.
SmoothQuant: What Activation-Aware Quantization Fixes
Why naive INT8 breaks on large models, and how SmoothQuant and activation-aware methods like AWQ recover near-FP16 accuracy.