LLM Quantization Explained: INT4 vs INT8 vs FP16

A 70B model needs 140 GB of GPU RAM in FP16. Most teams don't have that. This guide breaks down LLM quantization - INT4 vs INT8 vs FP16 - with real memory numbers, speed benchmarks, and a practical decision framework for deploying LLMs efficiently.

Mohammed Kafeel

Machine Learning Researcher

June 14, 2026

12 min read

On this page

TL;DR
What Is LLM Quantization?
The Memory Problem in Plain Numbers
Every Quantization Format, Ranked
INT4 vs INT8 vs FP16: Head-to-Head
The 3 Quantization Methods That Actually Work in Production
When to Quantize - and When Not To
What to Quantize: Weights, Activations, KV Cache
Key Takeaways
FAQ
Useful Sources

A 70B model needs 140 GB of GPU RAM in FP16. Most teams don't have that. Quantization fixes it - and in some cases makes inference 8.5× faster in the process.

This guide cuts through the theory. You'll get the exact memory numbers, a side-by-side format comparison, and a clear decision framework for choosing between INT4, INT8, and FP16 in production.

TL;DR

FP16 is the standard inference baseline: half the memory of FP32, minimal accuracy loss.
INT8 halves FP16 memory again - a 7B model drops from 14 GB to 7 GB - with low accuracy impact.
INT4 cuts that in half again (3.5 GB for 7B), enables 8.5× latency speedup, but needs smart methods like GPTQ or AWQ to hold accuracy.
Post-training quantization (PTQ) - applying compression after training - is the practical path for most production deployments.
The three methods that matter: AWQ, GPTQ, and SmoothQuant.

What Is LLM Quantization?

LLM quantization is the process of compressing a model's weights and activations from high-precision formats (like FP32 or FP16) into lower-precision formats (like INT8 or INT4). Fewer bits per value = less memory, faster math, lower cost.

Think of it like reducing a 24-bit color image to 8-bit. The picture still looks recognizable. You just use a fraction of the storage.

The key insight: most of a model's weights don't need full 32-bit precision to produce accurate outputs. Research consistently shows that 4-bit or 8-bit representations preserve the vast majority of model quality - especially with modern calibration techniques.

This is why model quantization has become the default first step before any LLM deployment.

The Memory Problem in Plain Numbers

Here's the core math. Each parameter in a model takes up a fixed number of bytes depending on its format:

FP32: 4 bytes per parameter
FP16 / BF16: 2 bytes per parameter
INT8: 1 byte per parameter
INT4: 0.5 bytes per parameter

Apply that to a 7B parameter model:

Format	Memory (weights only)
FP32	28 GB
FP16	14 GB
INT8	7 GB
INT4	3.5 GB

Now scale to 70B parameters:

Format	Memory (weights only)
FP32	280 GB
FP16	140 GB
INT8	70 GB
INT4	35 GB

A 70B model in FP16 requires four A100 80GB GPUs just for the weights. In INT4, it fits on a single A100 - with room left for the KV cache and activations.

These figures cover weights only. Runtime elements - attention caches, activations, framework overhead - add more. But the weight savings alone are transformative for deployment economics. (For the bigger picture, see quantization's impact on inference costs.)

Every Quantization Format, Ranked

Format	Size vs FP32	Accuracy Drop	Primary Use Case	Notes
FP32	100%	None	Training	Full precision; slow and memory-heavy
FP16	50%	Minimal	Training & inference	Standard baseline for most LLMs
FP8	25%	Low	Training & inference	Emerging; strong on Hopper/Blackwell GPUs
INT8	25%	Low	Inference	Excellent all-around trade-off
INT4	12.5%	Moderate	Inference	Needs GPTQ/AWQ to preserve accuracy
INT2	6.25%	High	Experimental	Accuracy often too poor for production

FP8 deserves a mention here. It's not yet universal, but NVIDIA's Hopper (H100) and Blackwell architectures have native FP8 Tensor Core support. For teams on those GPUs, FP8 is increasingly the first stop before INT4.

BF16 (Brain Float 16) is also worth knowing. It uses the same 16 bits as FP16 but allocates more bits to the exponent, giving it a wider dynamic range. It's common in training and on hardware that supports it natively (TPUs, Ampere+).

INT4 vs INT8 vs FP16: Head-to-Head

Dimension	FP16	INT8	INT4
Memory (7B model)	14 GB	7 GB	3.5 GB
Latency speedup vs FP16	1× (baseline)	~1.5–2×	up to 8.5×
Accuracy drop	Minimal	Low	Moderate (mitigated by AWQ/GPTQ)
Hardware support	Universal	Universal	Requires modern GPUs (Ampere+)
Best for	High-accuracy inference, fine-tuning	Production serving, balanced workloads	Max throughput, edge deployment, cost reduction
Typical tools	vLLM, TGI, TensorRT-LLM	SmoothQuant, bitsandbytes	GPTQ, AWQ, AutoAWQ

The INT4 latency advantage is real. Research published at ICML 2023 (Wu et al., arXiv:2301.12017) showed that an optimized W4A4 encoder pipeline runs 8.5× faster for latency-oriented scenarios and up to 3× faster for throughput-oriented scenarios compared to FP16 inference.

GPTQ - one of the leading INT4 methods - delivers approximately 3.25× speedup over FP16 with custom GPU kernels, and can quantize a 175B model like OPT-175B in roughly 4 GPU hours without retraining.

The trade-off is real too. Decoder-only models (GPT-style architectures) are more sensitive to aggressive INT4 quantization than encoder-only models. That's exactly why methods like AWQ and GPTQ exist - to recover accuracy that naive rounding would destroy. (For a deeper dive, see bit-width tradeoffs in detail.)

The 3 Quantization Methods That Actually Work in Production

Post-training quantization (PTQ) - compressing a model after training, without retraining - is the dominant approach for production LLM deployments. No GPU clusters needed. No training runs. You take an existing model and compress it.

Three methods dominate the PTQ landscape for LLMs.

01 - AWQ (Activation-Aware Weight Quantization)

AWQ protects the weights that matter most. The core insight: only ~1% of weights are "salient" - they disproportionately affect model output. AWQ identifies these weights by analyzing activation distributions, then scales them to preserve accuracy while aggressively quantizing the remaining 99%.

Precision target: INT4 weights (W4A16 or W4A8)
No retraining required: uses a small calibration dataset (128–512 samples)
Hardware-friendly: produces a format that runs efficiently on modern GPU Tensor Cores
Won MLSys 2024 Best Paper - it's not experimental, it's production-grade
Speedup: over 3× on edge GPUs; fits LLaMA-70B on a single RTX 4090 (24 GB)

AWQ is the go-to for INT4 deployment when you need accuracy close to FP16. (See our practical 4-bit quantization guide for a hands-on walkthrough.)

02 - GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ quantizes models to 3–4 bits per weight using a layer-by-layer approach. For each layer, it computes the inverse-Hessian of the loss function - a measure of how sensitive the model is to changes in each weight - and redistributes quantization error accordingly.

Precision target: INT4 (3–4 bits), supports down to 2-bit
Scale: quantizes OPT-175B or BLOOM-176B in ~4 GPU hours
Speedup: ~3.25× over FP16 with custom GPU kernels
Enables 175B models on a single A100 or two A6000s
Widely supported: AutoGPTQ, vLLM, TGI all support GPTQ out of the box

GPTQ is the standard for open-source model serving pipelines. If you're deploying a quantized Llama or Mistral model from Hugging Face, you're probably already using GPTQ.

03 - SmoothQuant

SmoothQuant solves a specific problem: activation outliers. When you try to quantize both weights and activations to INT8 (W8A8), outlier activation values destroy accuracy. SmoothQuant mathematically "smooths" these outliers by shifting quantization difficulty from activations to weights through an equivalent transformation.

Precision target: INT8 weights and activations (W8A8)
Training-free: pure post-training quantization, no calibration overhead
Memory reduction: up to 2× vs FP16
Speedup: up to 1.56× over FP16
Plug-and-play: compatible with most transformer architectures

SmoothQuant is the right call when you want full INT8 quantization (weights and activations) with minimal accuracy drop and no retraining. It's NVIDIA's recommended INT8 PTQ method in their TensorRT-LLM stack. (More on INT8 quantization and activation awareness.)

When to Quantize - and When Not To

Quantize when:

Your GPU has ≤24 GB VRAM - INT4 or INT8 is often the only way to run 7B+ models
Latency is a constraint - INT4 pipelines can be 8.5× faster than FP16
You're serving at scale - smaller memory footprint = more concurrent requests per GPU = lower cost per token
You want to reduce serving costs - fewer GPUs needed, lower cloud spend
You can tolerate small accuracy trade-offs - for most chat, summarization, and code tasks, INT8 quality is indistinguishable from FP16

Don't quantize when:

You need maximum accuracy - safety-critical applications, medical, legal, financial reasoning where every decimal point matters
Your model already fits comfortably - quantizing a 1B model that runs fine in FP16 on your hardware adds complexity for no gain
Your hardware doesn't support it - older GPUs without INT8/INT4 Tensor Core support won't see the speedups
You're fine-tuning - quantization is for inference, not training (QAT is a separate, more complex approach)

The decision is almost always straightforward: if you're deploying a 7B+ model and you don't have unlimited GPU budget, you quantize. (For sub-4 GB targets specifically, see quantization for memory-constrained devices.)

What to Quantize: Weights, Activations, KV Cache

Not everything in a model needs to be quantized equally. Here's the priority order:

01 - Model weights are the primary target. They're static, known before inference, and account for the bulk of memory. Quantizing weights is well-understood and low-risk with modern methods.

02 - Activations are trickier. They vary with every input and can contain outliers that degrade accuracy. SmoothQuant specifically addresses this. Dynamic quantization (computing scale factors per layer at runtime) is more accurate but slower; static quantization (pre-computed scale factors) is faster but less precise.

03 - KV cache is the third lever. In long-context serving, the key-value cache grows with sequence length and can dominate memory. Quantizing the KV cache to INT8 or INT4 reduces memory pressure without touching model weights - useful for RAG pipelines and long-document processing.

The practical starting point for most teams: quantize weights first, evaluate accuracy, then decide whether to push further into activation or KV cache quantization.

Key Takeaways

LLM memory scales linearly with bit-width: FP16 = 2× FP32, INT8 = 4× FP32, INT4 = 8× FP32.
A 7B model in INT4 needs 3.5 GB - it runs on a laptop GPU. The same model in FP16 needs 14 GB.
INT4 quantization (via GPTQ or AWQ) delivers up to 8.5× latency speedup and ~3.25× throughput gain over FP16.
SmoothQuant is the best path to W8A8 INT8 quantization: up to 2× memory reduction, 1.56× speedup, no retraining.
AWQ is the best path to INT4: protects the 1% of salient weights, near-FP16 accuracy, hardware-friendly.
Post-training quantization is the practical default - no retraining, no GPU clusters, works on existing model checkpoints.
When in doubt: start with INT8 (SmoothQuant or bitsandbytes), validate accuracy, then push to INT4 if latency or cost requires it.

FAQ

What is LLM quantization and why does it matter?

LLM quantization is the process of reducing the numerical precision of a model's weights and activations - for example, from 32-bit floats (FP32) to 8-bit integers (INT8). It matters because it dramatically reduces GPU memory requirements and speeds up inference, making large models deployable on hardware that would otherwise be too small or too expensive.

What is the difference between INT4, INT8, and FP16 quantization?

FP16 (16-bit floating point) is the standard inference baseline - half the memory of FP32, minimal accuracy loss. INT8 (8-bit integer) halves memory again and speeds up inference by ~1.5–2×. INT4 (4-bit integer) halves it once more, enabling up to 8.5× latency speedup, but requires smart methods like GPTQ or AWQ to avoid significant accuracy degradation.

Does quantization hurt model accuracy?

It depends on the format and method. FP16 has minimal accuracy impact. INT8 with SmoothQuant maintains accuracy close to FP16 for most tasks. INT4 with GPTQ or AWQ shows moderate accuracy drop - often imperceptible for chat, summarization, and code generation, but measurable on strict benchmarks. For safety-critical applications, stick with FP16 or INT8.

What is post-training quantization (PTQ)?

Post-training quantization (PTQ) applies compression to a model after it has been trained - no retraining required. You take an existing model checkpoint, run a calibration pass on a small dataset (typically 128–512 samples), and produce a quantized model. AWQ, GPTQ, and SmoothQuant are all PTQ methods. The alternative, Quantization-Aware Training (QAT), bakes quantization into the training loop for better accuracy but requires far more compute.

Which quantization method should I use: AWQ, GPTQ, or SmoothQuant?

Use AWQ for INT4 when you need near-FP16 accuracy and hardware-efficient inference on modern GPUs. Use GPTQ for INT4 when you're working with very large models (70B+) and need maximum compression with good open-source tooling support. Use SmoothQuant for INT8 when you want to quantize both weights and activations (W8A8) with no retraining and minimal accuracy loss.

Can I run a 70B LLM on a single GPU with quantization?

Yes. A 70B model in FP16 requires ~140 GB of VRAM - far beyond any single consumer or prosumer GPU. In INT4 (via GPTQ or AWQ), the weight footprint drops to ~35 GB. GPTQ specifically enables inference of 175B-parameter models on a single A100 (80 GB) or two A6000s (48 GB each). For a 70B model, a single A100 80GB handles it comfortably in INT4, with memory left for the KV cache.

Useful Sources

Understanding INT4 Quantization for Transformer Models (arXiv:2301.12017) - ICML 2023 paper establishing the 8.5× latency speedup benchmark for W4A4 pipelines.
Optimizing LLMs for Performance and Accuracy with Post-Training Quantization - NVIDIA Developer Blog - NVIDIA's guide to AWQ, SmoothQuant, and the TensorRT-LLM quantization stack.
LLM Quantization - BentoML LLM Inference Handbook - Practical reference covering quantization formats, memory calculators, and method comparisons.
A Visual Guide to Quantization - Maarten Grootendorst - 50+ illustrations covering symmetric/asymmetric quantization, GPTQ internals, GGUF, and QAT.

Have a question about quantizing a specific model or architecture? Drop it in the comments. And if you're evaluating LLM deployment strategies, our LLM inference optimization guide is the logical next read.

Keep reading

llmquantizationoptimization

LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory

Standard 2-bit quantization destroys model accuracy - perplexity spikes to 38,000+. Here's what the benchmarks actually say about 2-bit vs 4-bit vs 8-bit, and how to pick the right level for your hardware and task.

MKMohammed Kafeel

17 min read

llmquantizationsmoothquant

SmoothQuant: What Activation-Aware Quantization Fixes

Naive INT8 quantization drops OPT-175B accuracy from 71.6% to 32.3%. SmoothQuant fixes that - without retraining - by migrating quantization difficulty from activations to weights via a mathematically equivalent transform.

MKMohammed Kafeel

12 min read

llmquantizationgpu

Run 70B Models on a Single RTX 4090 With 4-Bit Quantization

A single RTX 4090 can run 70B parameter models with 4-bit quantization. Here's the exact VRAM math, benchmark numbers, and four complete setup methods - llama.cpp, Ollama, AutoAWQ, and ExLlamaV2.

MKMohammed Kafeel

13 min read

Back to all posts