How to Quantize Llama 3 to 4-Bit With Minimal Accuracy Loss

A complete, code-first guide to quantizing Llama 3 to 4-bit using bitsandbytes NF4, AWQ, GPTQ, and GGUF - with real VRAM numbers, MMLU benchmarks, and tips to keep accuracy loss under 2%.

Mohammed Kafeel

Machine Learning Researcher

June 19, 2026

16 min read

On this page

What Is 4-Bit Quantization? (And Why It Matters for Llama 3)
Which 4-Bit Quantization Method Should You Use?
Method 1 - bitsandbytes NF4 (Fastest to Get Running)
Method 2 - AWQ (Best Accuracy for GPU Production)
Method 3 - GGUF Q4KM (Best for CPU and Local Inference)
How to Minimize Accuracy Loss - 5 Proven Tips
How to Verify Your Quantized Model Isn't Broken
VRAM Requirements at a Glance
Key Takeaways
Frequently Asked Questions
Useful Sources

TL;DR

4-bit quantization cuts Llama 3 8B VRAM from ~16 GB to ~5 GB - no GPU upgrade needed.

AWQ is the most accurate 4-bit method (only -1.8% MMLU vs FP16 baseline).

bitsandbytes NF4 takes 3 lines of config and is the fastest way to get running.

GGUF Q4_K_M is your go-to for CPU inference and Ollama.

Always use bnb_4bit_compute_dtype=torch.bfloat16 - not float16 - to avoid NaN issues on Llama 3.

Llama 3 8B in full BF16 precision needs ~16 GB of VRAM. That rules out most consumer GPUs. But the 4-bit version fits in ~5 GB - an RTX 3060 or even a laptop GPU. And the accuracy hit? With the right method, you're looking at less than 2% on MMLU. That's the deal this guide delivers.

We'll cover every major 4-bit quantization method - bitsandbytes NF4, AWQ, GPTQ, and GGUF - with complete, runnable code for each. You'll also get benchmark numbers, VRAM tables, and five concrete tips to keep accuracy loss as low as possible.

What Is 4-Bit Quantization? (And Why It Matters for Llama 3)

4-bit quantization means storing each model weight as a 4-bit integer instead of a 16-bit float. That's a 75% reduction in memory per parameter - which is why it's the most important Llama 3 memory optimization technique available today.

In plain English: a neural network is just billions of numbers (weights). By default, Llama 3 stores each weight in BF16 (bfloat16) format, which takes 2 bytes. Quantization shrinks that to 0.5 bytes. The math is simple - the impact is massive.

The trade-off is precision. You're approximating each weight with a coarser value. Done naively, that tanks accuracy. Done well (with methods like AWQ or NF4), the degradation is surprisingly small. (New to the precision formats? Start with INT4 quantization explained.)

VRAM Comparison: Llama 3 at BF16 vs 4-Bit

Model	Precision	VRAM Required	Fits On
Llama 3 8B	BF16	~16 GB	RTX 3090, RTX 4090
Llama 3 8B	4-bit	~5 GB	RTX 3060, RTX 4060, laptop GPUs
Llama 3 70B	BF16	~140 GB	2× A100 80GB minimum
Llama 3 70B	4-bit	~40 GB	1× A100 80GB or 2× RTX 4090

The 8B model goes from requiring a high-end GPU to running on almost any modern card. The 70B model goes from a multi-GPU data center setup to something a well-equipped developer can actually run locally.

Which 4-Bit Quantization Method Should You Use?

Short answer: AWQ for GPU production, GGUF Q4_K_M for CPU/local, bitsandbytes NF4 for quick prototyping and QLoRA fine-tuning.

Here's the full breakdown with real MMLU numbers (Llama 3.1 8B, FP16 baseline = 65.2%):

Method	MMLU Score	MMLU Delta	Calibration	Best For	Tools
AWQ	64.0%	-1.8%	Required	GPU production inference	autoawq, vLLM, SGLang
GGUF Q4_K_M	63.8%	-2.1%	None	CPU / Ollama / local	llama.cpp, Ollama
GPTQ	63.2%	-2.9%	Required	GPU inference (fallback)	GPTQModel, AutoGPTQ
bitsandbytes NF4	~63.5–64.0%	~-1.5–2%	None	Prototyping, QLoRA	transformers, PEFT

Key insight: AWQ consistently outperforms GPTQ by about 1 percentage point on MMLU. That gap matters in production. For local use, GGUF Q4_K_M is the sweet spot - no calibration, runs everywhere, and loses only 2.1% vs FP16. (For the full head-to-head, see our AWQ vs GPTQ benchmark comparison.)

Method 1 - bitsandbytes NF4 (Fastest to Get Running)

bitsandbytes NF4 is the fastest path to a 4-bit Llama 3 model. No calibration dataset, no preprocessing - just three config lines and you're loading.

NF4 stands for 4-bit NormalFloat, a data type introduced in the QLoRA paper (Dettmers et al., 2023). It's specifically designed for weights that follow a normal distribution, which is exactly what you get in transformer models. That's why it preserves accuracy better than naive INT4.

This method is also the standard choice for QLoRA fine-tuning - you freeze the 4-bit base model and train only lightweight LoRA adapters on top.

Install

pip install torch transformers bitsandbytes accelerate

Load Llama 3 in 4-Bit NF4

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Configure 4-bit NF4 quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                        # Enable 4-bit loading
    bnb_4bit_quant_type="nf4",                # NormalFloat 4-bit (best for LLMs)
    bnb_4bit_use_double_quant=True,           # Double quantization: saves ~0.4 bits/param
    bnb_4bit_compute_dtype=torch.bfloat16     # Use bfloat16 for compute (NOT float16)
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",                        # Auto-distribute across available GPUs
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

# Check memory footprint
memory_gb = model.get_memory_footprint() / 1e9
print(f"Model loaded. Memory footprint: {memory_gb:.2f} GB")
# Expected output: ~4.5–5.0 GB for Llama 3 8B

# Run a quick generation test
inputs = tokenizer("What is the capital of France?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Two Rules You Can't Skip

Rule 1: Always enable double quantization. Setting bnb_4bit_use_double_quant=True applies a second quantization pass to the quantization constants themselves. It saves an extra ~0.4 bits per parameter at no measurable accuracy cost. On a 70B model, that's several gigabytes.

Rule 2: Use bfloat16, not float16, for compute dtype. Llama 3 uses bfloat16 natively. Setting bnb_4bit_compute_dtype=torch.float16 can produce NaN values during inference on some configurations. Stick with torch.bfloat16.

Method 2 - AWQ (Best Accuracy for GPU Production)

AWQ (Activation-Aware Weight Quantization) is the most accurate 4-bit method available for Llama 3. It achieves only a -1.8% MMLU drop from FP16 - the best of any 4-bit approach.

The reason AWQ is so good: it doesn't treat all weights equally. It analyzes activation magnitudes to identify the ~1% of weights that have the most impact on output quality, then protects those weights by scaling them before quantization. The other 99% get quantized normally. This targeted approach is why AWQ consistently beats GPTQ by 1–3% on benchmarks like MMLU and HumanEval.

AWQ models are also fully compatible with vLLM and SGLang - the two most popular high-throughput inference engines. That makes AWQ the default choice for production GPU deployments. (Running quantized models is also a proven way to cut LLM API and serving costs.)

Install

pip install autoawq

Load a Pre-Quantized AWQ Model

The fastest approach is to load a pre-quantized AWQ model from Hugging Face. Many Llama 3 AWQ models are available from the community (search for Llama-3-8B-Instruct-AWQ on Hugging Face).

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "casperhansen/llama-3-8b-instruct-awq"  # Example AWQ model

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load AWQ quantized model
model = AutoAWQForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    fuse_layers=True       # Fuse layers for faster inference
)

# Generate text
inputs = tokenizer("Explain 4-bit quantization in one sentence:", return_tensors="pt")
inputs = inputs.to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Load AWQ Model via Transformers (Alternative)

If you prefer the standard transformers interface:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "casperhansen/llama-3-8b-instruct-awq"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

For vLLM deployment, AWQ models load with a single flag:

python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3-8b-instruct-awq \
    --quantization awq \
    --dtype auto

Method 3 - GGUF Q4_K_M (Best for CPU and Local Inference)

GGUF is the format you want for CPU inference, Ollama, and local setups with no GPU. It runs on llama.cpp, which is highly optimized for CPU execution and supports partial GPU offloading.

GGUF (formerly GGML) is a binary format that bundles the quantized weights and model metadata into a single file. The community - particularly via Bartowski and similar contributors on Hugging Face - maintains up-to-date GGUF versions of every major Llama 3 release.

Which GGUF Quantization Level?

Q4_K_M is the sweet spot. Here's why the other options fall short:

Q2_K (2-bit): Loses too much accuracy. Reasoning and instruction-following degrade noticeably.
Q4_K_M (4-bit): Best balance of size and quality. MMLU delta of only -2.1% vs FP16.
Q8_0 (8-bit): Near-FP16 accuracy but uses ~8 GB for the 8B model - often not worth it if VRAM is tight.

Run with Ollama (One Command)

# Pull and run Llama 3 8B in Q4_K_M format
ollama run llama3:8b-instruct-q4_K_M

That's it. Ollama handles the download, quantization format selection, and serving automatically.

Run with llama.cpp (Command Line)

First, download a GGUF model file (e.g., from Bartowski's Hugging Face repos):

# Install llama.cpp (build from source or use pre-built binary)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

# Download a Q4_K_M GGUF model
# (Replace URL with actual model from Hugging Face)
wget https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf

# Run inference
./llama-cli \
    -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
    -p "What is 4-bit quantization?" \
    -n 200 \
    --temp 0.7

Tip: If you have a GPU, add -ngl 35 to offload 35 layers to GPU. This dramatically speeds up generation even if the full model doesn't fit in VRAM.

./llama-cli \
    -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
    -p "What is 4-bit quantization?" \
    -n 200 \
    -ngl 35    # Offload 35 layers to GPU

How to Minimize Accuracy Loss - 5 Proven Tips

You can keep 4-bit quantization LLM accuracy loss under 2% with the right choices. Here are the five decisions that matter most. (For the deeper picture, see accuracy loss at 4-bit for Llama 3.)

Tip 1: Use AWQ Over GPTQ When Accuracy Matters

AWQ achieves -1.8% MMLU delta vs FP16. GPTQ lands at -2.9%. That's a full percentage point difference on a widely-used benchmark. When you're serving users in production, that gap is real. Use AWQ as your default for GPU deployments; fall back to GPTQ only if AWQ weights aren't available for your specific model.

Tip 2: Enable Double Quantization in bitsandbytes

Always set bnb_4bit_use_double_quant=True. This performs a second quantization on the quantization constants themselves, saving an extra ~0.4 bits per parameter with no measurable accuracy cost. On a 70B model, that's roughly 3–4 GB of free savings.

Tip 3: Use bfloat16 Compute Dtype, Not float16

Set bnb_4bit_compute_dtype=torch.bfloat16. Llama 3 was trained with bfloat16, and using float16 for compute can cause numerical instability (NaN values) on some hardware configurations. This is a silent failure - the model loads fine but produces garbage output.

Tip 4: For GGUF, Never Go Below Q4_K_M

Q2_K and Q3_K_M quantization levels degrade Llama 3 significantly. Llama 3 models are notably more sensitive to aggressive quantization than Llama 2. The reasoning and instruction-following capabilities that make Llama 3 useful start breaking down below 4-bit. Q4_K_M is the floor for reliable quality.

Tip 5: Keep Attention Layers in Higher Precision (Advanced)

Some AWQ configurations support mixed-precision quantization - keeping attention layers (Q, K, V projections) at higher precision while quantizing the MLP layers more aggressively. If you're quantizing from scratch with AutoAWQ, explore the modules_to_not_convert parameter to exempt sensitive layers. This is advanced but can recover another 0.5–1% on accuracy-critical tasks.

How to Verify Your Quantized Model Isn't Broken

Run a generation test first, then check perplexity if you need a number. Most quantization failures are obvious - the model produces repetitive text, refuses to follow instructions, or outputs gibberish.

Quick Sanity Check

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map="auto"
)

# Test 1: Basic instruction following
prompt = "List three capitals of European countries."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print("Test 1:", tokenizer.decode(outputs[0], skip_special_tokens=True))

# Test 2: Reasoning
prompt2 = "If a train travels 60 mph for 2.5 hours, how far does it go?"
inputs2 = tokenizer(prompt2, return_tensors="pt").to(model.device)
outputs2 = model.generate(**inputs2, max_new_tokens=100, do_sample=False)
print("Test 2:", tokenizer.decode(outputs2[0], skip_special_tokens=True))

If both answers are correct and coherent, your quantized model is working.

Perplexity Check with lm-evaluation-harness

For a proper benchmark, use EleutherAI's lm-evaluation-harness:

pip install lm-eval

# Evaluate perplexity on WikiText-2
lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct,load_in_4bit=True \
    --tasks wikitext \
    --device cuda:0 \
    --batch_size 1

What Numbers to Expect

Model	Precision	WikiText-2 Perplexity	Status
Llama 3 8B	BF16	~6.1	Baseline
Llama 3 8B	4-bit NF4	~6.4	✅ Acceptable
Llama 3 8B	4-bit NF4	> 7.5	❌ Something went wrong

A perplexity above 7.5 on WikiText-2 for the 8B model suggests a configuration error - wrong compute dtype, tokenizer mismatch, or a corrupted model file. Recheck your BitsAndBytesConfig settings.

VRAM Requirements at a Glance

This table covers the full picture for Llama 3 memory optimization planning. Numbers are for model weights only - add 1–5 GB for KV cache at typical context lengths.

Model	Precision	VRAM Required	Fits On
Llama 3 8B	BF16	~16 GB	RTX 3090 (24 GB), RTX 4090 (24 GB)
Llama 3 8B	INT8	~8 GB	RTX 3060 (12 GB), RTX 4070 (12 GB)
Llama 3 8B	4-bit	~5 GB	RTX 3060 (12 GB), RTX 4060 (8 GB), most laptops
Llama 3 70B	BF16	~140 GB	4× A100 40 GB or 2× A100 80 GB
Llama 3 70B	INT8	~70 GB	1× A100 80 GB
Llama 3 70B	4-bit	~40 GB	1× A100 80 GB or 2× RTX 4090

Practical notes:

Llama 3 8B at 4-bit is the most accessible setup. It runs on a single RTX 3060 with VRAM to spare for the KV cache.
Llama 3 70B at 4-bit fits on a single A100 80 GB - or two consumer RTX 4090s (48 GB combined). That's a legitimate local setup for serious practitioners. (More on this in our guide to quantization strategies for large models.)
Mac Studio with 64 GB+ unified memory can run the 70B 4-bit model via GGUF/Ollama. Unified memory architecture makes Apple Silicon surprisingly competitive here.
If you're using Llama 3's 128k context window, budget an extra 10–20 GB for the KV cache on the 70B model.

Key Takeaways

Summary

4-bit quantization reduces Llama 3 8B VRAM from ~16 GB to ~5 GB - a 70%+ reduction.

AWQ delivers the best accuracy at 4-bit: only -1.8% MMLU vs FP16. Use it for GPU production.

bitsandbytes NF4 is the fastest setup path. Three config lines, no calibration. Best for QLoRA fine-tuning.

GGUF Q4_K_M is the standard for CPU inference, Ollama, and local deployments. Never go below Q4_K_M.

GPTQ quantization is a solid fallback for GPU inference but loses -2.9% MMLU - more than AWQ.

Always use bfloat16 compute dtype with bitsandbytes on Llama 3. Float16 can cause NaN errors.

Frequently Asked Questions

Does 4-bit quantization significantly reduce Llama 3 accuracy?

Not with the right method. AWQ loses only -1.8% on MMLU compared to the FP16 baseline of 65.2%. GGUF Q4_K_M loses -2.1%, and GPTQ loses -2.9%. For most production use cases - chat, coding assistance, summarization - that degradation is imperceptible. Where you'll notice it is in complex multi-step reasoning tasks. If that's your use case, stay at 8-bit or FP16.

What is the difference between GPTQ and AWQ for Llama 3?

Both are post-training quantization methods that require a calibration dataset. The key difference is what they optimize. GPTQ minimizes weight reconstruction error layer by layer. AWQ goes further - it analyzes activation magnitudes to identify the ~1% of weights that matter most, then scales those weights before quantization to protect them. That's why AWQ consistently outperforms GPTQ by 1–3% on benchmarks like MMLU and HumanEval.

Can I run 4-bit Llama 3 on a consumer GPU?

Yes. Llama 3 8B at 4-bit needs only ~5 GB of VRAM, which fits on an RTX 3060 (12 GB), RTX 4060 (8 GB), or even older 8 GB cards. Llama 3 70B at 4-bit needs ~40 GB - achievable with two RTX 4090s (48 GB combined) or a single A100 80 GB. For CPU-only machines, GGUF Q4_K_M via Ollama or llama.cpp works on any modern laptop.

What is NF4 quantization in bitsandbytes?

NF4 (4-bit NormalFloat) is a data type introduced in the QLoRA paper (Dettmers et al., 2023). Unlike standard INT4, NF4 is designed specifically for weights that follow a normal distribution - which is how transformer weights are typically initialized and trained. NF4 places quantization bins at positions that are information-theoretically optimal for normally distributed data, preserving more accuracy than naive integer quantization. It's the recommended bnb_4bit_quant_type for Llama 3.

Is GGUF or bitsandbytes better for local Llama 3 inference?

It depends on your hardware. GGUF (via Ollama or llama.cpp) is better if you're running on CPU, have limited VRAM, or want the simplest possible setup - one command with Ollama. bitsandbytes NF4 is better if you have a CUDA GPU and are working within the Python/transformers ecosystem, especially for fine-tuning with QLoRA. For pure inference on a GPU, AWQ (loaded via the autoawq library) is faster than bitsandbytes.

Can I fine-tune a 4-bit quantized Llama 3 model?

Yes - this is exactly what QLoRA is designed for. You load the base model in 4-bit NF4 with bitsandbytes, freeze those weights, then attach trainable LoRA adapters. Only the LoRA parameters (~1–2% of total weights) get updated during training. This lets you fine-tune Llama 3 8B on a single 16 GB GPU, and Llama 3 70B on a single A100 80 GB. Use the peft library with prepare_model_for_kbit_training() to set this up correctly.

How do I check if my quantized model is accurate?

Start with a qualitative sanity check: ask the model to answer a factual question and solve a simple math problem. If those pass, run a perplexity evaluation with lm-evaluation-harness. For Llama 3 8B at 4-bit NF4, expect a WikiText-2 perplexity of around 6.4. The BF16 baseline is ~6.1. If you see perplexity above 7.5, something is wrong - check your compute dtype, verify the model loaded correctly, and confirm you're using the right tokenizer.

Useful Sources

HuggingFace bitsandbytes documentation - Official transformers quantization docs with full BitsAndBytesConfig API reference.
Making LLMs accessible with bitsandbytes, 4-bit quantization and QLoRA - The original HuggingFace blog post introducing NF4 and double quantization.
AutoAWQ GitHub repository - Source code, model compatibility list, and quantization scripts for AWQ.
llama.cpp GitHub repository - The C++ inference engine powering GGUF quantization and CPU inference.
Meta AI: Quantized Lightweight Llama Models - Meta's official blog post on their quantization approach for Llama models.
EleutherAI lm-evaluation-harness - The standard tool for benchmarking quantized models on MMLU, WikiText-2, and dozens of other tasks.

Which method are you trying first? Drop a comment below - whether you're running bitsandbytes NF4 for a QLoRA project, deploying AWQ on vLLM, or just getting Llama 3 running locally with Ollama, we'd love to hear how it goes. And if you hit a specific error, describe your setup and we'll help you debug it.

Keep reading

llmquantizationgpu

Run 70B Models on a Single RTX 4090 With 4-Bit Quantization

A single RTX 4090 can run 70B parameter models with 4-bit quantization. Here's the exact VRAM math, benchmark numbers, and four complete setup methods - llama.cpp, Ollama, AutoAWQ, and ExLlamaV2.

MKMohammed Kafeel

13 min read

llmquantizationsmoothquant

SmoothQuant: What Activation-Aware Quantization Fixes

Naive INT8 quantization drops OPT-175B accuracy from 71.6% to 32.3%. SmoothQuant fixes that - without retraining - by migrating quantization difficulty from activations to weights via a mathematically equivalent transform.

MKMohammed Kafeel

12 min read

llmquantizationedge

Quantization for Edge Devices: LLMs Under 4 GB VRAM

A complete technical guide to running LLMs under 4 GB VRAM using quantization. Covers GGUF, GPTQ, AWQ, Bitsandbytes, real model sizes, benchmarks, and a step-by-step Ollama walkthrough.

MKMohammed Kafeel

18 min read

Back to all posts