Quantization for Edge Devices: LLMs Under 4 GB VRAM

A complete technical guide to running LLMs under 4 GB VRAM using quantization. Covers GGUF, GPTQ, AWQ, Bitsandbytes, real model sizes, benchmarks, and a step-by-step Ollama walkthrough.

Mohammed Kafeel

Machine Learning Researcher

June 19, 2026

18 min read

On this page

TL;DR - In Summary
What Is Quantization? (And Why Does It Matter for Edge Devices?)
How Much VRAM Does an LLM Actually Need? (The Math)
The 4 Main Quantization Methods Compared
Which LLMs Actually Fit in 4 GB VRAM? (2025–2026 Models)
The Best Tools to Run Quantized LLMs on Edge Devices
Step-by-Step: Run a Quantized LLM on 4 GB VRAM Right Now
What About 1-Bit and 2-Bit Quantization? (The Frontier)
Accuracy vs. Compression: How Much Quality Do You Actually Lose?
Key Takeaways
FAQ
Useful Sources

A 7B parameter model in full precision needs 28 GB of VRAM. Your gaming laptop has 4 GB. Quantization closes that gap - and it does it without retraining a single weight.

This guide covers everything you need to run LLMs under 4 GB VRAM in 2025–2026: the math, the methods, the models, and the exact commands.

TL;DR - In Summary

Quantization reduces model weight precision (e.g., FP32 → INT4), cutting VRAM by up to 75%.
A Mistral 7B model shrinks from 14 GB (FP16) to 4.1 GB at Q4_K_M - just barely fitting in 4 GB VRAM.
GGUF (Q4_K_M) via llama.cpp or Ollama is the only practical format for true edge/CPU inference.
GPTQ and AWQ are GPU-only and require CUDA - they don't work on pure CPU or low-end consumer GPUs without offloading.
4-bit quantization causes roughly 1–3% accuracy loss on most tasks. You won't notice it in conversation.

What Is Quantization? (And Why Does It Matter for Edge Devices?)

Quantization reduces the numerical precision of a model's weights - from 32-bit or 16-bit floating point down to 8-bit or 4-bit integers. Smaller numbers mean less memory. Less memory means you can run bigger models on smaller hardware.

Think of it like audio compression. A WAV file is lossless but huge. An MP3 at 128 kbps is 10× smaller and sounds nearly identical to most listeners. Quantization does the same thing to model weights - it trades a tiny bit of precision for a massive reduction in size.

Why does this matter for edge devices?

Running a 7B model at full FP32 precision requires 28 GB of VRAM. At FP16, that drops to 14 GB. At INT4 (4-bit), it's roughly 3.5–4.5 GB - suddenly within reach of a consumer laptop GPU. That's the entire value proposition of quantization for edge devices.

Without quantization, local LLM inference is locked to enterprise hardware. With it, you're running capable models on a $300 GPU.

How Much VRAM Does an LLM Actually Need? (The Math)

The rule of thumb: multiply the number of parameters by the bytes per parameter. A 7B model at FP16 (2 bytes/param) needs roughly 14 GB just for weights - before the KV cache, activations, or any overhead.

Here's the breakdown by precision:

Precision	Bytes/Param	7B Model	13B Model	70B Model
FP32	4 bytes	~28 GB	~52 GB	~280 GB
FP16 / BF16	2 bytes	~14 GB	~26 GB	~140 GB
INT8	1 byte	~7 GB	~13 GB	~70 GB
INT4 (Q4_K_M)	0.5 bytes	~3.5–4.5 GB	~6.5–7 GB	~35–40 GB

What "4 GB VRAM" means in practice:

You don't get the full 4 GB for the model. The OS, the inference runtime, and the KV cache (the memory used to store previous tokens during generation) all eat into that budget. Realistically, you have ~3–3.5 GB for model weights.

That means:

A 3B model at Q4_K_M fits comfortably (~1.9–2.5 GB).
A 7B model at Q4_K_M (~4.1 GB) is tight - it may need partial CPU offloading.
A 7B model at Q3_K_M (~3.9 GB) fits, with minimal quality loss.
Anything larger than 7B at 4-bit needs more than 4 GB VRAM.

The KV cache also grows with context length. A 4K context adds ~0.5 GB; a 32K context can add 4–8 GB. Keep your context windows short on constrained hardware.

The 4 Main Quantization Methods Compared

There are four methods you'll actually encounter in the wild. Each has a different target hardware, accuracy profile, and toolchain.

GGUF (llama.cpp) - Best for CPU and Edge

GGUF is the only format designed from the ground up for edge inference. It's the native format for llama.cpp and Ollama, and it supports layer offloading - meaning you can split a model between your GPU and CPU RAM when VRAM runs out.

GGUF uses K-quant variants (Q4_K_M, Q3_K_M, Q5_K_M, etc.) that apply different quantization strategies to different layers, preserving accuracy where it matters most. The naming follows the pattern Q[bits]_K_[size]: Q4 = 4-bit, K = K-means clustering, M = medium block size.

Real numbers: Q4_K_M reduces a 7B model from 7.2 GB (FP16) to 4.1 GB with 1–3% accuracy loss. Q3_K_M brings it to ~3.9 GB - the sweet spot for strict 4 GB limits. (For a full walkthrough of quantizing models for edge deployment, follow our Llama 3 guide.)

GGUF runs natively on ARM, Apple Silicon, AMD, Intel, and NVIDIA. It's the only format that works on a Raspberry Pi.

GPTQ - Best for GPU Accuracy

GPTQ (General-purpose Post-Training Quantization) is a 4-bit GPU format that achieves near-FP16 accuracy. It uses the full Hessian matrix to minimize quantization error layer by layer, which is why it's so accurate - and why it takes 4+ GPU-hours to quantize a 175B model.

On Llama2-7B, the FP16 baseline perplexity (a measure of how well the model predicts text - lower is better) is 5.47. GPTQ at 4-bit stays within <1% of that. At 2-bit (GPTQ W2g64), perplexity degrades to 21.00 - essentially unusable.

The catch: GPTQ requires CUDA. It won't run on CPU-only hardware, and it can't offload layers between CPU and GPU the way GGUF can. If your VRAM is smaller than the model, it crashes.

AWQ - Best for GPU Speed

AWQ (Activation-aware Weight Quantization) is the fastest 4-bit format for GPU inference. Unlike GPTQ, which treats all weights equally, AWQ identifies the ~1% of "salient" weights that drive large activations and protects them. The result: slightly better accuracy than GPTQ at the same bit-width, and much faster quantization (10–30 minutes vs. hours).

With the Marlin kernel on an A100/H100, AWQ hits 741 tokens/second - roughly 3× faster than GPTQ inference. For a 70B model, AWQ cuts VRAM from 140 GB to ~35 GB.

AWQ is the go-to for production GPU serving via vLLM or SGLang. It's not designed for edge devices. (Scaling that across a cluster? Here's edge quantization in Kubernetes.)

Bitsandbytes - Best for Flexibility and Fine-Tuning

Bitsandbytes is the most flexible quantization library - it quantizes on the fly, no pre-quantized files needed. Load any HuggingFace model in 8-bit or 4-bit with a single argument.

Its killer feature is NF4 (Normal Float 4), the format used by QLoRA for fine-tuning quantized models. If you want to fine-tune a 7B model on a single GPU, Bitsandbytes is how you do it.

At 8-bit (W8A8), it delivers a 1.5× speedup on NVIDIA Tensor Cores with negligible accuracy loss. At 4-bit, accuracy is more variable than GPTQ or AWQ - it's better suited for experimentation than production deployment.

Comparison Table

Method	Best For	Bit-widths	Inference Speed	Accuracy Loss	Edge/CPU?
GGUF	CPU, edge, Apple Silicon	Q2–Q8	High on CPU	1–3% (Q4_K_M)	✅ Yes
GPTQ	GPU accuracy, large models	2-bit, 4-bit	High (with kernels)	<1% at 4-bit	❌ No
AWQ	GPU speed, production	4-bit, 8-bit	Highest (741 tok/s)	<2% at 4-bit	❌ No
Bitsandbytes	Fine-tuning, flexibility	4-bit, 8-bit	Good (W8A8)	Moderate at 4-bit	❌ No

Which LLMs Actually Fit in 4 GB VRAM? (2025–2026 Models)

Short answer: 3B models and smaller fit easily. 7B models need Q3 or aggressive offloading.

Here's what the data shows from the AscentCore Small LLM Benchmark (April 2026), which tested 22 quantized models via Ollama:

Model	Params	Format	VRAM Required	Speed (tok/s)	Best Use Case
Llama 3.2 3B	3B	Q4_K_M	~2.0 GB	98.7	General chat, narrative
Phi-3 Mini	3.8B	Q4_K_M	~3.5 GB	69.3	Reasoning, instruction-following
Qwen 2.5 1.5B	1.5B	Q4_K_M	~0.9 GB	167.5	Structured output, JSON, multilingual
Qwen 2.5 3B	3B	Q4_K_M	~2.0 GB	~120–140	Balanced quality and speed
Gemma 2 2B	2.6B	Q3_K_M	~1.4 GB	~140–160	Fast inference, edge deployment
Mistral 7B	7B	Q4_K_M	~4.1 GB	49.0	Highest text quality in class

Notes on each:

Qwen 2.5 1.5B is the standout for sub-3B models. At Q8_0, it hits 95.7% JSON parse rate and ROUGE-L of 0.421 - competitive with some 7B models. Fastest at 167.5 tok/s.
Mistral 7B Q4_K_M leads on raw text quality (ROUGE-L 0.496, factual consistency 0.762) but at 4.1 GB it's tight for a 4 GB card. Use Q3_K_M (~3.9 GB) if you're hitting OOM errors.
Llama 3.2 3B is fast and good at text, but struggles with structured JSON output (47.8% parse rate). Don't use it for JSON pipelines.
Phi-3 Mini has the highest repetition rate (0.052) of any model in the benchmark - 5–50× higher than competitors. Avoid it for production text generation.
Gemma 2 2B at Q3_K_M fits in ~1.4 GB and runs at 140–160 tok/s. Excellent for ultra-constrained hardware.

The Best Tools to Run Quantized LLMs on Edge Devices

Ollama

Ollama is the easiest way to run quantized LLMs locally. One command installs it, one command pulls a model, one command starts a chat. It wraps llama.cpp under the hood and automatically serves an OpenAI-compatible API on localhost:11434. (For how Ollama's GGUF quantization for edge stacks up against vLLM and TGI, see our serving comparison.)

The tradeoff: Ollama adds ~20–30% overhead compared to running llama.cpp directly. For interactive use, you won't notice. For high-throughput batch processing, you will.

Best for: Developers who want a working local LLM in under 5 minutes.

llama.cpp

llama.cpp is the engine that powers most edge LLM inference. Written in C/C++, it runs on virtually any hardware - ARM, x86, Apple Metal, NVIDIA, AMD, Vulkan. It supports GGUF natively, handles CPU/GPU layer splitting automatically, and has hit 100,000+ GitHub stars.

It's faster than Ollama (no wrapper overhead) and gives you full control over quantization parameters. The tradeoff is a steeper setup - you need to compile it or find pre-built binaries, and serving requires a separate llama-server binary.

Best for: Performance-focused engineers, embedded systems, and anyone who needs fine-grained control.

AutoGPTQ

AutoGPTQ is the standard Python library for quantizing and serving GPTQ models. It handles the Hessian-based quantization process and integrates with the ExLlamaV2 backend for fast inference. On NVIDIA GPUs, ExLlamaV2 generates ~64 tok/s vs. ~52 tok/s for ExLlamaV1 - a 23% speedup.

Best for: NVIDIA GPU users who need maximum accuracy at 4-bit and are comfortable with Python environments.

AutoAWQ

AutoAWQ quantizes models to AWQ format in 10–30 minutes (vs. hours for GPTQ). The resulting models are slightly smaller and slightly more accurate than GPTQ at the same bit-width. Combined with the ExLlamaV2 backend or vLLM, AWQ models hit the highest GPU inference throughput available - up to 741 tok/s on A100/H100 with the Marlin kernel.

Best for: Production GPU deployments where VRAM is limited and speed matters.

ExLlamaV2

ExLlamaV2 is the inference backend, not a standalone tool. It's the engine that makes GPTQ and AWQ models run fast on NVIDIA GPUs. When you load a GPTQ or AWQ model via AutoGPTQ or AutoAWQ, ExLlamaV2 is doing the actual computation.

It's ~2.2× faster than llama.cpp on GPU-only benchmarks, but it has no CPU fallback and no layer offloading. If the model doesn't fit in VRAM, it fails.

Best for: Maximum GPU throughput on NVIDIA hardware.

Which Tool Should You Pick?

Your Situation	Use This
Quick local setup, any hardware	Ollama
CPU-only, ARM, Apple Silicon, Raspberry Pi	llama.cpp
NVIDIA GPU, max accuracy, Python workflow	AutoGPTQ + ExLlamaV2
NVIDIA GPU, limited VRAM, production serving	AutoAWQ + vLLM
Fine-tuning a quantized model (QLoRA)	Bitsandbytes

Step-by-Step: Run a Quantized LLM on 4 GB VRAM Right Now

This walkthrough uses Ollama and a Q4_K_M model - the fastest path to a working local LLM.

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download the installer from https://ollama.com

Step 2: Pull a Q4_K_M model

For a 4 GB VRAM card, pull a 3B model. Qwen 2.5 1.5B is the best quality-per-MB option:

ollama pull qwen2.5:1.5b

Or for the highest text quality that still fits in 4 GB (tight):

ollama pull mistral:7b-instruct-q3_K_M

Step 3: Verify the model loaded correctly

ollama list

Expected output:

NAME                          ID              SIZE    MODIFIED
qwen2.5:1.5b                  abc123def456    940 MB  2 minutes ago

Step 4: Start a chat session

ollama run qwen2.5:1.5b

You'll see the prompt >>>. Type your message and hit Enter.

Step 5: Check VRAM usage (NVIDIA)

In a separate terminal while the model is running:

nvidia-smi

Expected output for Qwen 2.5 1.5B Q4_K_M:

| GPU  Name        | Memory-Usage |
| 0    RTX 3050    | 1200MiB / 4096MiB |

Step 6: Use the API instead of the chat interface

Ollama automatically serves an OpenAI-compatible API:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:1.5b",
    "messages": [{"role": "user", "content": "Explain quantization in one sentence."}]
  }'

Step 7: Swap to a 7B model if you have headroom

ollama pull mistral:7b-instruct-q4_K_M
ollama run mistral:7b-instruct-q4_K_M

Expected performance on a 4 GB GPU (RTX 3050):

Model	VRAM Used	Tokens/sec
Qwen 2.5 1.5B Q4_K_M	~1.2 GB	~167 tok/s
Llama 3.2 3B Q4_K_M	~2.0 GB	~99 tok/s
Mistral 7B Q3_K_M	~3.9 GB	~25–35 tok/s

Tip: If you get an OOM error with Mistral 7B Q4_K_M (~4.1 GB), switch to Q3_K_M (~3.9 GB). The quality difference is minimal - roughly 2–4% perplexity increase.

What About 1-Bit and 2-Bit Quantization? (The Frontier)

1-bit and 2-bit quantization are real, working, and not yet ready for everyday use. Here's where things stand. (For the full breakdown of extreme bit-width tradeoffs, see our 2-bit vs 4-bit vs 8-bit deep dive.)

BitNet b1.58

BitNet b1.58 (Wang et al., JMLR 2025) is a Transformer architecture trained from scratch with ternary weights: every weight is -1, 0, or +1. That's 1.58 bits of information per parameter.

The results are striking. BitNet b1.58 matches FP16 Transformer performance at the same model size and training token count, while consuming 55–82% less energy per token and using 3× less memory. A 13B BitNet b1.58 model is more energy-efficient than a 3B FP16 model.

On a Raspberry Pi 5, a 3B BitNet model generates 11 tokens/second. On a Surface Laptop 7 (Snapdragon X Elite), it hits 48 tokens/second - 4–5× faster than llama.cpp on the same hardware.

The catch: BitNet requires training from scratch. You can't take an existing Llama or Mistral model and BitNet-ize it. Native hardware support (ternary multiply-accumulate units) doesn't exist in consumer silicon yet. BitNet is the future of edge inference - just not today's solution.

NanoQuant

NanoQuant (Chong et al., 2025, arXiv:2602.06694) is the first post-training quantization method to achieve sub-1-bit compression. It formulates quantization as a low-rank binary factorization problem, compressing weights to 1-bit, 0.8-bit, and even 0.55-bit.

The headline result: NanoQuant compresses Llama2-70B from 138 GB to 5.35 GB - a 25.8× reduction - using a single H100 GPU in 13 hours. The resulting model runs on a consumer 8 GB GPU at 20.11 tokens/second.

At 1-bit, NanoQuant achieves a perplexity of 10.34 on Llama2-7B (vs. 5.47 for FP16 baseline). That's a meaningful gap, but it's functional - far better than naive 1-bit methods that produce perplexity in the tens of thousands.

Status: Research prototype. No production toolchain yet. Watch this space.

Accuracy vs. Compression: How Much Quality Do You Actually Lose?

At 4-bit (Q4_K_M), you lose 1–3% accuracy. In practice, you won't notice it in conversation.

Here's what the numbers actually mean:

Precision	Perplexity (Llama2-7B)	vs. FP16 Baseline	Practical Impact
FP16 (baseline)	5.47	-	Full quality
INT8	~5.52	<1%	Negligible
Q4_K_M (GGUF)	~5.63–5.75	~1–3%	Barely noticeable
Q3_K_M (GGUF)	~5.90–6.10	~5–8%	Slight degradation on complex tasks
Q2_K (GGUF)	~7.50+	~15–30%	Noticeable quality drop
GPTQ W2g64	21.00	~284%	Essentially broken

Perplexity measures how surprised the model is by the next token - lower is better. A perplexity of 5.47 vs. 5.63 is a difference you'd struggle to detect in a conversation.

What you will notice at Q3 and below:

Slightly more hallucinations on factual questions
Weaker performance on complex multi-step reasoning
Occasional grammatical awkwardness in long outputs

What you won't notice:

General chat and Q&A quality
Code generation for common tasks
Summarization and extraction

The AscentCore benchmark (April 2026) confirmed this: Q4_K_M vs. Q8_0 ROUGE-L delta is only +0.013 for Mistral 7B - essentially zero. For speed-sensitive applications, Q4_K_M delivers 40–60% higher tokens/second with minimal quality cost.

The practical rule: Use Q4_K_M as your default. Drop to Q3_K_M only if you're hitting VRAM limits. Never use Q2_K for anything you'd show to a user.

Key Takeaways

The 4 GB VRAM Playbook - 2025–2026

Format: Use GGUF (Q4_K_M) for any edge or CPU-based inference. It's the only format with layer offloading.
Model size: 3B models are the sweet spot for 4 GB VRAM. 7B models need Q3_K_M or partial CPU offloading.
Best small model: Qwen 2.5 1.5B Q4_K_M - 167 tok/s, 95.7% JSON parse rate, fits in ~1.2 GB VRAM.
Best quality in 4 GB: Mistral 7B Q3_K_M - highest text quality, ~3.9 GB, ~25–35 tok/s on a 4 GB GPU.
Tool: Ollama for ease of use; llama.cpp for raw performance and control.
Accuracy loss: Q4_K_M = ~1–3% perplexity increase. Unnoticeable in practice.
Frontier: BitNet b1.58 (JMLR 2025) and NanoQuant (2025) are pushing toward 1-bit inference - not production-ready yet, but coming fast.
Avoid: GPTQ and AWQ on edge hardware - they require CUDA and crash if the model doesn't fit entirely in VRAM.

FAQ

What is quantization for edge devices?

Quantization for edge devices is the process of reducing the numerical precision of an LLM's weights (e.g., from 16-bit to 4-bit) so the model fits within the limited memory of consumer hardware like laptops, mini-PCs, and single-board computers. It's the primary technique that makes local LLM inference possible on hardware with 4–8 GB of RAM or VRAM.

Can I run a 7B LLM on 4 GB VRAM?

Yes, but only with aggressive quantization. A Mistral 7B model at Q4_K_M requires ~4.1 GB - just at the limit. For reliable operation, use Q3_K_M (~3.9 GB), which leaves headroom for the KV cache. You'll see ~25–35 tokens/second on a 4 GB GPU like an RTX 3050. For a more comfortable experience, a 3B model at Q4_K_M is the better choice.

What is the best quantization format for CPU-only inference?

GGUF (specifically Q4_K_M or Q5_K_M) via llama.cpp or Ollama. It's the only format with native CPU kernels optimized for ARM and x86, and the only one that supports splitting model layers between CPU RAM and GPU VRAM. AWQ and GPTQ require CUDA and have no viable CPU inference path.

What is the difference between Q4_K_M and Q4_0?

Both are 4-bit GGUF formats, but Q4_K_M uses K-means clustering (non-uniform quantization) with a medium block size. This preserves accuracy better than Q4_0, which uses uniform quantization with a single scaling factor per block. Q4_K_M is the recommended default - it's slightly larger than Q4_0 but noticeably more accurate.

How much accuracy do I lose with 4-bit quantization?

Roughly 1–3% on perplexity benchmarks. On Llama2-7B, FP16 baseline perplexity is 5.47; Q4_K_M brings it to ~5.63–5.75. In practice, this difference is imperceptible in conversation, summarization, and most coding tasks. You'll only notice degradation at Q2 or below, where perplexity can jump to 7.5+ and hallucinations increase.

Is BitNet b1.58 ready for production use?

Not yet. BitNet b1.58 (JMLR 2025) requires training from scratch with ternary weights - you can't apply it to existing models like Llama or Mistral. Native hardware support for ternary operations doesn't exist in consumer silicon. That said, early results on ARM hardware (11 tok/s on Raspberry Pi 5, 48 tok/s on Snapdragon X Elite) show it's a credible future path for ultra-low-power edge inference.

What's the fastest model I can run on 4 GB VRAM?

Qwen 2.5 1.5B at Q4_K_M - it runs at ~167 tokens/second and uses only ~1.2 GB VRAM, leaving plenty of headroom for the KV cache. For maximum throughput on tiny hardware, Llama 3.2 1B Q4_K_M hits 226 tokens/second (AscentCore benchmark, April 2026) but has weaker JSON output reliability.

Useful Sources

Have questions about your specific hardware setup? Drop them in the comments - we read every one.

Keep reading

llmquantizationsmoothquant

SmoothQuant: What Activation-Aware Quantization Fixes

Naive INT8 quantization drops OPT-175B accuracy from 71.6% to 32.3%. SmoothQuant fixes that - without retraining - by migrating quantization difficulty from activations to weights via a mathematically equivalent transform.

MKMohammed Kafeel

12 min read

llmquantizationllama

How to Quantize Llama 3 to 4-Bit With Minimal Accuracy Loss

A complete, code-first guide to quantizing Llama 3 to 4-bit using bitsandbytes NF4, AWQ, GPTQ, and GGUF - with real VRAM numbers, MMLU benchmarks, and tips to keep accuracy loss under 2%.

MKMohammed Kafeel

16 min read

llmquantizationoptimization

LLM Quantization Explained: INT4 vs INT8 vs FP16

A 70B model needs 140 GB of GPU RAM in FP16. Most teams don't have that. This guide breaks down LLM quantization - INT4 vs INT8 vs FP16 - with real memory numbers, speed benchmarks, and a practical decision framework for deploying LLMs efficiently.

MKMohammed Kafeel

12 min read

Back to all posts