Run 70B Models on a Single RTX 4090 With 4-Bit Quantization

A single RTX 4090 can run 70B parameter models with 4-bit quantization. Here's the exact VRAM math, benchmark numbers, and four complete setup methods - llama.cpp, Ollama, AutoAWQ, and ExLlamaV2.

Mohammed Kafeel

Machine Learning Researcher

June 8, 2026

13 min read

On this page

The VRAM Math: Why 70B Is Now Possible
Quantization Types Decoded
Benchmark: Real Tokens/Sec on a Single RTX 4090
Method 01 - llama.cpp (Recommended for Most Users)
Method 02 - Ollama (Easiest, One Command)
Method 03 - AutoAWQ (Python/HuggingFace Ecosystem)
Method 04 - ExLlamaV2 (Fastest Inference)
Quality Tradeoffs: What You Actually Lose
Hard Limits You Need to Know
Key Takeaways
FAQ
Useful Sources

A $2,000 gaming GPU can run the same class of model that required a $30,000 A100 two years ago. That's what 4-bit quantization does. A 70B model that needs 140 GB at full precision fits - with the right approach - on a single RTX 4090 with 24 GB of VRAM.

Here's exactly how to do it.

TL;DR

FP16 70B = 140 GB VRAM. Impossible on a single 4090.

Q4_K_M (4-bit) = ~38–42 GB total. Needs CPU offloading on a single 4090.

IQ2_XS (2-bit) = ~21–24 GB. Fits entirely in VRAM. Fastest on a single card.

Best speed: ExLlamaV2 at 50–65 tok/s. Easiest setup: Ollama in one command (but expect 2–4 tok/s with Q4_K_M due to CPU bottleneck).

Recommended method for most users: llama.cpp with --n-gpu-layers 45 and Q4_K_M.

The VRAM Math: Why 70B Is Now Possible

4-bit quantization cuts a 70B model's memory footprint by ~75%. That's the entire story, compressed.

Here's the full breakdown:

Precision	VRAM Required	Single RTX 4090 (24 GB)
FP16 (full precision)	~140 GB	❌ Impossible
INT8 (8-bit)	~70–75 GB	❌ Still impossible
Q3_K_M (3-bit)	~28–30 GB	⚠️ Partial offloading
Q4_K_M (4-bit)	~38–42 GB	⚠️ CPU offloading needed
IQ2_XS (2-bit)	~21–24 GB	✅ Fits in VRAM

The math: 70 billion parameters × 0.5 bytes per parameter (4-bit) = 35 GB in weights alone. Add the KV cache and CUDA runtime overhead and you're at 38–42 GB total.

That's more than 24 GB. So you need CPU offloading - the GPU handles as many layers as fit in VRAM, and the rest run on system RAM.

System RAM requirement: 64 GB DDR5 is the practical minimum for Q4_K_M with a single 4090. Faster RAM (DDR5-6000+) directly improves inference speed for offloaded layers.

Quantization Types Decoded

Not all 4-bit quantization is equal. The format you pick determines both quality and speed.

GGUF formats (used by llama.cpp and Ollama):

Q4_K_M - 4-bit, K-quant method, Medium size/quality balance. Selectively raises sensitive layers (attention projections) to 5–6 bits. The standard recommendation. Perplexity: ~6.97 vs. FP16's ~6.76 - a ~3% increase.
Q4_0 - Older, uniform 4-bit. Lower quality than Q4_K_M at similar size. Avoid it unless you have a specific reason.
Q5_K_M - 5-bit, slightly better quality than Q4_K_M, slightly more VRAM. Good if you have headroom.
IQ2_XS - 2-bit importance-aware quantization. Smallest size (~21 GB for 70B), fits in a single 4090's VRAM. Noticeable quality drop vs. Q4_K_M.

Framework-specific formats:

AWQ (Activation-aware Weight Quantization) - Identifies and protects the ~1% of "salient" weights most critical to output quality. Consistently outperforms GPTQ in accuracy benchmarks. Scores ~1–3% higher on MMLU at the same 4-bit width.
GPTQ (GPU-optimized Post-Training Quantization) - Layer-wise Hessian-based compression. Mature, widely supported. Slightly slower than AWQ on consumer GPUs.
EXL2 (ExLlamaV2 format) - Optimized for NVIDIA GPU inference with FlashAttention. Fastest raw throughput of any method.

Quick rule: If you're on a single 4090 and want simplicity, use GGUF Q4_K_M. If you have a GPU with 40+ GB VRAM and want max speed, use AWQ or EXL2. (For a deeper quantization format selection for RTX 4090, compare GGUF, AWQ, and GPTQ head-to-head.)

Benchmark: Real Tokens/Sec on a Single RTX 4090

ExLlamaV2 is the fastest. Ollama with Q4_K_M is the slowest. Here's the full picture:

Method	Quantization	Tokens/sec	Notes
ExLlamaV2	EXL2 4.0bpw	50–65	Fastest. Requires EXL2 loader.
AutoAWQ	AWQ INT4	30–40	Strong quality + speed balance.
llama.cpp	GGUF Q4_K_M	25–35	Best for CPU/GPU hybrid.
GPTQ	4-bit	15–25	Slower than AWQ on consumer GPUs.
Ollama	IQ2_XS (2-bit)	15–25	Fits in VRAM, acceptable speed.
Ollama	Q4_K_M	2.4–4	CPU bottleneck. Barely usable.

The Ollama Q4_K_M number isn't a typo. When the model can't fit in VRAM and most layers run on CPU, PCIe bandwidth becomes the bottleneck. 2–4 tok/s is the real-world result.

The IQ2_XS exception: At 2-bit, the 70B model fits entirely in 24 GB VRAM. No CPU offloading. That's why it hits 15–25 tok/s through Ollama - the GPU runs everything. (For a broader look at serving frameworks for quantized 70B, compare vLLM, Ollama, and TGI.)

Method 01 - llama.cpp (Recommended for Most Users)

llama.cpp is the right choice for most people. It handles CPU/GPU hybrid inference natively, supports every GGUF quantization format, and exposes an OpenAI-compatible API server.

Build with CUDA for RTX 4090

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -S . -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="89"
cmake --build build -j$(nproc)

The 89 CUDA architecture targets Ada Lovelace - the RTX 4090's architecture. Don't skip that flag.

Download the Model

huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF \
  Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

This pulls the ~42 GB Q4_K_M file from Hugging Face. bartowski maintains well-quantized GGUF builds for current models.

Run the Server

./build/bin/llama-server \
  -m models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 45 \
  --ctx-size 8192 \
  --flash-attn \
  --host 0.0.0.0 \
  --port 8080

Key flags:

--n-gpu-layers 45 - Offloads ~45 transformer layers to the GPU, consuming ~22 GB VRAM. Lower to 40 or 35 if you hit OOM.
--flash-attn - Enables Flash Attention. Measurable speed improvement on RTX 4090.
--ctx-size 8192 - 8K context is the practical limit on a single 4090 with 70B. Push to 4096 if you're tight on VRAM.

The server exposes an OpenAI-compatible API at http://localhost:8080/v1/chat/completions. Drop it into any tool that accepts an OpenAI endpoint.

Method 02 - Ollama (Easiest, One Command)

Ollama gets you running in under 5 minutes. It auto-downloads a quantized model and handles GPU/CPU splitting automatically.

curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.3:70b

That's it. Ollama downloads Q4_K_M by default (~42 GB), splits the load between your GPU and CPU RAM, and opens an interactive chat.

The honest tradeoff: Ollama's hybrid Q4_K_M mode on a single 4090 delivers 2–4 tok/s. That's slower than human reading speed. It works, but it's not fast.

For better speed with Ollama, use the 2-bit variant:

ollama run llama3.3:70b:iq2_xs

This fits in VRAM and hits 15–25 tok/s. Quality is lower, but the experience is usable.

Ollama is the right choice for quick experiments, demos, or if you just want to verify the model works before committing to a more complex setup.

Method 03 - AutoAWQ (Python/HuggingFace Ecosystem)

AutoAWQ is the best choice if you're already in the HuggingFace/Python ecosystem. It delivers 30–40 tok/s and retains near-full quality through activation-aware weight protection.

Install

pip install autoawq transformers accelerate

Load a Pre-Quantized AWQ Model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
    fuse_layers=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
)

fuse_layers=True enables kernel fusion for faster inference. The hugging-quants namespace on Hugging Face maintains official AWQ-INT4 builds of major models.

Important: AWQ models require ~38 GB of VRAM to run without offloading. On a single 4090 (24 GB), you'll need device_map="auto" to split across GPU and CPU. This reduces speed compared to a 40+ GB GPU setup.

AWQ quantization protects the ~1% of weights most critical to output quality. The result: MMLU scores within 1–3% of the FP16 baseline, versus 2–5% degradation with GPTQ.

Method 04 - ExLlamaV2 (Fastest Inference)

ExLlamaV2 is the fastest inference engine for 70B on NVIDIA hardware. It hits 50–65 tok/s - roughly 2x faster than llama.cpp - by using optimized FlashAttention kernels and a dedicated EXL2 quantization format.

Install

pip install exllamav2

Download an EXL2 Model

huggingface-cli download turboderp/Llama-3-70B-Instruct-exl2 \
  --revision 4.0bpw \
  --local-dir ./models/llama-3-70b-exl2-4.0bpw

Load and Run

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

config = ExLlamaV2Config()
config.model_dir = "./models/llama-3-70b-exl2-4.0bpw"
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len=8192, lazy=False)
model.load_autosplit(cache)

ExLlamaV2 automatically splits layers across available VRAM and CPU RAM. On a single 4090, it offloads ~35–40 layers to GPU and the rest to system RAM.

The catch: ExLlamaV2 is GPU-only. No CPU inference. If your model doesn't fit in VRAM at all, you need llama.cpp. But for maximum tokens-per-second on a single 4090, nothing beats it.

Quality Tradeoffs: What You Actually Lose

Q4_K_M retains ~97–99% of full-precision quality. For most use cases, the difference is imperceptible.

Here's the actual perplexity data on WikiText-2 for Llama 3 70B:

FP16: 6.7647
Q4_K_M: 6.9674 (+~3% increase)

That 3% perplexity increase translates to essentially nothing in everyday chat, summarization, and code generation. The 70B model's large parameter count makes it unusually robust to quantization - the 8B model suffers far more at the same bit width.

The quality ladder:

Q5_K_M - <1% degradation vs. FP16. Near-lossless. Needs ~47 GB total.
Q4_K_M - ~3% degradation. The sweet spot. Needs ~38–42 GB.
Q3_K_M - Noticeable degradation. Partial offloading on a single 4090.
IQ2_XS - Visible quality drop. Complex reasoning and code generation suffer. But it fits in 24 GB.

The 4-bit floor: Research consistently shows a "cliff" between 4-bit and 3-bit. Below 4 bits, reasoning quality, instruction following, and multilingual performance degrade significantly. Don't go below Q4 for production use. (For the data behind sub-4-bit quantization for 70B models, see our bit-width deep dive.)

AWQ vs. GPTQ quality: AWQ scores ~1–3% higher on MMLU and HumanEval at the same 4-bit width, because it identifies and protects the small fraction of weights most critical to output quality. If quality matters more than setup simplicity, use AWQ.

Hard Limits You Need to Know

Running 70B models on a single RTX 4090 works. It's not magic, though. These are the real constraints:

01 - Context window is tight. With Q4_K_M on a single 4090, practical context is 4K–8K tokens. The KV cache alone consumes 1.25–2 GB at 4K context. Push beyond 8K and you'll hit OOM.

02 - CPU offloading creates a speed ceiling. Ollama's hybrid Q4_K_M mode maxes out at 2–4 tok/s because PCIe bandwidth (not GPU compute) is the bottleneck. If you need faster Q4_K_M inference, you need more VRAM - a second 4090, or a 48+ GB card.

03 - Quantization is inference-only. You can't fine-tune a quantized model. Training and fine-tuning still require full FP16/BF16 precision. A 70B fine-tune needs ~140+ GB VRAM. That's multi-GPU A100/H100 territory.

04 - Not suitable for production serving. A single 4090 with CPU offloading collapses under concurrent users. At 5+ simultaneous requests, latency becomes unacceptable. For production multi-user serving, use vLLM on a 40+ GB GPU or a cloud endpoint. (Weighing local hardware against the cloud? See the self-hosting economics with quantization.)

05 - System RAM matters. For Q4_K_M with CPU offloading, you need at least 48 GB of system RAM (64 GB recommended). DDR5-6000+ makes a measurable difference in offloaded layer speed.

Key Takeaways

01 - Pick your quantization based on what fits. IQ2_XS fits in 24 GB VRAM and gives 15–25 tok/s. Q4_K_M needs CPU offloading but delivers better quality. Know the tradeoff before you start.

02 - Pick your framework based on your goal. Fastest: ExLlamaV2 (50–65 tok/s). Most flexible: llama.cpp. Easiest: Ollama. Python-native: AutoAWQ.

03 - Q4_K_M quality loss is real but small. ~3% perplexity increase. Imperceptible in chat. Noticeable in complex multi-step reasoning. Use Q5_K_M if you have the VRAM headroom.

04 - CPU offloading is the bottleneck, not the GPU. The RTX 4090's compute isn't the limiting factor. PCIe bandwidth between GPU and CPU RAM is. Minimize offloaded layers to maximize speed.

05 - This is for inference only. Fine-tuning a 70B model still needs full precision and enterprise hardware. Quantization solves the inference problem, not the training problem.

FAQ

Can you actually run a 70B model on a single RTX 4090?

Yes - with caveats. The RTX 4090 has 24 GB of VRAM. A 70B model at Q4_K_M needs ~38–42 GB total. You run it via CPU offloading: the GPU handles ~45 layers (~22 GB), and the rest runs on system RAM. Speed is 25–35 tok/s with llama.cpp. If you use IQ2_XS (2-bit), the model fits entirely in VRAM and you get 15–25 tok/s without offloading.

What is GGUF Q4_K_M and why is it the recommended format?

GGUF is the file format used by llama.cpp and Ollama. Q4_K_M means 4-bit quantization using the K-quant method, Medium size/quality balance. It applies mixed precision - most weights at 4-bit, but sensitive layers (like attention projections) at 5–6 bits. This recovers significantly more quality than the older Q4_0 format at nearly the same file size. For a 70B model, Q4_K_M is the standard recommendation because it hits the best quality-to-VRAM ratio.

What's the difference between AWQ quantization and GGUF Q4_K_M?

Both are 4-bit, but the approach differs. AWQ (Activation-aware Weight Quantization) analyzes model activations to identify and protect the ~1% of weights most critical to output quality. GGUF Q4_K_M uses a block-based mixed-precision approach. AWQ typically scores 1–3% higher on benchmarks like MMLU and HumanEval. GGUF Q4_K_M is more portable - it runs on CPU, Apple Silicon, and GPU with offloading. For a single 4090, GGUF is more practical. For a 40+ GB GPU, AWQ is faster and more accurate.

How fast is llama.cpp 70B on a single RTX 4090?

With Q4_K_M and --n-gpu-layers 45, expect 25–35 tokens per second. That's interactive speed - faster than you can read. The exact number depends on how many layers fit in VRAM (more layers = faster), your system RAM speed, and context length. With --flash-attn enabled and DDR5-6000 RAM, you'll hit the higher end of that range.

Why is Ollama so slow with a 70B model?

Ollama defaults to Q4_K_M for 70B models. That's ~42 GB - more than the 4090's 24 GB. Ollama automatically splits the model: GPU handles what fits, CPU handles the rest. The PCIe bus between GPU and CPU RAM becomes the bottleneck, capping generation at 2–4 tok/s. To get usable speed with Ollama on a single 4090, use the IQ2_XS variant (ollama run llama3.3:70b:iq2_xs), which fits in VRAM and hits 15–25 tok/s.

Can I fine-tune a 70B model on an RTX 4090?

No. Quantization is inference-only. Fine-tuning requires full FP16/BF16 precision, which means ~140 GB VRAM for a 70B model. Even QLoRA fine-tuning of a 70B model requires multiple high-VRAM GPUs. The RTX 4090 can fine-tune models up to ~13B parameters with QLoRA. For 70B fine-tuning, you need cloud compute or a multi-GPU A100/H100 setup.

Is ExLlamaV2 better than llama.cpp for 70B inference?

For raw speed on a single GPU: yes. ExLlamaV2 hits 50–65 tok/s vs. llama.cpp's 25–35 tok/s - roughly 2x faster. ExLlamaV2 uses optimized FlashAttention kernels that llama.cpp's hybrid CPU/GPU mode can't match. The tradeoff: ExLlamaV2 is GPU-only, requires the EXL2 model format, and has a steeper setup curve. If you need the fastest possible inference and don't need CPU fallback, ExLlamaV2 wins. For flexibility and ease of use, llama.cpp is the better default.

Useful Sources

llama.cpp GitHub repository - Build instructions, CUDA flags, and server documentation
bartowski's GGUF model collection on Hugging Face - Well-maintained Q4_K_M and IQ2_XS builds for current models
hugging-quants AWQ-INT4 models - Official AWQ-INT4 builds for Meta Llama models
ExLlamaV2 GitHub (turboderp) - ExLlamaV2 source, EXL2 format documentation, and benchmarks
Ollama model library - One-command model downloads with automatic quantization selection
oobabooga benchmark: GPTQ vs AWQ vs EXL2 vs llama.cpp - Detailed inference speed comparison across frameworks
SitePoint: Quantization explained for consumer GPUs - Accessible breakdown of quantization tradeoffs
NVIDIA Developer Blog: Accelerating LLMs with llama.cpp on RTX - Official NVIDIA guidance on RTX-optimized llama.cpp builds

Keep reading

llmquantizationllama

How to Quantize Llama 3 to 4-Bit With Minimal Accuracy Loss

A complete, code-first guide to quantizing Llama 3 to 4-bit using bitsandbytes NF4, AWQ, GPTQ, and GGUF - with real VRAM numbers, MMLU benchmarks, and tips to keep accuracy loss under 2%.

MKMohammed Kafeel

16 min read

llmquantizationoptimization

LLM Quantization Explained: INT4 vs INT8 vs FP16

A 70B model needs 140 GB of GPU RAM in FP16. Most teams don't have that. This guide breaks down LLM quantization - INT4 vs INT8 vs FP16 - with real memory numbers, speed benchmarks, and a practical decision framework for deploying LLMs efficiently.

MKMohammed Kafeel

12 min read

llmquantizationoptimization

LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory

Standard 2-bit quantization destroys model accuracy - perplexity spikes to 38,000+. Here's what the benchmarks actually say about 2-bit vs 4-bit vs 8-bit, and how to pick the right level for your hardware and task.

MKMohammed Kafeel

17 min read

Back to all posts