Run 70B Models on a Single RTX 4090 With 4-Bit Quantization
A how-to for fitting a 70B-parameter model onto one 24 GB RTX 4090 using aggressive 4-bit and 2-bit quantization — what works, what breaks, and the accuracy cost.
Mohammed Kafeel
Machine Learning Researcher
Quick answer: A 70B model in FP16 needs ~140 GB; even at 4-bit it's ~35 GB — which does not fit in the RTX 4090's 24 GB. To run a 70B fully on one 4090 you must go below 4-bit. The two practical routes: (1) sub-4-bit GPU quantization — an ExLlamaV2 EXL2 quant at ~2.2–2.5 bits-per-weight (~19–22 GB) or a GGUF i-quant (IQ2_XXS / IQ2_M) — which fits entirely in 24 GB and runs fast; or (2) higher-bit with CPU offload — a GGUF Q3/Q4 quant where some layers live on the GPU and the rest spill to system RAM, giving better quality but much slower generation. Pair either with a quantized KV cache (4-bit/8-bit) and a modest context length to claw back VRAM. The trade-off is real quality loss from sub-4-bit, but a 70B at ~2.4 bpw often still beats a 13B at 4-bit — the size advantage tends to win.
The core problem: the math doesn't fit
Start with the memory reality, because it dictates everything:
| 70B precision | Bits/weight | Weights only | Fits in 24 GB? |
|---|---|---|---|
| FP16 | 16 | ~140 GB | No (6× over) |
| INT8 | 8 | ~70 GB | No |
INT4 (Q4_K_M) |
~4.8 | ~42 GB | No |
INT3 (Q3_K_M) |
~3.9 | ~34 GB | No |
| ~2.4 bpw (EXL2/IQ2) | ~2.4 | ~21 GB | Yes (tight) |
| ~2.06 bpw (IQ2_XXS) | ~2.06 | ~18 GB | Yes |
The RTX 4090 has 24 GB. To fit 70B weights on the card you need roughly ≤2.5 bits per weight — and you still need headroom for the KV cache and activations. That's why running 70B on a 4090 is fundamentally an exercise in aggressive (sub-4-bit) quantization or offloading, not standard 4-bit.
A reassuring counterpoint from quantization scaling: large models tolerate low-bit quantization better than small ones (more redundancy per weight). A 70B at ~2.4 bpw is damaged but still highly capable — often more so than a 13B at clean 4-bit. The size advantage is exactly why this is worth doing.
Route 1 — Sub-4-bit, fully on the GPU (recommended for speed)
This keeps the entire model in VRAM, so generation is fast. You trade quality for fit by going to ~2–2.5 bpw.
Option A: ExLlamaV2 (EXL2) at ~2.2–2.5 bpw
EXL2 supports fractional bits-per-weight, so you can target exactly what fits. A 70B at ~2.4 bpw lands around 21 GB, leaving room for a small KV cache.
# Download a pre-quantized EXL2 70B at ~2.4 bpw, or quantize your own with exllamav2
# Run with the ExLlamaV2 loader (e.g., via TabbyAPI or text-generation-webui)
# Key settings for a 24 GB card:
# - model: 70B @ ~2.4 bpw EXL2
# - cache: Q4 (4-bit KV cache) to save VRAM
# - max_seq_len: keep modest (e.g., 4096-8192) to bound KV memory
EXL2's fractional bpw plus a 4-bit KV cache is the single best speed/quality combo for 70B on a 4090 — fully GPU-resident, fast tokens/sec, and you tune bpw to the last gigabyte.
Option B: GGUF i-quants (IQ2_XXS / IQ2_M) with llama.cpp
llama.cpp's importance-matrix 2-bit quants are designed for exactly this. IQ2_XXS (~2.06 bpw, ~18 GB) fits comfortably; IQ2_M (~2.7 bpw) is higher quality but tighter.
# Run fully on GPU: offload all layers (-ngl 99), use a quantized KV cache
./llama-cli -m llama-3-70b-IQ2_M.gguf \
-ngl 99 \ # put all layers on the GPU
--ctx-size 4096 \ # modest context to bound KV memory
--cache-type-k q8_0 \ # quantized KV cache (key)
--cache-type-v q8_0 \ # quantized KV cache (value)
-p "Your prompt"
i-quants use a calibration importance matrix to allocate the precious few bits where they matter most, which is why IQ2 is far more usable than naive 2-bit.
Option C: AQLM (advanced 2-bit)
AQLM (Additive Quantization) achieves usable 2-bit quality through learned codebooks and is a strong choice when you want the best quality at ~2 bpw fully on-GPU. It's more complex to set up and slower to quantize, but the inference quality at 2-bit is among the best available. Use it when IQ2/EXL2 quality isn't enough and you can invest in the tooling.
Route 2 — Higher-bit with CPU offload (recommended for quality)
If sub-4-bit quality is unacceptable, keep more bits (Q3_K_M or even Q4_K_M) and offload the overflow layers to system RAM. The GPU holds as many layers as fit; the CPU handles the rest. Quality is much better; speed is much worse (the CPU layers bottleneck generation).
# Q4_K_M 70B (~42 GB) won't fit in 24 GB — offload only what fits.
# -ngl N puts N layers on the GPU; the rest run on CPU/RAM.
./llama-cli -m llama-3-70b-Q4_K_M.gguf \
-ngl 40 \ # tune: as many layers as fit in ~22 GB
--ctx-size 4096 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-p "Your prompt"
Tuning -ngl: increase the number of offloaded GPU layers until you're near the VRAM limit, then back off one step. More GPU layers = faster; too many = out-of-memory. You'll need ample system RAM (64 GB+ is comfortable for a Q4 70B) since the non-GPU layers live there.
Expect a large speed gap: full-GPU sub-4-bit might do tens of tokens/sec; partial CPU offload of a Q4 70B often drops to low single digits. Choose based on whether you're optimizing for quality (offload) or interactivity (sub-4-bit on-GPU).
Clawing back VRAM: the KV cache and context
Weights aren't the only consumer. The KV cache grows with context length and can eat several gigabytes — exactly what you don't have to spare. Two high-impact levers:
| Lever | Effect |
|---|---|
| Quantized KV cache | q8_0 (8-bit) or q4 KV roughly halves/quarters cache memory |
| Shorter context | KV memory is linear in context length — 4K uses half of 8K |
| GQA models | Llama 3 70B uses grouped-query attention → already small KV cache |
| Lower bpw weights | Frees VRAM that can go to cache or more GPU layers |
On a 24 GB card every gigabyte counts: a 4-bit KV cache plus a 4K context can be the difference between fitting and OOM. Llama 3 70B's GQA helps a lot here — its KV cache is far smaller than a multi-head-attention model of the same size.
The decision: which route for you?
| Your priority | Choose |
|---|---|
| Interactive speed (chat, agents) | Route 1 — EXL2 ~2.4 bpw or IQ2, fully on GPU |
| Best possible quality, speed secondary | Route 2 — Q4_K_M with CPU offload |
| Best quality at 2-bit, fully on-GPU | Route 1, Option C — AQLM |
| Simplicity / broad tooling | GGUF (IQ2 or Q3/Q4 offload) via Ollama/llama.cpp |
| Squeeze the absolute largest context | Sub-4-bit weights + 4-bit KV cache |
The honest recommendation: for most people wanting a usable, interactive 70B on a single 4090, start with an EXL2 ~2.4 bpw quant (or GGUF IQ2_M) fully on the GPU with a 4-bit KV cache. If the quality isn't good enough for your task, move to Q4 with CPU offload and accept the speed hit — or step down to a 32B/34B model at clean 4-bit, which fits comfortably and may serve you better than a heavily-damaged 70B.
A reality check on quality
Be clear-eyed: 2–2.5 bpw is aggressive, and quality drops. Expect more degradation on reasoning, math, and code than on casual chat. Mitigations and expectations:
- Validate on your task — perplexity for a quick check, real task accuracy for the decision. Don't assume the 70B@2bit beats your alternatives; measure it.
- Use importance-matrix / learned-codebook quants (IQ2, AQLM, good EXL2) — never naive 2-bit, which is unusable.
- Consider the alternative: a 32–34B model at 4-bit (~18–20 GB) fits cleanly on a 4090 with good quality and full speed. For many workloads it's the smarter choice than a crippled 70B. Run both and compare before committing.
Frequently asked questions
Can a 70B model run on a single RTX 4090? Yes, but not at standard 4-bit — a 4-bit 70B is ~35–42 GB and the 4090 has only 24 GB. You must go below 4-bit (an EXL2 quant at ~2.2–2.5 bits-per-weight, or a GGUF i-quant like IQ2_XXS/IQ2_M) to fit the model fully in VRAM, or keep higher bits and offload some layers to system RAM at a large speed cost. Pair either with a quantized KV cache and modest context.
What bits-per-weight do I need to fit 70B in 24 GB? Roughly 2.5 bits-per-weight or lower for the weights, leaving headroom for the KV cache and activations. A ~2.4 bpw quant is about 21 GB of weights; ~2.06 bpw (IQ2_XXS) is about 18 GB. EXL2's fractional bpw lets you target exactly what fits, and a quantized KV cache plus a shorter context buys the remaining room.
Is a 70B at 2-bit better than a smaller model at 4-bit? Often, but not always — measure. Larger models tolerate low-bit quantization better because they have more redundancy, so a 70B at ~2.4 bpw is damaged but frequently still more capable than a 13B at clean 4-bit. However, a 32–34B model at 4-bit fits cleanly on a 4090 with full quality and speed, and for many tasks it's a better choice than a heavily-quantized 70B. Compare them on your real workload.
Should I use EXL2, GGUF i-quants, or AQLM? EXL2 (~2.4 bpw) is the best speed/quality balance for a fully-GPU 70B on a 4090, with fractional bpw control. GGUF i-quants (IQ2_XXS/IQ2_M) via llama.cpp are simple, well-supported, and run on the broad Ollama/llama.cpp ecosystem. AQLM offers among the best 2-bit quality but is more complex to set up. Start with EXL2 or IQ2; reach for AQLM if you need more quality at 2-bit.
How do I save VRAM beyond quantizing the weights?
Quantize the KV cache (8-bit q8_0 or 4-bit), shorten the context length (KV memory is linear in context), and prefer models with grouped-query attention (Llama 3 70B uses GQA, which already shrinks the KV cache). These levers can free several gigabytes — often the difference between fitting and an out-of-memory error on a 24 GB card.
Why is CPU offload so much slower? When some layers live in system RAM, every token must move data across the PCIe bus and compute those layers on the CPU, which is far slower than the GPU. The offloaded layers become the bottleneck, dropping generation from tens of tokens per second (fully on GPU) to low single digits. Offload is the price of keeping higher-bit quality when the model doesn't fit in VRAM.
Key takeaways
- A 70B at standard 4-bit (~35–42 GB) does not fit in the 4090's 24 GB — running it requires sub-4-bit quantization or CPU offload.
- Route 1 (speed): EXL2 at ~2.2–2.5 bpw or GGUF i-quants (IQ2_XXS/IQ2_M), fully on GPU — fast, with real quality loss.
- Route 2 (quality):
Q3/Q4GGUF with-ngllayer offload to system RAM — better quality, much slower; needs 64 GB+ RAM. - Claw back VRAM with a quantized KV cache (q8_0/q4), a modest context, and GQA models (Llama 3 70B).
- Use smart low-bit methods (IQ2, AQLM, good EXL2) — never naive 2-bit.
- Sanity-check the alternative: a 32–34B at clean 4-bit fits comfortably and may beat a crippled 70B — measure both on your task before committing.
References
- ExLlamaV2 (EXL2 fractional-bpw quantization and fast inference). https://github.com/turboderp/exllamav2
- llama.cpp — i-quants (IQ2/IQ3), KV cache quantization, layer offload (
-ngl). https://github.com/ggml-org/llama.cpp - Egiazarian, V., Panferov, A., Kuznedelev, D., et al. (2024). AQLM: Extreme Compression of LLMs via Additive Quantization. https://arxiv.org/abs/2401.06118
- Tseng, A., Chee, J., et al. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. https://arxiv.org/abs/2402.04396
- Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. https://arxiv.org/abs/2305.13245
- Dettmers, T., & Zettlemoyer, L. (2022). The case for 4-bit precision: k-bit Inference Scaling Laws. https://arxiv.org/abs/2212.09720
Keep reading
LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory
How accuracy degrades and memory shrinks as you drop from 8-bit to 4-bit to 2-bit quantization, and how to find the sweet spot for your model and hardware.
LLM Quantization Explained: INT4 vs INT8 vs FP16
A beginner's guide to LLM quantization: how INT4, INT8, and FP16 trade memory for quality, and the rule of thumb for sizing a model to your GPU.
How to Quantize Llama 3 to 4-Bit With Minimal Accuracy Loss
A step-by-step guide to quantizing Llama 3 to 4-bit precision while keeping accuracy loss minimal — calibration data, method choice, and verification.