All posts

Run 70B Models on a Single RTX 4090 With 4-Bit Quantization

A how-to for fitting a 70B-parameter model onto one 24 GB RTX 4090 using aggressive 4-bit and 2-bit quantization — what works, what breaks, and the accuracy cost.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202614 min read

Quick answer: A 70B model in FP16 needs ~140 GB; even at 4-bit it's ~35 GB — which does not fit in the RTX 4090's 24 GB. To run a 70B fully on one 4090 you must go below 4-bit. The two practical routes: (1) sub-4-bit GPU quantization — an ExLlamaV2 EXL2 quant at ~2.2–2.5 bits-per-weight (~19–22 GB) or a GGUF i-quant (IQ2_XXS / IQ2_M) — which fits entirely in 24 GB and runs fast; or (2) higher-bit with CPU offload — a GGUF Q3/Q4 quant where some layers live on the GPU and the rest spill to system RAM, giving better quality but much slower generation. Pair either with a quantized KV cache (4-bit/8-bit) and a modest context length to claw back VRAM. The trade-off is real quality loss from sub-4-bit, but a 70B at ~2.4 bpw often still beats a 13B at 4-bit — the size advantage tends to win.


The core problem: the math doesn't fit

Start with the memory reality, because it dictates everything:

70B precision Bits/weight Weights only Fits in 24 GB?
FP16 16 ~140 GB No (6× over)
INT8 8 ~70 GB No
INT4 (Q4_K_M) ~4.8 ~42 GB No
INT3 (Q3_K_M) ~3.9 ~34 GB No
~2.4 bpw (EXL2/IQ2) ~2.4 ~21 GB Yes (tight)
~2.06 bpw (IQ2_XXS) ~2.06 ~18 GB Yes

The RTX 4090 has 24 GB. To fit 70B weights on the card you need roughly ≤2.5 bits per weight — and you still need headroom for the KV cache and activations. That's why running 70B on a 4090 is fundamentally an exercise in aggressive (sub-4-bit) quantization or offloading, not standard 4-bit.

A reassuring counterpoint from quantization scaling: large models tolerate low-bit quantization better than small ones (more redundancy per weight). A 70B at ~2.4 bpw is damaged but still highly capable — often more so than a 13B at clean 4-bit. The size advantage is exactly why this is worth doing.


Route 1 — Sub-4-bit, fully on the GPU (recommended for speed)

This keeps the entire model in VRAM, so generation is fast. You trade quality for fit by going to ~2–2.5 bpw.

Option A: ExLlamaV2 (EXL2) at ~2.2–2.5 bpw

EXL2 supports fractional bits-per-weight, so you can target exactly what fits. A 70B at ~2.4 bpw lands around 21 GB, leaving room for a small KV cache.

# Download a pre-quantized EXL2 70B at ~2.4 bpw, or quantize your own with exllamav2
# Run with the ExLlamaV2 loader (e.g., via TabbyAPI or text-generation-webui)

# Key settings for a 24 GB card:
#   - model: 70B @ ~2.4 bpw EXL2
#   - cache: Q4 (4-bit KV cache) to save VRAM
#   - max_seq_len: keep modest (e.g., 4096-8192) to bound KV memory

EXL2's fractional bpw plus a 4-bit KV cache is the single best speed/quality combo for 70B on a 4090 — fully GPU-resident, fast tokens/sec, and you tune bpw to the last gigabyte.

Option B: GGUF i-quants (IQ2_XXS / IQ2_M) with llama.cpp

llama.cpp's importance-matrix 2-bit quants are designed for exactly this. IQ2_XXS (~2.06 bpw, ~18 GB) fits comfortably; IQ2_M (~2.7 bpw) is higher quality but tighter.

# Run fully on GPU: offload all layers (-ngl 99), use a quantized KV cache
./llama-cli -m llama-3-70b-IQ2_M.gguf \
    -ngl 99 \                 # put all layers on the GPU
    --ctx-size 4096 \         # modest context to bound KV memory
    --cache-type-k q8_0 \     # quantized KV cache (key)
    --cache-type-v q8_0 \     # quantized KV cache (value)
    -p "Your prompt"

i-quants use a calibration importance matrix to allocate the precious few bits where they matter most, which is why IQ2 is far more usable than naive 2-bit.

Option C: AQLM (advanced 2-bit)

AQLM (Additive Quantization) achieves usable 2-bit quality through learned codebooks and is a strong choice when you want the best quality at ~2 bpw fully on-GPU. It's more complex to set up and slower to quantize, but the inference quality at 2-bit is among the best available. Use it when IQ2/EXL2 quality isn't enough and you can invest in the tooling.


Route 2 — Higher-bit with CPU offload (recommended for quality)

If sub-4-bit quality is unacceptable, keep more bits (Q3_K_M or even Q4_K_M) and offload the overflow layers to system RAM. The GPU holds as many layers as fit; the CPU handles the rest. Quality is much better; speed is much worse (the CPU layers bottleneck generation).

# Q4_K_M 70B (~42 GB) won't fit in 24 GB — offload only what fits.
# -ngl N puts N layers on the GPU; the rest run on CPU/RAM.
./llama-cli -m llama-3-70b-Q4_K_M.gguf \
    -ngl 40 \                 # tune: as many layers as fit in ~22 GB
    --ctx-size 4096 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    -p "Your prompt"

Tuning -ngl: increase the number of offloaded GPU layers until you're near the VRAM limit, then back off one step. More GPU layers = faster; too many = out-of-memory. You'll need ample system RAM (64 GB+ is comfortable for a Q4 70B) since the non-GPU layers live there.

Expect a large speed gap: full-GPU sub-4-bit might do tens of tokens/sec; partial CPU offload of a Q4 70B often drops to low single digits. Choose based on whether you're optimizing for quality (offload) or interactivity (sub-4-bit on-GPU).


Clawing back VRAM: the KV cache and context

Weights aren't the only consumer. The KV cache grows with context length and can eat several gigabytes — exactly what you don't have to spare. Two high-impact levers:

Lever Effect
Quantized KV cache q8_0 (8-bit) or q4 KV roughly halves/quarters cache memory
Shorter context KV memory is linear in context length — 4K uses half of 8K
GQA models Llama 3 70B uses grouped-query attention → already small KV cache
Lower bpw weights Frees VRAM that can go to cache or more GPU layers

On a 24 GB card every gigabyte counts: a 4-bit KV cache plus a 4K context can be the difference between fitting and OOM. Llama 3 70B's GQA helps a lot here — its KV cache is far smaller than a multi-head-attention model of the same size.


The decision: which route for you?

Your priority Choose
Interactive speed (chat, agents) Route 1 — EXL2 ~2.4 bpw or IQ2, fully on GPU
Best possible quality, speed secondary Route 2 — Q4_K_M with CPU offload
Best quality at 2-bit, fully on-GPU Route 1, Option C — AQLM
Simplicity / broad tooling GGUF (IQ2 or Q3/Q4 offload) via Ollama/llama.cpp
Squeeze the absolute largest context Sub-4-bit weights + 4-bit KV cache

The honest recommendation: for most people wanting a usable, interactive 70B on a single 4090, start with an EXL2 ~2.4 bpw quant (or GGUF IQ2_M) fully on the GPU with a 4-bit KV cache. If the quality isn't good enough for your task, move to Q4 with CPU offload and accept the speed hit — or step down to a 32B/34B model at clean 4-bit, which fits comfortably and may serve you better than a heavily-damaged 70B.


A reality check on quality

Be clear-eyed: 2–2.5 bpw is aggressive, and quality drops. Expect more degradation on reasoning, math, and code than on casual chat. Mitigations and expectations:

  • Validate on your task — perplexity for a quick check, real task accuracy for the decision. Don't assume the 70B@2bit beats your alternatives; measure it.
  • Use importance-matrix / learned-codebook quants (IQ2, AQLM, good EXL2) — never naive 2-bit, which is unusable.
  • Consider the alternative: a 32–34B model at 4-bit (~18–20 GB) fits cleanly on a 4090 with good quality and full speed. For many workloads it's the smarter choice than a crippled 70B. Run both and compare before committing.

Frequently asked questions

Can a 70B model run on a single RTX 4090? Yes, but not at standard 4-bit — a 4-bit 70B is ~35–42 GB and the 4090 has only 24 GB. You must go below 4-bit (an EXL2 quant at ~2.2–2.5 bits-per-weight, or a GGUF i-quant like IQ2_XXS/IQ2_M) to fit the model fully in VRAM, or keep higher bits and offload some layers to system RAM at a large speed cost. Pair either with a quantized KV cache and modest context.

What bits-per-weight do I need to fit 70B in 24 GB? Roughly 2.5 bits-per-weight or lower for the weights, leaving headroom for the KV cache and activations. A ~2.4 bpw quant is about 21 GB of weights; ~2.06 bpw (IQ2_XXS) is about 18 GB. EXL2's fractional bpw lets you target exactly what fits, and a quantized KV cache plus a shorter context buys the remaining room.

Is a 70B at 2-bit better than a smaller model at 4-bit? Often, but not always — measure. Larger models tolerate low-bit quantization better because they have more redundancy, so a 70B at ~2.4 bpw is damaged but frequently still more capable than a 13B at clean 4-bit. However, a 32–34B model at 4-bit fits cleanly on a 4090 with full quality and speed, and for many tasks it's a better choice than a heavily-quantized 70B. Compare them on your real workload.

Should I use EXL2, GGUF i-quants, or AQLM? EXL2 (~2.4 bpw) is the best speed/quality balance for a fully-GPU 70B on a 4090, with fractional bpw control. GGUF i-quants (IQ2_XXS/IQ2_M) via llama.cpp are simple, well-supported, and run on the broad Ollama/llama.cpp ecosystem. AQLM offers among the best 2-bit quality but is more complex to set up. Start with EXL2 or IQ2; reach for AQLM if you need more quality at 2-bit.

How do I save VRAM beyond quantizing the weights? Quantize the KV cache (8-bit q8_0 or 4-bit), shorten the context length (KV memory is linear in context), and prefer models with grouped-query attention (Llama 3 70B uses GQA, which already shrinks the KV cache). These levers can free several gigabytes — often the difference between fitting and an out-of-memory error on a 24 GB card.

Why is CPU offload so much slower? When some layers live in system RAM, every token must move data across the PCIe bus and compute those layers on the CPU, which is far slower than the GPU. The offloaded layers become the bottleneck, dropping generation from tens of tokens per second (fully on GPU) to low single digits. Offload is the price of keeping higher-bit quality when the model doesn't fit in VRAM.


Key takeaways

  • A 70B at standard 4-bit (~35–42 GB) does not fit in the 4090's 24 GB — running it requires sub-4-bit quantization or CPU offload.
  • Route 1 (speed): EXL2 at ~2.2–2.5 bpw or GGUF i-quants (IQ2_XXS/IQ2_M), fully on GPU — fast, with real quality loss.
  • Route 2 (quality): Q3/Q4 GGUF with -ngl layer offload to system RAM — better quality, much slower; needs 64 GB+ RAM.
  • Claw back VRAM with a quantized KV cache (q8_0/q4), a modest context, and GQA models (Llama 3 70B).
  • Use smart low-bit methods (IQ2, AQLM, good EXL2) — never naive 2-bit.
  • Sanity-check the alternative: a 32–34B at clean 4-bit fits comfortably and may beat a crippled 70B — measure both on your task before committing.

References

  1. ExLlamaV2 (EXL2 fractional-bpw quantization and fast inference). https://github.com/turboderp/exllamav2
  2. llama.cpp — i-quants (IQ2/IQ3), KV cache quantization, layer offload (-ngl). https://github.com/ggml-org/llama.cpp
  3. Egiazarian, V., Panferov, A., Kuznedelev, D., et al. (2024). AQLM: Extreme Compression of LLMs via Additive Quantization. https://arxiv.org/abs/2401.06118
  4. Tseng, A., Chee, J., et al. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. https://arxiv.org/abs/2402.04396
  5. Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. https://arxiv.org/abs/2305.13245
  6. Dettmers, T., & Zettlemoyer, L. (2022). The case for 4-bit precision: k-bit Inference Scaling Laws. https://arxiv.org/abs/2212.09720