Quantization for Edge Devices: LLMs Under 4 GB VRAM
A case study on shrinking LLMs to run under 4 GB of VRAM for edge deployment — which quantization methods survive the squeeze and where quality falls apart.
Mohammed Kafeel
Machine Learning Researcher
Quick answer: Under a 4 GB memory budget, the winning strategy is a small model (1–4B parameters) quantized to 4-bit, not a large model squeezed to extreme precision. A 3B model at 4-bit needs only ~1.5–2 GB of weights, leaving room for the KV cache and the rest of the system — whereas a 7–8B at 4-bit (~4–4.5 GB) won't fit a 4 GB device once you account for overhead. Use an edge-friendly runtime (llama.cpp/GGUF, MLC LLM, ExecuTorch, or ONNX Runtime) that runs on ARM CPUs, mobile GPUs, and NPUs; pick a strong small model (Llama 3.2 1B/3B, Gemma 2B, Phi-family, Qwen 2.5 0.5–3B); quantize the KV cache and cap context to fit the memory ceiling; and fine-tune for your specific task to recover quality the small size costs you. The mantra for edge: smallest capable model + 4-bit + task specialization, not biggest model + most aggressive quantization.
The scenario
Consider a team — call them FieldOps — building an offline voice assistant for a handheld industrial device. Constraints that define the problem:
- ≤4 GB memory shared between the model, the OS, and the app (typical for mid-range phones, Jetson Nano-class boards, and embedded SoCs with unified memory).
- No reliable network — inference must run fully on-device, so a cloud API is off the table.
- Modest, well-scoped task — answer questions about equipment, parse commands, summarize logs. Not open-ended general intelligence.
- Acceptable latency — a few tokens per second is fine; this isn't a datacenter.
The instinct is to cram the biggest model that "fits." That instinct is wrong for edge. The right question is the smallest model that does the job well, because every gigabyte and every millisecond is scarce.
Why "small model + 4-bit" beats "big model + 2-bit" on edge
On a 24 GB desktop GPU, pushing a 70B to 2-bit can make sense (large models tolerate it, and capability is the goal). On a <4 GB edge device the calculus inverts:
- The memory ceiling is brutal. Even a 7B at 4-bit (~4 GB weights alone) overruns a 4 GB device once you add the KV cache, activations, and OS. A 1–4B model leaves breathing room.
- Extreme quantization needs big models. 2-bit only works tolerably on large models; a small model at 2-bit is wrecked. So you can't compensate for the budget by quantizing a mid-size model harder — it just breaks.
- The task is narrow. A well-scoped, fine-tuned small model often matches a giant generalist on that task. Edge workloads are usually specific, which plays to small models' strengths.
The edge rule: choose the smallest model that passes your task eval, quantize it to a quality-preserving 4-bit, and spend your remaining effort on fine-tuning — not on fitting a bigger model.
Step 1 — Pick the right small model
The 1–4B class has become genuinely capable. Strong candidates for under-4 GB deployment:
| Model family | Sizes | Notes |
|---|---|---|
| Llama 3.2 | 1B, 3B | Strong small instruct models, broad support |
| Gemma 2 / 3 | ~2B | Compact, good quality, mobile-friendly |
| Phi family | ~1.5–4B | Punch above their weight on reasoning |
| Qwen 2.5 | 0.5B–3B | Wide size range, multilingual |
| SmolLM / TinyLlama | <1.5B | For the tightest budgets |
Memory at 4-bit (weights only), to match against your budget:
| Model size | 4-bit weights | Fits under 4 GB? (with KV + OS overhead) |
|---|---|---|
| 1B | ~0.5 GB | Easily |
| 3B | ~1.5–2 GB | Yes — the sweet spot |
| 7–8B | ~4–4.5 GB | No — overruns once overhead is added |
The 3B-at-4-bit slot (~1.5–2 GB) is the edge sweet spot: capable enough for scoped tasks, small enough to leave room for everything else.
Step 2 — Choose an edge runtime
Desktop GPU runtimes (vLLM, ExLlamaV2) don't target phones and embedded boards. Use a runtime built for edge hardware:
| Runtime | Best for | Hardware reach |
|---|---|---|
| llama.cpp / GGUF | The default — broad device support, easy quant | ARM/x86 CPU, mobile GPU (Metal/Vulkan), some NPU |
| MLC LLM | Compiled, GPU-accelerated mobile inference | iOS/Android GPUs, WebGPU |
| ExecuTorch | PyTorch's on-device runtime | Mobile CPU/GPU/NPU |
| ONNX Runtime | Cross-platform, NPU acceleration | CPU, mobile, NPUs (QNN, etc.) |
| MediaPipe LLM | Turnkey on-device LLM on Android/iOS | Mobile CPU/GPU |
For most edge projects, llama.cpp with a GGUF model is the pragmatic starting point — it runs on nearly anything, supports aggressive quantization and KV-cache quantization, and has the widest device coverage. Move to MLC LLM / ExecuTorch / ONNX Runtime when you need NPU acceleration or tighter platform integration.
Step 3 — Quantize to fit, without wrecking quality
Target 4-bit with a quality-preserving scheme; reserve lower bits only if you truly can't fit.
# Convert and quantize a 3B model to a high-quality 4-bit GGUF
python convert_hf_to_gguf.py meta-llama/Llama-3.2-3B-Instruct \
--outfile llama-3.2-3b-f16.gguf --outtype f16
# Use an importance matrix for minimal-loss 4-bit
./llama-imatrix -m llama-3.2-3b-f16.gguf -f calibration.txt -o m.imatrix
./llama-quantize --imatrix m.imatrix \
llama-3.2-3b-f16.gguf llama-3.2-3b-Q4_K_M.gguf Q4_K_M
Then trim runtime memory so it fits the device:
./llama-cli -m llama-3.2-3b-Q4_K_M.gguf \
--ctx-size 2048 \ # short context — KV memory is linear in length
--cache-type-k q8_0 \ # quantized KV cache
--cache-type-v q8_0 \
-p "Equipment question..."
Memory levers specific to tight edge budgets:
| Lever | Effect |
|---|---|
4-bit weights (Q4_K_M) |
The baseline; ~¼ of FP16 |
| Quantized KV cache | 8-bit/4-bit KV halves/quarters cache memory |
| Short context (1–2K) | KV memory scales linearly with context length |
| Smaller model | The biggest lever — drop 3B→1B if 3B won't fit |
| 3-bit (only if forced) | Last resort; small models degrade fast below 4-bit |
Note the asymmetry with desktop: on edge you'd rather drop to a smaller model at clean 4-bit than quantize a bigger one to 3-bit, because small models degrade badly below 4-bit. Model size is the lever, not extreme bit-width.
Step 4 — Recover quality with task specialization
A 3B model won't match a 70B as a generalist — but FieldOps doesn't need a generalist. The highest-leverage move on edge is fine-tuning the small model for the narrow task (often with LoRA, then merged and quantized). This frequently lets a 3B match or beat a much larger generic model on the specific workload, which is all that matters on-device.
Combine with context engineering: a tight, well-structured prompt and on-device retrieval over a small local knowledge base lets the small model punch above its weight without growing the model. (See the context-engineering and LoRA posts for the techniques.)
Step 5 — Validate on-device, not just on your laptop
Edge performance is hardware-specific. Before shipping:
- Measure real memory use on the target device — laptop RAM headroom lies about phone/embedded limits.
- Measure tokens/sec on the target — CPU/NPU throughput varies wildly across SoCs.
- Run your task eval — confirm the quantized, fine-tuned small model passes the accuracy bar on real inputs.
- Test thermals and battery — sustained inference on a handheld throttles; a model that's fast for ten seconds may crawl after a minute of load.
The result (illustrative)
For a FieldOps-style deployment, the converged configuration looks like:
| Decision | Choice | Why |
|---|---|---|
| Model | 3B instruct (e.g., Llama 3.2 3B) | Capable for scoped tasks, ~1.5–2 GB at 4-bit |
| Quantization | Q4_K_M GGUF (imatrix) |
Minimal-loss 4-bit, fits budget |
| Runtime | llama.cpp (or MLC/ONNX for NPU) | Broad device support |
| KV cache | 8-bit, 2K context | Bounds runtime memory under the ceiling |
| Quality recovery | LoRA fine-tune on task data | Closes the gap to larger generalists |
| Total footprint | ~2–3 GB | Fits under 4 GB with OS/app headroom |
The lesson FieldOps internalized: the budget was met by choosing a smaller model, not by quantizing a bigger one harder — and the quality bar was met by specializing the small model, not by adding parameters.
Frequently asked questions
What size LLM can run under 4 GB of VRAM? A 1–4B parameter model at 4-bit. A 3B at 4-bit needs about 1.5–2 GB for weights, leaving room for the KV cache, activations, and OS — making it the practical sweet spot for a 4 GB budget. A 1B model (~0.5 GB) fits very comfortably for the tightest devices. A 7–8B model at 4-bit (~4–4.5 GB) overruns a 4 GB device once overhead is included.
Why not just run a 7B at 2-bit on an edge device? Because extreme quantization needs large models to work. A 7B at 2-bit is badly degraded — small and mid-size models lose far more accuracy below 4-bit than large ones do. You can't compensate for a tiny memory budget by quantizing a bigger model harder; you'd just break it. The reliable path is a smaller model at clean 4-bit.
Which runtime should I use for on-device LLMs? llama.cpp with a GGUF model is the pragmatic default — it runs on ARM/x86 CPUs, mobile GPUs, and some NPUs, and supports aggressive weight and KV-cache quantization. For NPU acceleration or tighter mobile integration, consider MLC LLM (compiled GPU inference), ExecuTorch (PyTorch on-device), ONNX Runtime (cross-platform with NPU support), or MediaPipe LLM Inference on Android/iOS.
How do I make a small edge model accurate enough? Specialize it. Fine-tune the small model (e.g., with LoRA) on your specific task, which often lets a 3B match or beat a much larger generic model on that workload. Pair this with context engineering — a tight prompt and on-device retrieval over a small local knowledge base — so the model gets the right information without growing in size. Edge tasks are usually narrow, which plays to a specialized small model's strengths.
How do I reduce memory beyond 4-bit weights? Quantize the KV cache (8-bit or 4-bit), cap the context length (KV memory grows linearly with it — 2K uses half of 4K), and, most importantly, drop to a smaller model if needed. On edge, moving from 3B to 1B at clean 4-bit is usually better than keeping 3B and dropping to 3-bit, because small models degrade quickly below 4-bit.
Do I need to test on the actual device? Yes — always. A laptop's spare RAM and CPU don't reflect a phone or embedded board's real limits, throughput, thermals, or battery behavior. Measure memory use, tokens/sec, and task accuracy on the target hardware, and test sustained load, since handheld devices throttle when they heat up and a model that's fast briefly may slow under continuous use.
Key takeaways
- Under 4 GB, the answer is a small model (1–4B) at 4-bit, not a big model at extreme precision — a 3B at 4-bit (~1.5–2 GB) is the sweet spot.
- Extreme quantization needs large models; small models degrade fast below 4-bit, so shrink the model, not the bit-width.
- Use an edge runtime — llama.cpp/GGUF as the default, or MLC/ExecuTorch/ONNX Runtime for NPU acceleration.
- Bound runtime memory with a quantized KV cache and a short context (1–2K).
- Recover quality by specializing — LoRA fine-tuning plus context engineering let a small model match larger generalists on a narrow task.
- Validate on the actual device — real memory, throughput, thermals, and battery, not laptop estimates.
References
- llama.cpp — GGUF quantization, KV-cache quantization, edge/ARM support. https://github.com/ggml-org/llama.cpp
- MLC LLM — universal on-device LLM deployment. https://github.com/mlc-ai/mlc-llm
- PyTorch ExecuTorch — on-device inference runtime. https://pytorch.org/executorch/
- ONNX Runtime — cross-platform inference with NPU acceleration. https://onnxruntime.ai/
- Google. MediaPipe LLM Inference — on-device LLMs for Android/iOS. https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference
- Meta. Llama 3.2 (1B/3B) model card. https://github.com/meta-llama/llama-models
Keep reading
Run 70B Models on a Single RTX 4090 With 4-Bit Quantization
A how-to for fitting a 70B-parameter model onto one 24 GB RTX 4090 using aggressive 4-bit and 2-bit quantization — what works, what breaks, and the accuracy cost.
AWQ vs GPTQ: What the Quantization Benchmarks Show
A benchmark-driven comparison of AWQ and GPTQ post-training quantization — accuracy, speed, and memory — so you can pick the right method instead of guessing.
LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory
How accuracy degrades and memory shrinks as you drop from 8-bit to 4-bit to 2-bit quantization, and how to find the sweet spot for your model and hardware.