All posts

Quantization for Edge Devices: LLMs Under 4 GB VRAM

A case study on shrinking LLMs to run under 4 GB of VRAM for edge deployment — which quantization methods survive the squeeze and where quality falls apart.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202614 min read

Quick answer: Under a 4 GB memory budget, the winning strategy is a small model (1–4B parameters) quantized to 4-bit, not a large model squeezed to extreme precision. A 3B model at 4-bit needs only ~1.5–2 GB of weights, leaving room for the KV cache and the rest of the system — whereas a 7–8B at 4-bit (~4–4.5 GB) won't fit a 4 GB device once you account for overhead. Use an edge-friendly runtime (llama.cpp/GGUF, MLC LLM, ExecuTorch, or ONNX Runtime) that runs on ARM CPUs, mobile GPUs, and NPUs; pick a strong small model (Llama 3.2 1B/3B, Gemma 2B, Phi-family, Qwen 2.5 0.5–3B); quantize the KV cache and cap context to fit the memory ceiling; and fine-tune for your specific task to recover quality the small size costs you. The mantra for edge: smallest capable model + 4-bit + task specialization, not biggest model + most aggressive quantization.


The scenario

Consider a team — call them FieldOps — building an offline voice assistant for a handheld industrial device. Constraints that define the problem:

  • ≤4 GB memory shared between the model, the OS, and the app (typical for mid-range phones, Jetson Nano-class boards, and embedded SoCs with unified memory).
  • No reliable network — inference must run fully on-device, so a cloud API is off the table.
  • Modest, well-scoped task — answer questions about equipment, parse commands, summarize logs. Not open-ended general intelligence.
  • Acceptable latency — a few tokens per second is fine; this isn't a datacenter.

The instinct is to cram the biggest model that "fits." That instinct is wrong for edge. The right question is the smallest model that does the job well, because every gigabyte and every millisecond is scarce.


Why "small model + 4-bit" beats "big model + 2-bit" on edge

On a 24 GB desktop GPU, pushing a 70B to 2-bit can make sense (large models tolerate it, and capability is the goal). On a <4 GB edge device the calculus inverts:

  1. The memory ceiling is brutal. Even a 7B at 4-bit (~4 GB weights alone) overruns a 4 GB device once you add the KV cache, activations, and OS. A 1–4B model leaves breathing room.
  2. Extreme quantization needs big models. 2-bit only works tolerably on large models; a small model at 2-bit is wrecked. So you can't compensate for the budget by quantizing a mid-size model harder — it just breaks.
  3. The task is narrow. A well-scoped, fine-tuned small model often matches a giant generalist on that task. Edge workloads are usually specific, which plays to small models' strengths.

The edge rule: choose the smallest model that passes your task eval, quantize it to a quality-preserving 4-bit, and spend your remaining effort on fine-tuning — not on fitting a bigger model.


Step 1 — Pick the right small model

The 1–4B class has become genuinely capable. Strong candidates for under-4 GB deployment:

Model family Sizes Notes
Llama 3.2 1B, 3B Strong small instruct models, broad support
Gemma 2 / 3 ~2B Compact, good quality, mobile-friendly
Phi family ~1.5–4B Punch above their weight on reasoning
Qwen 2.5 0.5B–3B Wide size range, multilingual
SmolLM / TinyLlama <1.5B For the tightest budgets

Memory at 4-bit (weights only), to match against your budget:

Model size 4-bit weights Fits under 4 GB? (with KV + OS overhead)
1B ~0.5 GB Easily
3B ~1.5–2 GB Yes — the sweet spot
7–8B ~4–4.5 GB No — overruns once overhead is added

The 3B-at-4-bit slot (~1.5–2 GB) is the edge sweet spot: capable enough for scoped tasks, small enough to leave room for everything else.


Step 2 — Choose an edge runtime

Desktop GPU runtimes (vLLM, ExLlamaV2) don't target phones and embedded boards. Use a runtime built for edge hardware:

Runtime Best for Hardware reach
llama.cpp / GGUF The default — broad device support, easy quant ARM/x86 CPU, mobile GPU (Metal/Vulkan), some NPU
MLC LLM Compiled, GPU-accelerated mobile inference iOS/Android GPUs, WebGPU
ExecuTorch PyTorch's on-device runtime Mobile CPU/GPU/NPU
ONNX Runtime Cross-platform, NPU acceleration CPU, mobile, NPUs (QNN, etc.)
MediaPipe LLM Turnkey on-device LLM on Android/iOS Mobile CPU/GPU

For most edge projects, llama.cpp with a GGUF model is the pragmatic starting point — it runs on nearly anything, supports aggressive quantization and KV-cache quantization, and has the widest device coverage. Move to MLC LLM / ExecuTorch / ONNX Runtime when you need NPU acceleration or tighter platform integration.


Step 3 — Quantize to fit, without wrecking quality

Target 4-bit with a quality-preserving scheme; reserve lower bits only if you truly can't fit.

# Convert and quantize a 3B model to a high-quality 4-bit GGUF
python convert_hf_to_gguf.py meta-llama/Llama-3.2-3B-Instruct \
    --outfile llama-3.2-3b-f16.gguf --outtype f16

# Use an importance matrix for minimal-loss 4-bit
./llama-imatrix -m llama-3.2-3b-f16.gguf -f calibration.txt -o m.imatrix
./llama-quantize --imatrix m.imatrix \
    llama-3.2-3b-f16.gguf llama-3.2-3b-Q4_K_M.gguf Q4_K_M

Then trim runtime memory so it fits the device:

./llama-cli -m llama-3.2-3b-Q4_K_M.gguf \
    --ctx-size 2048 \          # short context — KV memory is linear in length
    --cache-type-k q8_0 \      # quantized KV cache
    --cache-type-v q8_0 \
    -p "Equipment question..."

Memory levers specific to tight edge budgets:

Lever Effect
4-bit weights (Q4_K_M) The baseline; ~¼ of FP16
Quantized KV cache 8-bit/4-bit KV halves/quarters cache memory
Short context (1–2K) KV memory scales linearly with context length
Smaller model The biggest lever — drop 3B→1B if 3B won't fit
3-bit (only if forced) Last resort; small models degrade fast below 4-bit

Note the asymmetry with desktop: on edge you'd rather drop to a smaller model at clean 4-bit than quantize a bigger one to 3-bit, because small models degrade badly below 4-bit. Model size is the lever, not extreme bit-width.


Step 4 — Recover quality with task specialization

A 3B model won't match a 70B as a generalist — but FieldOps doesn't need a generalist. The highest-leverage move on edge is fine-tuning the small model for the narrow task (often with LoRA, then merged and quantized). This frequently lets a 3B match or beat a much larger generic model on the specific workload, which is all that matters on-device.

Combine with context engineering: a tight, well-structured prompt and on-device retrieval over a small local knowledge base lets the small model punch above its weight without growing the model. (See the context-engineering and LoRA posts for the techniques.)


Step 5 — Validate on-device, not just on your laptop

Edge performance is hardware-specific. Before shipping:

  • Measure real memory use on the target device — laptop RAM headroom lies about phone/embedded limits.
  • Measure tokens/sec on the target — CPU/NPU throughput varies wildly across SoCs.
  • Run your task eval — confirm the quantized, fine-tuned small model passes the accuracy bar on real inputs.
  • Test thermals and battery — sustained inference on a handheld throttles; a model that's fast for ten seconds may crawl after a minute of load.

The result (illustrative)

For a FieldOps-style deployment, the converged configuration looks like:

Decision Choice Why
Model 3B instruct (e.g., Llama 3.2 3B) Capable for scoped tasks, ~1.5–2 GB at 4-bit
Quantization Q4_K_M GGUF (imatrix) Minimal-loss 4-bit, fits budget
Runtime llama.cpp (or MLC/ONNX for NPU) Broad device support
KV cache 8-bit, 2K context Bounds runtime memory under the ceiling
Quality recovery LoRA fine-tune on task data Closes the gap to larger generalists
Total footprint ~2–3 GB Fits under 4 GB with OS/app headroom

The lesson FieldOps internalized: the budget was met by choosing a smaller model, not by quantizing a bigger one harder — and the quality bar was met by specializing the small model, not by adding parameters.


Frequently asked questions

What size LLM can run under 4 GB of VRAM? A 1–4B parameter model at 4-bit. A 3B at 4-bit needs about 1.5–2 GB for weights, leaving room for the KV cache, activations, and OS — making it the practical sweet spot for a 4 GB budget. A 1B model (~0.5 GB) fits very comfortably for the tightest devices. A 7–8B model at 4-bit (~4–4.5 GB) overruns a 4 GB device once overhead is included.

Why not just run a 7B at 2-bit on an edge device? Because extreme quantization needs large models to work. A 7B at 2-bit is badly degraded — small and mid-size models lose far more accuracy below 4-bit than large ones do. You can't compensate for a tiny memory budget by quantizing a bigger model harder; you'd just break it. The reliable path is a smaller model at clean 4-bit.

Which runtime should I use for on-device LLMs? llama.cpp with a GGUF model is the pragmatic default — it runs on ARM/x86 CPUs, mobile GPUs, and some NPUs, and supports aggressive weight and KV-cache quantization. For NPU acceleration or tighter mobile integration, consider MLC LLM (compiled GPU inference), ExecuTorch (PyTorch on-device), ONNX Runtime (cross-platform with NPU support), or MediaPipe LLM Inference on Android/iOS.

How do I make a small edge model accurate enough? Specialize it. Fine-tune the small model (e.g., with LoRA) on your specific task, which often lets a 3B match or beat a much larger generic model on that workload. Pair this with context engineering — a tight prompt and on-device retrieval over a small local knowledge base — so the model gets the right information without growing in size. Edge tasks are usually narrow, which plays to a specialized small model's strengths.

How do I reduce memory beyond 4-bit weights? Quantize the KV cache (8-bit or 4-bit), cap the context length (KV memory grows linearly with it — 2K uses half of 4K), and, most importantly, drop to a smaller model if needed. On edge, moving from 3B to 1B at clean 4-bit is usually better than keeping 3B and dropping to 3-bit, because small models degrade quickly below 4-bit.

Do I need to test on the actual device? Yes — always. A laptop's spare RAM and CPU don't reflect a phone or embedded board's real limits, throughput, thermals, or battery behavior. Measure memory use, tokens/sec, and task accuracy on the target hardware, and test sustained load, since handheld devices throttle when they heat up and a model that's fast briefly may slow under continuous use.


Key takeaways

  • Under 4 GB, the answer is a small model (1–4B) at 4-bit, not a big model at extreme precision — a 3B at 4-bit (~1.5–2 GB) is the sweet spot.
  • Extreme quantization needs large models; small models degrade fast below 4-bit, so shrink the model, not the bit-width.
  • Use an edge runtime — llama.cpp/GGUF as the default, or MLC/ExecuTorch/ONNX Runtime for NPU acceleration.
  • Bound runtime memory with a quantized KV cache and a short context (1–2K).
  • Recover quality by specializing — LoRA fine-tuning plus context engineering let a small model match larger generalists on a narrow task.
  • Validate on the actual device — real memory, throughput, thermals, and battery, not laptop estimates.

References

  1. llama.cpp — GGUF quantization, KV-cache quantization, edge/ARM support. https://github.com/ggml-org/llama.cpp
  2. MLC LLM — universal on-device LLM deployment. https://github.com/mlc-ai/mlc-llm
  3. PyTorch ExecuTorch — on-device inference runtime. https://pytorch.org/executorch/
  4. ONNX Runtime — cross-platform inference with NPU acceleration. https://onnxruntime.ai/
  5. Google. MediaPipe LLM Inference — on-device LLMs for Android/iOS. https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference
  6. Meta. Llama 3.2 (1B/3B) model card. https://github.com/meta-llama/llama-models