Quantization

Shrinking models with INT4/INT8, AWQ, GPTQ, and GGUF — fitting bigger models on smaller hardware without losing the plot on accuracy.

MCP52 Caching8 Quantization8 Routing6 Inference & Serving3 Cost Optimization11 Self-Hosting & Compliance20

llmquantizationsmoothquant

SmoothQuant: What Activation-Aware Quantization Fixes

Naive INT8 quantization drops OPT-175B accuracy from 71.6% to 32.3%. SmoothQuant fixes that - without retraining - by migrating quantization difficulty from activations to weights via a mathematically equivalent transform.

MKMohammed Kafeel

12 min read

llmquantizationedge

Quantization for Edge Devices: LLMs Under 4 GB VRAM

A complete technical guide to running LLMs under 4 GB VRAM using quantization. Covers GGUF, GPTQ, AWQ, Bitsandbytes, real model sizes, benchmarks, and a step-by-step Ollama walkthrough.

MKMohammed Kafeel

18 min read

llmquantizationllama

How to Quantize Llama 3 to 4-Bit With Minimal Accuracy Loss

A complete, code-first guide to quantizing Llama 3 to 4-bit using bitsandbytes NF4, AWQ, GPTQ, and GGUF - with real VRAM numbers, MMLU benchmarks, and tips to keep accuracy loss under 2%.

MKMohammed Kafeel

16 min read

llmquantizationoptimization

LLM Quantization Explained: INT4 vs INT8 vs FP16

A 70B model needs 140 GB of GPU RAM in FP16. Most teams don't have that. This guide breaks down LLM quantization - INT4 vs INT8 vs FP16 - with real memory numbers, speed benchmarks, and a practical decision framework for deploying LLMs efficiently.

MKMohammed Kafeel

12 min read

llmquantizationgguf

GGUF vs AWQ vs GPTQ: Which Quantization Format to Use?

GGUF, AWQ, and GPTQ compress LLMs to run on less hardware - but each format wins in a different scenario. Here's the data-backed decision framework you need.

MKMohammed Kafeel

14 min read

llmquantizationawq

AWQ vs GPTQ: What the Quantization Benchmarks Show

AWQ and GPTQ are the two dominant 4-bit quantization methods for LLMs - but the benchmarks tell a more nuanced story than most comparisons admit. Here's what the data actually shows.

MKMohammed Kafeel

13 min read

llmquantizationoptimization

LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory

Standard 2-bit quantization destroys model accuracy - perplexity spikes to 38,000+. Here's what the benchmarks actually say about 2-bit vs 4-bit vs 8-bit, and how to pick the right level for your hardware and task.

MKMohammed Kafeel

17 min read

llmquantizationgpu

Run 70B Models on a Single RTX 4090 With 4-Bit Quantization

A single RTX 4090 can run 70B parameter models with 4-bit quantization. Here's the exact VRAM math, benchmark numbers, and four complete setup methods - llama.cpp, Ollama, AutoAWQ, and ExLlamaV2.

MKMohammed Kafeel

13 min read