All postsCategory

Quantization

Shrinking models with INT4/INT8, AWQ, GPTQ, and GGUF — fitting bigger models on smaller hardware without losing the plot on accuracy.

llmquantizationgpu

Run 70B Models on a Single RTX 4090 With 4-Bit Quantization

A how-to for fitting a 70B-parameter model onto one 24 GB RTX 4090 using aggressive 4-bit and 2-bit quantization — what works, what breaks, and the accuracy cost.

MKMohammed Kafeel
14 min read
llmquantizationawq

AWQ vs GPTQ: What the Quantization Benchmarks Show

A benchmark-driven comparison of AWQ and GPTQ post-training quantization — accuracy, speed, and memory — so you can pick the right method instead of guessing.

MKMohammed Kafeel
13 min read
llmquantizationoptimization

LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory

How accuracy degrades and memory shrinks as you drop from 8-bit to 4-bit to 2-bit quantization, and how to find the sweet spot for your model and hardware.

MKMohammed Kafeel
14 min read
llmquantizationgguf

GGUF vs AWQ vs GPTQ: Which Quantization Format to Use?

A practical guide to choosing between GGUF, AWQ, and GPTQ quantization formats based on your runtime, hardware, and accuracy needs.

MKMohammed Kafeel
13 min read
llmquantizationoptimization

LLM Quantization Explained: INT4 vs INT8 vs FP16

A beginner's guide to LLM quantization: how INT4, INT8, and FP16 trade memory for quality, and the rule of thumb for sizing a model to your GPU.

MKMohammed Kafeel
12 min read
llmquantizationedge

Quantization for Edge Devices: LLMs Under 4 GB VRAM

A case study on shrinking LLMs to run under 4 GB of VRAM for edge deployment — which quantization methods survive the squeeze and where quality falls apart.

MKMohammed Kafeel
14 min read
llmquantizationllama

How to Quantize Llama 3 to 4-Bit With Minimal Accuracy Loss

A step-by-step guide to quantizing Llama 3 to 4-bit precision while keeping accuracy loss minimal — calibration data, method choice, and verification.

MKMohammed Kafeel
14 min read
llmquantizationsmoothquant

SmoothQuant: What Activation-Aware Quantization Fixes

Why naive INT8 breaks on large models, and how SmoothQuant and activation-aware methods like AWQ recover near-FP16 accuracy.

MKMohammed Kafeel
12 min read