Quantization
Shrinking models with INT4/INT8, AWQ, GPTQ, and GGUF — fitting bigger models on smaller hardware without losing the plot on accuracy.
Run 70B Models on a Single RTX 4090 With 4-Bit Quantization
A how-to for fitting a 70B-parameter model onto one 24 GB RTX 4090 using aggressive 4-bit and 2-bit quantization — what works, what breaks, and the accuracy cost.
AWQ vs GPTQ: What the Quantization Benchmarks Show
A benchmark-driven comparison of AWQ and GPTQ post-training quantization — accuracy, speed, and memory — so you can pick the right method instead of guessing.
LLM Quantization: 2-Bit vs 4-Bit vs 8-Bit Accuracy & Memory
How accuracy degrades and memory shrinks as you drop from 8-bit to 4-bit to 2-bit quantization, and how to find the sweet spot for your model and hardware.
GGUF vs AWQ vs GPTQ: Which Quantization Format to Use?
A practical guide to choosing between GGUF, AWQ, and GPTQ quantization formats based on your runtime, hardware, and accuracy needs.
LLM Quantization Explained: INT4 vs INT8 vs FP16
A beginner's guide to LLM quantization: how INT4, INT8, and FP16 trade memory for quality, and the rule of thumb for sizing a model to your GPU.
Quantization for Edge Devices: LLMs Under 4 GB VRAM
A case study on shrinking LLMs to run under 4 GB of VRAM for edge deployment — which quantization methods survive the squeeze and where quality falls apart.
How to Quantize Llama 3 to 4-Bit With Minimal Accuracy Loss
A step-by-step guide to quantizing Llama 3 to 4-bit precision while keeping accuracy loss minimal — calibration data, method choice, and verification.
SmoothQuant: What Activation-Aware Quantization Fixes
Why naive INT8 breaks on large models, and how SmoothQuant and activation-aware methods like AWQ recover near-FP16 accuracy.