All postsCategory
Inference & Serving
Getting tokens out fast — vLLM, throughput, batching, and the serving stack behind production LLMs.
llmvllminference
vLLM KV Cache Reuse: A Guide to Cutting Inference Costs
How to configure and verify KV cache reuse in vLLM to cut repeated-prefix inference costs, with concrete steps and the metrics to watch.
MKMohammed Kafeel
14 min readllminferencevllm
PagedAttention in vLLM: 14× Throughput with KV Caching
How PagedAttention borrows OS virtual-memory paging to eliminate KV cache fragmentation, and why it lets vLLM reach up to 14× higher throughput.
MKMohammed Kafeel
11 min readllmself-hostingvllm
vLLM vs Ollama vs TGI: LLM Serving Framework Comparison
A framework decision that's easy to get wrong — they look similar on the surface but are built for fundamentally different use cases. Plus a step-by-step guide to running Llama 4 Scout on a single GPU.
SYShubham Yadav
13 min read