Quick answer: vLLM can reuse the KV cache — the stored attention keys and values — of any prompt prefix that has already been computed, so repeated shared content (system prompts, few-shot examples, RAG documents, conversation history) is never recomputed. This feature is Automatic Prefix Caching (APC). You enable it with --enable-prefix-caching (and in recent vLLM versions on the V1 engine it is on by default), then maximize its impact by putting shared content at the front of every prompt so prefixes align on cache blocks. A high prefix-cache hit rate directly cuts the compute spent on prompt processing, which slashes cost-per-request and time-to-first-token for prefix-heavy workloads — often dramatically for multi-turn chat and RAG.
What is the KV cache, and why reuse it?
During text generation, a transformer computes attention keys (K) and values (V) for every token in the prompt and every token it generates. To avoid recomputing them at each decoding step, these K/V tensors are stored in the KV cache. This is what makes autoregressive generation tractable — each new token attends to the cached K/V of all previous tokens instead of reprocessing the whole sequence.
The KV cache normally lives only for the duration of a single request. The insight behind KV cache reuse is that if two requests share the same prefix — the same system prompt, the same document, the same conversation so far — then the K/V for that prefix is identical. Recomputing it for the second request is pure waste. vLLM keeps those KV blocks around and lets the next matching request reuse them.
The payoff: the expensive "prefill" phase (processing the prompt) is skipped for the shared portion. You pay compute only for the genuinely new tokens. For a chatbot with a 2,000-token system prompt, or a RAG app reusing a long document, this eliminates the bulk of per-request prompt compute.
This is built on vLLM's PagedAttention, which stores the KV cache in fixed-size, non-contiguous blocks — the same block structure that makes cross-request sharing possible. (For the memory mechanism itself, see the companion post on PagedAttention.)
How Automatic Prefix Caching works in vLLM
vLLM's Automatic Prefix Caching (APC) reuses KV blocks across requests using a simple, robust mechanism:
- Block-based KV cache. The KV cache is divided into fixed-size blocks (default
block_size = 16tokens). A prompt of 1,000 tokens occupies ~63 blocks. - Content hashing. Each block is identified by a hash of the tokens it contains plus all preceding tokens (the prefix up to that block). Two requests with an identical leading token sequence produce identical block hashes.
- Cache lookup. When a new request arrives, vLLM hashes its prompt blocks and checks whether matching blocks already exist in the cache. Matching leading blocks are reused; computation starts only from the first block that differs.
- LRU eviction. Cached blocks live in GPU memory. When space is needed, vLLM evicts least-recently-used blocks. A reused prefix that stays hot survives; cold prefixes are evicted.
Two consequences follow directly from this design:
- It's an exact token-prefix match. Reuse happens only for a byte-identical leading sequence of tokens. A difference in token #5 means everything from block 1 onward is recomputed.
- It aligns on block boundaries. Reuse is granular to
block_size. The shared prefix is matched block by block, so the more contiguous shared content at the front, the more blocks are reused.
Step-by-step: enabling and using KV cache reuse
Step 1 — Enable Automatic Prefix Caching
Offline (Python LLM class):
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_prefix_caching=True, # turn on KV cache reuse
gpu_memory_utilization=0.90, # leave headroom; more memory = more cache
)
Online (OpenAI-compatible server):
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching \
--gpu-memory-utilization 0.90
Version note: on vLLM's V1 engine (default in recent releases), prefix caching is enabled by default — you'd use
--no-enable-prefix-cachingto turn it off. On older/V0 paths you must opt in with--enable-prefix-caching. Check your installed version; the flag is harmless to pass explicitly.
Step 2 — Structure prompts so prefixes are shared
This is where you actually capture the savings. KV reuse only helps if requests share a leading token sequence, so put stable, shared content first and variable content last:
[ shared system prompt ] ← identical across all requests (cached)
[ shared few-shot examples ] ← identical across all requests (cached)
[ retrieved document(s) ] ← shared within a doc's Q&A session (cached)
───────────────────────────── ← reuse boundary
[ user's specific question ] ← unique per request (recomputed — cheap)
Anything you place before the first varying token is a candidate for reuse. Anything after the first difference is recomputed. The rule mirrors prefix caching everywhere: stable first, volatile last.
Step 3 — Give the cache enough memory
KV blocks live in the GPU memory left over after model weights. More free memory means more blocks cached and a higher hit rate before eviction kicks in:
--gpu-memory-utilization(e.g., 0.90): raises the fraction of GPU memory vLLM uses for weights + KV cache. Push it as high as is stable.--kv-cache-dtype fp8: stores the KV cache in 8-bit instead of 16-bit, roughly doubling cache capacity (and thus how many prefixes stay hot) at a small, usually acceptable accuracy cost. Options includefp8,fp8_e4m3,fp8_e5m2.--max-model-len: don't set it larger than you need — oversized context budgets reserve memory that could hold more cached prefixes.
Step 4 — Verify you're actually getting hits
Enabling the flag is not the same as benefiting from it. Measure the prefix-cache hit rate. vLLM exposes Prometheus metrics from the server:
vllm:prefix_cache_queries_total # blocks queried against the cache
vllm:prefix_cache_hits_total # blocks served from cache
# hit rate = hits_total / queries_total
Scrape the /metrics endpoint or watch the engine's logged stats. A near-zero hit rate means your prompts don't actually share a leading token sequence — almost always because something variable (a per-request timestamp, user ID, or reordered content) sits too early in the prompt. Fix the prompt structure, not the flag.
# Quick check against a running vLLM server
curl -s http://localhost:8000/metrics | grep prefix_cache
Where KV cache reuse pays off most
| Workload | Why it benefits |
|---|---|
| Multi-turn chat | Each turn reuses the entire prior conversation as a cached prefix |
| RAG with a shared document | Many questions about the same retrieved doc reuse its KV blocks |
| Long fixed system prompt | The system prompt is computed once and reused across all users |
| Few-shot prompting | Shared exemplars at the front are cached across every request |
| Batch jobs over one context | Summarize/extract many things from the same large input |
| Agent loops | The tool definitions and instructions prefix is stable across steps |
Where it does not help
- Every request is unique from the first token — no shared prefix, nothing to reuse.
- Variable content placed early — a timestamp/UUID/user-ID at the top invalidates the prefix for everything after it.
- Prefixes too short to fill a block or evicted before reuse (sparse, low-traffic prompts).
Cost and latency: what actually improves
KV cache reuse cuts the prefill cost — the compute to process the prompt — for the shared portion. The concrete effects:
- Lower compute per request → lower cost. On self-hosted vLLM, cost is GPU-time. Skipping prefill for a long shared prefix means each request consumes less GPU time, so the same hardware serves more requests (higher throughput, lower cost per request).
- Faster time-to-first-token (TTFT). Prefill dominates TTFT for long prompts. Reusing the prefix's KV means generation starts almost immediately.
- Higher effective batch size. Less prefill work per request frees the scheduler to run more concurrent sequences.
The magnitude scales with your prefix-to-unique ratio: the larger the shared prefix relative to each unique question, the bigger the win. A RAG app with a 6,000-token document and a 30-token question benefits enormously; an app with 50-token prompts and no shared content benefits not at all.
Note the difference from API-provider prompt caching (Anthropic/OpenAI): there you save on billed input tokens. On self-hosted vLLM you save on GPU compute time. Same idea, different cost currency.
Advanced tuning and gotchas
| Lever / gotcha | Effect / risk | Guidance |
|---|---|---|
block_size |
Granularity of reuse; smaller = finer matching, more overhead | Default 16 is fine for most; rarely needs changing |
--kv-cache-dtype fp8 |
~2× cache capacity → higher hit rate; small accuracy cost | Enable for memory-bound, prefix-heavy serving |
gpu_memory_utilization too low |
Few blocks cached → premature eviction → low hit rate | Raise it as high as stays stable |
| Variable content placed early | Breaks the prefix; near-zero hit rate | Move timestamps/IDs/user data to the end |
| Non-deterministic prompt assembly | Same logical prompt, different bytes → no match | Build prompts deterministically (stable ordering) |
| Expecting reuse across different models | KV cache is model- and config-specific | One cache per model; don't expect cross-model sharing |
| Multi-tenant privacy | Shared prefixes are reused across users by content hash | Reuse is content-based, not a data leak, but review for sensitive shared prefixes |
| Assuming the flag alone saves money | No structural prefix sharing → no benefit | Always verify with the hit-rate metric |
Frequently asked questions
What is KV cache reuse in vLLM? It's vLLM's ability to reuse the stored attention keys and values (the KV cache) of a prompt prefix that has already been computed, so repeated shared content isn't recomputed. The feature is called Automatic Prefix Caching (APC). When a new request shares a leading token sequence with a previous one, vLLM serves those KV blocks from cache and computes only the new tokens, cutting prefill compute, cost, and time-to-first-token.
How do I enable prefix caching in vLLM?
Pass --enable-prefix-caching to vllm serve, or set enable_prefix_caching=True in the LLM constructor. On vLLM's V1 engine (default in recent versions) it's already enabled, and you'd disable it with --no-enable-prefix-caching. Then structure prompts so shared content comes first, give the cache enough GPU memory, and verify the hit rate via the Prometheus metrics.
How much does KV cache reuse save? It depends on your prefix-to-unique ratio — how large the shared prefix is relative to each unique query. It eliminates prefill compute for the shared portion, so a RAG app with a 6,000-token shared document and short questions can avoid most per-request prompt compute, while an app with short, all-unique prompts sees no benefit. The savings show up as lower GPU time per request, higher throughput, and faster time-to-first-token.
Why is my vLLM prefix cache hit rate near zero?
Your prompts don't share a leading token sequence. The most common cause is variable content placed too early — a per-request timestamp, UUID, user ID, or non-deterministically ordered content near the top of the prompt invalidates the prefix for everything after it. Move all stable, shared content to the front and all variable content to the end, then re-check vllm:prefix_cache_hits_total / vllm:prefix_cache_queries_total.
Does fp8 KV cache hurt accuracy?
Storing the KV cache in fp8 (--kv-cache-dtype fp8) roughly doubles cache capacity, letting more prefixes stay hot and raising the hit rate, at a small accuracy cost that is acceptable for most applications. For accuracy-critical workloads, benchmark fp8 against the default on your own eval set before adopting it in production.
Is vLLM KV cache reuse the same as Anthropic/OpenAI prompt caching? Conceptually yes — both reuse a computed prompt prefix to avoid reprocessing it. The difference is the cost currency: API providers discount billed input tokens (e.g., ~0.1× for Anthropic cache reads), while self-hosted vLLM saves GPU compute time. The prompt-structuring rule is identical: stable content first, volatile content last.
Key takeaways
- vLLM reuses the KV cache of shared prompt prefixes via Automatic Prefix Caching, skipping prefill for content that's already been computed.
- Enable it with
--enable-prefix-caching(on by default on the V1 engine) — but the flag alone does nothing without structural prefix sharing. - Stable content first, volatile content last — any variable token early in the prompt destroys reuse for everything after it.
- Give the cache memory: raise
gpu-memory-utilizationand consider--kv-cache-dtype fp8to roughly double capacity and lift the hit rate. - Always verify with
vllm:prefix_cache_hits_total/vllm:prefix_cache_queries_total— a near-zero hit rate is a prompt-structure problem, not a config one. - Savings scale with the prefix-to-unique ratio; multi-turn chat, RAG over shared docs, and fixed system prompts benefit most.
- Unlike API prompt caching (which discounts billed tokens), vLLM reuse saves GPU compute time — the same idea in a different cost currency.
References
- vLLM. Automatic Prefix Caching — vLLM documentation. https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html
- vLLM. Engine arguments (
--enable-prefix-caching,--kv-cache-dtype,--gpu-memory-utilization). https://docs.vllm.ai/en/latest/serving/engine_args.html - Kwon, W., Li, Z., Zhuang, S., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP '23. https://arxiv.org/abs/2309.06180
- vLLM. Production metrics (Prometheus
vllm:prefix_cache_*). https://docs.vllm.ai/en/latest/serving/metrics.html - vLLM. FP8 KV Cache — quantization documentation. https://docs.vllm.ai/en/latest/features/quantization/fp8.html
Keep reading
PagedAttention in vLLM: 14× Throughput with KV Caching
How PagedAttention borrows OS virtual-memory paging to eliminate KV cache fragmentation, and why it lets vLLM reach up to 14× higher throughput.
vLLM vs Ollama vs TGI: LLM Serving Framework Comparison
A framework decision that's easy to get wrong — they look similar on the surface but are built for fundamentally different use cases. Plus a step-by-step guide to running Llama 4 Scout on a single GPU.
Run 70B Models on a Single RTX 4090 With 4-Bit Quantization
A how-to for fitting a 70B-parameter model onto one 24 GB RTX 4090 using aggressive 4-bit and 2-bit quantization — what works, what breaks, and the accuracy cost.