vLLM KV Cache Reuse: A Guide to Cutting Inference Costs

vLLM KV cache reuse cuts Time to First Token by 78% and triples throughput. This guide covers how Automatic Prefix Caching works, how to enable it, and how to extend it across distributed clusters.

Mohammed Kafeel

Machine Learning Researcher

June 13, 2026

17 min read

On this page

What Is KV Cache Reuse in vLLM?
Why Inference Costs Spiral Without It
How vLLM's Automatic Prefix Caching (APC) Works
The Real Numbers: What KV Cache Reuse Actually Saves
Where KV Cache Reuse Has the Biggest Impact
How to Enable Prefix Caching in vLLM (Step by Step)
Advanced: Going Beyond Single-Instance Caching
Common Mistakes That Kill Your Cache Hit Rate
Key Takeaways
FAQ
Useful Sources

TL;DR - Key Takeaways

vLLM KV cache reuse (Automatic Prefix Caching / APC) stores computed attention states and reuses them when requests share the same token prefix - skipping expensive prefill entirely.

In our tests on Qwen3-32B with 20 questions over a shared document: TTFT dropped 78% (4.3s → 0.97s) and output throughput jumped 254% (427 → 1,513 tok/s).

Anthropic charges $0.30 vs. $3.00 per million tokens for cached vs. uncached - a 10x cost difference. The same pattern holds at OpenAI.

Enabled by default in vLLM v0.6+. One flag: --enable-prefix-caching.

For distributed clusters, naive load balancers destroy cache locality. llm-d's precise scheduling delivers 57x faster TTFT vs. approximate scheduling on the same hardware.

Beyond single-instance GPU memory: LMCache gives you 3–10x TTFT reduction and access to 100x more KV caches.

What Is KV Cache Reuse in vLLM?

vLLM KV cache reuse means storing the Key-Value attention tensors from a processed prompt and reusing them - instead of recomputing them - when the next request shares the same token prefix.

Here's the core mechanic. Every transformer model computes attention by comparing every token against every other token. The result - the KV tensors - gets stored in GPU memory during the prefill phase. Without reuse, every new request recomputes those tensors from scratch, even if the first 10,000 tokens are identical to the previous request.

Reuse skips that work entirely. The new request jumps straight to decoding the novel suffix. Time to First Token (TTFT) collapses. GPU cycles go toward serving more requests instead of repeating math you've already done.

vLLM implements this as Automatic Prefix Caching (APC) - the engine detects shared prefixes automatically, with zero changes to your application code. It's not a manual "warm-up" step. It just works. (For how this stacks up against other serving stacks, see vLLM's KV cache advantages over alternatives.)

Why Inference Costs Spiral Without It

Without prefix caching, every request pays full prefill cost - regardless of how much of the prompt it shares with previous requests.

At small scale, this is annoying. At production scale, it's a budget problem.

The Anthropic number makes it concrete. Claude Sonnet charges $3.00 per million tokens for uncached input, and $0.30 per million tokens for cached input. That's a 10x price difference - not a rounding error. OpenAI's API pricing shows the same pattern. A high cache hit rate doesn't just make your app faster; it makes it fundamentally cheaper to operate.

Now run the math on a real workload. Imagine 150 enterprise customers, each with a 6,000-token context, served by 5 concurrent users each. That's the kind of B2B SaaS load the llm-d team benchmarked. The total KV-cache demand hit 73% of cluster capacity - six times what any single pod could hold. Without cache-aware scheduling, every request recomputes from scratch. With it, you serve the same load on the same hardware at a fraction of the cost.

The GPU waste is the hidden cost. Prefill is compute-bound. Every millisecond your GPU spends recomputing a shared system prompt is a millisecond it can't spend generating tokens for paying users. Without vLLM prefix caching, you're burning H100 cycles on work you've already paid for.

How vLLM's Automatic Prefix Caching (APC) Works

APC caches KV blocks from processed requests and reuses them when a new request shares the same prefix. The implementation is more precise than it sounds.

The Paged Attention Foundation

vLLM divides GPU memory into fixed-size pages (blocks) of 16 tokens by default. This is the same paged attention architecture that makes vLLM's memory management efficient in the first place - we cover PagedAttention as the foundation for KV cache reuse in its own deep dive. Prefix caching builds directly on top of it - only full blocks are cached. A partial block at the end of a prefix doesn't get stored.

This has one practical implication: if your shared prefix is 1,847 tokens and your block size is 16, vLLM caches 115 full blocks (1,840 tokens) and leaves the last 7 tokens out. Keep your shared prefixes aligned to block boundaries for maximum hit rate.

Hash-Based Block Matching

vLLM doesn't use a prefix tree. It uses a hash table. Each KV-cache block gets a hash computed from:

Parent hash - the hash of the preceding block
Block tokens - the exact token IDs in this block
Extra hashes - LoRA IDs, multimodal input hashes, or cache_salt values for tenant isolation

This chained hashing means every block's identity depends on its entire history. Block 50 in a prompt has a different hash than block 50 in a different prompt, even if the tokens in that block happen to be identical. No false cache hits.

When a new request arrives, the scheduler calls kv_cache_manager.get_computed_blocks(), hashes the prompt tokens, and looks up matching blocks. Cache hits are "touched" (reference count incremented) to prevent eviction while the request runs.

LRU Eviction: What Gets Dropped and When

When GPU memory fills up, vLLM evicts blocks using Least Recently Used (LRU) policy. The free queue is a doubly linked list - O(1) operations for moving blocks.

One design detail worth knowing: when a request finishes, its blocks are added to the free queue in reverse order. The last block (which hashes the most tokens and is least likely to be reused) goes to the front of the eviction queue. Earlier blocks - more likely to be shared prefixes - stay cached longer.

Eviction is not deletion. The block hash is removed from the cache map, but the physical memory is reused for new allocations. If a future request needs that prefix again, it recomputes from scratch.

The Real Numbers: What KV Cache Reuse Actually Saves

Numbers from real benchmarks, not synthetic toy examples.

Single-instance, document QA workload (Qwen3-32B, 20 questions on a shared document):

Metric	Without Prefix Caching	With Prefix Caching	Change
Mean TTFT	4,343 ms	970 ms	−78%
Output throughput	427 tok/s	1,513 tok/s	+254%
Mean TPOT	102 ms	112 ms	+10% (negligible)

The 10% TPOT increase is the overhead of hash lookups. It's real, but it's swamped by the TTFT and throughput gains.

Multi-tenant production workload (Nexus Labs, Qwen2.5-32B, 4× H100 nodes):

Tenant A (fixed 1,847-token system prompt): 94% cache hit rate, TTFT p50 dropped from 480ms to 110ms - a 77% reduction
Tenant B (after prompt restructuring to move volatile fields to the tail): 87% cache hit rate, TTFT p50 dropped from 510ms to 145ms - a 71% reduction

Distributed scheduling (llm-d, 8 vLLM pods, 16× H100 GPUs):

Scheduler	Output tok/s	TTFT p90
Precise (llm-d)	8,730	0.54s
Approximate	6,944	31.1s
Cache-blind	4,429	92.6s

Precise scheduling is 57x faster than approximate and 170x faster than random on the same hardware. Throughput doubles vs. cache-blind configurations.

Cost impact (Anthropic Claude Sonnet API):

Uncached tokens: $3.00 per million
Cached tokens: $0.30 per million
10x cheaper at high cache hit rates

Where KV Cache Reuse Has the Biggest Impact

01. Multi-Turn Conversations

Every turn in a conversation appends to the history. By turn 10, the chat history might be 8,000 tokens. Without caching, every turn reprocesses all 8,000 tokens. With caching, only the new user message (a few hundred tokens) gets prefilled.

The latency impact compounds. A 400ms TTFT on a 12-step agent plan means 4.8 seconds of dead time before the user sees anything. Cache the history, and each turn feels instant.

02. RAG Pipelines

RAG workloads are trickier. Retrieved documents change between queries, and document order matters for prefix matching. If document A appears before document B in one query and after in another, the prefix breaks.

The practical fix: fix your document ordering. Sort retrieved chunks by a stable key (chunk ID, relevance score) before building the prompt. Consistent ordering means consistent prefixes, which means cache hits. The llm-d team notes that position-independent KV-fusion is on the roadmap for cases where ordering can't be controlled.

03. Long System Prompts

This is the easiest win. A fixed system prompt - tool definitions, persona instructions, safety guidelines - is identical across every request. Cache it once, reuse it forever (until eviction).

In the Nexus Labs deployment, a 1,847-token fixed system prompt drove a 94% cache hit rate at steady state. The 6% misses were cold starts after pod restarts. That's about as good as it gets.

04. Agentic Workflows

Agents are the most extreme case. The prefix contains goals, tool schemas, and a growing history of actions and observations. Production data from Manus shows input-to-output ratios exceeding 100:1 in real agent deployments. The prefix is overwhelmingly large relative to each new output.

Without caching, agents become prohibitively expensive at scale. With vLLM prompt caching, each reasoning step only prefills the new observation - a tiny fraction of the total context. This is what makes complex multi-step agents computationally viable.

How to Enable Prefix Caching in vLLM (Step by Step)

Step 1: Check Your vLLM Version

Prefix caching is enabled by default in vLLM v0.6.0+. If you're on v0.6.0 or later, you already have it. The v0.6.0 release (July 2024) stabilized APC and delivered a 2.7x throughput improvement over v0.5.x.

Check your version:

python -c "import vllm; print(vllm.__version__)"

If you're below v0.6.0, upgrade. The performance difference is not marginal.

Step 2: Enable via CLI Flag

For vllm serve, the flag is explicit:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.9

To explicitly disable it (e.g., for benchmarking without caching):

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --no-enable-prefix-caching

Step 3: Enable via Python API

For offline inference or programmatic serving:

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_prefix_caching=True,
    gpu_memory_utilization=0.9,
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# First request - computes and caches the prefix
outputs = llm.generate(
    ["[Long system prompt...] What is the capital of France?"],
    sampling_params
)

# Second request - reuses the cached prefix, much faster TTFT
outputs = llm.generate(
    ["[Long system prompt...] What is the capital of Germany?"],
    sampling_params
)

Step 4: Choose Your Hash Algorithm

Control the hash algorithm with --prefix-caching-hash-algo:

# Default: cryptographically secure, recommended for multi-tenant
vllm serve [model] --prefix-caching-hash-algo sha256

# Cross-language reproducible hashes (recommended for distributed setups)
vllm serve [model] --prefix-caching-hash-algo sha256_cbor

# Fast non-cryptographic (single-tenant only - collision risk in multi-tenant)
vllm serve [model] --prefix-caching-hash-algo xxhash

Use sha256 (default) for any multi-tenant deployment. As of v0.11, sha256 is the default - prior versions had non-collision-free hashes. The xxhash option is faster but increases collision risk, which can leak information across tenants. Don't use it in shared environments.

For multi-tenant isolation, add a cache_salt per request:

{
  "messages": [...],
  "cache_salt": "tenant-specific-salt-value"
}

This prevents timing-based attacks where an adversary infers cached content from latency differences.

Step 5: Verify Cache Hit Rate in Metrics

vLLM exposes Prometheus metrics. Check your cache hit rate:

# Start with metrics enabled (default)
vllm serve [model] --enable-prefix-caching

# Query the metric
curl http://localhost:8000/metrics | grep prefix_cache

Look for vllm:gpu_prefix_cache_hit_rate. A healthy multi-turn or fixed-system-prompt workload should hit 80%+ at steady state. If you're seeing under 20%, check the common mistakes section below.

Advanced: Going Beyond Single-Instance Caching

CPU Offloading (v0.11+)

GPU memory is finite. When the cache fills up, LRU eviction kicks in and you lose cached blocks. CPU offloading extends the effective cache size by moving evicted blocks to CPU RAM instead of discarding them.

Available since vLLM v0.11.0:

vllm serve [model] \
  --enable-prefix-caching \
  --kv-offloading-backend native \
  --kv-offloading-size 50  # GB of CPU RAM to use

The trade-off: loading from CPU is slower than GPU. On modern hardware (NVLink, PCIe 5.0), the latency is often still faster than recomputing a long prefix. On older interconnects, measure before committing.

vLLM v0.20.0 added TurboQuant 2-bit KV cache compression - 4x capacity on the same GPU memory - and FP8 KV cache support. These let you fit more cached blocks in GPU memory before needing to offload at all.

Single-instance caching has a hard limit: one pod's GPU memory. In a multi-replica deployment, a request routed to Pod B gets a cold cache even if Pod A has the exact prefix cached.

LMCache solves this by creating a shared KV cache layer across instances:

# Install
pip install lmcache lmcache_vllm

# Start shared cache server
lmcache_server localhost 65432

# Start vLLM instances pointing to the shared cache
LMCACHE_CONFIG_FILE=example.yaml \
  lmcache_vllm serve [model] --gpu-memory-utilization 0.8 --port 8000

LMCACHE_CONFIG_FILE=example.yaml \
  lmcache_vllm serve [model] --gpu-memory-utilization 0.8 --port 8001

LMCache benchmarks show 3–10x TTFT reduction vs. vLLM baseline, 7x faster access to distributed KV caches, and access to 100x more KV caches than fit in a single GPU. It supports CPU RAM, local SSD, Redis, S3, and RDMA backends - effectively turning KV cache into a multi-tier serving architecture that spills across storage layers.

The LMCache paper (arXiv:2510.09665) reports up to 15x throughput improvement on multi-round QA workloads vs. basic vLLM with GPU-only caching.

Distributed Prefix-Cache Aware Scheduling (llm-d)

Even with LMCache, you need the router to know which pod holds which prefix. Without that, you're routing blind and destroying cache locality.

llm-d solves this with precise prefix-cache aware scheduling. Each vLLM pod streams KVEvents - a live feed of cache block creation and eviction - to a global index. The scheduler queries this index for every incoming request and routes it to the pod with the highest cache affinity score.

The benchmark results (8 pods, 16× H100, B2B SaaS workload with 150 enterprise customers × 6,000-token contexts × 5 concurrent users = 73% cluster KV-cache demand):

Precise scheduling: 8,730 tok/s, TTFT p90 = 0.54s
Cache-blind scheduling: 4,428 tok/s, TTFT p90 = 92.6s

Same hardware. Same model. 2x throughput and 170x lower latency - purely from smarter routing.

Common Mistakes That Kill Your Cache Hit Rate

01. Volatile fields at the start of the prompt. Timestamps, session IDs, request UUIDs - if any of these appear in the first 16 tokens, they invalidate the entire prefix. vLLM caches at block boundaries. One differing token in block 1 kills all downstream blocks. Fix: push volatile fields to the end of the prompt. Static instructions first, dynamic content last.

02. Non-deterministic prefix construction. If your system prompt is assembled at request time from database queries, feature flags, or A/B test variants, the prefix changes per request. Even small changes - a different whitespace character, a reordered list - break the hash match. Fix: serialize your prompt construction. Cache the assembled prompt string and reuse it.

03. Multi-tenant isolation not configured. In shared deployments, different tenants' caches can bleed into each other if you're using non-cryptographic hashes. Worse, a malicious tenant could infer another tenant's cached content from TTFT differences (timing attack). Fix: use sha256 (default in v0.11+) and add per-tenant cache_salt values.

04. Wrong preemption mode under memory pressure. When GPU memory fills up, vLLM preempts requests. The default swap mode copies KV blocks to CPU and back - this can cause cache thrashing under burst load. Fix: set --preemption-mode recompute. It discards preempted blocks cleanly instead of evicting cached prefixes to make room for swap buffers.

05. Expecting gains on decode-heavy workloads. APC only accelerates the prefill phase. If your workload generates very long outputs (e.g., code generation with 2,000+ output tokens), the decode phase dominates and prefix caching won't move the needle. It's a prefill optimization. Know your bottleneck.

06. Round-robin load balancing in multi-replica setups. A standard round-robin load balancer routes the same user's requests to different pods on every turn. Pod B has a cold cache for a prefix that Pod A just computed. Fix: use session-sticky routing at minimum, or llm-d's precise prefix-cache aware scheduling for maximum reuse across users.

Key Takeaways

vLLM KV cache reuse eliminates redundant prefill computation by storing and reusing KV attention blocks across requests that share the same token prefix.
Enable it with one flag: --enable-prefix-caching (default on in v0.6+). Use enable_prefix_caching=True in the Python API.
Real gains are large: 78% TTFT reduction, 254% throughput increase on document QA workloads. Multi-tenant deployments hit 77–94% cache hit rates with properly structured prompts.
The cost case is clear: Anthropic charges 10x less for cached tokens ($0.30 vs. $3.00 per million). At scale, cache hit rate is a direct cost multiplier.
Prompt structure determines hit rate. Static content first, volatile content last. One changed token in the first block invalidates everything downstream.
Single-instance caching has limits. For distributed clusters, add LMCache for cross-instance sharing and llm-d for cache-aware routing. The distributed gains (57x TTFT, 2x throughput) dwarf single-instance gains.
Use sha256 in multi-tenant environments. Non-cryptographic hashes (xxhash) increase collision risk and can leak information across tenants.

FAQ

What is vLLM KV cache reuse? It's vLLM's mechanism for storing the Key-Value attention tensors computed during prompt processing and reusing them when a subsequent request shares the same token prefix. Instead of recomputing the prefill phase, vLLM retrieves the cached tensors and skips straight to decoding. The feature is called Automatic Prefix Caching (APC).

How do I enable vLLM prefix caching? Add --enable-prefix-caching to your vllm serve command, or set enable_prefix_caching=True in the Python API. In vLLM v0.6.0 and later, it's enabled by default. To disable it explicitly, use --no-enable-prefix-caching.

What is vLLM enable prefix caching and when should I use it? --enable-prefix-caching is the CLI flag that activates Automatic Prefix Caching. Use it whenever your workload has requests that share a common prefix - fixed system prompts, multi-turn chat history, repeated document context, or agentic reasoning loops. It has negligible overhead when there are no cache hits, so there's rarely a reason to disable it.

What is vLLM prompt caching and how does it differ from prefix caching? They're the same thing. "Prompt caching" is the user-facing term (used by API providers like Anthropic and OpenAI); "prefix caching" or "KV cache reuse" is the implementation-level term. vLLM's APC is the engine-side implementation of what API providers call prompt caching.

What is Automatic Prefix Caching (APC) in vLLM? APC is vLLM's name for its KV cache reuse system. It automatically detects when incoming requests share a token prefix with previously processed requests, retrieves the cached KV blocks for that prefix, and skips the prefill computation for those tokens. No application code changes are required - it's fully transparent.

Does prefix caching work in distributed multi-replica deployments? Not automatically. Each vLLM pod maintains its own isolated cache. A standard load balancer routes requests without cache awareness, so the same prefix gets recomputed on different pods. To get cache reuse across instances, use LMCache (cross-instance KV sharing) and llm-d (precise prefix-cache aware routing). The llm-d benchmarks show 57x faster TTFT vs. approximate scheduling on a 16-GPU cluster.

What hash algorithm should I use for vLLM prefix caching? Use sha256 (the default since v0.11) for any multi-tenant deployment. It's cryptographically secure and prevents hash collisions that could leak information across tenants. Use sha256_cbor if you need reproducible hashes across different Python or vLLM versions. Only use xxhash in single-tenant environments where you need maximum speed and security is not a concern.

Useful Sources

Keep reading

llminferencevllm

PagedAttention in vLLM: 14× Throughput with KV Caching

PagedAttention is the memory management algorithm inside vLLM that eliminates KV cache fragmentation, cuts GPU memory waste from 60–80% to under 4%, and delivers up to 24x higher throughput than HuggingFace Transformers.

MKMohammed Kafeel

14 min read

llmself-hostingvllm

vLLM vs Ollama vs TGI: LLM Serving Framework Comparison

A data-backed comparison of vLLM, Ollama, and TGI - covering throughput benchmarks, concurrency behavior, quantization support, and a 3-question decision framework to pick the right LLM serving framework fast.

SYShubham Yadav

15 min read

routingllmreasoning

When to Use Reasoning Models vs Standard LLMs

Reasoning models don't just generate text - they think before they answer. Here's what that actually means, how they're built, and when to use one over a standard LLM.