Multi-Tier LLM Cache: Semantic, Prefix & Inference Layers

A multi-tier LLM cache stacks semantic, prefix, and inference caching layers to slash costs by up to 80% and cut latency by 78%. Here's exactly how each layer works and when to use them.

Mohammed Kafeel

Machine Learning Researcher

June 16, 2026

19 min read

On this page

What Is a Multi-Tier LLM Cache?
The Three Layers: How They Stack
Layer 1: Semantic Caching - Match by Meaning, Not Exact Text
Layer 2: Prefix Caching - Reuse the KV Cache at the Token Level
Layer 3: Inference Caching - Cache the Final Output
How the Three Layers Work Together
Real-World Benchmarks: What the Numbers Actually Say
How to Choose Your Caching Strategy
FAQ
Key Takeaways
Useful Sources

Every repeated LLM call you don't cache is money on fire. At scale - say, 1 million queries per day - the difference between a naive inference setup and a properly architected multi-tier LLM cache can be $980+ per month saved on a single workload. That's not a projection. That's a real number from a documented GPU cost analysis at a 70% cache hit rate.

This post breaks down exactly how the three caching layers work, how they stack, and which one you should implement first.

TL;DR / Key Takeaways

A multi-tier LLM cache combines three distinct layers: semantic, prefix, and inference caching.

Semantic caching (application layer): 61.6–68.8% API call reduction, up to 80% cost savings, 25–250x speedup on repeated prompts.

Prefix caching (inference engine layer): +254% throughput, −78% TTFT using vLLM or SGLang RadixAttention.

Inference caching (output layer): full output memoization and CDN-level caching for LLM APIs.

Stack all three for combined savings exceeding 80% vs. naive full-inference.

Always implement in this order: prefix caching first (highest leverage), then semantic, then output caching.

What Is a Multi-Tier LLM Cache?

A multi-tier LLM cache is a layered architecture that intercepts inference requests at multiple points in the pipeline - before the model runs, during prefill, and after generation - to eliminate redundant computation.

Single-layer caching leaves most of the savings on the table. A semantic cache alone won't help you if the same 10,000-token system prompt gets reprocessed on every request. A prefix cache alone won't save you if 40% of your traffic is semantically identical FAQ queries phrased differently. You need all three working together.

Here's why it matters now: LLM API prices have dropped roughly 80% since 2024, yet total inference spend keeps rising because token volume is exploding. LLM cost optimization is no longer about negotiating better rates - it's about not calling the model when you don't have to.

The three-layer architecture addresses this at every level of the stack:

Application layer - semantic similarity matching before the request hits the model
Inference engine layer - KV state reuse across requests sharing a common prefix
Output layer - full response memoization for identical or near-identical calls

The Three Layers: How They Stack

Think of the three layers as a funnel. Each layer catches a different class of redundant work before it reaches the GPU.

Here's how the architecture looks in practice, from outermost to innermost:

Incoming Request
       │
       ▼
┌─────────────────────────────┐
│  Layer 3: Inference Cache   │  ← Exact output memoization / CDN
│  (Application / Gateway)   │
└────────────┬────────────────┘
             │ Miss
             ▼
┌─────────────────────────────┐
│  Layer 1: Semantic Cache    │  ← Vector similarity match
│  (Application / Gateway)   │    (GPTCache, Redis, Qdrant)
└────────────┬────────────────┘
             │ Miss
             ▼
┌─────────────────────────────┐
│  Layer 2: Prefix Cache      │  ← KV tensor reuse (prefill skip)
│  (Inference Engine)         │    (vLLM, SGLang, TensorRT-LLM)
└────────────┬────────────────┘
             │ Miss
             ▼
       Full Inference
       (GPU compute)

Layer 3 (inference cache) sits outermost because it's the cheapest check - a simple key-value lookup. Layer 1 (semantic cache) runs next, using vector similarity to catch paraphrased queries. Layer 2 (prefix cache) operates inside the inference engine itself, eliminating prefill computation for shared prompt prefixes.

Each layer has a different cost profile, a different hit rate, and a different implementation complexity. (For a focused look at how prefix and semantic caching compose in a single stack, the two innermost layers are worth studying together.) We'll cover each in detail.

Layer 1: Semantic Caching - Match by Meaning, Not Exact Text

Semantic caching intercepts requests at the application layer and returns a cached response when the incoming query is semantically similar to one already answered - even if the wording is completely different.

This is the layer that turns "What are your business hours?" and "When are you open?" into a single cached response. Traditional caches fail here. Semantic caching doesn't.

How Does Semantic Caching Work?

The mechanism has four steps:

Embed the query - convert the incoming prompt into a high-dimensional vector using a model like all-MiniLM-L6-v2 (384 dimensions) or BGE-M3 (512 dimensions)
Search the vector store - run an approximate nearest-neighbor search against cached embeddings using HNSW (O(log n) vs. O(n) for exhaustive search)
Compare similarity - calculate cosine similarity between the new query and cached entries
Return or miss - if similarity exceeds the threshold, return the cached response; otherwise, call the model and cache the result

What Similarity Threshold Should You Use?

This is the most consequential tuning decision in semantic caching. Get it wrong and you either miss valid hits or return wrong answers.

Threshold Range	Use Case	Risk
0.95–1.0	Near-duplicate detection, factual Q&A	Very low false positives
0.85–0.95	Paraphrases, conversational assistants	Balanced
0.75–0.85	Broader semantic equivalence	Moderate false positive risk
< 0.75	High-recall, low-precision scenarios	High false positive risk

The GPTCache research paper (Regmi & Pun, 2024) found that 0.8 was the empirically optimal threshold - achieving 68.8% cache hit rates while keeping positive hit accuracy above 97%.

Which Vector Stores Work Best?

The main options in production today:

Redis / Valkey with Search - fastest for in-memory lookup; vector search overhead is just 5–20ms
Milvus - best for large-scale deployments with billions of embeddings
Qdrant - strong filtering capabilities, good for multi-tenant setups
Pinecone - managed, low-ops, higher cost at scale

For most teams starting out, Redis is the right call. The latency overhead is negligible compared to the 1–5 seconds you're saving on each LLM call.

What Are the Real Performance Numbers?

The GPTCache benchmarks across 8,000 query-answer pairs (Python programming, customer support, shipping, shopping QA) showed:

61.6–68.8% cache hit rate across all categories
92.5–97.3% positive hit accuracy - cached responses were correct
API call reduction: up to 68.8%
Cost reduction: 40–80% depending on hit rate
Latency improvement: 2–4x on cache hits; 25–250x for highly repetitive workloads

At a 70% hit rate on 1 million daily queries, the math works out to roughly $980/month saved in a documented H100 on-demand scenario - dropping from ~$1,410/month to ~$427/month.

Layer 2: Prefix Caching - Reuse the KV Cache at the Token Level

Prefix caching eliminates redundant prefill computation by reusing the KV (key-value) tensors from the attention layers for any tokens that appear at the start of multiple requests.

This is the highest-leverage optimization for most production LLM workloads. If you have a 2,000-token system prompt that's identical across every request, you're recomputing those 2,000 tokens from scratch every single time - unless you have prefix caching enabled.

What Is the KV Cache, Exactly?

During inference, the transformer's attention mechanism generates key and value tensors for every token it processes. These tensors are stored in the KV cache so the model doesn't recompute them when generating subsequent tokens in the same sequence.

Prefix caching extends this across requests. If request A and request B both start with the same 1,000-token system prompt, the KV tensors for those 1,000 tokens only need to be computed once. Every subsequent request reuses them directly.

How Do the Major Inference Engines Implement It?

Three engines dominate here, and they differ significantly:

vLLM - Flat Prefix Caching vLLM uses PagedAttention to partition the KV cache into fixed-size blocks, reducing memory fragmentation from 60–80% waste down to < 4%. Its prefix caching stores KV pages for a single shared prefix per request. Good for straightforward system-prompt reuse. (More on PagedAttention's role in multi-tier cache efficiency.)

SGLang - RadixAttention SGLang's RadixAttention builds a radix tree (trie) of KV pages across all active requests. It dynamically finds the longest common prefix path, enabling KV reuse across few-shot examples, conversation history, and tool definitions - not just the system prompt. This is why SGLang eliminates 70–90% of prefill computation in few-shot and multi-turn workloads.

The overhead is minimal: RadixAttention management adds only 0.2 seconds across 100 requests - less than 0.3% of total time.

TensorRT-LLM - Static Compilation TensorRT-LLM achieves the highest raw throughput (~3,400 tok/s for Llama-3.1 70B on H100) but uses static compilation rather than a dynamic radix tree. It's the fastest for fixed-model, high-throughput production - but lacks SGLang's dynamic prefix reuse.

What Are the vLLM Prefix Caching Benchmarks?

The numbers are hard to ignore:

Metric	Without Prefix Caching	With Prefix Caching	Change
Output token throughput	427 tok/s	1,513 tok/s	+254%
Mean TTFT	4.34s	0.97s	−78%
Cache hit rate	-	~50% (realistic)	-

At 90% prefix overlap, you get a 32–49% throughput boost. At near-100% hit rates, TTFT drops from seconds to sub-second - effectively making long-context requests feel instant.

How Does Provider-Level Prompt Caching Compare?

OpenAI, Anthropic, and Google all offer prefix caching at the API level - they just call it "prompt caching":

OpenAI: Automatic for prompts ≥1,024 tokens. Cache reads cost 50% of standard input price. TTL: 5 minutes.
Anthropic: Developer-controlled via explicit cache_control breakpoints. Cache reads cost 10% of standard input price (90% discount). TTL: 5 minutes or 1 hour.
Google: Explicit CachedContent objects. ~50% discount on cached reads. Default TTL: 1 hour.

A 2026 study across 500+ agentic sessions (DeepResearch Bench) found prompt caching reduces API costs by 41–80% and TTFT by 13–31% across providers. The key finding: system-prompt-only caching outperforms naive full-context caching - caching dynamic tool results can paradoxically increase latency.

Layer 3: Inference Caching - Cache the Final Output

Inference caching (also called output memoization) stores the complete generated response for a given input and returns it instantly on an exact match - no model call, no prefill, no decode.

This is the simplest layer conceptually and the most powerful when it hits. The tradeoff: it only works for truly identical inputs.

What Does Inference Caching Cover?

Three main patterns:

Request-response memoization - store the full prompt + parameters + response. Return instantly on exact match. Ideal for static content, pre-computed answers, and batch pipelines.
CDN-level LLM API caching - platforms like Cloudflare Workers KV, Fastly AI Accelerator, and Helicone cache LLM responses at the edge. Responses served in milliseconds from the node closest to the user. Cloudflare distributes this across 300+ global locations automatically.
Gateway-level caching - tools like Gravitee and Azure API Management implement caching as a policy layer in the AI gateway, intercepting requests before they reach the model provider.

When Should You Use Inference Caching?

It's the right tool when:

Responses are deterministic - same input always produces the same output (temperature = 0)
Queries are highly repetitive - FAQ bots, product description generators, code snippet lookups
Latency is critical - you need sub-100ms responses that a model call can't provide
Cost per call is high - caching a $0.05 Claude call that gets hit 1,000 times saves $50

It's the wrong tool when:

Responses need to be personalized or time-sensitive
Prompts include dynamic variables (session IDs, timestamps) in the prefix
Creative generation is required (temperature > 0)

TTL management matters here. Add random TTL jitter to prevent cache expiration storms - where all cached entries expire simultaneously and suddenly spike GPU load.

How the Three Layers Work Together

The three layers are complementary, not competing. Each catches a different class of redundant work. Together, they can eliminate 80%+ of GPU compute for the right workloads.

Here's a concrete architecture walkthrough for a RAG-based customer support system:

Request arrives: "How do I reset my password?"

Inference cache check (Layer 3): Is this exact prompt + parameters in the cache? If yes → return in <10ms. If no → continue.
Semantic cache check (Layer 1): Embed the query. Search Redis vector store. Is there a cached response with cosine similarity ≥ 0.85? The system finds a cached response for "What's the process for resetting my account password?" (similarity: 0.91). → Return cached response in ~20ms. If no → continue.
Prefix cache check (Layer 2): The request reaches the inference engine (vLLM or SGLang). The 3,000-token system prompt + knowledge base context is already in the KV cache from a previous request. Prefill is skipped for those tokens. Only the new user query (8 tokens) needs to be processed. TTFT drops from 4.3s to 0.97s.
Full inference (fallback): Only genuinely novel queries reach this stage.

The Right Prompt Structure for Maximum Cache Efficiency

Structure matters enormously. Always put static content first:

[System prompt - STATIC, cacheable]
[Knowledge base / documents - STATIC, cacheable]
[Few-shot examples - STATIC, cacheable]
[User query - DYNAMIC, not cached]

A single dynamic element (timestamp, session ID, user name) inserted before the static content breaks the prefix cache entirely. Keep dynamic content at the end.

Multi-Tier KV Cache Storage

For enterprise deployments, the KV cache itself can span multiple storage tiers (as implemented in LMCache) - a matter of KV cache management across inference layers:

GPU VRAM - active working set (fastest, limited capacity)
CPU DRAM - hot cache buffer using pinned memory for fast GPU↔CPU transfers
Local NVMe / remote storage (Redis, Mooncake, InfiniStore) - persistent cross-session cache

LMCache + vLLM benchmarks show 3.7–19x faster TTFT compared to GPU-only configurations, with throughput reaching 33K–66K tokens/sec vs. ~14K tok/s baseline.

Real-World Benchmarks: What the Numbers Actually Say

We've pulled together the key numbers from production benchmarks and peer-reviewed research. Here's the full picture:

Semantic Caching (GPTCache + Redis)

Metric	Value	Source
API call reduction	61.6–68.8%	GPTCache paper (Regmi & Pun, 2024)
Cost reduction	40–80%	Percona benchmark
Positive hit accuracy	92.5–97.3%	GPTCache paper
Vector search overhead	5–20ms	Redis Vector Sets
Latency improvement	2–4x (cache hits)	Redis semantic caching
Peak speedup	25–250x (repeated prompts)	Percona benchmark

Prefix Caching (vLLM / SGLang)

Metric	Value	Source
Throughput gain	+254% (427 → 1,513 tok/s)	vLLM benchmarks
TTFT reduction	−78% (4.34s → 0.97s)	vLLM benchmarks
Prefill elimination (SGLang)	70–90%	RadixAttention paper
SGLang overhead	<0.3% of total time	SGLang benchmarks
Realistic cache hit rate	~50%	vLLM production data

Provider-Level Prompt Caching

Provider	Cost Discount	TTFT Improvement	TTL
Anthropic (Claude)	90% on cache reads	Up to 85%	5 min / 1 hour
OpenAI (GPT-4o+)	50% on cache reads	~50%	5 minutes
Google (Gemini)	~50% on cache reads	~50%	1 hour
Amazon Bedrock	Up to 90%	Up to 85%	5 minutes

Inference Engine Comparison (Llama-3.1 70B, H100, TP=4)

Engine	Throughput	TTFT (p50)	Best For
TensorRT-LLM	~3,400 tok/s	~105ms	Raw throughput, fixed models
SGLang	~2,900 tok/s	~112ms	Shared-prefix, multi-turn, RAG
vLLM	~2,800 tok/s	~120ms	General use, broad model support

SGLang's p95 latency is consistently 5–8% lower than vLLM across all concurrency levels. For small models (7B–8B), SGLang delivers ~29% higher throughput than vLLM on H100.

How to Choose Your Caching Strategy

Start with prefix caching. It's the highest-leverage, lowest-risk optimization available - and it's often a single config flag.

Here's a decision framework:

Step 1: Enable Prefix Caching First

If you're using a managed API (OpenAI, Anthropic, Google), prompt caching is either automatic or a one-line change. Do this today. The cost savings are immediate and require zero changes to your application logic.

If you're self-hosting with vLLM, enable automatic prefix caching with --enable-prefix-caching. If you're running multi-turn or few-shot workloads, switch to SGLang for RadixAttention.

Target: 80%+ cache hit rate with stable system prompts.

Step 2: Add Semantic Caching for Repetitive Query Patterns

Add semantic caching if your workload includes:

FAQ-style queries with natural language variation
Customer support bots
Product lookup or documentation search
Any high-traffic endpoint where users ask the same thing differently

Use Redis + all-MiniLM-L6-v2 for a fast start. Set your similarity threshold at 0.85 and tune from there based on false positive rates.

Target: 30–60% hit rate on your semantic cache layer.

Step 3: Add Inference Caching for Deterministic Endpoints

Layer inference caching on top for any endpoint where:

Temperature = 0 (deterministic outputs)
Inputs are highly repetitive and exact-match is possible
You want CDN-level distribution (Cloudflare, Fastly AI Accelerator)

What to Avoid

Don't inject timestamps or session IDs into your system prompt prefix - it breaks prefix caching entirely
Don't set semantic similarity thresholds below 0.75 without extensive validation - false positives will return wrong answers (and false-hit monitoring across cache tiers is how you catch them)
Don't cache tool results in agentic workflows - dynamic tool outputs break prefix reuse and add cache write overhead without corresponding read benefits
Don't skip TTL jitter - synchronized cache expiration causes GPU load spikes

Quick Decision Matrix

Your Workload	Best Layer to Start
Long static system prompts	Prefix caching (vLLM/SGLang or provider API)
FAQ / repetitive natural language queries	Semantic caching (GPTCache + Redis)
Identical API calls at high volume	Inference caching (memoization / CDN)
Multi-turn conversations	Prefix caching (SGLang RadixAttention)
RAG with shared document context	Prefix caching + semantic caching
Agentic workflows	Prefix caching (system prompt only)

FAQ

What is a multi-tier LLM cache?

A multi-tier LLM cache is an architecture that applies caching at three distinct levels of the inference pipeline: the application layer (semantic caching), the inference engine layer (prefix/KV caching), and the output layer (inference caching). Each layer catches a different class of redundant computation. Together, they can reduce LLM inference costs by 80%+ and cut latency by up to 78% compared to uncached inference.

What is the difference between semantic caching and prefix caching?

Semantic caching operates at the application layer and matches incoming queries to cached responses based on vector similarity - it can return a cached answer even if the wording is completely different. Prefix caching operates inside the inference engine and reuses the computed KV (key-value) tensors for tokens that appear at the start of multiple requests. Semantic caching skips the model call entirely; prefix caching reduces the cost of the model call by eliminating redundant prefill computation.

How does vLLM prefix caching work?

vLLM's prefix caching stores the KV cache blocks (computed attention key-value tensors) for shared prompt prefixes. When a new request arrives that shares a prefix with a previously processed request, vLLM reuses the cached KV blocks instead of recomputing them during the prefill phase. This is enabled with the --enable-prefix-caching flag. Benchmarks show +254% throughput and −78% TTFT improvement with prefix caching enabled vs. disabled.

What is SGLang RadixAttention and how does it differ from vLLM prefix caching?

SGLang's RadixAttention builds a radix tree (trie) of KV cache pages across all active requests, dynamically finding the longest common prefix path. Unlike vLLM's flat prefix caching (which handles a single shared prefix per request), RadixAttention can reuse KV states across few-shot examples, conversation history, and tool definitions simultaneously. This enables 70–90% prefill elimination in multi-turn and few-shot workloads, with management overhead under 0.3% of total inference time.

What cosine similarity threshold should I use for semantic caching?

The recommended starting point is 0.85, which balances hit rate against false positive risk. Thresholds of 0.85–0.95 catch paraphrases and semantically equivalent queries with high accuracy. Below 0.75, false positive rates increase significantly - the cache may return incorrect answers for unrelated queries. The GPTCache research found 0.8 to be empirically optimal, achieving 68.8% hit rates with 97%+ positive accuracy. Tune based on your specific domain and acceptable false positive rate.

How much can I realistically save with LLM caching?

It depends heavily on your workload's repetition patterns. With semantic caching at a 70% hit rate, documented benchmarks show ~$980/month savings on a 1M queries/day workload running on H100 on-demand. Provider-level prompt caching (Anthropic) delivers up to 90% cost reduction on cached input tokens. A well-architected multi-tier system combining all three layers can exceed 80% total cost reduction vs. naive full-inference for workloads with stable system prompts and repetitive query patterns. Low-repetition creative workloads will see much lower gains.

Key Takeaways

The three-layer framework, summarized:

Semantic caching (application layer) - match by meaning, not exact text. Tools: GPTCache, Redis, Qdrant, Milvus. Threshold: 0.85. Expected savings: 40–80%.

Prefix caching (inference engine layer) - reuse KV tensors for shared prompt prefixes. Tools: vLLM (flat), SGLang (RadixAttention), TensorRT-LLM. Expected gains: +254% throughput, −78% TTFT.

Inference caching (output layer) - memoize full responses for identical inputs. Tools: Redis, Cloudflare Workers KV, Fastly AI Accelerator, Helicone. Best for: deterministic, high-repetition endpoints.

Implementation order: Prefix caching first → semantic caching second → inference caching third.

The golden rule: Static content first in your prompt. Dynamic content last. A single dynamic element in the prefix breaks prefix caching entirely.

Combined savings potential: 80%+ cost reduction and sub-second TTFT for workloads with stable prompts and repetitive query patterns.

What's your current caching setup? Are you running all three layers, or still on a single-tier approach? Drop a comment below - we're particularly interested in what similarity thresholds teams are using in production and whether SGLang's RadixAttention is living up to the benchmarks in real workloads.

Useful Sources

Regmi, S. & Pun, C.P. (2024). GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching. arXiv:2411.05276. https://arxiv.org/abs/2411.05276
Biton, D. & Friedman, R. (2026). From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings. arXiv:2603.03301. https://arxiv.org/abs/2603.03301
Lumer, E. et al. (2026). Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks. arXiv:2601.06007. https://arxiv.org/abs/2601.06007
Pan, Z. et al. (2025). KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows. arXiv:2507.07400. https://arxiv.org/abs/2507.07400
vLLM Documentation - Automatic Prefix Caching. https://docs.vllm.ai/en/stable/design/prefix_caching/
LMCache Architecture Overview. https://docs.lmcache.ai/developer_guide/architecture.html
Redis - What is Semantic Caching? https://redis.io/blog/what-is-semantic-caching/
Percona - Semantic Caching for LLM Apps: Reduce Costs by 40–80%. https://www.percona.com/blog/semantic-caching-for-llm-apps-reduce-costs-by-40-80-and-speed-up-by-250x/
Anthropic - Prompt Caching Documentation. https://platform.claude.com/docs/en/build-with-claude/prompt-caching
SGLang - RadixAttention Concepts. https://sgl-project-sglang-93.mintlify.app/concepts/radix-attention

Keep reading

llmcachingcost optimization

LLM Cache Pre-Warming for Off-Peak Customer Service Bots

Your customer service bot is recomputing the same system prompt thousands of times a day. LLM cache pre-warming stops that waste - and the benchmarks are staggering: 57x faster TTFT, 2x throughput, up to 90% cost reduction.

MKMohammed Kafeel

17 min read

llmcachingsemantic caching

Category-Aware Semantic Caching for LLM Workloads

Traditional caching misses 85–90% of LLM queries. Category-aware semantic caching fixes that - delivering 250× faster responses, 40–80% cost cuts, and cache coverage across your entire workload distribution.

MKMohammed Kafeel

22 min read

llmroutingproduction

Signal-Driven Routing for Mixture-of-Models in Production

Signal-driven routing replaces static LLM classification with composable keyword, embedding, and domain signals - cutting costs 3.66x while preserving 95% of GPT-4 quality in production mixture-of-models deployments.

SYShubham Yadav

16 min read

Back to all posts