Multi-Tier LLM Cache: Semantic, Prefix & Inference Layers
How to stack semantic, prefix, and inference-layer caches into a single pipeline that maximises hit rate while controlling cost and staleness.
Mohammed Kafeel
Machine Learning Researcher
Quick answer: A multi-tier LLM cache layers three independent caching strategies so each request is served as cheaply as possible. Tier 1 — semantic cache: if a new query is semantically similar to a past one, return the stored answer and skip the model entirely (≈100% saving). Tier 2 — prefix cache: on a semantic miss, the model runs but reuses the KV/processed form of any shared prompt prefix (≈90% saving on the shared input). Tier 3 — inference reuse: within the running engine, identical in-flight or recently-computed work is shared (KV blocks, deduplicated requests). Requests fall through the tiers from cheapest to most expensive: free exact-repeat hits at the top, discounted model calls in the middle, full compute only at the bottom. Done right, this captures both kinds of savings — eliminating calls and cheapening the ones that remain — without sacrificing correctness where it matters.
Why one cache is never enough
Each caching strategy solves a different problem and fails at the others:
- A semantic cache can return a stored answer for free — but only for questions that genuinely repeat, and it risks returning a stale or wrong answer if tuned loosely.
- A prefix cache never changes the output and works for any shared context — but the model still runs, so it only discounts a call, never eliminates it.
- Inference-level reuse squeezes the running engine — but only helps requests that already reached the engine.
No single tier covers the full space of "how can this request be cheaper?" A multi-tier cache composes them so that every request is handled by the cheapest tier that can correctly serve it, falling through to the next only when the current one misses.
The mental model is a waterfall: free hits at the top, discounted compute in the middle, full compute at the bottom. The goal is to push as much traffic as high up the waterfall as is safe.
The three tiers
Tier 1 — Semantic cache (skip the model)
Stores complete question→answer pairs and returns a stored answer when a new query is semantically similar to a previous one. On a hit, the LLM is never called.
- Match: embedding similarity above a threshold (e.g., cosine > 0.95).
- Saving: ≈100% of the call — just an embedding + vector search.
- Risk: false hits. A query that only looks similar can return the wrong answer.
- Fits: repeated, stable questions (FAQ, support, docs assistants).
- Never use for: personalized, live, or high-stakes answers.
Tier 2 — Prefix cache (cheapen the model call)
Reuses the processed form (KV cache / billed tokens) of a shared, exact prompt prefix, so the model doesn't reprocess the system prompt, few-shot examples, or documents on every call. The model still generates a fresh answer.
- Match: exact byte/token prefix match.
- Saving: ≈90% on the cached input portion (Anthropic cache reads ≈0.1× input price; self-hosted vLLM saves prefill GPU time).
- Risk: none — output is identical to uncached.
- Fits: any workload with a large shared prefix (always safe to enable).
Tier 3 — Inference reuse (squeeze the engine)
Within the serving engine, share computation across concurrent and recent requests. This is the lowest, most mechanical tier.
- KV block reuse across requests (vLLM Automatic Prefix Caching — overlaps with Tier 2 when self-hosting).
- Request deduplication / coalescing — collapse identical in-flight requests into one computation, fan the result back out.
- Continuous batching — the scheduler packs concurrent sequences to maximize GPU utilization.
- Saving: higher throughput and utilization → lower cost per request.
- Fits: self-hosted, high-concurrency serving.
The request flow
Incoming request
│
▼
┌──────────────────────────────┐
TIER 1 ───▶│ Semantic cache (embed+search) │── hit ─▶ return stored answer
└──────────────────────────────┘ (no model call — free)
│ miss
▼
┌──────────────────────────────┐
TIER 2 ───▶│ Model call w/ PREFIX caching │── shared prefix served at ≈0.1×
└──────────────────────────────┘ (model runs, fresh answer)
│
▼
┌──────────────────────────────┐
TIER 3 ───▶│ Inference engine reuse │── KV-block reuse, dedup,
│ (vLLM APC, dedup, batching) │ continuous batching
└──────────────────────────────┘
│
▼
Fresh answer
│
▼
write back to Tier 1 (if cacheable)
The cost gradient runs top to bottom: free → discounted → full compute. A request only descends a tier when the tier above can't serve it correctly.
Step-by-step: building the stack
Step 1 — Put a semantic cache in front
Embed the query, search a vector store, and serve on a confident hit. Be conservative with the threshold and explicit about what is not cacheable.
def handle(query, user_ctx):
# Skip Tier 1 entirely for content that must never be reused
if is_personalized(query) or is_time_sensitive(query):
return tier2_model_call(query, user_ctx)
emb = embed(query)
hit = semantic_store.search(emb, threshold=0.97) # conservative
if hit and not hit.is_stale():
record_hit("semantic")
return hit.answer # free, instant
answer = tier2_model_call(query, user_ctx) # miss → descend
semantic_store.add(emb, answer, ttl=hit_ttl(query)) # write-back
return answer
Design decisions that matter here:
- Threshold trades hit rate against false hits — tune on real traffic with an eval set.
- TTL / invalidation — give cached answers a freshness window; invalidate when the underlying data changes.
- Bypass list — personalized/live/high-stakes queries skip Tier 1 by policy, not by threshold luck.
Step 2 — Enable prefix caching on the model call
Everything that misses Tier 1 hits the model — so make those calls cheap by reusing the shared prefix. Structure every prompt stable-first, volatile-last so the prefix matches.
def tier2_model_call(query, user_ctx):
# Stable prefix (cached) first; unique question last.
return client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
system=[{
"type": "text",
"text": SHARED_SYSTEM_PROMPT, # identical across all calls
"cache_control": {"type": "ephemeral"},# prefix-cache it
}],
messages=[{"role": "user", "content": query}], # volatile, last
)
For self-hosted vLLM, Tier 2 and Tier 3 partly merge: enabling Automatic Prefix Caching gives you prefix reuse and KV-block inference reuse from the same mechanism.
vllm serve <model> --enable-prefix-caching --gpu-memory-utilization 0.90
Step 3 — Optimize the inference engine (self-hosted)
If you run your own serving layer, capture Tier 3:
- Automatic Prefix Caching — KV blocks reused across requests (also your Tier 2 on self-hosted).
- Continuous batching — on by default in vLLM; keeps the GPU saturated.
- Request coalescing — collapse identical concurrent requests where your gateway supports it.
- fp8 KV cache — roughly doubles cache capacity, raising reuse hit rates.
On a managed API (Anthropic/OpenAI), Tier 3 is handled by the provider — you get it implicitly and tune only Tiers 1–2.
Step 4 — Write-back and invalidation
The fresh answer flows back up to Tier 1 so the next identical question is free. This is where correctness lives or dies:
- Only write back cacheable answers — never personalized or time-sensitive ones.
- Set a TTL matched to how fast the underlying truth changes.
- Invalidate on source change — if the document or data behind an answer updates, evict the dependent cache entries.
How the tiers compose: a cost walkthrough
Say 100 requests arrive, of which 30 are semantic repeats and the other 70 share a 4,000-token system prompt with a ~50-token unique question:
| Stage | Requests | Cost driver |
|---|---|---|
| Tier 1 hits (semantic repeats) | 30 | ≈ free (embed + vector search only) |
| Tier 2 (model call, prefix cached) | 70 | ≈0.1× on 4,000 shared tokens + full on 50 |
| Tier 3 (engine reuse, self-hosted) | (the 70) | higher throughput → lower GPU-time per call |
Versus a naive no-cache baseline where all 100 requests pay full price for 4,050 tokens each, the multi-tier stack:
- Eliminates 30 calls outright (Tier 1),
- Discounts the shared 4,000 tokens by ~90% on the remaining 70 (Tier 2),
- Packs those 70 more efficiently onto the hardware (Tier 3).
The savings multiply across tiers rather than competing — which is the entire point of layering.
Tier comparison at a glance
| Property | Tier 1: Semantic | Tier 2: Prefix | Tier 3: Inference reuse |
|---|---|---|---|
| What's reused | Whole answers | Processed input prefix | KV blocks / in-flight compute |
| Match | Semantic similarity | Exact token prefix | Exact blocks / identical work |
| Model called? | No (on hit) | Yes | Yes |
| Saving | ≈100% of the call | ≈90% of shared input | Throughput / utilization |
| Correctness risk | Real (false hits) | None | None |
| Where it lives | App layer | App + provider/engine | Serving engine |
| Always safe to use? | No (policy-gated) | Yes | Yes |
Common pitfalls
| Pitfall | Tier | Fix |
|---|---|---|
| Semantic threshold too loose | 1 | Tighten; measure false-hit rate on real queries |
| Caching personalized/live answers semantically | 1 | Policy-based bypass list; never write them back |
| No TTL or source-change invalidation | 1 | Attach TTLs; evict on underlying-data updates |
| Volatile content early in the prompt | 2 | Stable-first, volatile-last so the prefix matches |
| Low prefix-cache hit rate unnoticed | 2/3 | Monitor hit-rate metrics; it's a prompt-structure issue |
| Treating tiers as alternatives | all | They compose — layer them, don't pick one |
| No per-tier observability | all | Track hit rate, false-hit rate, and cost delta per tier |
| Over-caching low-repeat traffic semantically | 1 | If repeats are rare, skip Tier 1 — it adds latency for no gain |
Observability: measure each tier
You cannot tune what you don't measure. Track, per tier:
- Tier 1: semantic hit rate, false-hit rate (sample and human/LLM-judge cached answers), latency saved.
- Tier 2: prefix-cache hit rate (
cache_read_input_tokenson APIs;vllm:prefix_cache_hits_totalon vLLM), token-cost delta. - Tier 3: GPU utilization, throughput (req/s), batch occupancy.
- End-to-end: blended cost per request and p50/p95 latency, before vs after.
The single most important guardrail is Tier 1 false-hit rate — it's the only tier that can return a wrong answer, so it deserves continuous sampling, not a one-time tuning pass.
Frequently asked questions
What is a multi-tier LLM cache? It's a caching architecture that layers three strategies so each request is served by the cheapest tier that can handle it correctly. Tier 1 is a semantic cache that returns stored answers for repeated questions (skipping the model). Tier 2 is a prefix cache that discounts the shared input of model calls that still run. Tier 3 is inference-level reuse inside the serving engine. Requests fall through from free to discounted to full compute.
Why use three tiers instead of just a semantic cache? A semantic cache only helps repeated questions and carries false-hit risk, so it can't safely cover everything. Most traffic is unique queries that still share context (a system prompt or document) — those can't be semantically cached but can be prefix-cached, which has zero correctness risk. And inside the engine, KV-block reuse and batching cut cost further. Each tier covers what the others can't, and their savings multiply.
In what order should requests hit the tiers? Cheapest first: semantic cache → prefix-cached model call → inference-engine reuse. A request only descends to the next tier when the current one misses. This pushes as much traffic as safely possible to the free top tier, falls to discounted model calls in the middle, and only reaches full compute at the bottom.
How do I avoid serving wrong answers from the semantic tier? Use a conservative similarity threshold, exclude personalized/time-sensitive/high-stakes queries via a policy-based bypass list (not threshold luck), attach TTLs and invalidate on source-data changes, and continuously sample the false-hit rate with human or LLM-judge review. Only Tier 1 can return a wrong answer, so it needs the most guardrails.
Do Tier 2 and Tier 3 overlap? On self-hosted vLLM, yes — Automatic Prefix Caching provides both prefix reuse (Tier 2) and KV-block inference reuse (Tier 3) from one mechanism. On a managed API, the provider handles Tier 3 internally and exposes Tier 2 as prompt/prefix caching, so you tune only Tiers 1 and 2. The conceptual separation still helps you reason about where savings come from.
Is a multi-tier cache worth it for low-traffic apps? Not always. The semantic tier adds embedding + search latency and only pays off when questions actually repeat; for low-repeat traffic, skip it and keep just prefix caching (which is always safe and cheap to enable). Start with Tier 2, add Tier 1 when you observe real query repetition, and add Tier 3 tuning only if you self-host at meaningful concurrency.
Key takeaways
- A multi-tier cache layers semantic (skip the call) → prefix (cheapen the call) → inference reuse (squeeze the engine) so each request is served by the cheapest tier that can handle it correctly.
- The tiers compose, not compete — Tier 1 eliminates calls, Tier 2 discounts the survivors, Tier 3 packs them efficiently; savings multiply.
- Tier 2 (prefix) is always safe; Tier 1 (semantic) is the only tier that can return a wrong answer — gate it with a conservative threshold, a bypass policy, TTLs, and false-hit monitoring.
- On self-hosted vLLM, Tiers 2 and 3 partly merge via Automatic Prefix Caching; on a managed API, the provider owns Tier 3.
- Measure per tier — semantic hit/false-hit rate, prefix-cache hit rate, GPU utilization — and the blended cost/latency delta end-to-end.
- For low-repeat traffic, start with prefix caching alone and add the semantic tier only once you observe real repetition.
References
- Anthropic. Prompt caching — Claude API documentation. https://docs.claude.com/en/docs/build-with-claude/prompt-caching
- OpenAI. Prompt caching — API documentation. https://platform.openai.com/docs/guides/prompt-caching
- vLLM. Automatic Prefix Caching — vLLM documentation. https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html
- Bang, F. (2023). GPTCache: An Open-Source Semantic Cache for LLM Applications. Proceedings of the 3rd Workshop on NLP Open Source Software (NLP-OSS). https://github.com/zilliztech/GPTCache
- Kwon, W., Li, Z., Zhuang, S., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP '23. https://arxiv.org/abs/2309.06180
- Redis. Semantic caching for LLMs (RedisVL). https://redis.io/docs/latest/develop/ai/
Keep reading
Category-Aware Semantic Caching for LLM Workloads
How to partition your semantic cache by query category so similar-but-different intents don't collide, and why heterogeneous workloads break naive semantic caching.
LLM Cache Pre-Warming for Off-Peak Customer Service Bots
A case study on warming LLM caches with predictable queries overnight so support bots hit cache on the first message of the day instead of paying full inference cost.
Anthropic Prompt Caching: How It Works + When to Use It
How Anthropic prompt caching works, what it costs to write and read the cache, and the conditions under which it cuts input token spend by up to 90%.