Category-Aware Semantic Caching for LLM Workloads

Traditional caching misses 85–90% of LLM queries. Category-aware semantic caching fixes that - delivering 250× faster responses, 40–80% cost cuts, and cache coverage across your entire workload distribution.

Mohammed Kafeel

Machine Learning Researcher

June 10, 2026

22 min read

On this page

What Is Semantic Caching for LLMs - and Why Does It Matter?
What "Category-Aware" Actually Means
The Four Category Properties That Drive Policy Decisions
The Hybrid Architecture: 4-Step Pipeline
Similarity Threshold Tuning by Category
Cache Eviction Policies: LRU, LFU, TTL, and the Category-Aware Formula
Real Performance Numbers
Prompt Caching vs. Semantic Caching: What's the Difference?
Practical Tools: GPTCache, Redis + LangChain
Challenges in Production
Best Practices for Production Deployment
Step-by-Step Implementation Guide
Key Takeaways
FAQ
Useful Sources

Your LLM is answering the same question five different ways - and billing you five times for it.

Traditional exact-match caching catches maybe 10–15% of chatbot traffic. Semantic caching LLM systems push that to 40–70%. But here's what most teams miss: a single global similarity threshold destroys accuracy in some categories and kills hit rates in others. The fix is category-aware caching - and it drops your break-even hit rate from 15–20% all the way to 3–5%.

This post covers the full architecture, the math behind threshold tuning, eviction formulas, real performance numbers, and a step-by-step implementation guide.

What Is Semantic Caching for LLMs - and Why Does It Matter?

Semantic caching stores LLM responses by meaning, not by exact text - so "How do I reset my password?" and "I forgot my login credentials" return the same cached answer.

Traditional caching uses exact string matching. It only fires when a query is byte-for-byte identical to a previous one. In natural language, that almost never happens. Hit rates hover around 10–15% for typical chatbots.

Semantic caching changes the equation entirely. It converts each query into a vector embedding - a high-dimensional numerical representation of meaning - then runs an approximate nearest-neighbor (ANN) search against cached embeddings. If cosine similarity exceeds a threshold (commonly 0.85–0.95), the system returns the cached response instantly. No LLM call. No tokens consumed.

The result: 40–70% hit rates for typical chatbot workloads. Cached queries return in ~27ms versus 6,800ms for full inference - a 250× speed-up.

At 10,000 queries per day using Claude Sonnet, a 60% hit rate saves roughly $8,256 per year in API costs alone.

How the 4-Step Lookup Works

01. Embed - The incoming query is converted to a vector (e.g., 384 or 1,536 dimensions) using a model like all-mpnet-base-v2 or OpenAI text-embedding-3-large.

02. Search - The system performs an ANN search against cached embeddings using cosine similarity.

03. Decide - If similarity ≥ threshold τ, return the cached response. Otherwise, forward to the LLM.

04. Store - On a cache miss, the LLM response is generated and stored with its embedding for future reuse.

This is fast. But it's not enough. A single global threshold fails heterogeneous workloads. That's where category-aware caching comes in.

What "Category-Aware" Actually Means

Category-aware caching applies different similarity thresholds, TTLs, and eviction quotas to different query types - because code queries and conversational queries live in completely different embedding spaces.

A uniform threshold of 0.80 causes 15% false matches in dense code embeddings. The same threshold misses valid paraphrases in sparse conversational embeddings. A fixed 1-hour TTL wastes memory on code patterns that stay stable for months, and serves stale stock prices that change every second.

The core insight from Wang et al. (arXiv:2510.26835, IBM Research / Tencent, October 2025): production LLM workloads exhibit long-tail cache hit rate distributions. Two or three head categories - code generation, API documentation - achieve 40–60% hit rates and account for 60–70% of traffic. Five to ten tail categories - conversational chat, financial data, legal, medical - achieve only 5–15% hit rates but represent 30–40% of traffic.

Traditional vector databases require 15–20% hit rates to break even on their 30ms remote search cost. They exclude the entire long tail. Category-aware hybrid architecture drops that break-even to 3–5% - making every category economically viable.

Category	Traffic	Hit Rate	Vector DB Viable?	Hybrid Viable?
Code generation	35%	55%	✅ Yes	✅ Yes
API documentation	25%	45%	✅ Yes	✅ Yes
Conversational chat	15%	12%	❌ No	✅ Yes
Financial data	10%	8%	❌ No	✅ Yes
Legal queries	8%	10%	❌ No	✅ Yes
Medical queries	4%	6%	❌ No	✅ Yes

The Four Category Properties That Drive Policy Decisions

Each query category has four measurable properties that determine the right caching policy.

01. Embedding Space Density

Code queries use constrained vocabulary - keywords, API names, syntax elements. They cluster densely. The 10th nearest neighbor in code embedding space sits at distance ≈ 0.12.

Conversational queries use varied phrasings. They distribute sparsely. The 10th nearest neighbor sits at distance ≈ 0.38.

Dense spaces need tight thresholds to avoid false positives. Sparse spaces need loose thresholds to capture valid paraphrases.

02. Query Repetition Patterns

Code and documentation queries follow a Zipfian power-law distribution (α ≈ 1.2). The top 10% of queries account for 45% of traffic. This justifies large cache quotas and long TTLs.

Conversational queries distribute uniformly with minimal repetition. Lower expected hit rates warrant smaller allocations and looser thresholds to capture semantic variations.

03. Content Staleness Rates

Code patterns: Change at ~0.01%/day → TTL of 7 days or more
Technical documentation: Changes at ~2%/day → TTL of hours to 1 day
Stock prices / news: Change at ~80%/hour → TTL of 5 minutes

A fixed TTL either wastes memory (code) or serves stale data (financial). Per-category TTLs eliminate both failure modes.

04. Computational Cost of the Model

Reasoning models (o1, GPT-4o) cost significantly more per call than smaller models (Claude 3.5 Haiku, Gemini 2.0 Flash). A cache hit on an expensive model produces larger savings.

The category-aware formula: allocate 40% quota to 30% of traffic when that traffic uses expensive models. Weight by economic value, not raw volume.

The Hybrid Architecture: 4-Step Pipeline

The hybrid architecture separates in-memory HNSW search from external document storage - cutting miss cost from 30ms to 2ms.

This is the architectural breakthrough that makes low-hit-rate categories viable. Here's how it works:

[Query + Category] → [In-Memory HNSW Index] → [TTL Check] → [External Store Fetch]
                              ↓ miss (2ms)
                         [LLM Forward]

Components

In-memory HNSW index - Stores only embedding vectors (~1.5KB per entry for 384 dimensions) and category metadata (threshold, TTL, priority). Total footprint: ~2KB per entry. Supports O(log n) search - 2–3ms for 1M entries.

External document store - Holds full request/response bodies and timestamps. Accessed by primary key lookup (5ms) rather than vector search (30ms). Options: Redis, SQL, S3.

Category policy engine - Manages per-category configurations. Applies thresholds during HNSW traversal. Validates TTLs before external access. Handles compliance (HIPAA/GDPR categories can set allowCaching=false).

ID mapping layer - Connects in-memory index positions to external storage identifiers via hash maps.

The 4-Step Pipeline

Step 01 - Query ingestion + categorization. The client submits a query with a category tag (via explicit routing, endpoint-based routing, or a lightweight prompt classifier). The policy engine retrieves the category config.

Step 02 - Embedding generation. The query is converted to a vector using the same embedding model across all categories (e.g., all-mpnet-base-v2 for balanced precision/recall/latency).

Step 03 - ANN/HNSW similarity search. The in-memory HNSW index runs a category-specific threshold during graph traversal. On a miss, it returns NULL immediately - no external access, no 30ms remote call. Just 2ms.

Step 04 - Cache hit/miss decision + external fetch. On a match, the system validates TTL before fetching the full document by ID from the external store (5ms). On a miss, the query goes to the LLM; the response is stored with category metadata and TTL.

Why This Changes the Economics

With a vector database, every query - hit or miss - incurs ~30ms remote search cost. Break-even requires ≥15–20% hit rate.

With the hybrid architecture, misses cost 2ms (local return). Break-even drops to:

Fast model (200ms inference): ~1% hit rate
Slow model (500ms inference): ~0.4% hit rate

That's a 15× reduction in break-even threshold. Every tail category becomes economically viable.

Similarity Threshold Tuning by Category

Dense embedding spaces need τ ≥ 0.88–0.90. Sparse spaces need τ ≤ 0.75–0.78. A single global threshold breaks both.

At threshold 0.80 in dense code embeddings, 15% of matches are false positives - sort_ascending matches sort_descending. Tightening to 0.90 drops false matches to 3%.

At threshold 0.80 in sparse conversational embeddings, valid paraphrases get missed. Loosening to 0.75 captures semantic equivalents without increasing false positives in the sparse space.

Per-Category Threshold Reference

Category	Embedding Space	Recommended τ	TTL
Code generation	Dense	0.88–0.90	7 days
API documentation	Dense	0.88–0.90	1–3 days
Customer support	Sparse	0.75–0.78	1–4 hours
Financial data	Sparse	0.75–0.78	5 minutes
Technical docs	Medium	0.82–0.86	12–24 hours
Medical / Legal	Compliance	`allowCaching=false`	N/A

AWS Benchmark: Threshold vs. Cost Savings

AWS tested multiple thresholds on real chatbot queries (Claude 3 Haiku + Titan Embeddings):

Threshold	Hit Rate	Accuracy	Cost Savings
0.99 (strict)	23.5%	92.1%	15.8%
0.95	56.0%	92.6%	51.9%
0.90	74.5%	92.3%	72.5%
0.80	87.6%	91.8%	84.6%
0.75 (loose)	90.3%	91.2%	86.3%

Moving from 0.99 to 0.75 increases cost savings by ~70 percentage points with less than 1 point of accuracy loss - for general chatbot scenarios. High-stakes domains (medical, legal, financial) require stricter thresholds.

How to Tune in Practice

Start at 0.90–0.92 as a baseline. Build a validation set with two types of pairs: queries that express the same intent, and queries that look similar but need different answers. Lower the threshold gradually. When false positives exceed 3–5%, you've hit the limit of your embedding model - threshold tuning alone won't fix it. Switch to a domain-specific embedding model.

Cache Eviction Policies: LRU, LFU, TTL, and the Category-Aware Formula

Standard LRU doesn't account for economic value. Category-aware eviction does.

Traditional Policies

LRU (Least Recently Used) - Evicts the item not accessed for the longest time. Good for general-purpose caching. Doesn't weight by model cost or category priority.

LFU (Least Frequently Used) - Retains popular content. High tracking overhead. Works well for stable hot keys.

TTL (Time-to-Live) - Expires entries after a fixed duration. Essential for volatile categories. No usage tracking needed.

FIFO (First In, First Out) - Simple, predictable. Ignores access patterns. Best for streaming workloads.

The Category-Aware Eviction Formula

When the cache is full, the hybrid architecture uses:

eviction_score = priority × (1 / age) × hitRate

Where:

priority reflects the economic value of the category (expensive models = higher priority)
age is the time since last access
hitRate is the observed hit rate for that category's entries

Entries with the lowest eviction score get removed first. This weights by economic value rather than pure recency - keeping high-value, frequently-hit entries from expensive model categories in cache longer.

Adaptive Load-Based Policy Adjustment

Under high model load, the system can dynamically relax thresholds and extend TTLs to reduce traffic to overloaded models. Theoretical projections from Wang et al. show 9–17% traffic reduction to overloaded models through threshold relaxation of 0.05.

Safety bounds prevent excessive relaxation: minimum threshold τ_min = 0.80 for dense spaces, maximum TTL = 2× baseline.

Real Performance Numbers

Cached queries run 250× faster. Costs drop 40–80%. Break-even hit rate falls from 15–20% to 3–5%.

Here's what the data actually shows:

Latency

Scenario	Without Cache	With Cache Hit	Improvement
Gemini API call	~6,800ms	~27ms	250×
AWS ElastiCache (threshold 0.80)	~4,350ms	~600ms	7.3×
Redis benchmark	~2,700ms	~300ms	9×
Tail latency (P95/P99)	27–36 seconds	Milliseconds	Dramatic

Cost Savings

At 10,000 queries/day using Claude Sonnet with a 60% hit rate:

Daily cost without cache: $41.00
Daily cost with cache: $16.40
Annual savings: $8,856
Infrastructure cost: ~$50/month

At a 36% hit rate on a larger deployment (~$39,200/year spend):

Annual savings: ~$12,000

Break-Even Analysis

Architecture	Miss Cost	Break-Even Hit Rate (fast model)
Pure vector DB	30ms	15–20%
Hybrid (in-memory HNSW)	2ms	1%

The hybrid architecture makes tail categories with 5–15% hit rates - conversational chat, financial data, legal queries - economically viable. Vector databases exclude them entirely.

Prompt Caching vs. Semantic Caching: What's the Difference?

Prompt caching is provider-side and skips input token processing. Semantic caching is application-side and skips the entire LLM call.

These are complementary, not competing - it's worth understanding the difference between prefix and semantic caching approaches before you stack them.

Feature	Prompt Caching	Semantic Caching
Location	Provider (Anthropic, OpenAI)	Application / gateway layer
Matching	Exact prefix match	Semantic similarity (vector)
What's skipped	Input token re-processing	Entire LLM inference call
Cost savings	Up to 90% on input tokens	100% of API cost on cache hits
TTL	5–10 minutes (auto-refresh)	Hours to days (configurable)
Best for	Long static system prompts, RAG context	Repetitive user queries, FAQs

Anthropic's prompt caching requires the anthropic-beta: prompt-caching-2024-07-31 header and cache_control markers. Up to 4 cache breakpoints per prompt. TTL: 5 minutes.

OpenAI's prompt caching is automatic for prompts ≥ 1,024 tokens on GPT-4o and newer. Place static content at the start of the prompt. TTL: 5–10 minutes.

The production recommendation: enable prompt caching first (it's often automatic), then layer semantic caching on top for high-frequency user queries. Combined, you can cut total costs by 80%+.

Practical Tools: GPTCache, Redis + LangChain

You don't need to build this from scratch. GPTCache and Redis + LangChain cover 90% of production use cases.

GPTCache

GPTCache (github.com/zilliztech/gptcache) is an open-source Python library that intercepts LLM queries before they reach the model. It supports:

Embedding generators: OpenAI, Cohere, Huggingface, ONNX, SentenceTransformers
Vector stores: FAISS, Milvus, Redis, Qdrant
Cache storage: SQLite, MySQL, PostgreSQL
Eviction policies: LRU, FIFO
LLM adapters: OpenAI, LangChain, LlamaIndex

Basic setup with semantic similarity:

from gptcache import Cache
from gptcache.adapter.api import init_similar_cache

init_similar_cache()  # Uses ONNX embeddings + FAISS + SQLite by default

GPTCache supports cache_context for namespace isolation - critical for multi-tenant deployments where you don't want legal queries matching customer support responses.

Redis + LangChain (`RedisSemanticCache`)

LangChain's langchain-redis package provides RedisSemanticCache - a production-ready semantic cache backed by Redis vector search:

from langchain_redis import RedisSemanticCache
from langchain_core.globals import set_llm_cache
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_cache = RedisSemanticCache(
    embeddings=embeddings,
    redis_url="redis://localhost:6379",
    distance_threshold=0.1,  # Lower = stricter matching
    ttl=3600,                 # 1 hour TTL
    name="production-cache-v1"
)

set_llm_cache(semantic_cache)

Once set, every LangChain LLM call automatically checks the cache first. Redis handles vector search via RediSearch/RedisVL, TTL expiration natively, and scales horizontally.

Embedding model recommendations:

all-mpnet-base-v2 (768 dims) - Redis identifies this as the top overall model for semantic caching (best precision, recall, memory, latency, F1)
text-embedding-3-large (256–3072 dims) - Better retrieval quality, more forgiving of messy queries
Fine-tuned ModernBERT (LangCache-Embed) - +6% precision over text-embedding-3-large for domain-specific use cases

Challenges in Production

Semantic caching has four failure modes that will hurt you if you don't plan for them.

01. The Cold Start Problem

A fresh cache has no entries. Every query is a miss. Hit rates start at 0% and climb as the cache warms up.

Fix: Pre-populate with 10,000–50,000 common queries before launch. For code assistants, seed with the top 1,000 most-asked programming questions. For support bots, seed with your FAQ database.

02. Cache Poisoning

If an incorrect LLM response gets cached, it gets served to every semantically similar query that follows. Silent failure - no error, just wrong answers.

Fix: Never cache responses that begin with error markers. Implement a hit-count threshold before trusting a cached entry. Add a force_refresh flag for admins to invalidate specific entries. Monitor false positive rates; if they exceed 5%, tighten the threshold or improve the embedding model. (Left unchecked, false-hit rates become a hidden cost driver.)

03. Multi-Turn Conversation Handling

A single-turn cache treats "What about the timeout?" as a standalone query. In a multi-turn conversation about a specific service, that question has a completely different meaning.

Fix: Two approaches work. Context-aware embedding - embed the query plus relevant conversation history together. This ensures the vector reflects actual intent, not isolated phrasing. Trade-off: longer inputs generate more unique embeddings, reducing hit rates. Query rewriting - a lightweight model rewrites follow-up questions into self-contained queries before cache lookup. "What about the timeout?" becomes "What is the default timeout for the XYZ service?" This preserves intent without embedding long context.

04. Threshold Tuning Complexity

A single global threshold is wrong for heterogeneous workloads. But per-category tuning requires ongoing A/B testing and monitoring.

Fix: Start with category defaults (dense: 0.90, sparse: 0.75). Route 5–10% of traffic to an alternate threshold. Measure hit rate and false positive rate. Adopt the winning configuration. Automate this with a feedback loop that adjusts thresholds based on observed staleness rates and user rejection signals.

Best Practices for Production Deployment

These are the decisions that separate a working semantic cache from one that silently degrades.

Deploy at the gateway layer. Place the semantic cache between your application and the LLM API - not inside individual services. This ensures shared cache benefits across all services and consistent policy enforcement. The gateway is also where you compose semantic caching with other cache layers.

Use separate namespaces per category. Prevent cross-domain contamination. A legal query matching a customer support response is a false positive waiting to happen.

Always normalize embeddings. Set normalize_embeddings=True. Without normalization, cosine similarity scores can be negative or exceed 1.0, making threshold comparisons meaningless.

Add TTL jitter. Without random jitter on TTL values, all entries expire simultaneously - causing a cache stampede. Spread expirations over time.

Implement event-driven invalidation for critical data. When a product price updates, flush related cache entries immediately rather than waiting for TTL expiry. TTL is a safety net; event-driven invalidation is the primary mechanism for high-consistency requirements.

Monitor hit/miss rates separately from latency. A 90% hit rate can mask a 10% miss rate with catastrophic P99 latency. Track cache hits, cache misses, false positive rate, and latency differential between hits and misses.

Set fallback behavior. If the cache layer fails, requests must still reach the LLM. Never let cache infrastructure failures cascade into application downtime.

Respect compliance boundaries. Categories subject to HIPAA or GDPR should set allowCaching=false. Queries never enter the cache, creating no temporary data presence.

Step-by-Step Implementation Guide

Here's how to go from zero to production-ready category-aware semantic caching.

01 - Audit Your Workload

Before writing a line of code, analyze your query logs. Identify 3–7 distinct query categories. Measure approximate hit rates if you have exact-match caching already. Identify which categories use expensive models (o1, GPT-4o) vs. cheap models (Haiku, Flash). This audit determines your category policy table.

02 - Choose Your Stack

For most teams: Redis + LangChain (langchain-redis) for managed infrastructure, or GPTCache for maximum flexibility. Choose your embedding model based on your domain - all-mpnet-base-v2 for general use, fine-tuned ModernBERT for domain-specific precision.

03 - Define Category Policies

Create a policy table mapping each category to its threshold, TTL, quota weight, and compliance flag:

CATEGORY_POLICIES = {
    "code_generation":    {"threshold": 0.90, "ttl": 604800, "quota": 0.40},  # 7 days
    "api_documentation":  {"threshold": 0.88, "ttl": 86400,  "quota": 0.25},  # 1 day
    "customer_support":   {"threshold": 0.75, "ttl": 3600,   "quota": 0.20},  # 1 hour
    "financial_data":     {"threshold": 0.78, "ttl": 300,    "quota": 0.10},  # 5 minutes
    "medical_records":    {"threshold": None, "ttl": None,   "allow_caching": False},
}

04 - Build the Category Classifier

Use explicit routing where possible (zero latency overhead). If you need automatic classification, a lightweight classifier like a fine-tuned distilbert or a simple keyword-based router adds minimal latency (< 5ms) and avoids the complexity of embedding-based classification. (For the embedding-based alternative, see semantic routing for query classification.)

05 - Implement the Hybrid Cache Layer

Separate your in-memory HNSW index (embeddings + metadata only) from your external document store (full responses). On a cache miss, return NULL immediately from the HNSW layer - don't touch the external store. On a hit, validate TTL before fetching the document by ID.

06 - Warm the Cache

Pre-populate with your top queries per category before launch. Target 10,000–50,000 entries for the head categories. For code assistants, the top 5% of queries account for ~80% of traffic (Zipfian distribution) - seed those first.

07 - Monitor and Iterate

Track these metrics from day one:

Cache hit rate per category
False positive rate (sampled + human/LLM-judged)
Latency differential (hits vs. misses)
Cost savings per day
TTL expiration patterns (are entries expiring before being reused?)

Run A/B tests on thresholds monthly. Adjust category policies based on observed staleness rates. The LLM inference optimization gains compound over time as the cache warms.

Key Takeaways

TL;DR - The numbers that matter.

Traditional caching: 10–15% hit rate. Semantic caching LLM: 40–70% hit rate.
Speed: Cached queries return in ~27ms vs. ~6,800ms - a 250× improvement.
Cost: 40–80% reduction in LLM API costs. $8,000+/year savings at 10,000 queries/day.
Category-aware vs. uniform: A single threshold causes 15% false positives in dense code spaces AND misses valid paraphrases in sparse conversational spaces. Per-category policies fix both.
Hybrid architecture: Drops break-even hit rate from 15–20% (vector DB) to 3–5% - making every tail category economically viable.
Eviction formula: priority × (1/age) × hitRate - weights by economic value, not pure recency.
Threshold defaults: Dense spaces (code) → τ ≥ 0.88–0.90. Sparse spaces (conversation) → τ ≤ 0.75–0.78.
Prompt caching ≠ semantic caching. Prompt caching (Anthropic/OpenAI) is provider-side, skips input token processing. Semantic caching is application-side, skips the entire LLM call.
Tools: GPTCache for flexibility. Redis + LangChain (RedisSemanticCache) for production.
Top embedding model: all-mpnet-base-v2 (768 dims) - best overall precision, recall, memory, and latency for semantic cache workloads.

FAQ

What is semantic caching for LLMs?

Semantic caching for LLMs stores model responses indexed by the meaning of the query - not the exact text. When a new query arrives, the system converts it to a vector embedding and searches for semantically similar cached queries using cosine similarity. If similarity exceeds a threshold, the cached response is returned instantly without calling the LLM. This raises cache hit rates from ~10–15% (exact match) to 40–70% for typical chatbot workloads, cutting latency from seconds to milliseconds.

What is the difference between prompt caching and semantic caching?

Prompt caching (offered natively by Anthropic and OpenAI) operates at the provider level. It caches the processed KV states of identical prompt prefixes, reducing input token re-processing costs by up to 90%. The LLM still runs and generates a fresh response. Semantic caching operates at the application level. It caches complete responses and returns them for semantically similar queries - skipping the LLM call entirely. Prompt caching saves on compute; semantic caching eliminates the API call. Use both together for maximum cost reduction.

What similarity threshold should I use for semantic caching?

It depends on your query category. For dense embedding spaces like code generation, use τ ≥ 0.88–0.90 to avoid false positives (e.g., sort_ascending matching sort_descending). For sparse spaces like conversational chat, use τ ≤ 0.75–0.78 to capture valid paraphrases. A global starting point of 0.90–0.92 works for mixed workloads. When false positives exceed 3–5%, threshold tuning alone won't fix it - you need a better embedding model.

What is GPTCache and how does it implement semantic caching?

GPTCache is an open-source Python library (github.com/zilliztech/gptcache) that intercepts LLM queries before they reach the model. It converts queries to embeddings, searches a vector store (FAISS, Milvus, Redis) for similar past queries, and returns cached responses on hits. It supports multiple embedding models (OpenAI, Huggingface, ONNX), cache storage backends (SQLite, MySQL, PostgreSQL), and LLM adapters (OpenAI, LangChain). It integrates directly into LangChain via GPTCacheCache.

How does category-aware caching reduce LLM API costs?

Category-aware caching applies different similarity thresholds, TTLs, and eviction quotas to different query types. This matters because a single global policy either causes false positives in dense embedding spaces (code) or misses valid matches in sparse spaces (conversation). By tuning per category, you maximize hit rates where they're achievable and maintain accuracy where it's critical. The hybrid architecture (in-memory HNSW + external store) drops the break-even hit rate from 15–20% to 3–5%, making even low-hit-rate categories like financial data and customer support economically viable to cache.

What are the main challenges with semantic caching in production?

Four challenges dominate: (1) Cold start - the cache is empty at launch; pre-populate with common queries. (2) Cache poisoning - incorrect responses get cached and served to similar queries; never cache error responses, monitor false positive rates. (3) Multi-turn conversations - single-query caching ignores context; use query rewriting or context-aware embeddings. (4) Threshold tuning - a single global threshold fails heterogeneous workloads; implement per-category thresholds with A/B testing and automated feedback loops.

What is the best embedding model for semantic caching?

For general-purpose semantic caching, Redis research identifies all-mpnet-base-v2 (768 dimensions) as the top overall model - best balance of precision, recall, memory usage, latency, and F1 score. For higher retrieval quality, OpenAI's text-embedding-3-large (configurable 256–3072 dims) outperforms text-embedding-ada-002. For domain-specific workloads (medical, legal, customer support), fine-tuned ModernBERT variants (e.g., LangCache-Embed) achieve +6% precision over text-embedding-3-large.

Useful Sources

Wang, C. et al. (2025). Category-Aware Semantic Caching for Heterogeneous LLM Workloads. arXiv:2510.26835. https://arxiv.org/abs/2510.26835
Redis. What is Semantic Caching? Guide to Faster, Smarter LLM Apps. https://redis.io/blog/what-is-semantic-caching/
Percona. Semantic Caching for LLM Apps: Reduce Costs by 40–80% and Speed Up by 250×. https://www.percona.com/blog/semantic-caching-for-llm-apps-reduce-costs-by-40-80-and-speed-up-by-250x/
AWS. Semantic Caching Benchmarks - Amazon ElastiCache. https://docs.aws.amazon.com/AmazonElastiCache/latest/dg/semantic-caching-benchmarks.html
Portkey. Semantic Caching Thresholds and Why They Matter. https://portkey.ai/blog/semantic-caching-thresholds
Zilliz. GPTCache GitHub Repository. https://github.com/zilliztech/gptcache
LangChain. RedisSemanticCache Reference. https://reference.langchain.com/python/langchain-redis/cache/RedisSemanticCache
Redis. What's the Best Embedding Model for Semantic Caching? https://redis.io/blog/whats-the-best-embedding-model-for-semantic-caching/
Microsoft Research. Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation. https://www.microsoft.com/en-us/research/publication/semantic-caching-for-low-cost-llm-serving-from-offline-learning-to-online-adaptation/

Keep reading

llmcachingprefix caching

Prefix Caching vs Semantic Caching: Which Fits Your App?

Prefix caching and semantic caching both cut LLM costs and latency - but they work at completely different layers. Here's how to choose, and when to run both.

MKMohammed Kafeel

13 min read

llmcachingarchitecture

Multi-Tier LLM Cache: Semantic, Prefix & Inference Layers

A multi-tier LLM cache stacks semantic, prefix, and inference caching layers to slash costs by up to 80% and cut latency by 78%. Here's exactly how each layer works and when to use them.