Category-Aware Semantic Caching for LLM Workloads
How to partition your semantic cache by query category so similar-but-different intents don't collide, and why heterogeneous workloads break naive semantic caching.
Mohammed Kafeel
Machine Learning Researcher
Quick answer: A single global semantic cache with one similarity threshold breaks down when your LLM app serves heterogeneous workloads — FAQs, code generation, factual lookups, personalized queries, all mixed together. Different query categories need different similarity thresholds, freshness rules, and cacheability policies, and a one-size-fits-all threshold is simultaneously too loose for precision-critical queries (causing wrong answers) and too strict for paraphrase-heavy ones (killing the hit rate). Category-aware semantic caching fixes this by classifying each query into a category first, then routing it to a category-specific cache namespace with its own threshold, TTL, embedding strategy, and cache/no-cache policy. The result is a higher hit rate and fewer false hits at the same time — because each category is tuned for its own risk and reuse profile.
The problem: one threshold can't fit every query
Semantic caching matches an incoming query against stored queries by embedding similarity, returning a cached answer when similarity exceeds a threshold. That single threshold is the whole ballgame — and on a heterogeneous workload, no single value is correct.
Consider four query types flowing through the same bot:
| Query type | Example | What it needs from the cache |
|---|---|---|
| FAQ | "how do I reset my password" | Loose threshold — many paraphrases, same answer |
| Code generation | "write a function to merge two sorted lists" | Strict threshold — tiny wording changes alter intent |
| Factual lookup | "what's the capital of France" | Medium threshold, stable answer, long TTL |
| Personalized | "what's the status of my order" | Must NOT be cached at all |
A global threshold of, say, 0.95 is:
- Too loose for code generation — "merge two sorted lists" and "merge two unsorted lists" are >0.95 similar but need different answers. You return a confidently wrong result.
- Too strict for FAQs — "how do I reset my password" and "I forgot my password, help" may fall below 0.95 yet deserve the same answer. You miss a hit and pay for a needless generation.
- Dangerous for personalized queries — any threshold risks serving one user's data to another.
The core failure: heterogeneous workloads have heterogeneous caching requirements, but a global cache applies homogeneous rules. Category-aware caching restores the match between requirement and rule.
The idea: classify first, then cache per category
Category-aware semantic caching inserts a lightweight classification step before the cache lookup, then routes the query to a per-category cache configuration. Each category gets its own:
- Similarity threshold — tuned to how much paraphrase tolerance is safe for that category.
- TTL / freshness policy — how long an answer stays valid.
- Cacheability policy — some categories (personalized, real-time) are never cached.
- Namespace — a separate partition of the vector store, so categories don't cross-match.
- (Optionally) embedding strategy — a domain-appropriate embedding model or normalization per category.
Incoming query
│
▼
┌─────────────┐
│ Classifier │ ── category ──┐
└─────────────┘ │
▼
┌───────────────────────────────────────────────┐
│ Route to category config & namespace │
├───────────────┬───────────────┬─────────────────┤
│ FAQ │ Code-gen │ Personalized │
│ thr 0.90 │ thr 0.99 │ NO CACHE │
│ TTL 7d │ TTL 1d │ → live model │
│ namespace:faq │ namespace:code │ (bypass) │
└───────────────┴───────────────┴─────────────────┘
│ │
▼ ▼
per-namespace semantic lookup → hit / miss → live model
The classifier is the new component; everything downstream is a normal semantic cache, just partitioned and parameterized per category.
Why per-category namespaces matter
Routing to separate namespaces isn't just bookkeeping — it prevents cross-category false hits. In a single shared vector space, a code-generation query could match a FAQ entry that happens to be embedding-close, returning nonsense. Partitioning by category guarantees a query only matches within its own kind, which:
- Eliminates cross-category collisions — a code query never matches a billing FAQ.
- Shrinks each search space — smaller, homogeneous indexes are faster and more precise.
- Enables per-category invalidation — flush the
faqnamespace when help docs change without touchingcode. - Allows per-category embeddings — use a code-tuned embedding for the code namespace and a general one for FAQs.
Designing per-category policies
The heart of the system is the policy table. Each category's parameters reflect its reuse profile (how often near-duplicates recur) and its risk profile (how bad a false hit is).
| Category | Threshold | TTL | Cache? | Rationale |
|---|---|---|---|---|
| FAQ / how-to | Loose (~0.90) | Days | Yes | Heavy paraphrasing, stable answers, low risk |
| Factual lookup | Medium (~0.94) | Days–weeks | Yes | Stable facts, but distinct facts must not collide |
| Code generation | Strict (~0.99) | Hours | Yes | Small wording changes flip intent; high false-hit cost |
| Definitional | Medium (~0.95) | Weeks | Yes | Stable, paraphrase-tolerant |
| Real-time data | — | — | No | Answer changes constantly; caching guarantees staleness |
| Personalized | — | — | No | Cross-user leakage risk; must hit live path |
| High-stakes (legal/medical) | Very strict / No | Short | Review-gated | A near-miss is dangerous |
Two design principles drive these values:
- Threshold scales with false-hit cost. The worse a wrong answer is, the stricter (higher) the threshold — up to "never cache."
- TTL scales with answer volatility. The faster the truth changes, the shorter the TTL — down to "never cache."
Building it: step by step
Step 1 — Build the classifier
The classifier maps a query to a category. Options, cheapest first:
- Rules / keyword routing — fast and free for obvious signals ("write a function" → code; "my order" → personalized). Brittle alone.
- Embedding + centroid — embed the query and assign to the nearest category centroid. Cheap, no extra LLM call.
- Small classifier model — a fine-tuned small model or a fast LLM call for ambiguous queries.
Keep it cheap: the classifier runs on every request, so its cost and latency must be far below the model call it's protecting. A common pattern is rules-first, embedding-centroid fallback, with a small-model tiebreaker only for the ambiguous remainder.
def classify(query, emb):
if rule_match(query): # cheap fast path
return rule_match(query)
cat, confidence = nearest_centroid(emb, category_centroids)
if confidence < 0.6: # ambiguous → escalate
return small_model_classify(query)
return cat
Step 2 — Route to the per-category config
def handle(query):
emb = embed(query)
category = classify(query, emb)
cfg = CATEGORY_CONFIG[category]
if not cfg.cacheable: # personalized / real-time / high-stakes
return live_model_call(query)
hit = vector_store.search(
emb, namespace=cfg.namespace, threshold=cfg.threshold,
)
if hit and not hit.is_stale():
record_hit(category)
return hit.answer
answer = live_model_call(query)
vector_store.upsert(emb, answer, namespace=cfg.namespace, ttl=cfg.ttl)
return answer
Step 3 — Tune each category independently
Because categories are isolated, you can tune one without regressing the others. For each category, track hit rate and false-hit rate (sample cached answers and judge them) and move the threshold along that trade-off curve. A loose-threshold category like FAQ targets high hit rate; a strict-threshold category like code targets near-zero false hits.
Step 4 — Monitor and invalidate per category
- Per-category metrics: hit rate, false-hit rate, latency saved, cost saved.
- Per-category invalidation: flush only the affected namespace when its source data changes.
- Misclassification monitoring: a sudden hit-rate or false-hit shift in one category often signals the classifier drifting, not the cache.
The classifier is now a failure point — handle it
Category-aware caching adds power but also a new dependency: a misclassification routes a query to the wrong policy. Send a code query to the FAQ namespace (loose threshold) and you risk a false hit; send a personalized query to a cacheable category and you risk a data leak. Mitigations:
| Risk | Mitigation |
|---|---|
| Personalized query misrouted to a cacheable category | Hard rules for personalization (account/order/“my”) override the classifier; default-deny on uncertainty |
| Ambiguous query misclassified | Confidence threshold → escalate to a stronger classifier or default to a strict/no-cache policy |
| Classifier drift over time | Monitor per-category false-hit rate; retrain/recalibrate on drift |
| Classifier latency too high | Keep it cheap (rules + centroid); reserve LLM classification for the ambiguous tail |
Safety default: when classification is uncertain, fall back to the most conservative applicable policy (strict threshold or no cache). Never let uncertainty resolve toward looser caching.
When category-aware caching is worth it
Worth it when…
- Your workload is genuinely heterogeneous — multiple distinct query types with different risk/reuse profiles.
- A global threshold visibly underperforms — you're stuck choosing between false hits and a low hit rate.
- Some categories are strictly uncacheable (personalized, real-time) and need reliable bypass.
- You operate at enough scale that the tuning and classifier overhead pay for themselves.
Overkill when…
- Your workload is homogeneous — one query type, where a single tuned threshold suffices.
- Volume is low — the classifier complexity isn't justified by the savings.
- You can't yet measure per-category false-hit rates — without that, you can't tune categories safely, so start with a single conservative cache.
Frequently asked questions
What is category-aware semantic caching? It's a semantic caching design that classifies each incoming query into a category before the cache lookup, then routes it to a per-category configuration — its own similarity threshold, TTL, namespace, and cache/no-cache policy. This lets each query type be tuned for its own reuse and risk profile, instead of forcing every query through one global threshold that can't fit them all.
Why isn't a single similarity threshold enough? Because heterogeneous workloads have conflicting needs. A threshold loose enough to catch FAQ paraphrases ("forgot my password" ≈ "reset password") is too loose for code generation, where "merge two sorted lists" and "merge two unsorted lists" are highly similar but need different answers — producing false hits. And no threshold is safe for personalized queries, which must never be cached. One value can't simultaneously be loose, strict, and disabled.
How does the system decide a query's category? A classifier runs before the cache lookup. It typically uses cheap signals first — keyword/rule routing and nearest-centroid assignment on the query embedding — and escalates only ambiguous queries to a small classifier model. The classifier must be far cheaper and faster than the model call it protects, since it runs on every request.
What happens if a query is misclassified? Misclassification routes the query to the wrong policy, which can cause a false hit or, worse, route a personalized query to a cacheable category. Mitigate this with hard override rules for sensitive categories (personalization signals always bypass the cache), a confidence threshold that escalates ambiguous queries, and a safety default that resolves uncertainty toward the most conservative policy (strict threshold or no cache). Monitor per-category false-hit rates to catch classifier drift.
Why use separate namespaces per category? Separate namespaces prevent cross-category false hits — a code query can't match a billing FAQ that happens to be embedding-close. They also shrink each search space (faster, more precise lookups), enable per-category invalidation (flush FAQs without touching code), and allow per-category embedding models. Partitioning is what makes per-category tuning actually isolated.
When is category-aware caching overkill? When your workload is homogeneous — a single query type served well by one tuned threshold — or when volume is too low to justify the classifier and per-category tuning overhead. It's also premature if you can't yet measure per-category false-hit rates, since you'd be tuning blind. In those cases, start with a single conservative semantic cache and add category-awareness once heterogeneity and scale demand it.
Key takeaways
- A single global threshold can't serve heterogeneous workloads — it's simultaneously too loose for precision-critical queries and too strict for paraphrase-heavy ones.
- Category-aware caching classifies first, then routes each query to a per-category threshold, TTL, namespace, and cache/no-cache policy.
- Threshold scales with false-hit cost; TTL scales with answer volatility — up to "never cache" for personalized, real-time, and high-stakes categories.
- Per-category namespaces prevent cross-category false hits, shrink search spaces, and enable isolated invalidation and tuning.
- The classifier is a new failure point — keep it cheap, use hard overrides for sensitive categories, and default to the most conservative policy on uncertainty.
- It's worth it for genuinely heterogeneous, high-volume workloads and overkill for homogeneous or low-volume ones — start simple and add it when measurement shows a global threshold failing.
References
- Bang, F. (2023). GPTCache: An Open-Source Semantic Cache for LLM Applications. Proceedings of the 3rd Workshop on NLP Open Source Software (NLP-OSS). https://github.com/zilliztech/GPTCache
- Redis. Semantic caching for LLMs (RedisVL — thresholds, namespaces). https://redis.io/docs/latest/develop/ai/
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. https://arxiv.org/abs/1908.10084
- OpenAI. Embeddings — API documentation. https://platform.openai.com/docs/guides/embeddings
Keep reading
Multi-Tier LLM Cache: Semantic, Prefix & Inference Layers
How to stack semantic, prefix, and inference-layer caches into a single pipeline that maximises hit rate while controlling cost and staleness.
Prefix Caching vs Semantic Caching: Which Fits Your App?
The practical difference between prefix caching (exact-match on token sequences) and semantic caching (embedding similarity), and how to pick the right one for your use case.
LLM Cache Pre-Warming for Off-Peak Customer Service Bots
A case study on warming LLM caches with predictable queries overnight so support bots hit cache on the first message of the day instead of paying full inference cost.