LLM Cache Pre-Warming for Off-Peak Customer Service Bots
A case study on warming LLM caches with predictable queries overnight so support bots hit cache on the first message of the day instead of paying full inference cost.
Mohammed Kafeel
Machine Learning Researcher
Quick answer: Customer service traffic is bursty — quiet overnight, slammed during business hours. You can exploit that by pre-populating your semantic cache during off-peak hours: batch-generate answers to anticipated and historically-common questions when the bot is idle, store them in the cache, and let peak-hour traffic hit a warm cache instead of the model. Because off-peak generation can use batch APIs at ~50% discount, you produce the answers cheaply and shift load off the expensive peak window. The result for a typical support bot: a high peak-hour cache-hit rate, lower per-conversation cost, and faster responses exactly when volume is highest. The technique applies to answer-level (semantic) caches — short-lived prefix caches expire in minutes and can't be pre-warmed hours ahead.
The scenario
Consider a SaaS company — call it NorthDesk — running an LLM customer-service bot. Its traffic looks like almost every support workload:
- Peak (9am–6pm): ~80% of daily volume, latency-sensitive, expensive at full per-call pricing.
- Off-peak (nights/weekends): the GPUs and API budget sit nearly idle.
- Question distribution: highly repetitive — the top 200 questions ("how do I reset my password", "how do I change my plan", "where's my invoice") account for the majority of conversations.
NorthDesk's problem: at peak, every repeated question still triggered a full model call, driving up both cost and tail latency right when the system was most loaded. The questions were predictable, but the answers were being regenerated from scratch, live, thousands of times a day.
The insight: if the questions are predictable and the answers are stable, why generate them during the expensive, latency-critical peak at all? Generate them ahead of time, when capacity is free.
Why off-peak pre-warming works
Three properties of support workloads make this strategy pay off:
- Repetition. A small set of questions dominates volume, so a modest pre-computed cache covers a large share of traffic.
- Answer stability. The correct answer to "how do I reset my password" doesn't change minute to minute, so a pre-computed answer stays valid for hours or days.
- Predictable bursts. Traffic is concentrated in known windows, so there's a reliable idle period to do the work and a reliable peak to reap the benefit.
When all three hold, pre-warming converts expensive, latency-critical, on-peak generation into cheap, relaxed, off-peak batch generation.
Why this is semantic caching, not prefix caching
A crucial design point: off-peak pre-warming targets the semantic (answer) cache, not the prefix cache. Prefix caches (Anthropic 5-min/1-hour TTL, OpenAI's inactivity-based cache, vLLM KV blocks) live for minutes — they cannot survive the hours between "off-peak" and "peak." What can persist is a semantic cache of complete answers, stored in your own vector store with a TTL you control. Pre-warming fills that store ahead of demand. (Prefix caching still helps the live misses during peak — see the multi-tier cache post.)
The architecture
── OFF-PEAK (idle window) ───────────────────────────────
Question sources Batch generation Semantic cache
┌────────────────┐ build ┌──────────────┐ store ┌──────────────┐
│ • Top historical │ ───────▶ │ Batch API call │ ──────▶ │ vector store │
│ questions │ question │ (~50% cheaper)│ Q→A │ (embeddings + │
│ • Anticipated/ │ list │ off-peak │ pairs │ answers, TTL)│
│ seasonal Qs │ └──────────────┘ └──────────────┘
└────────────────┘
── PEAK (live traffic) ──────────────────────────────────
User question ─▶ embed ─▶ semantic cache lookup
│ hit ─▶ return warm answer (cheap, instant)
│ miss ─▶ live model call (prefix-cached)
└─▶ write answer back to cache
The off-peak job and the live path share the same semantic cache. Off-peak fills it proactively; the live path fills any gaps reactively and serves hits.
Step-by-step: how NorthDesk built it
Step 1 — Mine the question list
Pre-warming is only as good as the questions you anticipate. NorthDesk assembled its list from:
- Historical logs: the top N questions by frequency from the last 30–90 days (clustered semantically so near-duplicates collapse into one canonical question).
- Anticipated events: questions expected to spike from known triggers — a pricing change, a product launch, a billing cycle date, a seasonal event.
- Editorial seeds: support leads add known-important questions even if not yet high-volume.
Step 2 — Generate answers off-peak with the batch API
During the idle window, a scheduled job sends the question list through a batch API — both OpenAI's Batch API and Anthropic's Message Batches offer roughly 50% off standard pricing in exchange for asynchronous (within-24h) completion, which is perfect for non-urgent off-peak work.
# Off-peak scheduled job (conceptual)
questions = build_question_list() # historical + anticipated + seeds
# Submit as a batch (≈50% cheaper, async — fine overnight)
batch = client.batches.create(
requests=[make_request(q) for q in questions],
)
results = wait_for_batch(batch) # completes within the idle window
for q, answer in results:
emb = embed(q)
semantic_cache.upsert(
embedding=emb,
answer=answer,
ttl=ttl_for(q), # e.g., 24h for stable FAQs
metadata={"source": "prewarm", "generated_at": now()},
)
Step 3 — Serve from the warm cache at peak
The live path checks the semantic cache first; warm entries are served instantly, misses fall through to a live (prefix-cached) model call and are written back:
def answer(query):
if is_personalized(query) or is_time_sensitive(query):
return live_model_call(query) # never serve these from cache
emb = embed(query)
hit = semantic_cache.search(emb, threshold=0.96)
if hit and not hit.is_stale():
return hit.answer # warm hit — cheap, instant
ans = live_model_call(query) # miss — generate + backfill
semantic_cache.upsert(emb, ans, ttl=ttl_for(query))
return ans
Step 4 — Refresh on a schedule and on change
Pre-warmed answers must not go stale. NorthDesk:
- Re-ran the batch nightly to refresh TTLs and pick up new top questions.
- Invalidated on source change — when a help-center article or policy changed, the dependent cache entries were evicted immediately, not left to expire.
- Versioned answers so a model or prompt change could invalidate the whole pre-warmed set at once.
The results (illustrative)
For a workload like NorthDesk's, the pattern produces gains along three axes:
| Metric | Before (live-only) | After (off-peak pre-warm + live) |
|---|---|---|
| Peak-hour cache-hit rate | ~0% (cold) | High — pre-warmed top questions hit warm |
| Cost of repeated questions | Full price, every time, at peak | Generated once off-peak at ~50% batch discount |
| Peak latency (repeats) | Full generation latency | Near-instant (embedding + lookup) |
| Peak compute load | All generation on-peak | Repeats shifted off-peak |
The savings compound: each pre-warmed question is (a) generated at a batch discount rather than full price, (b) generated once instead of thousands of times, and (c) served at peak without a model call at all. Meanwhile the most expensive, most contended window — peak — carries only the genuinely novel questions.
The strategic point isn't just "cache more." It's moving work from when capacity is scarce and expensive to when it's abundant and cheap.
When this strategy fits — and when it doesn't
Fits when…
- Traffic is bursty with predictable peaks and a reliable idle window.
- A small set of questions dominates volume (high repetition).
- Answers are stable over hours/days and safe to reuse.
- You can anticipate demand from logs and known events.
Doesn't fit when…
- Every conversation is unique — there's nothing to pre-compute.
- Answers are personalized or time-sensitive — pre-warming would serve stale or wrong responses; these must always hit the live path.
- Traffic is flat 24/7 — there's no cheap off-peak window to exploit (though batch-discount generation can still help).
- Content changes constantly — pre-warmed answers would be stale before peak.
Pitfalls and how NorthDesk avoided them
| Pitfall | Risk | Mitigation |
|---|---|---|
| Pre-warming personalized/account-specific answers | Serving one user's answer to another | Policy bypass — never pre-warm or cache personalized Qs |
| Stale pre-computed answers | Confidently wrong info at peak | Nightly refresh + invalidate on source change |
| Similarity threshold too loose | False hits return the wrong canned answer | Conservative threshold; sample false-hit rate |
| Over-investing in rare questions | Wasted off-peak generation | Pre-warm by frequency; let the live path cover the tail |
| Cache and live model drift apart | Cached answers reflect an old prompt/model | Version the cache; invalidate on model/prompt change |
| Assuming prefix cache can be pre-warmed | It expires in minutes, not hours | Pre-warm the semantic cache; prefix-cache the live misses |
Frequently asked questions
What does it mean to pre-populate an LLM cache off-peak? It means generating answers to anticipated and historically-common questions during low-traffic hours and storing them in your semantic cache, so that high-traffic peak requests are served from the warm cache instead of triggering live model calls. Off-peak generation can use discounted batch APIs, so you produce the answers cheaply and shift load away from the expensive, latency-critical peak window.
Why use the semantic cache and not prompt/prefix caching for this? Prefix caches (Anthropic, OpenAI, vLLM) live for only minutes, so they can't survive the hours between off-peak generation and peak demand. A semantic cache stores complete question–answer pairs in your own vector store with a TTL you control, so it can persist from the idle window into peak. Prefix caching still helps the live misses during peak; the two layer together.
How do I decide which questions to pre-warm? Mine your historical logs for the highest-frequency questions (cluster them so near-duplicates collapse), add questions you anticipate from known events like pricing changes or seasonal spikes, and let support staff seed known-important ones. Prioritize by frequency — pre-warming rare questions wastes generation effort that the live path can cover reactively.
How much can off-peak pre-warming save? The savings come from three multiplying effects: pre-warmed answers are generated at a batch discount (~50% off) rather than full price, generated once instead of regenerated thousands of times, and served at peak without any model call. The magnitude depends on your repetition rate — the larger the share of traffic covered by a small set of stable questions, the bigger the win.
How do I keep pre-warmed answers from going stale? Refresh the cache on a schedule (e.g., nightly re-generation to renew TTLs and capture new top questions), invalidate entries immediately when the underlying help content or policy changes, and version the cache so a model or prompt update can invalidate the whole pre-warmed set. Never pre-warm personalized or time-sensitive answers — route those to the live path always.
Does this work for a bot with flat, 24/7 traffic? Less so — the strategy's leverage comes from a cheap idle window and a contended peak. With flat traffic there's no off-peak discount window to exploit. However, you can still benefit from caching repeated questions reactively (a normal semantic cache) and from generating any batchable, non-urgent work through discounted batch APIs.
Key takeaways
- Pre-warming moves work from expensive, contended peak hours to cheap, idle off-peak hours — generate anticipated answers ahead of demand.
- It targets the semantic (answer) cache, not prefix caches, which expire in minutes and can't be pre-warmed hours ahead.
- Generate off-peak with batch APIs (~50% off) — answers are produced cheaply, once, and served at peak with no live model call.
- Savings compound: batch discount × generate-once × serve-without-a-call.
- Anticipate questions from historical logs plus known events; prioritize by frequency and let the live path cover the long tail.
- Guard freshness and correctness: nightly refresh, invalidate on source change, version the cache, conservative similarity threshold, and never pre-warm personalized/time-sensitive answers.
References
- OpenAI. Batch API — asynchronous requests at reduced cost. https://platform.openai.com/docs/guides/batch
- Anthropic. Message Batches — batch processing at reduced cost. https://docs.claude.com/en/docs/build-with-claude/batch-processing
- Bang, F. (2023). GPTCache: An Open-Source Semantic Cache for LLM Applications. Proceedings of the 3rd Workshop on NLP Open Source Software (NLP-OSS). https://github.com/zilliztech/GPTCache
- Anthropic. Prompt caching — Claude API documentation (prefix cache TTLs). https://docs.claude.com/en/docs/build-with-claude/prompt-caching
- Redis. Semantic caching for LLMs (RedisVL). https://redis.io/docs/latest/develop/ai/
Keep reading
Multi-Tier LLM Cache: Semantic, Prefix & Inference Layers
How to stack semantic, prefix, and inference-layer caches into a single pipeline that maximises hit rate while controlling cost and staleness.
Anthropic Prompt Caching: How It Works + When to Use It
How Anthropic prompt caching works, what it costs to write and read the cache, and the conditions under which it cuts input token spend by up to 90%.
Category-Aware Semantic Caching for LLM Workloads
How to partition your semantic cache by query category so similar-but-different intents don't collide, and why heterogeneous workloads break naive semantic caching.