Caching
Prompt, prefix, semantic, and KV caching — the techniques that cut repeated LLM work and the cost that comes with it.
Anthropic Prompt Caching: How It Works + When to Use It
How Anthropic prompt caching works, what it costs to write and read the cache, and the conditions under which it cuts input token spend by up to 90%.
Category-Aware Semantic Caching for LLM Workloads
How to partition your semantic cache by query category so similar-but-different intents don't collide, and why heterogeneous workloads break naive semantic caching.
vLLM KV Cache Reuse: A Guide to Cutting Inference Costs
How to configure and verify KV cache reuse in vLLM to cut repeated-prefix inference costs, with concrete steps and the metrics to watch.
Multi-Tier LLM Cache: Semantic, Prefix & Inference Layers
How to stack semantic, prefix, and inference-layer caches into a single pipeline that maximises hit rate while controlling cost and staleness.
LLM Cache Pre-Warming for Off-Peak Customer Service Bots
A case study on warming LLM caches with predictable queries overnight so support bots hit cache on the first message of the day instead of paying full inference cost.
OpenAI vs Anthropic Prompt Caching: Key Differences
A side-by-side comparison of how OpenAI and Anthropic implement prompt caching — automatic vs manual, TTLs, pricing, and which fits which workload.
Prefix Caching vs Semantic Caching: Which Fits Your App?
The practical difference between prefix caching (exact-match on token sequences) and semantic caching (embedding similarity), and how to pick the right one for your use case.
Prompt Caching Break-Even: How Many Reads to Save Money?
The exact formula for calculating your prompt caching break-even point — factoring in write premium, read discount, TTL, and request volume — so you know whether caching is worth it before you turn it on.