All posts

Anthropic Prompt Caching: How It Works + When to Use It

How Anthropic prompt caching works, what it costs to write and read the cache, and the conditions under which it cuts input token spend by up to 90%.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 20269 min read

Quick answer: Anthropic prompt caching stores the repeated prefix of your prompt (system instructions, tool definitions, long documents) on Anthropic's servers. The first request writes the cache at a small premium (1.25× base input price). Every later request that reuses that exact prefix reads it for about 0.1× the base input price — an up to 90% saving on the cached tokens. You get the 90% whenever many calls share a large, byte-identical prefix and arrive within the cache's time-to-live (TTL) window.


What is Anthropic prompt caching?

Prompt caching is a feature of the Claude Messages API that lets you reuse a large, unchanging portion of your prompt across many requests without paying full price to reprocess it each time. Instead of re-reading a 30,000-token system prompt or document on every call, Claude reads it once, caches the internal representation, and serves it from cache on subsequent calls.

The cached portion is billed at a steep discount, while only the new, request-specific tokens (the user's latest question) are processed at full price. For applications that send the same context repeatedly — chatbots, coding agents, document Q&A, RAG pipelines — this is the single highest-leverage cost optimization available.


How does prompt caching work under the hood?

There is one rule that everything else follows from:

Prompt caching is a prefix match. Any byte change anywhere in the prefix invalidates the cache for everything after it.

The cache key is derived from the exact bytes of your rendered prompt up to each cache breakpoint. The API renders your request in a fixed order:

tools → system → messages

You mark where the stable prefix ends by placing a cache_control breakpoint on a content block. On the next request, Claude compares the bytes up to that breakpoint; if they match exactly, it reads from cache. If a single character differs — a changed timestamp, a reordered JSON key, an added tool — the match breaks and that portion is reprocessed at full price.

The golden rule: stable content first, volatile content last

Because caching is prefix-based, you want your never-changing content (frozen system prompt, deterministic tool list) at the front, and your per-request content (the user's question, timestamps, request IDs) after the last breakpoint. Put a timestamp at the top of your system prompt and you make the entire prompt uncacheable.


Where does the 90% saving come from? (Cache read vs. write pricing)

Prompt caching has three token-pricing tiers. Understanding them is the key to knowing when you actually save money:

Token type Cost vs. base input price What it means
Cache read (cache_read_input_tokens) ~0.1× (90% cheaper) Tokens served from cache. This is the 90% saving.
Cache write — 5-min TTL 1.25× (25% premium) One-time cost to store the prefix for 5 minutes.
Cache write — 1-hour TTL 2× (100% premium) One-time cost to store the prefix for 1 hour.
Uncached input (input_tokens) 1× (full price) New, request-specific tokens processed normally.

So a cache read costs roughly one-tenth of the normal input price. Once your large prefix has been written once, every reuse of it is 90% cheaper. The "write" premium is what you pay up front to unlock those cheap reads.

Worked example with current Claude pricing

Say you have a 30,000-token shared context and you're using Claude Opus 4.8 (base input price $5.00 per million tokens). You make 100 requests in five minutes that all reuse that context:

Approach Cost of the 30K shared context across 100 calls
No caching (30K × 100 × $5/M) $15.00
With 5-min caching (1 write at 1.25× + 99 reads at 0.1×) ~$1.67

That's roughly an 89% reduction on the shared-context portion — the headline 90% figure in practice. (Your user-specific tokens are still billed at full price, so blended savings depend on how large your shared prefix is relative to each unique question.)


5-minute vs. 1-hour TTL: which should you use?

Anthropic offers two cache lifetimes. The trade-off is write cost vs. how long the entry survives:

5-minute TTL (default) 1-hour TTL
Syntax {"type": "ephemeral"} {"type": "ephemeral", "ttl": "1h"}
Write premium 1.25×
Break-even point ~2 requests ~3 requests
Best for Steady, continuous traffic Bursty traffic with idle gaps > 5 min

Break-even math: With the 5-minute TTL, two requests already pay it off (1.25× write + 0.1× read = 1.35×, versus 2× for two uncached reads). With the 1-hour TTL you need at least three requests (2× write + 0.2× for two reads = 2.2×, versus 3× uncached). The 1-hour option costs more to write but keeps the entry warm across longer pauses, which matters when your users come in waves rather than a steady stream.


When does prompt caching save you 90%? (And when it doesn't)

Caching pays off when…

  • Many requests share a large, identical prefix. A long system prompt, a fixed set of tool definitions, few-shot examples, or retrieved documents reused across calls.
  • Conversations are multi-turn. Each new turn reuses the entire prior conversation as a cached prefix, so savings accrue as the chat grows.
  • Requests arrive within the TTL window. The cached entry must still be alive when the next matching request comes in.
  • The prefix exceeds the model's minimum cacheable length (see the table below).

Caching does NOT help when…

  • Every request is unique from the start. If the first 1,000 tokens differ each time, there is no reusable prefix — adding cache_control only pays the write premium with zero reads.
  • Traffic is too sparse. If matching requests are minutes or hours apart and the cache expires between them, you keep paying to re-write.
  • The shared portion is tiny. Below the minimum cacheable length, nothing is cached — silently, with no error.

Minimum cacheable prefix by model

Model Minimum cacheable tokens
Claude Opus 4.8 / 4.7 / 4.6 / 4.5, Haiku 4.5 4,096
Claude Sonnet 4.6, Haiku 3.5, Haiku 3 2,048
Claude Sonnet 4.5 / 4 / 3.7 1,024

Note the practical trap: a 3,000-token prompt caches on Sonnet 4.5 but silently won't cache on Opus 4.8, because the minimum is higher.


How to enable prompt caching (code examples)

You enable caching by adding a cache_control marker. The simplest approach is automatic caching of the last cacheable block; the manual approach gives you fine-grained placement.

Python (automatic caching — recommended)

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=16000,
    cache_control={"type": "ephemeral"},  # auto-caches the last cacheable block
    system="You are an expert on this large document...",
    messages=[{"role": "user", "content": "Summarize the key points"}],
)

Python (manual placement with 1-hour TTL)

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=16000,
    system=[{
        "type": "text",
        "text": "<large shared prompt>",
        "cache_control": {"type": "ephemeral", "ttl": "1h"},
    }],
    messages=[{"role": "user", "content": "Summarize the key points"}],
)

A few hard limits worth knowing: you can place a maximum of 4 cache breakpoints per request, and cache_control can sit on any content block — system text, tool definitions, or message content (text, image, tool_use, tool_result, document).


How do you verify a cache hit?

The response usage object tells you exactly what happened. Check three fields:

Field Meaning
cache_creation_input_tokens Tokens written to cache this request (you paid the write premium)
cache_read_input_tokens Tokens served from cache (you paid ~0.1× — the 90% saving)
input_tokens Uncached remainder, processed at full price
print(response.usage.cache_creation_input_tokens)
print(response.usage.cache_read_input_tokens)
print(response.usage.input_tokens)

If cache_read_input_tokens is zero across repeated identical-prefix requests, a silent invalidator is at work. Total prompt size = input_tokens + cache_creation_input_tokens + cache_read_input_tokens, so don't read input_tokens alone — check the sum.


Common reasons caching silently fails

When the savings don't materialize, it's almost always one of these prefix-breaking patterns:

Pattern Why it breaks caching
datetime.now() / Date.now() in the system prompt The prefix changes on every request
uuid4() or request IDs early in the content Every request becomes unique
json.dumps(d) without sort_keys=True Non-deterministic key order changes the bytes
Session/user ID interpolated into the system prompt No cross-request or cross-user sharing
Tool set that varies per user/request Tools render first; any change invalidates everything
Switching models mid-conversation Caches are model-scoped

The fix is always the same: move the dynamic piece after the last breakpoint, make serialization deterministic, or remove it if it isn't load-bearing.


Frequently asked questions

How much does Anthropic prompt caching save? Reading from the cache costs about 0.1× the base input token price — up to a 90% reduction on the cached portion of your prompt. Writing to the cache costs 1.25× (5-minute TTL) or 2× (1-hour TTL), so savings apply from the second matching request onward.

When does prompt caching actually save 90%? You save the full ~90% on every cache read after the initial write. This happens when many requests share a large, identical prefix — a long system prompt, fixed tool definitions, retrieved documents, or a growing conversation history — and those requests arrive within the cache TTL before it expires.

What is the difference between 5-minute and 1-hour prompt cache TTL? The 5-minute (ephemeral) TTL costs 1.25× to write and breaks even after two requests. The 1-hour TTL costs 2× to write and breaks even after about three requests, but keeps the cache alive across longer gaps. Use 5-minute for steady traffic and 1-hour for bursty workloads with idle gaps longer than five minutes.

Why is my Anthropic cache read count zero? A cache_read_input_tokens of zero across repeated requests means a silent invalidator is changing the prefix. Common causes: a timestamp or UUID in the system prompt, non-deterministic JSON serialization, a varying tool set, or a prefix shorter than the model's minimum cacheable length.

What is the minimum prompt length for caching with Claude? It depends on the model. Opus 4.8, 4.7, 4.6, 4.5, and Haiku 4.5 require at least 4,096 tokens. Sonnet 4.6 requires 2,048 tokens. Shorter prefixes silently will not cache.


Key takeaways

  • Prompt caching is a prefix match: keep stable content first, volatile content last.
  • Cache reads cost ~0.1× base input price — that's the up-to-90% saving.
  • Cache writes cost 1.25× (5-min) or 2× (1-hour); the discount kicks in from the second matching request.
  • The 90% materializes when many requests share a large, byte-identical prefix within the TTL window.
  • Always verify with cache_read_input_tokens — a zero means a silent invalidator broke your prefix.