Anthropic Prompt Caching: How It Works + When to Use It

Anthropic prompt caching lets you reuse a stable prefix of your Claude prompts instead of reprocessing them from scratch - slashing costs by up to 90% and cutting latency by more than 2x.

Mohammed Kafeel

Machine Learning Researcher

June 8, 2026

14 min read

On this page

What Is Prompt Caching? (The 30-Second Answer)
How Does Anthropic Prompt Caching Work?
Prompt Caching Pricing: What You Actually Pay
How Much Can You Actually Save?
When Should You Use Prompt Caching? (5 Best Use Cases)
When NOT to Use Prompt Caching
How to Implement Prompt Caching (Step-by-Step)
Prompt Caching Limitations to Know
FAQ
Useful Sources

TL;DR: Anthropic prompt caching stores a reusable prefix of your Claude prompt so subsequent requests don't reprocess it from scratch. Cache reads cost 0.1× the base input price - a 90% discount. One developer went from $720/month to $72/month on repetitive API calls. Cache hits also cut latency by more than 2×. Works on all active Claude models; automatic caching is not available on Bedrock or Vertex AI.

What Is Prompt Caching? (The 30-Second Answer)

Prompt caching is a Claude API feature that stores a reusable prefix of your prompt so the model doesn't reprocess it on every request. Instead of paying full input-token prices every time, you pay a one-time write cost and then a tiny read cost on every subsequent hit.

The mental model is simple:

[stable prefix - cached] + [new tail - processed fresh]

Everything in the stable prefix gets reused. Only the new tail gets processed from scratch. That's where the savings come from.

What is prompt caching in Claude, specifically? It's an ephemeral key-value cache tied to your workspace. Claude checks whether the incoming prompt prefix matches a cached entry. If it does, it skips recomputation. If it doesn't, it processes the full prompt and writes a new cache entry once the response begins.

How Does Anthropic Prompt Caching Work?

The system checks your prompt prefix against a cached hash. A match = cache hit; you pay ~10% of normal input costs. No match = cache miss; the full prefix is processed and written to cache.

Here's the exact flow on every request:

Claude receives your request.
It computes a hash of the prefix up to your cache breakpoint.
It checks for a matching entry in the workspace cache.
Hit: Returns the cached computation, charges cache-read rates.
Miss: Processes the full prompt, writes the prefix to cache once the response starts, charges cache-write rates.

The Prefix-Matching Rule

Cache hits require 100% byte-level identical prefixes. One changed token = cache miss. No exceptions.

The prefix is built in a fixed order: tools → system → messages. This hierarchy matters. Change your tool definitions and you invalidate the system and messages cache too. Change only the system prompt and you invalidate the messages cache. Change only the latest user message and nothing upstream breaks.

Common mistake: Putting a timestamp or per-request variable inside the cached block. The hash changes every request, you never get a hit, and you pay cache-write prices for nothing. Always place your cache_control breakpoint on the last block that stays identical across requests.

Cache Windows: 5 Minutes vs. 1 Hour

There are two TTL options:

Window	Cost	Best for
5 minutes (default)	1.25× base input	High-frequency requests (> 1 per 5 min)
1 hour	2× base input	Slower cadences, agentic tasks, long user sessions

The 5-minute cache refreshes for free each time it's hit. So if your system prompt is called every few seconds, you write once and read indefinitely within that window - the TTL resets on every hit.

The 1-hour cache costs more to write but is worth it when requests are spaced further apart - for example, a user who might not respond for 10 minutes, or a background agent that runs every 30 minutes.

To use the 1-hour TTL, add "ttl": "1h" to your cache_control object:

{ "cache_control": { "type": "ephemeral", "ttl": "1h" } }

One constraint: If you mix TTLs in a single request, longer TTLs must appear before shorter ones in the prompt hierarchy.

Automatic vs. Explicit Cache Breakpoints

Claude prompt caching offers two modes - automatic (easiest) and explicit (more control). That manual control is itself a key distinction - here's how OpenAI's automatic caching differs from Anthropic's manual approach.

Automatic caching: Add a single top-level cache_control field. The system places the breakpoint on the last cacheable block and moves it forward automatically as conversations grow. Best for multi-turn chat.
Explicit breakpoints: Place cache_control directly on individual content blocks. Up to 4 breakpoints per request. Best when different sections change at different frequencies (e.g., tool definitions rarely change, but context updates daily).

You can combine both: use an explicit breakpoint to anchor your system prompt, and let automatic caching handle the growing conversation history.

Prompt Caching Pricing: What You Actually Pay

Three pricing tiers apply: cache write (more than base), cache read (much less than base), and regular input (base rate) for anything after your last breakpoint.

The multipliers:

5-minute cache write: 1.25× base input price
1-hour cache write: 2× base input price
Cache read: 0.1× base input price (90% cheaper)

Pricing Table by Model

Model	Base Input	5m Write	1h Write	Cache Hit	Output
Claude Fable 5 / Mythos 5	$10/MTok	$12.50/MTok	$20/MTok	$1/MTok	$50/MTok
Claude Opus 4.8 / 4.7 / 4.6 / 4.5	$5/MTok	$6.25/MTok	$10/MTok	$0.50/MTok	$25/MTok
Claude Sonnet 4.6 / 4.5	$3/MTok	$3.75/MTok	$6/MTok	$0.30/MTok	$15/MTok
Claude Haiku 4.5	$1/MTok	$1.25/MTok	$2/MTok	$0.10/MTok	$5/MTok

Prices per million tokens (MTok). Multipliers stack with Batch API discounts.

The Real Cost Math (with Example)

Say you're running a coding assistant on Claude Sonnet 4.6 with a 10,000-token system prompt. You get 1,000 requests per day.

Without caching:

10,000 tokens × 1,000 requests × $3/MTok = $30/day

With caching (5-minute window, high request frequency):

Day 1 cache write: 10,000 tokens × $3.75/MTok = $0.0375 (once)
999 cache reads: 10,000 tokens × 999 × $0.30/MTok = $2.997/day

That's roughly $27 saved per day on the system prompt alone - before you even count output tokens. Scale that to a month and you're looking at ~$810 in savings from one prompt. That's the core appeal of prompt caching as a cost reduction technique.

How Much Can You Actually Save?

Up to 90% on input costs for repetitive, high-volume workloads. Latency drops by more than 2× on cache hits.

The most concrete real-world data point: a developer running repetitive API calls cut their monthly bill from $720 to $72 - a 90% reduction - purely by enabling prompt caching on a stable system prompt.

That's not a cherry-picked edge case. It's the expected outcome when:

Your system prompt is large (thousands of tokens)
You send many requests per day
The prefix stays identical across requests

For latency, cache hits skip the full KV computation for the cached prefix. The result is more than 2× faster time-to-first-token on long-context requests. (Self-hosted stacks get the same win through KV cache reuse in self-hosted vLLM.) For a coding assistant loading 50,000 tokens of codebase context, that difference is immediately noticeable.

Where savings are smaller: Short prompts, highly dynamic prefixes, or low request volume. If you only make 10 requests per day, the write cost may not pay off within the 5-minute window.

When Should You Use Prompt Caching? (5 Best Use Cases)

Prompt caching pays off whenever you have a large, stable prefix that gets reused across many requests. Here are the five scenarios where it delivers the most value.

1. Long System Prompts Reused Across Many Requests

The classic case. If your system prompt is 2,000+ tokens and you're making hundreds of requests per day, caching it is a no-brainer. Write once, read at 10% of base cost on every subsequent call.

2. Multi-Turn Conversations with Large Context

Use automatic caching for chat applications. As the conversation grows, the cache breakpoint moves forward automatically - each new turn reads the prior history from cache and only processes the new messages fresh.

3. RAG Pipelines with Static Document Context

If you're embedding a fixed knowledge base (a legal document, a product manual, a codebase README) into every request, cache it. The document doesn't change; there's no reason to reprocess it 500 times a day. (For heavier RAG setups, see multi-tier caching architectures that stack prefix and semantic layers.)

4. Claude Code Prompt Caching - Coding Assistants with Large Codebases

Claude Code prompt caching is one of the highest-ROI use cases. Load your repo structure, CLAUDE.md, architecture notes, and relevant file contents into the cached prefix. Every autocomplete or Q&A request reads that context from cache instead of reprocessing thousands of tokens.

5. Batch Processing with Shared Instructions

Running batch jobs where every item shares the same instructions? Cache the instructions block. Each item in the batch pays only the cache-read rate for the shared prefix.

When NOT to Use Prompt Caching

Caching adds a write surcharge (25% for 5 minutes, 100% for 1 hour). If you won't get enough reads to offset that cost, skip it.

Avoid prompt caching when:

Your prompt is below the minimum token threshold. Claude won't cache it even if you mark it - no error, just no cache. Check the minimums for your model (see Limitations section).
The prefix changes on every request. Timestamps, per-request IDs, or dynamic context in the cached block = perpetual cache misses + paying write surcharges for nothing.
You're making one-off requests. No repetition means no cache hits. You'd pay the write premium and never recoup it.
The 1-hour window cost doesn't justify the savings. At 2× base input for the write, you need enough reads within the hour to break even. For very low-volume workloads, do the math first. (We work through the actual break-even point for prompt caching separately.)
You're on Bedrock or Vertex AI and need automatic caching. Automatic caching is not available on those platforms (though explicit breakpoints and 1-hour TTL are).

How to Implement Prompt Caching (Step-by-Step)

Two paths: automatic (one line of config) or explicit breakpoints (more control). Start with automatic unless you have a specific reason not to.

Automatic Caching (Easiest)

Add cache_control={"type": "ephemeral"} at the top level of your messages.create() call. The system handles the rest.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system="You are an expert software engineer with deep knowledge of "
           "distributed systems, microservices architecture, and Python. "
           "[... rest of your long system prompt ...]",
    messages=[
        {"role": "user", "content": "Explain this function..."}
    ]
)

# Check cache performance
print(response.usage.model_dump_json())

The usage object will show cache_creation_input_tokens on the first call and cache_read_input_tokens on subsequent hits. If both are 0, your prompt didn't meet the minimum token threshold.

For multi-turn conversations, automatic caching moves the breakpoint forward each turn. You don't need to update anything as the conversation grows.

Explicit Cache Breakpoints (More Control)

Place cache_control directly on the content block you want to cache. Useful when you have a large static context (codebase, document) that should be cached independently from the conversation history.

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert software engineer..."
        },
        {
            "type": "text",
            "text": "[Large codebase context - 10,000+ tokens here]",
            "cache_control": {"type": "ephemeral"}
            # Cache breakpoint: everything up to and including this block
        }
    ],
    messages=[
        {"role": "user", "content": "What does this function do?"}
    ]
)

To pre-warm the cache before users arrive (eliminating first-request latency), send a request with max_tokens=0:

# Fire at app startup to warm the cache
prewarm = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=0,
    system=[
        {
            "type": "text",
            "text": "You are an expert software engineer...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "warmup"}]
)
# stop_reason will be "max_tokens", content will be []
# cache_creation_input_tokens confirms the write happened

Track your cache performance with these response fields:

cache_creation_input_tokens - tokens written to cache this request
cache_read_input_tokens - tokens read from cache this request
input_tokens - tokens after the last breakpoint (not cached)

Total input = cache_read_input_tokens + cache_creation_input_tokens + input_tokens

Prompt Caching Limitations to Know

Know these before you ship - a few of them will bite you if you don't.

Minimum token thresholds (by model):

Minimum	Models
512 tokens	Claude Fable 5, Claude Mythos 5
1,024 tokens	Claude Opus 4.8, Sonnet 4.6/4.5, Opus 4.1
2,048 tokens	Claude Mythos Preview, Opus 4.7, Haiku 3.5
4,096 tokens	Claude Opus 4.6/4.5, Haiku 4.5

Note: On Amazon Bedrock, Fable 5 and Mythos 5 require 1,024 tokens minimum.

Other hard limits:

Max 4 cache breakpoints per request. Automatic caching uses one slot, so you have 3 left for explicit breakpoints.
20-block lookback window per breakpoint. If a growing conversation pushes your breakpoint more than 20 blocks past the last cache write, the lookback misses it. Add a second breakpoint to stay within range.
Cache entry only available after the first response begins. Parallel requests sent simultaneously will all miss the cache - the first one to respond creates the entry. If you need parallel cache hits, pre-warm first.
Byte-level exact match required. One token difference = full cache miss. This includes key ordering in JSON (some languages like Go and Swift randomize key order - that breaks caches).
Workspace-level isolation as of February 5, 2026. On the Claude API, Claude Platform on AWS, and Microsoft Foundry, caches are isolated per workspace - not just per organization. If you use multiple workspaces, each one has its own cache. Bedrock and Vertex AI still use organization-level isolation.
Automatic caching not available on Bedrock or Vertex AI. Explicit breakpoints and the 1-hour TTL work on both platforms.
Thinking blocks can't be directly marked with cache_control, though they get cached as part of assistant turns in multi-turn conversations.

FAQ

What is prompt caching in Claude? Prompt caching is a Claude API feature that stores a reusable prefix of your prompt in a temporary cache. When subsequent requests send an identical prefix, Claude reads it from cache instead of reprocessing it - reducing costs by up to 90% and latency by more than 2×.

How does Anthropic prompt caching work? When you send a request with caching enabled, Claude computes a hash of your prompt prefix up to the cache breakpoint. If a matching hash exists in the workspace cache, it's a hit - you pay 0.1× the base input price. If not, it's a miss - Claude processes the full prefix and writes it to cache once the response begins.

How much does prompt caching cost? Cache writes cost 1.25× the base input price (5-minute TTL) or 2× (1-hour TTL). Cache reads cost 0.1× the base input price - 90% cheaper than standard input tokens. For Claude Sonnet 4.6/4.5, that's $0.30/MTok for reads vs. $3/MTok for standard input.

How much can prompt caching save? Up to 90% on input costs for repetitive workloads. A documented real-world case: a developer cut their monthly API bill from $720 to $72 by caching a stable system prompt across high-volume requests. Latency also drops by more than 2× on cache hits.

What is the minimum prompt length for caching? It depends on the model. The minimum is 512 tokens for Claude Fable 5 and Mythos 5; 1,024 tokens for Opus 4.8 and Sonnet 4.6/4.5; 2,048 tokens for Opus 4.7 and Haiku 3.5; and 4,096 tokens for Opus 4.6/4.5 and Haiku 4.5. Prompts below the threshold are processed normally - no error, no cache.

What's the difference between the 5-minute and 1-hour cache windows? The 5-minute cache (default) costs 1.25× base input to write and refreshes for free on every hit. It's ideal for high-frequency requests. The 1-hour cache costs 2× base input to write but stays alive for an hour - better for slower cadences, agentic tasks, or user sessions where responses might be spaced more than 5 minutes apart.

Does prompt caching work on AWS Bedrock or Vertex AI? Partially. Automatic caching is not available on Bedrock or Vertex AI. Explicit cache breakpoints and the 1-hour TTL do work on both platforms. Also note: on Bedrock, cache isolation remains at the organization level (not workspace level), unlike the Claude API which moved to workspace-level isolation on February 5, 2026.

What happens if my prompt changes slightly between requests? Any change to the cached prefix - even a single token - produces a different hash and results in a cache miss. You'll pay the full cache-write cost again with no read benefit. This is the most common caching mistake: placing a cache_control breakpoint on a block that contains dynamic content (timestamps, per-request IDs, user messages). Always anchor the breakpoint on the last block that stays identical across all requests you want to share a cache.

Useful Sources

Anthropic Prompt Caching - Official Documentation - Full mechanics, pricing tables, code examples, and edge cases.
Anthropic Prompt Caching Cookbook - Practical patterns: large context caching, tool definition caching, multi-turn conversation examples.
Amazon Bedrock Prompt Caching Documentation - Per-model minimums and platform-specific behavior for Bedrock deployments.
Anthropic Models Overview - Current model list with supported features and deprecation status.

Keep reading

llmprompt cachingcost optimization

Prompt Caching Break-Even: How Many Reads to Save Money?

Prompt caching advertises a 90% discount on cache hits - but the write premium means a low cache hit rate costs you more than no caching at all. Here's the exact break-even math and the architecture decisions that determine whether you capture the savings.

MKMohammed Kafeel

14 min read

llmprompt cachingopenai

OpenAI vs Anthropic Prompt Caching: Key Differences

A direct, data-driven comparison of OpenAI and Anthropic prompt caching - covering activation, TTL, cost savings, hit rates, and a decision framework for choosing the right one.

MKMohammed Kafeel

13 min read

llmroutingcost optimization

RouteLLM vs vLLM Semantic Router: Which One Actually Cuts Costs?

RouteLLM and vLLM Semantic Router both reduce LLM costs - but they solve fundamentally different problems. Here's the benchmark data, the architecture breakdown, and the exact decision framework to pick the right one.

SYShubham Yadav

15 min read

Back to all posts