Prompt Caching Break-Even: How Many Reads to Save Money?

Prompt caching advertises a 90% discount on cache hits - but the write premium means a low cache hit rate costs you more than no caching at all. Here's the exact break-even math and the architecture decisions that determine whether you capture the savings.

Mohammed Kafeel

Machine Learning Researcher

June 18, 2026

14 min read

On this page

What Is Prompt Caching (and Why Does the Math Matter)?
The Break-Even Formula - Explained Simply
How Many Reads Do You Actually Need? (By Provider)
Does Prefix Length Change the Break-Even?
What Hit Rates Look Like in the Real World
Two Real Cost Scenarios - Before vs After
When Should You NOT Use Prompt Caching?
How to Maximize Your Cache Hit Rate (3 Steps)
Key Takeaways
FAQ
Useful Sources

TL;DR

You need just 2 cache reads (5-min TTL on Claude Sonnet 4.6) to start saving money. The math breaks even at 1.39 reads.

The write premium is real: 1.25× for a 5-min cache, 2× for a 1-hour cache. Low hit rates cost you more than no caching.

Architecture determines your hit rate - not your prompt wording. A single structural fix took one team from 7.4% → 84% hit rate and cut costs by 70%.

What Is Prompt Caching (and Why Does the Math Matter)?

Prompt caching stores the KV tensors from your prompt's static prefix so the model skips recomputing them on repeat requests. You pay a write premium once, then a deeply discounted read rate on every subsequent hit.

The cost mechanic has two sides:

Cache write premium: You pay more than the base input rate to create the cache entry (1.25× or 2× depending on TTL).
Cache read discount: Every hit costs only 10% of the base input rate - a 90% discount.

The tension between these two numbers is where the prompt caching break-even lives. Get the ratio wrong and you're subsidizing wasted cache writes. Get it right and you're cutting LLM API cost optimization by 40–80%. (This is one of the highest-leverage ways to treat prompt caching as a cost optimization lever.)

The Break-Even Formula - Explained Simply

You need approximately 1.4 cache reads within a 5-minute window to recover the write cost. In practice, 2 reads is your minimum viable threshold.

For Claude Sonnet 4.6, the numbers are:

Base input: $3.00/M tokens
5-min cache write: $3.75/M tokens (1.25× premium)
Cache read: $0.30/M tokens (90% discount)
Net savings per cache hit: $3.00 − $0.30 = $2.70/M

The break-even formula for the 5-minute TTL:

cache_write_cost = N × savings_per_read
$3.75 = N × $2.70
N = 1.39 reads

Round up to 2 reads in practice. Anything below that is a guaranteed loss. (We break down Anthropic's pricing structure and break-even math on its own if you want the full derivation.)

For the 1-hour cache window, the write cost doubles to 2× base ($6.00/M):

$6.00 = N × $2.70
N = 2.22 reads

You need 3 reads within an hour. The math is essentially identical across Claude and OpenAI - both use a ~1.25× write premium and a 90% read discount, so the break-even lands at about 1.4 reads for the short TTL.

The key insight: the break-even threshold is low. Two reads and you're in the black. The real risk isn't the math - it's whether your architecture actually delivers those reads.

How Many Reads Do You Actually Need? (By Provider)

Each major provider has a different pricing structure, which shifts the break-even point and the cache hit rate you need to justify the write cost.

Provider	Cache Write Cost	Cache Read Cost	Break-Even Reads	TTL Window
Claude Sonnet 4.6	$3.75/M (1.25× base)	$0.30/M	~1.4 (5-min) / ~2.2 (1-hr)	5 min or 1 hr
OpenAI GPT-4o	$5.00/M (no surcharge)	$2.50/M (50% off)	~2 reads	~15 min
Google Gemini 3.1 Pro	$0.50/M (explicit)	$0.20/M (90% off)	~1.3 reads	Configurable

A few things stand out:

OpenAI's model is different: no write surcharge, but only a 50% read discount instead of 90%. The break-even is still ~2 reads, but the savings ceiling is lower. (We go deeper on comparing break-even points across providers in a dedicated head-to-head.)
Gemini's explicit caching has the lowest write cost and the lowest break-even (~1.3 reads), but adds hourly storage fees ($4.50/M/hr for Pro models) that change the economics at low request volumes.
Gemini's implicit caching (automatic, no setup) gives ~75% savings on repetitive tokens with zero write overhead - worth enabling as a baseline before reaching for explicit caching.

Minimum cacheable prefix: 1,024 tokens for Claude and OpenAI. 4,096 tokens for Gemini 3.1 Pro. Prompts shorter than these thresholds get zero benefit from the cache infrastructure.

Does Prefix Length Change the Break-Even?

No. The percentage savings stay constant at ~77% regardless of prefix length. What changes is the absolute dollar magnitude - and that changes how aggressively you should pursue caching.

For Claude Sonnet 4.6, here's what 10 cache reads looks like across prefix sizes:

Prefix Length	Write Cost	10 Read Costs	Total w/ Cache	Total w/o Cache	Savings
2,000 tokens	$0.0075	$0.006	$0.0135	$0.060	77%
50,000 tokens	$0.1875	$0.150	$0.3375	$1.500	77%
200,000 tokens	$0.7500	$0.600	$1.3500	$6.000	77%

The percentage is identical. But a 200K-token system prompt caching correctly saves $4.65 per 10 requests. At 10,000 daily requests with a 90% cache hit rate, that's $4,185/day in prompt caching cost savings.

At 10 requests a day with the same hit rate, it's $4.18/day. The ROI logic is the same - only the business case changes. (At high request volumes, this is where break-even analysis at enterprise scale starts to dominate the bill.)

The practical implication: large, stable prefixes are where the economic leverage lives. A 1,500-token system prompt above mostly unique fresh content produces modest savings no matter how perfect your hit rate.

What Hit Rates Look Like in the Real World

Production hit rates vary wildly - and the gap between theoretical and actual is where most implementations fail.

The ProjectDiscovery case study is the most instructive data point available. Their security audit automation tool started with a 7.4% cache hit rate. The cause: dynamic content (timestamps, request IDs, session tokens) embedded before the stable content in the system prompt. Every request hashed differently. Moving that dynamic content after the stable prefix - a single structural change - pushed hit rates to 84% and cut LLM costs by 59–70%. They served 9.8 billion tokens from cache.

Multi-step vs. single-step tasks tell a similar story. The same team measured 91.8% hit rates on multi-step agent tasks (where the same large context recurs across steps) versus 35.5% on single-step tasks (where each request is essentially unique). The architecture of the task, not the prompt wording, drove the difference.

The arxiv study (Lumer et al., 2026) tested 500 agent sessions across GPT-5.2, Claude Sonnet 4.5, GPT-4o, and Gemini 2.5 Pro using 10,000-token system prompts. Results:

41–80% cost reduction across all providers
13–31% improvement in time to first token (TTFT)
GPT-5.2: 79–81% cost reduction
Claude Sonnet 4.5: 78–79% cost reduction
GPT-4o: 46–48% cost reduction
Gemini 2.5 Pro: 28–41% cost reduction

The consistent finding: cache hit rate is an architectural property, not a prompt property. It depends on how you structure request flow, where you place volatile vs. stable content, and whether your traffic pattern produces repeated prefix matches within the cache TTL.

Two Real Cost Scenarios - Before vs After

The same feature produces wildly different savings depending on your architecture. Here are two concrete examples.

Scenario 01: Customer Support Bot

Setup: 10,000 tickets/day, 1,500-token system prompt, 500-token fresh context per ticket
Without caching: $105/day, $3,150/month
With caching (99.99% hit rate): $94.50/day, $2,835/month
Savings: 10%

Why so modest? The system prompt is only 25% of total input tokens. Even with a near-perfect cache hit rate, the uncached fresh context dominates the bill. Prompt caching reads help, but the ratio of cached to uncached tokens is unfavorable.

Scenario 02: Document Q&A

Setup: 1,000 queries/day, 50,000-token document, 200-token query per request
Without caching: $155/day, $4,653/month
With caching (90% hit rate): $34/day, $1,030/month
Savings: 78%

The document dominates input at 99.6% of total tokens. Every cache hit skips reprocessing 50,000 tokens. The math is compelling because the cached prefix is the whole point.

The lesson: architecture matters more than the feature. Before implementing prompt caching, calculate your ratio of cached tokens to total input tokens. If that ratio is below 30%, your savings ceiling is low regardless of hit rate.

When Should You NOT Use Prompt Caching?

Caching hurts when your hit rate stays below the break-even threshold - or when the write premium accumulates without enough reads to recover it.

Specific failure modes to watch for:

Dynamic content contamination: Any timestamp, user ID, or session token placed before your stable content fragments the cache prefix at that point. Everything after it becomes uncacheable. A single Today is {date} early in a system prompt breaks caching for the entire prompt that follows.
Single-step tasks: Each request is essentially unique. With no repeated prefix, you pay the cache write premium on every request and collect zero reads. The arxiv study found only 35.5% hit rates on single-step tasks - often below break-even.
Parallel request race conditions: Fire 100 concurrent requests simultaneously and the first one triggers a cache write that takes 2–4 seconds to materialize. The other 99 arrive during that write window, before the cache entry is available. All 100 pay write overhead; none can read from cache. Fix: issue a single warm-up request before firing the parallel batch.
Prefixes under 1,024 tokens: Below the minimum threshold, caching doesn't activate. Every cache write is wasted overhead with no possible read benefit.
Low-volume workloads with long TTLs: If requests arrive at intervals longer than your cache TTL, the write cost never amortizes. A system prompt cached once per hour, triggered by a slow trickle that never accumulates 3 hits before expiry, costs more than no caching.

How to Maximize Your Cache Hit Rate (3 Steps)

Cache hit rate is an architectural property. These three structural changes determine whether you capture the discount or subsidize wasted writes.

Step 01: Anchor Static Content at the Top

Place your system prompt, reference documents, few-shot examples, and tool definitions at the very beginning of the prompt - before anything dynamic. This is the cacheable prefix. Everything before the first volatile element gets cached; everything after does not.

The structure to aim for:

[System prompt - never changes]
[Reference document or few-shot examples - stable]
[Tool definitions - fixed set]
[Cache breakpoint]
[Dynamic content: user input, query, session context]

Step 02: Move Dynamic Content to the End

Timestamps, user IDs, session tokens, current date strings - all of these must appear after the stable prefix, outside the cached block. This is the most common fix that transforms broken implementations into high-performing ones. ProjectDiscovery went from 7.4% to 84% hit rate with this change alone.

Never embed dynamic values in the middle of a system prompt. The cache prefix breaks at the first volatile token.

Step 03: Use Multi-Turn Architecture

Each conversation turn that extends the same prefix gets a cache read on the stable portion. Multi-step agent tasks achieve 91.8% hit rates precisely because the same large context recurs across steps. Single-step tasks achieve 35.5% because each request is essentially a fresh start.

For multi-turn chat, the effective pattern is:

[Cached block 1: System prompt - never changes]
[Cached block 2: Conversation history - grows each turn]
[Uncached: Current user message]

After the second turn, the full conversation prefix is cached. Each new turn reads the system prompt from cache and writes an extended history entry. Hit rates for this pattern in customer support or assistant applications typically land in the 40–60% range.

Key Takeaways

The break-even is 1.39 reads (5-min TTL, Claude Sonnet 4.6). Use 2 reads as your practical minimum viable threshold.
The 1-hour TTL raises the bar: break-even jumps to 2.22 reads. You need 3 reads per hour to justify it.
Percentage savings are constant (~77%) regardless of prefix length. Absolute savings scale linearly - a 200K-token prefix saves 100× more per request than a 2K-token prefix.
Production hit rates range from 7% to 92% depending entirely on architecture. Dynamic content contamination is the #1 cause of low cache hit rates.
The document Q&A pattern delivers 78% cost savings. The customer support pattern delivers 10%. The difference is the ratio of cached tokens to total input.
Parallel request race conditions are real. 100 concurrent requests can all pay write overhead and collect zero reads. Warm up the cache before firing batch workloads.

FAQ

What is the break-even for prompt caching?

For Claude Sonnet 4.6 with a 5-minute TTL, the break-even is 1.39 cache reads per write. In practice, you need 2 reads to start saving money. For the 1-hour TTL (2× write premium), the break-even rises to 2.22 reads - treat 3 reads as your minimum. OpenAI GPT-4o breaks even at ~2 reads. Gemini explicit caching breaks even at ~1.3 reads.

Does prompt caching work for all use cases?

No. Caching delivers strong savings for document Q&A, multi-turn agents, and any workload with a large stable prefix and moderate-to-high request density. It underperforms - or actively costs more - for single-step tasks, low-volume workloads where the write premium never amortizes, and any prompt with dynamic content embedded before the stable portion.

Which LLM provider has the best prompt caching pricing?

It depends on your workload. Claude offers the deepest read discount (90%) with explicit control over cache breakpoints. OpenAI's automatic caching requires zero implementation effort but caps savings at 50% per cache hit. Gemini's implicit caching has no write overhead at all - useful as a baseline - while explicit caching offers 90% read discounts with added storage fees. For high-volume agentic tasks, the arxiv study found GPT-5.2 and Claude Sonnet 4.5 delivered the highest absolute savings (78–81%).

How do I check my cache hit rate?

Every major provider exposes cache metrics in the API response. For Anthropic, check cache_creation_input_tokens (writes) and cache_read_input_tokens (reads) in the usage object. For OpenAI, look at cached_tokens in the prompt_tokens_details field. For Gemini, check cachedContentTokenCount. Log these values for at least 100 requests before drawing conclusions about your actual hit rate.

Can prompt caching hurt performance?

Yes, in two ways. First, if your cache hit rate falls below break-even, you pay more than without caching. Second, the arxiv study found that naive full-context caching can increase latency for some providers - specifically when dynamic tool results trigger cache writes for content that won't be reused. System-prompt-only caching consistently outperformed full-context caching for TTFT improvement. Cache strategically, not blindly.

What's the minimum token count to enable caching?

1,024 tokens for Claude (all models) and OpenAI GPT-4o. 4,096 tokens for Gemini 3.1 Pro. Claude Haiku 3.5 requires 2,048 tokens. Prompts shorter than these thresholds get no benefit - every cache write is pure overhead with no possible read discount.

Useful Sources

Anthropic prompt caching documentation - official pricing, TTL options, and implementation guide: platform.claude.com/docs/en/build-with-claude/prompt-caching
OpenAI prompt caching guide - automatic caching mechanics, minimum thresholds, and pricing: platform.openai.com/docs/guides/prompt-caching
ProjectDiscovery: How We Cut LLM Cost with Prompt Caching - the 7.4% → 84% hit rate case study with real cost data: projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching
"Don't Break the Cache" - arxiv.org/html/2601.06007v2 - the PwC study evaluating prompt caching across 500 agent sessions on GPT-5.2, Claude Sonnet 4.5, GPT-4o, and Gemini 2.5 Pro: arxiv.org/html/2601.06007v2
Google Gemini caching documentation - explicit vs. implicit caching, storage fees, and model-specific thresholds: ai.google.dev/gemini-api/docs/caching

What's your cache hit rate in production? Drop it in the comments - we want to see the real numbers.

Keep reading

llmcost optimizationanthropic

Anthropic Prompt Caching: How It Works + When to Use It

Anthropic prompt caching lets you reuse a stable prefix of your Claude prompts instead of reprocessing them from scratch - slashing costs by up to 90% and cutting latency by more than 2x.

MKMohammed Kafeel

14 min read

llmroutingcost optimization

RouteLLM vs vLLM Semantic Router: Which One Actually Cuts Costs?

RouteLLM and vLLM Semantic Router both reduce LLM costs - but they solve fundamentally different problems. Here's the benchmark data, the architecture breakdown, and the exact decision framework to pick the right one.

SYShubham Yadav

15 min read

llmself-hostingcost optimization

Run LLMs Locally vs OpenAI API: Real Cost Comparison

At 50M tokens/day, OpenAI costs $126,000/year. We model the full 36-month TCO across three usage tiers - hardware, electricity, ops labor - so you know exactly when self-hosting wins.

SYShubham Yadav

17 min read

Back to all posts