All posts

Prompt Caching Break-Even: How Many Reads to Save Money?

The exact formula for calculating your prompt caching break-even point — factoring in write premium, read discount, TTL, and request volume — so you know whether caching is worth it before you turn it on.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202612 min read

Quick answer: Prompt caching costs more on the first request (you pay a write premium) and far less on every request after (cheap reads). You break even — start saving money — once enough requests reuse the cached prefix. The break-even point is governed by one clean formula:

Break-even requests N* = (w − r) / (1 − r), where w is the cache-write multiplier and r is the cache-read multiplier (both relative to the base input price).

For Anthropic's pricing (read r = 0.1×), the 5-minute TTL (w = 1.25×) breaks even at N* ≈ 1.28 → you profit from the 2nd request, and the 1-hour TTL (w = 2×) breaks even at N* ≈ 2.11 → you profit from the 3rd request. The surprising part: this break-even count does not depend on how big your prompt is. Prompt size determines how much you save, not whether you cross break-even.


The setup: what you pay, and when

Prompt caching changes the price of the shared prefix tokens (system prompt, tools, documents, history) across three states, expressed as multiples of the normal "base" input price:

Token state Multiplier When you pay it
Cache write w The first request, to store the prefix
Cache read r Every later request that reuses the prefix
Uncached (base) What you'd pay with no caching at all

For Anthropic Claude: r = 0.1× (reads are ~90% cheaper), w = 1.25× for the 5-minute TTL, and w = 2× for the 1-hour TTL. The mechanics behind these numbers are covered in the companion post on how prompt caching works; here we focus purely on the economics.


Deriving the break-even point

Consider N requests that all reuse the same cached prefix of T tokens, within one TTL window.

Without caching, every request pays full price for the prefix:

Cost_uncached = N · T · 1  =  N·T

With caching, the first request writes (at w), and the remaining N − 1 read (at r):

Cost_cached = T·w  +  (N − 1)·T·r
            = T · [ w + (N − 1)·r ]

Caching wins when Cost_cached < Cost_uncached:

T·[ w + (N − 1)r ]  <  N·T
       w + (N − 1)r  <  N            ← T cancels out
       w − r         <  N(1 − r)
       N             >  (w − r) / (1 − r)
┌─────────────────────────────────────┐
│   Break-even:  N* = (w − r)/(1 − r)   │
└─────────────────────────────────────┘

The prefix size T cancels. Break-even is a function of the pricing multipliers only — not your prompt length, not your traffic volume. This is the single most useful fact in caching cost analysis: a 1,000-token prefix and a 100,000-token prefix break even after the same number of requests.


The numbers for Anthropic pricing

Plugging in r = 0.1:

TTL w N* = (w − r)/(1 − r) You profit from… Why
5-minute 1.25× (1.25 − 0.1)/0.9 = 1.28 the 2nd request 1.25 + 0.1 = 1.35× < 2× for two uncached
1-hour (2 − 0.1)/0.9 = 2.11 the 3rd request 2 + 0.1 = 2.10× ≈ 2× (barely loses at 2); 2 + 0.2 = 2.2× < 3× at three

So the rule of thumb everyone quotes — "5-minute pays off at 2 requests, 1-hour at 3" — falls straight out of the formula. The 1-hour TTL costs more to write, so you need one extra reuse to amortize it, but it survives much longer between requests.

A single cached call (N = 1) always loses

If a prefix is used exactly once before it expires, caching costs you money: you pay the write premium (1.25× or 2×) and never collect a discounted read. Caching only pays off when reuse actually happens within the TTL. This is why sprinkling cache_control on prompts that don't repeat is a net negative.


How much do you actually save? (Blended savings)

Break-even tells you whether you save. The magnitude depends on two more things: how many requests reuse the prefix (N), and how large the shared prefix is relative to the unique, per-request tokens.

Real prompts have a shared prefix of T tokens (cached) plus a unique portion of U tokens per request (the user's question — always billed at 1×, never cached). The total saving fraction on input tokens is:

Savings(N) = 1 −  [ T·w + (N−1)·T·r + N·U ]  /  [ N·(T + U) ]

As N → ∞, the write premium amortizes to nothing and this converges to:

Max savings  =  1 − (T·r + U) / (T + U)

Two levers fall out of this:

  1. More reuse (higher N) pushes you toward the maximum — the write premium gets spread thinner with every read.
  2. A bigger prefix-to-unique ratio raises the ceiling. If U = 0 (pure shared prefix), max savings = 1 − r = 90%. As the unique portion U grows, the ceiling drops, because those tokens are never discounted.

Worked example

A 4,000-token shared prefix (T) with a 50-token unique question (U), at Anthropic 5-minute pricing (w = 1.25, r = 0.1):

Requests (N) Savings on input tokens Notes
1 −25% (you lose) Paid the write, no reads yet
2 +32% Break-even already crossed
5 +66% Write premium amortizing
10 +78%
50 +87% Approaching the ceiling
100 +88%
+88.9% (ceiling) = 1 − (4000·0.1 + 50)/(4000 + 50)

The headline "90%" is the asymptotic ceiling for a pure prefix. In practice you land slightly below it (here ~89%) because of the small unique portion, and you only approach it after dozens of reuses.

Why the prefix ratio dominates

Prefix T Unique U Prefix ratio Max savings (N→∞)
4,000 50 98.8% ~88.9%
2,000 500 80% ~72%
1,000 1,000 50% ~45%
500 2,000 20% ~18%

If your unique per-request content is large relative to the shared prefix, caching can't deliver 90% no matter how many reads you get — the unique tokens are always full price. Caching rewards big, stable prefixes with small, variable tails.


The TTL trap: break-even resets per window

The formula assumes all N requests land within one TTL window. If they don't, the cache expires between reuses and each "session" pays the write premium again — effectively resetting break-even.

  • Steady traffic (requests arrive faster than the TTL): one write, many reads, savings accrue as derived.
  • Sparse/bursty traffic (gaps longer than the TTL): you re-write each burst. Now break-even applies per burst — a burst of only 1–2 requests may never pay off on the 5-minute TTL.

This is exactly why the 1-hour TTL exists: it raises the write premium (w = 2×, so break-even moves to 3 requests) but keeps the entry alive across long gaps, so bursty workloads still clear break-even within each window. Choose the TTL by matching it to your inter-request gap, not just the per-window request count.

5-min TTL:  ✓ continuous traffic      ✗ gaps > 5 min (re-writes)
1-hour TTL: ✓ bursty traffic w/ gaps   ✗ higher write cost (need 3+ reuses)

A break-even calculator (code)

def breakeven_requests(w, r):
    """Minimum requests in a TTL window before caching saves money."""
    return (w - r) / (1 - r)

def savings_fraction(N, T, U, w, r):
    """Fraction of input-token cost saved over N reusing requests."""
    cached = T * w + (N - 1) * T * r + N * U
    uncached = N * (T + U)
    return 1 - cached / uncached

# Anthropic pricing
R = 0.1
for label, W in [("5-min", 1.25), ("1-hour", 2.0)]:
    print(label, "break-even:", round(breakeven_requests(W, R), 2), "requests")

print(savings_fraction(N=100, T=4000, U=50, w=1.25, r=0.1))  # ≈ 0.877

Drop in your own w, r, prefix size T, unique size U, and expected reuse N to get your real numbers before committing to a TTL.


A decision rule

Your situation Verdict
Prefix reused ≥ 2× within 5 min Cache with 5-minute TTL
Reused ≥ 3× but with gaps > 5 min Cache with 1-hour TTL
Prefix used once then discarded Don't cache — you'd only pay the write premium
Large shared prefix, tiny unique question Cache — you'll approach the ~90% ceiling
Small prefix, large unique question Cache only if it reuses a lot; ceiling is low
Sparse, unpredictable traffic 1-hour TTL, or skip caching if reuse is rare

One-line heuristic: cache when you expect the prefix to be reused at least 2–3 times within the TTL and the shared portion is a large fraction of each prompt. Otherwise the write premium isn't worth it.


Frequently asked questions

How many requests before prompt caching saves money? Break-even is N* = (w − r) / (1 − r), where w is the write multiplier and r is the read multiplier. With Anthropic's read price of 0.1×, the 5-minute TTL (write 1.25×) breaks even at about 1.28 requests — so you profit from the second request — and the 1-hour TTL (write 2×) breaks even at about 2.11 requests, so you profit from the third. A prefix used only once costs more than not caching.

Does prefix size affect the break-even point? No. The prefix size cancels out of the break-even formula, so a small prefix and a huge prefix break even after the same number of requests. Prefix size affects how much you save (bigger prefixes save more in absolute terms and raise the savings ceiling), not whether you cross break-even.

Why don't I get the full 90% savings? The 90% figure is the asymptotic ceiling for a pure shared prefix with no unique tokens, reached only after many reuses. In reality each request has a unique portion (the user's question) that's always billed at full price, which lowers the ceiling, and the one-time write premium drags down savings until it's amortized over enough reads. With a large prefix and small unique tail you'll approach ~88–89%; with a large unique portion the ceiling can be far lower.

What happens if my cache expires between requests? Break-even resets. The formula assumes all reusing requests fall within one TTL window. If traffic is sparse and the cache expires between requests, each burst pays the write premium again, so a burst of only one or two requests may never pay off on the 5-minute TTL. Use the 1-hour TTL for bursty traffic with gaps longer than five minutes.

Should I use the 5-minute or 1-hour TTL for cost? Match the TTL to your inter-request gap. The 5-minute TTL has a lower write premium (break-even at 2 requests) and wins for continuous traffic. The 1-hour TTL costs more to write (break-even at 3 requests) but keeps the cache alive across long gaps, so it wins for bursty workloads where the 5-minute entry would expire and force repeated re-writes.

When is prompt caching not worth it? When the prefix is used only once before expiring (you pay the write premium for nothing), when traffic is so sparse the cache always expires before reuse, or when the shared prefix is tiny relative to the unique per-request content (the savings ceiling is too low to matter). Caching rewards large, stable prefixes reused several times within the TTL.


Key takeaways

  • Break-even: N* = (w − r)/(1 − r). With Anthropic's r = 0.1, the 5-minute TTL pays off at the 2nd request, the 1-hour TTL at the 3rd.
  • Prefix size cancels — break-even depends only on the write/read multipliers, not your prompt length. Size affects how much you save, not whether.
  • A single-use cached prefix always loses — you pay the write premium and collect no discounted reads.
  • Max savings = 1 − (T·r + U)/(T + U) — the 90% ceiling holds only for a pure prefix (U → 0); a large unique portion lowers it.
  • The write premium amortizes with reuse — savings climb from negative (N=1) toward the ceiling as N grows.
  • Break-even resets per TTL window — match the TTL to your inter-request gap; use 1-hour for bursty traffic, 5-minute for steady streams.

References

  1. Anthropic. Prompt caching — Claude API documentation (pricing multipliers, TTL options). https://docs.claude.com/en/docs/build-with-claude/prompt-caching
  2. Anthropic. Pricing — base input/output token prices by model. https://www.anthropic.com/pricing
  3. OpenAI. Prompt caching — API documentation (automatic caching, discount tiers). https://platform.openai.com/docs/guides/prompt-caching