All posts

OpenAI vs Anthropic Prompt Caching: Key Differences

A side-by-side comparison of how OpenAI and Anthropic implement prompt caching — automatic vs manual, TTLs, pricing, and which fits which workload.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202612 min read

Quick answer: Both OpenAI and Anthropic cache a repeated prompt prefix to cut input costs, but they hand you opposite levels of control. OpenAI caching is fully automatic — it turns on for any prompt over 1,024 tokens with no code changes, charges no write premium, and discounts cached input tokens by up to 90%. You can't choose what gets cached; you only influence routing. Anthropic caching is manual — you place explicit cache_control breakpoints (up to 4) to mark exactly what to cache, choose a 5-minute or 1-hour TTL, and pay a write premium (1.25× or 2×) in exchange for ~0.1× reads. The trade-off in one line: OpenAI optimizes for zero effort and never costs more; Anthropic optimizes for precise control at the cost of an up-front write premium.


The core philosophical difference

This comparison comes down to one design choice: who decides what gets cached?

  • OpenAI → the platform decides. Caching is an invisible optimization. You write a normal request; if a long prefix repeats, OpenAI caches it and discounts the reads automatically. There are no breakpoints, no TTL knob, and no write premium — but also no fine-grained control.
  • Anthropic → you decide. Caching is an explicit instruction. You mark the exact content blocks to cache with cache_control, pick the TTL, and accept a write premium. More work and more cost up front, but deterministic control over precisely what is cached and for how long.

Everything else — pricing shape, TTL behavior, usage reporting — flows from that one difference.


How OpenAI automatic caching works

OpenAI prompt caching works automatically on all API requests with no code changes, enabled for GPT-4o and all newer models.

  • Eligibility: prompts of 1,024 tokens or more. Shorter prompts are never cached (cached_tokens will be 0).
  • What's cached: the exact prefix of the prompt — the messages array (system/user/assistant), images, tool definitions, and structured-output schemas — matched by an exact prefix.
  • Routing: requests are routed by a hash of the initial prefix (roughly the first 256 tokens). You can pass the optional prompt_cache_key parameter to steer similar requests to the same cache and raise hit rates.
  • Pricing: cached input tokens are discounted by up to 90%, with no write premium — the first request that populates the cache is billed normally, and there's no extra fee to "store" it.
  • TTL: the cache persists for 5–10 minutes of inactivity, up to a maximum of about 1 hour. Newer models (the GPT-5 family) support extended retention up to 24 hours.
  • Verify with: usage.prompt_tokens_details.cached_tokens.

The mental model: you do nothing, and it never costs you more than not caching. The downside is you can't force, scope, or extend a cache beyond what the platform's heuristics decide.


How Anthropic manual prefix caching works

Anthropic prompt caching is explicit: you place cache_control breakpoints on the content blocks you want cached.

  • Eligibility: a minimum cacheable prefix length that varies by model — 1,024 tokens (Sonnet 3.x/4/4.5), 2,048 (Sonnet 4.6, Haiku 3/3.5), or 4,096 (Opus 4.x, Haiku 4.5).
  • What's cached: any content block you mark — system text, tool definitions, and message content (text, images, documents, tool_use, tool_result). The request renders in a fixed order — toolssystemmessages — and the cache is a prefix up to each breakpoint.
  • Control: up to 4 cache breakpoints per request, so you can cache, say, a tool block and a long document independently.
  • Pricing: cache reads cost ~0.1× the base input price (the ~90% saving), but writes cost a premium1.25× for the 5-minute TTL or for the 1-hour TTL.
  • TTL: 5 minutes by default ({"type": "ephemeral"}), or 1 hour with {"type": "ephemeral", "ttl": "1h"}.
  • Verify with: usage.cache_creation_input_tokens (write), usage.cache_read_input_tokens (read), and usage.input_tokens (uncached).

The mental model: you pay a small premium up front to write the cache, then collect deep discounts on every read — and you decide exactly what and how long.


Head-to-head comparison

Dimension OpenAI (automatic) Anthropic (manual)
Who controls caching The platform (automatic) You (explicit cache_control breakpoints)
Code changes needed None Yes — add breakpoints
Granularity control None (only prompt_cache_key for routing) Up to 4 breakpoints; choose exact blocks
Minimum prefix length 1,024 tokens (all eligible models) 1,024 / 2,048 / 4,096 tokens (varies by model)
Write premium None — first request billed normally 1.25× (5-min) or 2× (1-hour)
Cached-read discount Up to 90% ~90% (reads ≈ 0.1×)
Can a single use cost more? No — caching is free Yes — single-use pays the write premium
TTL 5–10 min inactivity, ~1 hr max; up to 24 hr (GPT-5) 5 min default, or 1 hour (opt-in)
TTL control None (platform-managed) You choose 5-min or 1-hour
Match type Exact prefix (routed by first ~256 tokens) Exact prefix up to each breakpoint
Usage field prompt_tokens_details.cached_tokens cache_read_input_tokens + cache_creation_input_tokens

The pricing difference that matters most

The single most consequential difference is the write premium.

  • OpenAI charges nothing extra to populate the cache. Because there's no write premium, caching is strictly non-negative: it either saves you money or does nothing — it can never cost more than not caching. A prefix used only once simply isn't discounted; you weren't charged extra to try.
  • Anthropic charges a write premium (1.25× or 2×). This means a prefix used only once before it expires costs you more than not caching at all. Anthropic caching has a genuine break-even point: ~2 requests for the 5-minute TTL, ~3 for the 1-hour TTL. (For the full break-even math, see the dedicated break-even analysis post.)

Practical consequence: with OpenAI you never have to ask "is this worth caching?" — it's always free to try. With Anthropic you do, because the write premium is real. The flip side: Anthropic's explicit model lets you guarantee a cache write and pin a 1-hour TTL, which OpenAI's heuristics won't promise.


TTL and persistence

OpenAI Anthropic
Default lifetime 5–10 min of inactivity 5 minutes
Maximum lifetime ~1 hour (24 hr on GPT-5 family) 1 hour (explicit opt-in)
Who chooses Platform You (ttl field)
Refresh behavior Activity resets inactivity timer Reuse refreshes the entry

OpenAI's TTL is inactivity-based and platform-managed — you can't pin it, though the GPT-5 family's extended 24-hour retention is a meaningful edge for long-lived contexts. Anthropic's TTL is a deliberate choice: pick 1-hour (and pay 2× to write) when your traffic is bursty with gaps longer than five minutes, so the entry survives between bursts.


What both providers share

Despite the control difference, the underlying mechanics are the same, and so is the optimization playbook:

  1. It's a prefix match. Both cache an exact leading portion of the prompt. Any byte change in the prefix invalidates the cache from that point on.
  2. Stable content first, volatile content last. Put never-changing instructions, tools, and documents at the front; put the user's unique input at the end. This rule is identical across both providers (and vLLM).
  3. Silent invalidators break both. A timestamp, UUID, per-user ID, or non-deterministic JSON ordering early in the prompt destroys cache hits on either platform.
  4. Verify, don't assume. Both expose a usage field for cached tokens — check it, because a zero means your prefix isn't actually matching.

Which should you optimize for?

You usually don't choose between them — you're on one provider or the other. But the way you should think about caching differs:

If you're on… Do this
OpenAI Nothing required — just structure prompts stable-first. Add prompt_cache_key to raise hit rates for similar requests. Never worry about break-even; caching is free.
Anthropic Add cache_control breakpoints on your large stable blocks. Choose the TTL by traffic pattern (5-min steady / 1-hour bursty). Make sure the prefix is reused ≥2–3× within the TTL so the write premium pays off.
Multi-provider Apply the shared playbook (stable-first, no early volatility) so the same prompt structure caches well on both, and add provider-specific markers where needed.

Rule of thumb: On OpenAI, caching is a free side effect of good prompt structure. On Anthropic, caching is a deliberate cost decision you opt into and tune.


Frequently asked questions

What is the difference between OpenAI and Anthropic prompt caching? OpenAI caching is automatic — it applies to any prompt over 1,024 tokens with no code changes, charges no write premium, and discounts cached input by up to 90%. Anthropic caching is manual — you place explicit cache_control breakpoints (up to 4) on the blocks you want cached, choose a 5-minute or 1-hour TTL, and pay a write premium (1.25× or 2×) in exchange for ~0.1× reads. OpenAI trades control for zero effort; Anthropic trades effort for precise control.

Does OpenAI charge extra to write to the cache? No. OpenAI has no cache-write premium — the first request that populates the cache is billed at the normal rate, and subsequent cache hits are discounted by up to 90%. This means OpenAI caching can never cost more than not caching. Anthropic, by contrast, charges 1.25× (5-minute) or 2× (1-hour) to write, so a single-use prefix can cost more than not caching.

How much does each provider discount cached tokens? Both advertise up to roughly 90% off cached input tokens — Anthropic reads cost about 0.1× the base input price, and OpenAI states up to 90% reduction (the exact discount can vary by model). The key difference is the write side: Anthropic adds a one-time write premium, while OpenAI does not.

Can I control what gets cached on OpenAI? Not directly. OpenAI caches the prompt prefix automatically and you cannot place breakpoints or pin a TTL. You can influence cache routing with the optional prompt_cache_key parameter to steer similar requests to the same cache and improve hit rates, and you control hits indirectly by structuring prompts stable-first. Anthropic gives explicit control through up to 4 cache_control breakpoints and a TTL choice.

What are the cache lifetimes (TTL)? OpenAI's cache persists for 5–10 minutes of inactivity, up to about an hour, with extended retention up to 24 hours on the GPT-5 family — all platform-managed. Anthropic defaults to a 5-minute TTL and offers an opt-in 1-hour TTL (at a higher 2× write premium) that you set explicitly with the ttl field. Use the longer TTL for bursty traffic with gaps over five minutes.

How do I verify a cache hit on each platform? On OpenAI, check usage.prompt_tokens_details.cached_tokens. On Anthropic, check usage.cache_read_input_tokens (served from cache), usage.cache_creation_input_tokens (written this request), and usage.input_tokens (uncached). On both, a zero cached count across repeated requests means a silent invalidator — often a timestamp, ID, or reordered content — is breaking your prefix.


Key takeaways

  • OpenAI = automatic, Anthropic = manual. That single difference drives everything else.
  • OpenAI charges no write premium, so caching can never cost more than not caching — but you can't scope or pin it (only influence routing via prompt_cache_key).
  • Anthropic charges a write premium (1.25× / 2×) for explicit control: up to 4 breakpoints and a 5-minute or 1-hour TTL you choose — but single-use prefixes can cost more (break-even ~2–3 requests).
  • Both discount cached reads by up to ~90% and both are exact prefix matches — so the stable-first, volatile-last rule applies identically.
  • Minimum prefix: 1,024 tokens on OpenAI (all eligible models); 1,024–4,096 on Anthropic depending on the model.
  • Verify hits with cached_tokens (OpenAI) or cache_read_input_tokens (Anthropic) — a zero means a silent invalidator broke the prefix.

References

  1. OpenAI. Prompt caching — API documentation (automatic caching, 1,024-token minimum, up to 90% discount, TTL, prompt_cache_key, cached_tokens). https://platform.openai.com/docs/guides/prompt-caching
  2. Anthropic. Prompt caching — Claude API documentation (cache_control breakpoints, read 0.1×, write 1.25×/2×, 5-min and 1-hour TTL, render order, usage fields). https://docs.claude.com/en/docs/build-with-claude/prompt-caching
  3. Anthropic. Pricing — base input/output token prices and cache multipliers by model. https://www.anthropic.com/pricing
  4. OpenAI. Pricing — cached input token rates by model. https://openai.com/api/pricing/