Both Anthropic and OpenAI offer prompt caching in 2026, and the pitch is identical: reuse the processed form of a static prefix, pay less on repeats, answer faster. Look closer and the designs diverge. Anthropic asks you to mark cache boundaries explicitly; OpenAI detects repeated prefixes automatically. That difference is small on paper and substantial in practice — it changes how you structure prompts and what cache hygiene even means. This post sits under the context engineering pillar and is the side-by-side companion to the prompt caching guide.
The Two Approaches at a Glance
The headline difference is where the control lives.
On Anthropic, caching is opt-in per request with explicit breakpoints. You attach a cache_control marker to a content block, and the provider caches everything up to and including that block. You pick where the line goes, and you can draw more than one.
On OpenAI, caching is on by default and automatic. Nothing marks the cached region. The provider checks whether the incoming prefix matches a recent one you sent (above a minimum size) and reuses the processed representation if so. Your control point is prompt structure: put stable content first, and the cache catches it.
Both land at the same principle — stable-first, variable-last — but they reward different habits. Anthropic rewards deliberate layering; OpenAI rewards prompt discipline. The prompt caching glossary entry has the short definition; this post is about what changes between providers.
Anthropic's Explicit Cache Breakpoints
Anthropic's cache_control marker is a content-block annotation. You tag a block — typically the last block of a section you want cached — and the provider treats the request as "everything up through this block is cacheable; process the rest fresh." On a later call with the same prefix, the provider loads the cached representation and skips reprocessing.
The design lets you layer cache regions. A production prompt often has tiers of stability: identity and rules (stable per deploy), tool definitions (stable per deploy), reference docs (stable per week), retrieved context (per query), user message (per turn). Place one breakpoint after identity and another after tool definitions, and each tier gets its own reusable cache. When reference docs update but tools do not, the first tier still fires.
The cost is thinking. You decide where breakpoints go and maintain them; moving a block above or below the marker changes what caches. The upside is precision: several cached layers can run in one request. The breakpoint matters at the block level — rewriting a cached block's contents, even whitespace, misses that block and everything after it. Treat cached blocks as immutable between deploys.
OpenAI's Automatic Prefix Caching
OpenAI takes the opposite approach. There is no marker. The provider checks whether the incoming prefix matches a recent one (above a minimum size, within TTL), and if so reuses the processed representation for the cached portion. The match starts at the first token and runs until the prompts diverge.
The virtue is simplicity. You do not configure caching per request or forget a breakpoint on a path. Ship a prompt with a stable front and the cache fires. The tax is control: you cannot say "cache exactly up to here." The cached region ends wherever prompts diverge. If a per-call field sneaks into the prefix, everything downstream becomes uncacheable for that call — silently.
Automatic caching also ties hit rate tightly to how consistently you construct prompts. Two services rendering "the same" system prompt with slightly different whitespace or field order create two prefixes and two separate cache entries. Anthropic has the same constraint, but the explicit boundary makes the drift easier to spot during review.
Side-by-Side Comparison
| Dimension | Anthropic (Claude) | OpenAI |
|---|---|---|
| Activation | Explicit via cache_control marker on a content block | Automatic when prefixes repeat above a threshold |
| Boundary control | You pick where the cached region ends; multiple breakpoints supported | Cached region ends wherever prompts diverge — no explicit boundary |
| Typical minimum size | Higher — caching below a floor is not stored | Lower — smaller prefixes can cache |
| TTL | Minutes-range; check current docs for exact duration | Minutes-range; check current docs for exact duration |
| Layering | Natural — place breakpoints after each stable tier | Implicit — longest shared prefix is what caches |
| Hit telemetry | Response metadata reports cached vs. uncached input tokens | Response metadata reports cached vs. uncached input tokens |
| Invalidation surface | Any change in or above a cached block | Any change in the shared prefix region |
| Per-call setup | Must attach marker; easy to forget on a new code path | Nothing to set — works if structure is right |
| Developer cost | Think once about breakpoints, then maintain | Think once about prompt structure, then maintain |
| Control trade-off | More precision, more bookkeeping | Less bookkeeping, less precision |
Pricing shape is similar across both — cached input tokens cost meaningfully less than fresh ones; output costs what it costs. Exact discount percentages differ between providers and change over time; check current pricing. For where caching sits in the broader picture, see the token economics guide 2026.
Prompt Structure Implications
Both approaches converge on stable-first, variable-last. The difference is where the flex points are.
On Anthropic, flex points are wherever you place (or decline to place) a breakpoint. You can cache tool definitions but not reference docs by placing the marker between them. You can cache the system prompt only, and treat tools as dynamic. Layering lets you match cache regions to how content actually changes.
On OpenAI, the flex point is wherever prompts diverge. If the system prompt is identical but tool definitions are in a different order, the cache catches the system prompt and stops. You get a single implicit boundary; you do not choose where it lands. The lever is ensuring stable content is genuinely stable, token for token.
In both cases, content in the cached region has to be exactly identical across calls — token-level match, not paraphrase. No timestamps, no request IDs, no per-user details unless the cache is scoped per user. The prompt caching guide covers the general hygiene rules; the provider-specific version is that Anthropic lets you draw the fence, OpenAI infers it.
Cache Hit Matching Rules
Both providers require the prompt to start with exactly the tokens the cache has seen. The match runs from the first token until prompts diverge; the moment they diverge, caching for that call ends.
Three things break matches on both:
- Any character change in the cached region — whitespace, punctuation, model hints, or reordered tool definitions.
- Minimum size not met — prefixes shorter than a provider-specific floor never cache.
- TTL expired — if the window passed between calls, the first call on the next window misses.
The Anthropic-specific gotcha: a block's content must not change. Dynamically formatting a "reference docs" block to include today's date changes the block's hash and misses every request downstream. The OpenAI-specific gotcha is subtle prefix contamination — a header auto-injected by an SDK wrapper, a trace ID prepended in middleware, a "current time is X" line baked into the template. You will not see an error; the hit rate just will not climb.
TTL and Invalidation
Both providers hold cached prefixes for minutes, not hours or days. Exact TTLs differ and change; design for tens of minutes. If traffic density within that window is high, caching pays off spectacularly. If calls are spaced further apart, each call re-warms and re-loses the cache.
Invalidation happens for four reasons on both providers:
- TTL expiry — time-based, not under your control.
- Prefix change — any edit in or above the cached region.
- Model version swap — switching variants restarts caching from zero.
- Capacity eviction — under memory pressure, older entries may drop before TTL.
Treat these as facts of life. Cache-hit telemetry (both providers surface it) is how you find out which one bit you this week.
A Migration Example
The cleanest way to write prompts that cache well on both is to assume explicit breakpoints (Anthropic) while enforcing stable structure (OpenAI). Here is a hypothetical request restructured for cross-provider use.
[System identity — stable across all calls]
You are an operations analyst for internal support tickets.
[Rules and guardrails — stable across all calls]
Follow these rules:
- Respond only with JSON matching the schema below.
- If information is missing, ask rather than guess.
[Tool definitions — stable per deploy]
Tools available: ticket_lookup, user_lookup, escalate.
[...schemas...]
[Reference material — stable per week]
Escalation policy summary: [...]
Common ticket categories: [...]
# <-- Anthropic: cache_control marker here -->
# <-- OpenAI: structural break; per-call content below -->
[Retrieved context — varies per call]
Recent tickets from this user: [...]
[User message — varies per call]
Ticket: [...]
On Anthropic, you attach cache_control to the last block of reference material. On OpenAI, you do nothing extra — the cache fires because the front is identical across calls. The structure is the same; only activation differs. If you later layer caches on Anthropic (one after tools, one after reference docs), you can do so without rewriting anything above the boundary.
Migration Tips
- Pin token-level stability in the shared prefix. No timestamps, no IDs, no dynamic date lines. If it varies per call, it goes below the boundary.
- Keep tool definitions in a fixed order. A sort-by-name pass at build time prevents accidental reordering from different code paths.
- Version prompt changes like schema migrations. A one-word rewrite invalidates every cached prefix on both providers. Batch changes; deploy at low-traffic windows if hit rate is business-critical.
- Log cache-hit ratios on both providers. Different SDKs, same signal — watch it trend.
- Do not rely on the cache for correctness. If behavior changes when caching misses, that is a bug. Caching is a cost optimization; it should not affect output.
- Partition per-tenant prompts where isolation matters. Cross-tenant prefix sharing is a security issue, not a savings opportunity.
Common Anti-Patterns
- Assuming OpenAI's automatic caching "just handles it" — it handles repeated prefixes, not repeated intent. Two callers rendering "the same" prompt differently will not share a cache.
- Putting an Anthropic
cache_controlmarker on a block that changes per call — caches the wrong thing, wastes the breakpoint, yields no hits. - Using multiple Anthropic breakpoints without telemetry — layering without measurement is guessing. Start with one, add more when data justifies it.
- Treating the two providers as identical — same in principle, different in where control lives. Ported prompts need the marker added or removed.
- Changing prompts on a hot path without staging — both providers invalidate on edit. A whitespace-only change can drop hit rate to zero for a whole TTL.
- Conflating prompt caching with response caching — they are different. See semantic caching vs. prompt caching.
FAQ
Do I need different prompts for Claude and OpenAI to get caching?
Not structurally. The same stable-first, variable-last layout works on both. The only provider-specific piece is attaching Anthropic's cache_control marker. If you build prompts as a shared template with a provider-specific render step, the marker is a one-line addition on the Anthropic path.
Is OpenAI's automatic caching always cheaper than no caching?
Yes — when it fires. The discount applies only to the cached portion of the input. If prefixes are not repeating above the minimum size, or calls are spaced wider than the TTL, you will not see savings. Worst case is break-even; best case is large.
Does Anthropic charge extra for using cache_control?
The mechanism is not a surcharge — you pay the regular input rate to write the cache on the first call, and a reduced rate to read it on hits. Exact write/read ratios differ from OpenAI's and change over time; check the current pricing page.
Can I use both providers in the same pipeline and share a prompt?
Yes, and many teams do for redundancy or routing. Keep the prompt template shared, render it identically for both, and add Anthropic's marker in the render step. Cache telemetry is per-provider, so monitor hit rates separately.
Which provider's caching is better?
Neither in the abstract. Anthropic gives more control and rewards deliberate layering; OpenAI gives less setup cost and rewards prompt discipline. The right answer is the one whose habits match your team's.
Wrap-Up
The two providers landed on the same economic argument and picked different ergonomics. Anthropic's explicit breakpoints favor teams that want to layer cache tiers and are willing to maintain boundaries. OpenAI's automatic prefix caching favors teams that want caching without per-request configuration and are willing to enforce stable prompt structure. Both reward the same discipline: keep the front of the prompt identical across calls, put everything volatile below the line, and verify hit rates in telemetry rather than trust it is working.
For the general mechanics, see the prompt caching guide 2026. For where caching sits in the cost picture, see the token economics guide 2026. For the adjacent concept people confuse with prompt caching, see semantic caching vs. prompt caching.