Skip to main content
Back to Blog
prompt cachingcontext engineeringLLM costsAnthropicOpenAIprompt engineering

Prompt Caching Guide (2026): Cutting LLM Costs With Cache Hits

How prompt caching works at Anthropic and OpenAI in 2026 — cache markers, hit requirements, TTL, and how to structure prompts so the cache actually fires.

SurePrompts Team
April 20, 2026
10 min read

TL;DR

Prompt caching reuses the processed form of static prompt prefixes — but only if you structure prompts so the cache can recognize them. Stable content first, variable content last; the savings follow.

Prompt caching sounds like a micro-optimization and turns out to reshape architecture. When a provider has seen the start of your prompt before, it reuses the processed form of that prefix instead of recomputing it — charging less and answering faster. In 2026 that is standard at both Anthropic and OpenAI. How you structure a prompt now determines whether the cache fires, and most prompts written before caching went mainstream leave the savings on the table. This sits under the context engineering pillar — caching is a big reason "context engineering" became the right frame.

What Prompt Caching Is

When the model processes a prompt, it converts tokens into an internal representation — the "processed prefix" — then runs attention over that representation to generate output. Most of the cost of a long prompt is in processing, not generation. If the same prefix shows up on the next call, the provider skips processing, reuses the representation, and runs generation on top. You still pay to generate output, but the cached input costs much less, and the request returns faster. See the prompt caching glossary entry for the short definition.

Two things the cache is not: it is not a semantic cache (identical prefixes only, not "similar" prompts), and it is not a response cache (the model still generates fresh output). See semantic caching vs. prompt caching — people conflate them often enough to cause bugs.

Why It Matters in 2026

Three trends made caching a first-class design concern.

System prompts got long. Production apps routinely run multi-thousand-token system prompts — identity, rules, tool definitions, reference material. Without caching, every call pays full input cost. With caching, that cost amortizes across every call in the window.

Agents loop. An agent makes many calls per task. If the system prompt and tool definitions are stable (they usually are), caching turns a 10-step task from 10x the system-prompt cost into roughly 1x.

Long context is mainstream. With windows past 1M tokens, people pack entire codebases or document sets into a prompt. If that content is reused across queries, caching it makes the design tractable. Without caching, paying full rates on a 500k-token prompt many times is often prohibitive.

If you call the same model more than a few times per session, caching is one of your largest cost levers. See token economics guide 2026 for where it sits in the cost picture.

How Caching Works Under the Hood

The flow:

  • You send a request.
  • The provider tokenizes the prompt and checks whether the beginning matches a cached prefix.
  • If the match is long enough, it loads the cached representation, skips re-processing, and processes only the uncached tail.
  • The model generates output.
  • On no match, the provider processes the whole prompt fresh — and may store the new prefix for future reuse.

The cache is keyed on the exact prefix, from the first token. A hit requires the prompt to start with the token sequence the cache has seen, continuing as long as they agree. The moment variable content appears in the prefix, everything after it is uncacheable.

Cache Hit Requirements

Three conditions need to line up. Exact numbers vary by provider and change over time — the shape is what matters.

ConditionWhat it meansWhat varies
Exact prefix matchThe prompt must start with the same tokens the cache processedNon-negotiable — this is how the cache is keyed
Minimum cached sizePrefixes below a token threshold may not cache at allExact minimum varies by provider; check current docs
Within TTL windowThe cached prefix must still be live when the next call arrivesTypically minutes, not hours — design for tens of minutes, not days

The "exact prefix match" condition trips people up. It is token-level, not paraphrase-level. Changing wording, reordering two sentences, or inserting a timestamp all break the match. If you want a hit, nothing in the cached region can move.

Writing Cache-Friendly Prompts

The rule: stable content at the top, variable content at the bottom. Per-call content goes last; constant content goes first.

code
# CACHE REGION — stable across all calls
[System identity and role]
[Operating rules and guardrails]
[Tool and schema definitions]
[Reference docs, glossary, style guide]

--- cache breakpoint (explicit on Anthropic, implicit on OpenAI) ---

# NON-CACHE REGION — varies per call
[Retrieved chunks for this specific query]
[Recent conversation turns]
[User's actual request]

Belongs in the stable region: identity, rules, tool definitions, reference material that applies every turn, and few-shot examples if the set is fixed. Keep out: timestamps, request IDs, per-user account data (unless the cache is scoped per user), retrieved chunks, and the user's question.

The test: if the same prompt were sent a thousand times in the next hour, which parts would be identical across every copy? That is your cache region.

Anthropic Cache Markers

Anthropic's caching uses explicit cache breakpoints. You mark where the cached region ends — typically via a cache_control marker on a content block. Everything up to that block is eligible for reuse; everything after is processed fresh.

The benefit is layered caching. Several breakpoints — one after the system prompt, one after tool definitions, one after reference docs — give each layer its own cache behavior. For a prompt with stable identity, semi-stable reference docs, and volatile per-query retrieval, you can cache the first two and skip the third. The cost is deciding where the breakpoints go — thinking done per pipeline, not per call.

OpenAI Automatic Caching

OpenAI's caching is automatic. No explicit markers — the provider detects when a prefix has been repeated across recent calls and, if it is above the minimum size and within TTL, reuses it transparently. The developer-facing action is: keep the front stable, and the cache takes care of itself.

Automatic caching is simpler — nothing to configure, no forgotten breakpoint. The cost is less control over where the cached region ends. You cannot say "cache exactly up to here"; the system matches whatever is consistently repeated.

Both approaches converge on the same guidance: keep the front stable, variable content at the end, verify cache-hit rates in logs. See Claude vs. OpenAI prompt caching for the side-by-side.

Cache Invalidation Pitfalls

The most common caching bug is silent — the cache does not fire, cost does not drop, and nothing in the API response screams. The symptom is a pipeline costing the same as before. Almost always, something in the supposedly-stable region is rotating.

Things that silently break caching:

  • Dynamic timestamps — a "current time is X" line, different every call.
  • Per-user personalization in the cached region — name, account ID, or preferences baked in rather than injected after the boundary.
  • Retrieved context above the boundary — retrieval changes per query; there, it rotates every call.
  • Tool definitions reordered — same tools in a different order are a different token sequence. Pin the order.
  • Wording drift across deploys — changing "helpful assistant" to "helpful AI assistant" invalidates every prefix with the old line. Treat prompt changes like schema migrations.
  • Model version changes — switching variants invalidates old caches. First call on a new model misses; savings ramp from there.

The diagnostic is cache-hit telemetry. Both providers surface whether a request hit. If the rate is 0% or suspiciously low, something in the prefix is rotating — find it and fix it.

When Caching Does Not Help

Caching rewards stable, reused prefixes at real volume. Low-value cases:

  • Short prompts — below the provider minimum, caching does not apply.
  • One-off requests — no reuse to capture.
  • High-variability prefixes — per-call customization makes each variant its own entry; low-volume variants age out before a second hit.
  • Low call volume — calls an hour apart likely age out between them. Caching pays when density within the TTL window is high.
  • Rapid development iteration — every prompt change invalidates the cache. Caching is a production optimization, not a development one.

Common Anti-Patterns

  • Rotating content in the cached region — timestamps, request IDs, dynamic date lines. Move them out or drop.
  • Rebuilding the system prompt per call — templating with per-call variables prevents reuse. Keep it identical; inject below the boundary.
  • Assuming caching is on by default — some SDK paths require explicit parameters. Verify via cache-hit metrics.
  • Caching prefixes that mix tenants — tenant A's data leaking into tenant B's prompt via a shared key is a security bug. Partition where isolation matters.
  • Over-layering breakpoints — five when two would do adds complexity without benefit. Start with one; add more when measurements justify it.
  • Treating caching as a reason to skip context budgeting — it reduces input cost on hits; it does not reduce attention cost or context-rot risk.

FAQ

How much does prompt caching save?

On hit calls, the cached portion costs meaningfully less than the same tokens processed fresh. Discount rates vary by provider and change — check current pricing. Qualitatively: for workloads with a long stable prefix and many calls, caching typically turns input cost from a dominant line item into a minor one.

What is the TTL for a cached prefix?

Varies by provider and configuration. Design as if the window is tens of minutes, not hours or days. If calls arrive more than an hour apart, assume the cache has aged out.

Does caching work for streamed responses?

Yes. Caching applies to input processing, not output generation. Streaming does not affect whether the input prefix was a hit.

Can I cache per-user or per-tenant?

Yes, with a trade-off. Per-tenant caches are great for hit rates within a tenant and wasteful if many tenants have low volume. Partition where isolation requires it; consolidate where it does not.

How do I know if my cache is firing?

Both providers surface cache metadata on responses — typically a count of cached versus uncached input tokens. Log it and graph hit ratio over time. A sudden drop usually means a prompt change invalidated the prefix; an always-zero ratio usually means something in the supposed cache region rotates.

Wrap-Up

Prompt caching changed the economics of long prompts in a way that shows up in architecture, not just billing. The rule is short — stable first, variable last, nothing rotating in the cached region — and the payoff is substantial for systems with long prefixes and real volume. Failure modes are silent: a timestamp drifts in, a prompt gets tweaked across a deploy, a retrieval block sneaks above the boundary, and the hit rate quietly collapses. Treat the cached region like a schema — something that changes deliberately, with measurement.

For the broader picture, see the context engineering pillar. For provider comparison, see Claude vs. OpenAI prompt caching. For the adjacent concept people confuse with it, see semantic caching vs. prompt caching. For the overall cost picture, see token economics guide 2026.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Get ready-made ChatGPT prompts

Browse our curated ChatGPT prompt library — tested templates you can use right away, no prompt engineering required.

Browse ChatGPT Prompts