Prompt Caching Guide (2026): Cutting LLM Costs With Cache Hits

Q: What is prompt caching and how does it work?

Prompt caching reuses the processed form of a prompt's static prefix instead of recomputing it on every call. When the model processes a prompt, it converts tokens into an internal representation and runs attention over it to generate output — and most of the cost of a long prompt is in that processing, not the generation. When the same prefix shows up on the next call, the provider tokenizes the prompt, checks whether the beginning matches a cached prefix, and if the match is long enough it loads the cached representation, skips re-processing, and processes only the uncached tail. You still pay to generate output, but the cached input costs much less and the request returns faster. On a cache miss, the provider processes the whole prompt fresh and may store the new prefix for future reuse.

Q: How do I write cache-friendly prompts?

The rule is simple: stable content at the top, variable content at the bottom. Constant content goes first; per-call content goes last. The stable region should hold the system identity and role, operating rules and guardrails, tool and schema definitions, reference docs, and few-shot examples if the set is fixed. Keep timestamps, request IDs, per-user account data, retrieved chunks, and the user's actual question out of the cached region — put them after the cache boundary. A useful test: if the same prompt were sent a thousand times in the next hour, which parts would be identical across every copy? That is your cache region. Because the cache is keyed on the exact token prefix from the first token, anything that rotates near the top breaks every match after it.

Q: Why does my prompt cache silently fail to fire?

The most common caching bug is silent — the cache does not fire, cost does not drop, and nothing in the API response signals the problem. Almost always, something in the supposedly-stable region is rotating. Frequent culprits: dynamic timestamps like a 'current time is X' line, per-user personalization baked into the cached region instead of injected after the boundary, retrieved context placed above the boundary (it changes per query), tool definitions reordered (same tools in a different order are a different token sequence), wording drift across deploys (changing 'helpful assistant' to 'helpful AI assistant' invalidates every prefix with the old line), and model version changes. The diagnostic is cache-hit telemetry: both providers surface whether a request hit. If the rate is 0% or suspiciously low, find what is rotating in the prefix and pin it.

Q: What's the difference between Anthropic and OpenAI prompt caching?

Anthropic uses explicit cache breakpoints — you mark where the cached region ends, typically via a cache_control marker on a content block. Everything up to that block is eligible for reuse; everything after is processed fresh. This enables layered caching: several breakpoints (one after the system prompt, one after tool definitions, one after reference docs) give each layer its own cache behavior, at the cost of deciding where breakpoints go. OpenAI's caching is automatic — no markers. The provider detects when a prefix has been repeated across recent calls and, if it is above the minimum size and within TTL, reuses it transparently. Automatic caching is simpler with nothing to configure, but you get less control over where the cached region ends. Both approaches converge on the same guidance: keep the front stable, variable content last, and verify cache-hit rates in your logs.

Q: When does prompt caching not help?

Caching rewards stable, reused prefixes at real volume, so it does little in several cases. Short prompts below the provider's minimum size do not cache at all. One-off requests have no reuse to capture. High-variability prefixes turn each per-call customization into its own cache entry, and low-volume variants age out before a second hit. Low call volume is a problem because the TTL is typically tens of minutes, not hours — calls an hour apart likely age out between them, so caching only pays when density within the window is high. And rapid development iteration works against caching because every prompt change invalidates the cache; caching is a production optimization, not a development one. Caching also does not reduce attention cost or context-rot risk — it only lowers input processing cost on hits.

Q: Does prompt caching work for streamed responses, and how do I know if it's firing?

Yes, caching works with streaming. Caching applies to input processing, not output generation, so whether you stream the response has no bearing on whether the input prefix was a hit. To confirm the cache is firing, rely on cache-hit telemetry rather than guessing: both providers surface cache metadata on responses — typically a count of cached versus uncached input tokens. Log it and graph the hit ratio over time. A sudden drop usually means a prompt change invalidated the prefix, and an always-zero ratio usually means something in the supposed cache region rotates every call. Treat the cached region like a schema — something that changes deliberately and with measurement — so you notice regressions instead of quietly paying full price.

Imtiaz Rayhan

Prompt caching sounds like a micro-optimization and turns out to reshape architecture. When a provider has seen the start of your prompt before, it reuses the processed form of that prefix instead of recomputing it — charging less and answering faster. In 2026 that is standard at both Anthropic and OpenAI. How you structure a prompt now determines whether the cache fires, and most prompts written before caching went mainstream leave the savings on the table. This sits under the context engineering pillar — caching is a big reason "context engineering" became the right frame.

What Prompt Caching Is

When the model processes a prompt, it converts tokens into an internal representation — the "processed prefix" — then runs attention over that representation to generate output. Most of the cost of a long prompt is in processing, not generation. If the same prefix shows up on the next call, the provider skips processing, reuses the representation, and runs generation on top. You still pay to generate output, but the cached input costs much less, and the request returns faster. See the prompt caching glossary entry for the short definition.

Two things the cache is not: it is not a semantic cache (identical prefixes only, not "similar" prompts), and it is not a response cache (the model still generates fresh output). See semantic caching vs. prompt caching — people conflate them often enough to cause bugs.

Why It Matters in 2026

Three trends made caching a first-class design concern.

System prompts got long. Production apps routinely run multi-thousand-token system prompts — identity, rules, tool definitions, reference material. Without caching, every call pays full input cost. With caching, that cost amortizes across every call in the window.

Agents loop. An agent makes many calls per task. If the system prompt and tool definitions are stable (they usually are), caching turns a 10-step task from 10x the system-prompt cost into roughly 1x.

Long context is mainstream. With windows past 1M tokens, people pack entire codebases or document sets into a prompt. If that content is reused across queries, caching it makes the design tractable. Without caching, paying full rates on a 500k-token prompt many times is often prohibitive.

If you call the same model more than a few times per session, caching is one of your largest cost levers. See token economics guide 2026 for where it sits in the cost picture.

How Caching Works Under the Hood

The flow:

You send a request.
The provider tokenizes the prompt and checks whether the beginning matches a cached prefix.
If the match is long enough, it loads the cached representation, skips re-processing, and processes only the uncached tail.
The model generates output.
On no match, the provider processes the whole prompt fresh — and may store the new prefix for future reuse.

The cache is keyed on the exact prefix, from the first token. A hit requires the prompt to start with the token sequence the cache has seen, continuing as long as they agree. The moment variable content appears in the prefix, everything after it is uncacheable.

Cache Hit Requirements

Three conditions need to line up. Exact numbers vary by provider and change over time — the shape is what matters.

Condition	What it means	What varies
Exact prefix match	The prompt must start with the same tokens the cache processed	Non-negotiable — this is how the cache is keyed
Minimum cached size	Prefixes below a token threshold may not cache at all	Exact minimum varies by provider; check current docs
Within TTL window	The cached prefix must still be live when the next call arrives	Typically minutes, not hours — design for tens of minutes, not days

The "exact prefix match" condition trips people up. It is token-level, not paraphrase-level. Changing wording, reordering two sentences, or inserting a timestamp all break the match. If you want a hit, nothing in the cached region can move.

Writing Cache-Friendly Prompts

The rule: stable content at the top, variable content at the bottom. Per-call content goes last; constant content goes first.

code

# CACHE REGION — stable across all calls
[System identity and role]
[Operating rules and guardrails]
[Tool and schema definitions]
[Reference docs, glossary, style guide]

--- cache breakpoint (explicit on Anthropic, implicit on OpenAI) ---

# NON-CACHE REGION — varies per call
[Retrieved chunks for this specific query]
[Recent conversation turns]
[User's actual request]

Belongs in the stable region: identity, rules, tool definitions, reference material that applies every turn, and few-shot examples if the set is fixed. Keep out: timestamps, request IDs, per-user account data (unless the cache is scoped per user), retrieved chunks, and the user's question.

The test: if the same prompt were sent a thousand times in the next hour, which parts would be identical across every copy? That is your cache region.

Anthropic Cache Markers

Anthropic's caching uses explicit cache breakpoints. You mark where the cached region ends — typically via a cache_control marker on a content block. Everything up to that block is eligible for reuse; everything after is processed fresh.

The benefit is layered caching. Several breakpoints — one after the system prompt, one after tool definitions, one after reference docs — give each layer its own cache behavior. For a prompt with stable identity, semi-stable reference docs, and volatile per-query retrieval, you can cache the first two and skip the third. The cost is deciding where the breakpoints go — thinking done per pipeline, not per call.

OpenAI Automatic Caching

OpenAI's caching is automatic. No explicit markers — the provider detects when a prefix has been repeated across recent calls and, if it is above the minimum size and within TTL, reuses it transparently. The developer-facing action is: keep the front stable, and the cache takes care of itself.

Automatic caching is simpler — nothing to configure, no forgotten breakpoint. The cost is less control over where the cached region ends. You cannot say "cache exactly up to here"; the system matches whatever is consistently repeated.

Both approaches converge on the same guidance: keep the front stable, variable content at the end, verify cache-hit rates in logs. See Claude vs. OpenAI prompt caching for the side-by-side.

Cache Invalidation Pitfalls

The most common caching bug is silent — the cache does not fire, cost does not drop, and nothing in the API response screams. The symptom is a pipeline costing the same as before. Almost always, something in the supposedly-stable region is rotating.

Things that silently break caching:

Dynamic timestamps — a "current time is X" line, different every call.
Per-user personalization in the cached region — name, account ID, or preferences baked in rather than injected after the boundary.
Retrieved context above the boundary — retrieval changes per query; there, it rotates every call.
Tool definitions reordered — same tools in a different order are a different token sequence. Pin the order.
Wording drift across deploys — changing "helpful assistant" to "helpful AI assistant" invalidates every prefix with the old line. Treat prompt changes like schema migrations.
Model version changes — switching variants invalidates old caches. First call on a new model misses; savings ramp from there.

The diagnostic is cache-hit telemetry. Both providers surface whether a request hit. If the rate is 0% or suspiciously low, something in the prefix is rotating — find it and fix it.

When Caching Does Not Help

Caching rewards stable, reused prefixes at real volume. Low-value cases:

Short prompts — below the provider minimum, caching does not apply.
One-off requests — no reuse to capture.
High-variability prefixes — per-call customization makes each variant its own entry; low-volume variants age out before a second hit.
Low call volume — calls an hour apart likely age out between them. Caching pays when density within the TTL window is high.
Rapid development iteration — every prompt change invalidates the cache. Caching is a production optimization, not a development one.

Common Anti-Patterns

Rotating content in the cached region — timestamps, request IDs, dynamic date lines. Move them out or drop.
Rebuilding the system prompt per call — templating with per-call variables prevents reuse. Keep it identical; inject below the boundary.
Assuming caching is on by default — some SDK paths require explicit parameters. Verify via cache-hit metrics.
Caching prefixes that mix tenants — tenant A's data leaking into tenant B's prompt via a shared key is a security bug. Partition where isolation matters.
Over-layering breakpoints — five when two would do adds complexity without benefit. Start with one; add more when measurements justify it.
Treating caching as a reason to skip context budgeting — it reduces input cost on hits; it does not reduce attention cost or context-rot risk.

FAQ

How much does prompt caching save?

On hit calls, the cached portion costs meaningfully less than the same tokens processed fresh. Discount rates vary by provider and change — check current pricing. Qualitatively: for workloads with a long stable prefix and many calls, caching typically turns input cost from a dominant line item into a minor one.

What is the TTL for a cached prefix?

Varies by provider and configuration. Design as if the window is tens of minutes, not hours or days. If calls arrive more than an hour apart, assume the cache has aged out.

Does caching work for streamed responses?

Yes. Caching applies to input processing, not output generation. Streaming does not affect whether the input prefix was a hit.

Can I cache per-user or per-tenant?

Yes, with a trade-off. Per-tenant caches are great for hit rates within a tenant and wasteful if many tenants have low volume. Partition where isolation requires it; consolidate where it does not.

How do I know if my cache is firing?

Both providers surface cache metadata on responses — typically a count of cached versus uncached input tokens. Log it and graph hit ratio over time. A sudden drop usually means a prompt change invalidated the prefix; an always-zero ratio usually means something in the supposed cache region rotates.

Wrap-Up

Prompt caching changed the economics of long prompts in a way that shows up in architecture, not just billing. The rule is short — stable first, variable last, nothing rotating in the cached region — and the payoff is substantial for systems with long prefixes and real volume. Failure modes are silent: a timestamp drifts in, a prompt gets tweaked across a deploy, a retrieval block sneaks above the boundary, and the hit rate quietly collapses. Treat the cached region like a schema — something that changes deliberately, with measurement.

For the broader picture, see the context engineering pillar. For provider comparison, see Claude vs. OpenAI prompt caching. For the adjacent concept people confuse with it, see semantic caching vs. prompt caching. For the overall cost picture, see token economics guide 2026.

Prompt Caching Guide (2026): Cutting LLM Costs With Cache Hits

What Prompt Caching Is

Why It Matters in 2026

How Caching Works Under the Hood

Cache Hit Requirements

Writing Cache-Friendly Prompts

Anthropic Cache Markers

OpenAI Automatic Caching

Cache Invalidation Pitfalls

When Caching Does Not Help

Common Anti-Patterns

FAQ

How much does prompt caching save?

What is the TTL for a cached prefix?

Does caching work for streamed responses?

Can I cache per-user or per-tenant?

How do I know if my cache is firing?

Wrap-Up

Get ready-made ChatGPT prompts

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

Context Engineering: The 2026 Replacement for Prompt Engineering

Claude vs OpenAI Prompt Caching: How the Two Differ (2026)

Token Economics Guide (2026): Making AI Cheap Enough to Ship