What is Prompt Caching?

Prompt caching is a performance optimization where the model's computed internal representations (key-value attention states) of a static prompt prefix are stored and reused across multiple requests.

Prompt Caching - Prompt Engineering Glossary

Prompt Caching: Prompt caching is a performance optimization where the model's computed internal representations (key-value attention states) of a static prompt prefix are stored and reused across multiple requests. Instead of recomputing these states every time the same system prompt or reference text is sent, the cached version is loaded directly, significantly reducing latency and computational cost for the repeated portion.

Example

An API application sends the same 8,000-token system prompt with every user request. With prompt caching enabled, the model computes the internal states for those 8,000 tokens once, then on subsequent requests loads the cache in microseconds instead of re-processing — reducing first-token latency from 2 seconds to 200 milliseconds.

Frequently asked questions

What is Prompt Caching?: Prompt caching is a performance optimization where the model's computed internal representations (key-value attention states) of a static prompt prefix are stored and reused across multiple requests.
Can you give an example of Prompt Caching?: An API application sends the same 8,000-token system prompt with every user request. With prompt caching enabled, the model computes the internal states for those 8,000 tokens once, then on subsequent requests loads the cache in microseconds instead of re-processing — reducing first-token latency from 2 seconds to 200 milliseconds.

Prompt Caching Guide (2026): Cutting LLM Costs With Cache Hits

Read article→

Prompt Caching

Example

Frequently asked questions

What is Prompt Caching?

Can you give an example of Prompt Caching?

Related Terms