Prompt Caching
Prompt caching is a performance optimization where the model's computed internal representations (key-value attention states) of a static prompt prefix are stored and reused across multiple requests. Instead of recomputing these states every time the same system prompt or reference text is sent, the cached version is loaded directly, significantly reducing latency and computational cost for the repeated portion.
Example
An API application sends the same 8,000-token system prompt with every user request. With prompt caching enabled, the model computes the internal states for those 8,000 tokens once, then on subsequent requests loads the cache in microseconds instead of re-processing — reducing first-token latency from 2 seconds to 200 milliseconds.
Related Terms
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts