KV-Cache
A KV-cache (key-value cache) stores the computed attention key and value matrices from previously processed tokens so the model does not need to recalculate them when generating each new token. Without a KV-cache, the model would recompute attention for the entire input sequence at every generation step. This cache is what makes autoregressive text generation fast enough for real-time conversations.
Example
When ChatGPT generates a 500-word response, it produces one token at a time. After processing your 100-token prompt, the KV-cache stores the attention computations for those tokens. For each subsequent token, the model only computes attention for the new token against the cached values — instead of reprocessing all previous tokens from scratch.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts