Skip to main content

Semantic Caching

Semantic caching is a pattern for caching LLM responses keyed by meaning similarity rather than exact prompt match. The incoming prompt is embedded, a vector store is queried for near-duplicates, and if cosine similarity exceeds a threshold, the cached response is returned.

This is distinct from prompt caching, which reuses computed prompt prefixes at the model layer rather than entire responses at the application layer. The trade-off is clear: semantic caching is cheaper and faster on hits, but risks returning a stale or wrong answer when two prompts look similar in embedding space yet are semantically distinct.

Example

A FAQ assistant embeds every incoming user question and checks a vector store of past Q&A pairs. If a user asks "how do I reset my password?" and a previously answered "I forgot my password — how do I reset it?" sits above the 0.92 cosine threshold, the cached answer is returned with no model call. If the user then asks "how do I reset my 2FA?" — lexically similar but a different task — the cache must miss, or the user gets wrong instructions; this is why threshold tuning and per-intent cache partitioning matter.

Frequently asked questions

What is Semantic Caching?

Semantic caching is a pattern for caching LLM responses keyed by meaning similarity rather than exact prompt match. The incoming prompt is embedded, a vector store is queried for near-duplicates, and if cosine similarity exceeds a threshold, the cached response is returned.

How does Semantic Caching work?

This is distinct from prompt caching, which reuses computed prompt prefixes at the model layer rather than entire responses at the application layer. The trade-off is clear: semantic caching is cheaper and faster on hits, but risks returning a stale or wrong answer when two prompts look similar in embedding space yet are semantically distinct.

Can you give an example of Semantic Caching?

A FAQ assistant embeds every incoming user question and checks a vector store of past Q&A pairs. If a user asks "how do I reset my password?" and a previously answered "I forgot my password — how do I reset it?" sits above the 0.92 cosine threshold, the cached answer is returned with no model call. If the user then asks "how do I reset my 2FA?" — lexically similar but a different task — the cache must miss, or the user gets wrong instructions; this is why threshold tuning and per-intent cache partitioning matter.