Semantic Caching
Semantic caching is a pattern for caching LLM responses keyed by meaning similarity rather than exact prompt match. The incoming prompt is embedded, a vector store is queried for near-duplicates, and if cosine similarity exceeds a threshold, the cached response is returned. This is distinct from prompt caching, which reuses computed prompt prefixes at the model layer rather than entire responses at the application layer. The trade-off is clear: semantic caching is cheaper and faster on hits, but risks returning a stale or wrong answer when two prompts look similar in embedding space yet are semantically distinct.
Example
A FAQ assistant embeds every incoming user question and checks a vector store of past Q&A pairs. If a user asks "how do I reset my password?" and a previously answered "I forgot my password — how do I reset it?" sits above the 0.92 cosine threshold, the cached answer is returned with no model call. If the user then asks "how do I reset my 2FA?" — lexically similar but a different task — the cache must miss, or the user gets wrong instructions; this is why threshold tuning and per-intent cache partitioning matter.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts