The three layers of LLM caching
An exact-match (response) cache is the simplest: hash the full request, and if you've seen it before, return the stored answer for free. It's perfect for deterministic, high-repeat calls — a fixed prompt over a fixed input. A semantic cache goes further: it embeds incoming prompts and serves a cached answer when a new prompt is close enough in meaning to one you've already answered, which catches paraphrases that an exact-match cache would miss.
Provider prompt caching is different and often overlooked. Both OpenAI and Anthropic discount the repeated prefix of a prompt — if your long system prompt is identical and sits at the front of the message, the provider reuses its internal computation and charges less for that portion. You don't store anything yourself; you just keep the stable part at the front and unchanged. It stacks with your own caches.
When to cache, and when not to
Cache aggressively when answers are stable: documentation Q&A, classification, summaries of unchanging content, and any prompt with a fixed long prefix. Be careful with personalized or time-sensitive responses — a cache that serves yesterday's stock price or another user's data is worse than no cache. Use a time-to-live (TTL) so entries expire, and scope keys by user where the answer depends on who's asking.
For semantic caching, the similarity threshold is the dial that matters: too loose and you serve a near-miss answer to a question it doesn't fit; too tight and you barely cache anything. Start conservative, sample the cache hits to confirm they're genuinely equivalent, then loosen carefully. The goal is free correct answers, never cheap wrong ones.
how it works
- 01
Find what's actually repeated
Upload a usage CSV to token·flow (or group your own logs by prompt hash) to see how many requests are exact or near-duplicates. If repeats are rare, caching won't help much — measure before you build.
- 02
Add an exact-match cache
Hash the full request and store the response keyed by that hash, with a sensible TTL. Serve hits straight from storage. This alone clears the easy, high-repeat traffic.
- 03
Layer in a semantic cache
Embed incoming prompts and serve a cached answer when a new prompt is close enough to one you've answered. Start with a conservative similarity threshold and sample hits to confirm they're equivalent.
- 04
Enable provider prompt caching
Keep your long, stable prompt prefix identical and at the front of the message so OpenAI or Anthropic discounts the repeated portion automatically. Don't reshuffle the prefix between calls.
- 05
Scope and expire correctly
Set TTLs so stale answers don't linger, and scope cache keys by user for anything personalized. Re-check a usage CSV afterward to confirm the cache is actually cutting paid calls.
frequently asked
- What's the difference between a response cache and prompt caching?
- A response cache is yours: you store the model's full answer and skip the call entirely on a repeat. Provider prompt caching is the provider's: OpenAI and Anthropic charge less for the repeated prefix of a prompt because they can reuse internal computation. The first saves the whole call; the second discounts part of the input. They stack.
- Isn't caching risky if answers should be fresh?
- It is if you cache the wrong things. Use it for stable answers — documentation, classification, summaries of unchanging content — and avoid it for personalized or time-sensitive responses, or scope and expire those entries aggressively with TTLs and per-user keys. Cache correctness over cache hit rate, always.
- How do I cache paraphrased questions, not just identical ones?
- That's what a semantic cache does. It embeds prompts and serves a stored answer when a new prompt is close enough in meaning. The trade-off is the similarity threshold: too loose serves near-miss answers, too tight caches almost nothing. Start tight, sample the hits, then loosen carefully.
Last updated June 15, 2026