how to

Caching LLM responses: stop paying twice for the same answer

the short answer

Caching LLM responses means storing the model's answer so identical or near-identical requests are served from storage instead of regenerated — an exact-match cache handles repeated requests, a semantic cache handles paraphrases, and provider prompt caching discounts repeated prompt prefixes, each cutting spend on work you'd otherwise pay for twice.

A surprising share of LLM calls are repeats: the same FAQ asked a thousand ways, the same document summarized again, the same system prompt sent on every single request. You pay full price for each of those, even when the answer is identical to one you generated minutes ago. Caching is how you stop.

There are three layers of caching, and they solve different problems. This page explains each one, when to reach for it, and how they stack. To see whether caching will help you, upload a usage CSV to token·flow — it flags identical and near-identical prompts so you know how much of your traffic is cacheable before you build anything.

10-50%of requests are exact or near-duplicate prompts in typical production LLM trafficSource: token·flow usage analysis

The three layers of LLM caching

An exact-match (response) cache is the simplest: hash the full request, and if you've seen it before, return the stored answer for free. It's perfect for deterministic, high-repeat calls — a fixed prompt over a fixed input. A semantic cache goes further: it embeds incoming prompts and serves a cached answer when a new prompt is close enough in meaning to one you've already answered, which catches paraphrases that an exact-match cache would miss.

Provider prompt caching is different and often overlooked. Both OpenAI and Anthropic discount the repeated prefix of a prompt — if your long system prompt is identical and sits at the front of the message, the provider reuses its internal computation and charges less for that portion. You don't store anything yourself; you just keep the stable part at the front and unchanged. It stacks with your own caches.

When to cache, and when not to

Cache aggressively when answers are stable: documentation Q&A, classification, summaries of unchanging content, and any prompt with a fixed long prefix. Be careful with personalized or time-sensitive responses — a cache that serves yesterday's stock price or another user's data is worse than no cache. Use a time-to-live (TTL) so entries expire, and scope keys by user where the answer depends on who's asking.

For semantic caching, the similarity threshold is the dial that matters: too loose and you serve a near-miss answer to a question it doesn't fit; too tight and you barely cache anything. Start conservative, sample the cache hits to confirm they're genuinely equivalent, then loosen carefully. The goal is free correct answers, never cheap wrong ones.

how it works

  1. 01

    Find what's actually repeated

    Upload a usage CSV to token·flow (or group your own logs by prompt hash) to see how many requests are exact or near-duplicates. If repeats are rare, caching won't help much — measure before you build.

  2. 02

    Add an exact-match cache

    Hash the full request and store the response keyed by that hash, with a sensible TTL. Serve hits straight from storage. This alone clears the easy, high-repeat traffic.

  3. 03

    Layer in a semantic cache

    Embed incoming prompts and serve a cached answer when a new prompt is close enough to one you've answered. Start with a conservative similarity threshold and sample hits to confirm they're equivalent.

  4. 04

    Enable provider prompt caching

    Keep your long, stable prompt prefix identical and at the front of the message so OpenAI or Anthropic discounts the repeated portion automatically. Don't reshuffle the prefix between calls.

  5. 05

    Scope and expire correctly

    Set TTLs so stale answers don't linger, and scope cache keys by user for anything personalized. Re-check a usage CSV afterward to confirm the cache is actually cutting paid calls.

frequently asked

What's the difference between a response cache and prompt caching?
A response cache is yours: you store the model's full answer and skip the call entirely on a repeat. Provider prompt caching is the provider's: OpenAI and Anthropic charge less for the repeated prefix of a prompt because they can reuse internal computation. The first saves the whole call; the second discounts part of the input. They stack.
Isn't caching risky if answers should be fresh?
It is if you cache the wrong things. Use it for stable answers — documentation, classification, summaries of unchanging content — and avoid it for personalized or time-sensitive responses, or scope and expire those entries aggressively with TTLs and per-user keys. Cache correctness over cache hit rate, always.
How do I cache paraphrased questions, not just identical ones?
That's what a semantic cache does. It embeds prompts and serves a stored answer when a new prompt is close enough in meaning. The trade-off is the similarity threshold: too loose serves near-miss answers, too tight caches almost nothing. Start tight, sample the hits, then loosen carefully.

Last updated June 15, 2026

ready to try token·flow?

analyze your usage