Where the wasted input tokens hide
Most prompt bloat lives in three places. The first is the system prompt: long, defensive instructions added over months, many of which the model already respects or which contradict each other. The second is few-shot examples — three near-identical examples teach the same thing one good example would, at three times the token cost. The third is retrieved context in RAG: pipelines routinely send far more chunks than the question needs, and each unused chunk is pure waste.
Compress in that order, because that's roughly the order of payoff. Read your system prompt line by line and delete anything the model honors without it. Cut few-shot examples to the minimum that holds accuracy. Then tune your retriever to return fewer, more relevant chunks, or summarize long documents before they enter the prompt.
Compress without breaking quality
The risk with compression is cutting something load-bearing. Avoid it by changing one thing at a time and checking output on a small evaluation set before and after — if accuracy holds, the tokens were waste. Keep the instructions that actually steer behavior (format requirements, tone, hard constraints) and drop the ones that restate common sense.
Two structural tricks help a lot. Put long, stable content (a big reference block, a fixed instruction set) at the front of the prompt so provider prompt-caching can discount it on repeat calls. And for very long inputs, summarize first: a cheap model can compress a 10,000-token document into a 1,000-token brief that the expensive model then reasons over, which is cheaper than feeding the full text to the expensive model every time.
how it works
- 01
Audit the system prompt
Read it line by line. Delete any instruction the model already follows without it, and merge contradictory or duplicated rules. Keep only the lines that visibly change behavior.
- 02
Trim few-shot examples
Replace several near-identical examples with one or two clear ones. Verify accuracy holds on a small test set — if it does, you just removed paid tokens from every call.
- 03
Filter retrieved context
For RAG, return the top few most-relevant chunks instead of a wide net. Tighten your similarity threshold or re-rank so unused context never enters the prompt.
- 04
Summarize long inputs first
Use a cheap model to compress long documents into a short brief, then send the brief to your main model. You pay a little on the small model to save a lot on the expensive one.
- 05
Order for caching, then measure
Move stable prefixes to the front so prompt caching can discount them. Re-run a usage CSV through token·flow to confirm input tokens per call actually dropped.
frequently asked
- Will compressing my prompt make answers worse?
- Only if you cut something the model relied on. The safe method is to change one thing, then check output quality on a small fixed set of test cases. If accuracy holds after a cut, those tokens were genuinely waste. Keep the instructions that steer behavior; remove the ones that restate the obvious.
- Does prompt compression help with output tokens too?
- Indirectly. A tighter prompt with a clear format instruction tends to produce a tighter, on-format response, which means fewer output tokens and fewer retries. But the direct savings of compression are on the input side — pair it with a max_tokens cap to control the output side as well.
- Is it worth compressing prompts on low-volume endpoints?
- Usually not. Compression pays off on prompts that run constantly or carry very heavy context. A rarely-called endpoint with a small prompt isn't worth the effort. Rank your prompts by total token weight — call volume times prompt size — and compress from the top.
Last updated June 15, 2026