how to

Prompt compression: send fewer tokens without losing the answer

the short answer

Prompt compression means sending fewer input tokens for the same result by trimming the system prompt to what changes behavior, replacing long few-shot examples with concise ones, summarizing or filtering retrieved context, and removing redundant instructions — so you pay for less input on every single call.

Every token in your prompt is a token you pay for, on every call, forever. A bloated system prompt or an over-stuffed context window doesn't just cost more once — it taxes every request that uses it. Prompt compression is the practice of cutting that input down to the smallest version that still produces the answer you need.

This isn't about making prompts cryptic or unreadable. It's about removing the parts that don't earn their keep: filler instructions the model already follows, examples that repeat the same lesson, and retrieved context the question never needed. This page walks through the techniques and where to apply them. To find your heaviest prompts, upload a usage CSV to token·flow — it ranks prompts by token weight so you compress the ones that matter.

40-70%of system-prompt tokens are often removable without changing model behaviorSource: token·flow usage analysis

Where the wasted input tokens hide

Most prompt bloat lives in three places. The first is the system prompt: long, defensive instructions added over months, many of which the model already respects or which contradict each other. The second is few-shot examples — three near-identical examples teach the same thing one good example would, at three times the token cost. The third is retrieved context in RAG: pipelines routinely send far more chunks than the question needs, and each unused chunk is pure waste.

Compress in that order, because that's roughly the order of payoff. Read your system prompt line by line and delete anything the model honors without it. Cut few-shot examples to the minimum that holds accuracy. Then tune your retriever to return fewer, more relevant chunks, or summarize long documents before they enter the prompt.

Compress without breaking quality

The risk with compression is cutting something load-bearing. Avoid it by changing one thing at a time and checking output on a small evaluation set before and after — if accuracy holds, the tokens were waste. Keep the instructions that actually steer behavior (format requirements, tone, hard constraints) and drop the ones that restate common sense.

Two structural tricks help a lot. Put long, stable content (a big reference block, a fixed instruction set) at the front of the prompt so provider prompt-caching can discount it on repeat calls. And for very long inputs, summarize first: a cheap model can compress a 10,000-token document into a 1,000-token brief that the expensive model then reasons over, which is cheaper than feeding the full text to the expensive model every time.

how it works

  1. 01

    Audit the system prompt

    Read it line by line. Delete any instruction the model already follows without it, and merge contradictory or duplicated rules. Keep only the lines that visibly change behavior.

  2. 02

    Trim few-shot examples

    Replace several near-identical examples with one or two clear ones. Verify accuracy holds on a small test set — if it does, you just removed paid tokens from every call.

  3. 03

    Filter retrieved context

    For RAG, return the top few most-relevant chunks instead of a wide net. Tighten your similarity threshold or re-rank so unused context never enters the prompt.

  4. 04

    Summarize long inputs first

    Use a cheap model to compress long documents into a short brief, then send the brief to your main model. You pay a little on the small model to save a lot on the expensive one.

  5. 05

    Order for caching, then measure

    Move stable prefixes to the front so prompt caching can discount them. Re-run a usage CSV through token·flow to confirm input tokens per call actually dropped.

frequently asked

Will compressing my prompt make answers worse?
Only if you cut something the model relied on. The safe method is to change one thing, then check output quality on a small fixed set of test cases. If accuracy holds after a cut, those tokens were genuinely waste. Keep the instructions that steer behavior; remove the ones that restate the obvious.
Does prompt compression help with output tokens too?
Indirectly. A tighter prompt with a clear format instruction tends to produce a tighter, on-format response, which means fewer output tokens and fewer retries. But the direct savings of compression are on the input side — pair it with a max_tokens cap to control the output side as well.
Is it worth compressing prompts on low-volume endpoints?
Usually not. Compression pays off on prompts that run constantly or carry very heavy context. A rarely-called endpoint with a small prompt isn't worth the effort. Rank your prompts by total token weight — call volume times prompt size — and compress from the top.

Last updated June 15, 2026

ready to try token·flow?

analyze your usage