how to

How to reduce your OpenAI API costs without breaking your product

the short answer

You reduce OpenAI API costs by right-sizing the model to the task, capping max_tokens, trimming the prompt and context you send, caching repeated calls, and using the Batch API for non-urgent work — output tokens cost the most, so cutting completion length usually saves more than anything else.

OpenAI bills you per token, split into input (the prompt and context you send) and output (the completion the model generates). Output tokens almost always cost more per token than input, so the single biggest lever is usually how much the model writes back, not how much you send in. Most teams discover this only after the bill arrives.

This guide walks the cheapest, lowest-risk wins first, in the order we'd actually do them. None of them require a rewrite of your app. If you want the cuts ranked for your own account, you can upload a usage CSV from the OpenAI dashboard to token·flow and it points at the costliest prompts and the ones you're sending over and over.

up to 30%of LLM spend is typically wasted on redundant or oversized promptsSource: token·flow usage analysis

Why output tokens are the expensive half

On most OpenAI models, generated (output) tokens are priced higher than prompt (input) tokens — often two to four times higher. That means an endpoint that returns a 1,500-token essay when a one-line answer would do is burning money on every call. Capping max_tokens, asking for terse answers, and using structured outputs (so the model returns clean JSON instead of rambling prose that you then have to retry) all attack the expensive half of the bill directly.

Structured outputs help twice: shorter responses, and fewer failed parses that trigger a second paid call. A retry is two completions where you wanted one. If 10% of your calls retry because the model returned malformed output, you're paying a 10% tax for nothing.

Send less, and send it less often

On the input side, the wins are trimming and reuse. Trim the system prompt down to what actually changes behavior, and for retrieval-augmented generation (RAG), stop stuffing the whole knowledge base into context — return the top few chunks, not the top fifty. Then look at repetition: if many calls share an identical long prefix (a big system prompt, a fixed instruction block), OpenAI's prompt caching discounts the repeated portion automatically when you keep that prefix stable and at the front of the message.

Finally, separate urgent calls from background ones. Anything that doesn't need an instant answer — nightly summaries, bulk classification, enrichment jobs — can go through the Batch API at a lower rate. Streaming, by contrast, changes the user experience but not the price: you pay for the same tokens whether they arrive all at once or word by word.

how it works

  1. 01

    Right-size the model

    Don't run a frontier model for classification, extraction, or routing. Move simple, high-volume calls to a smaller, cheaper model and keep the expensive one for genuinely hard reasoning. This is usually the single largest saving.

  2. 02

    Cap max_tokens and ask for brevity

    Set a max_tokens ceiling so a runaway generation can't cost 10x. Tell the model to be concise, and use structured outputs (JSON schema) where you need machine-readable results — shorter answers and fewer retries.

  3. 03

    Trim the prompt and the context

    Cut the system prompt to what changes behavior. For RAG, return the top few chunks instead of dumping the whole corpus. Every token you don't send is a token you don't pay for.

  4. 04

    Cache and deduplicate

    Keep long, stable prompt prefixes at the front so OpenAI's prompt caching discounts them. Add a response cache for identical requests so you never pay twice for the same answer.

  5. 05

    Batch the non-urgent work

    Move background jobs to the Batch API for a lower rate. Measure before and after — pull a usage CSV and feed it to token·flow to confirm the spend actually dropped where you expected.

frequently asked

What's the fastest way to cut my OpenAI bill?
Right-sizing the model. Most teams default to their most capable model for everything, including tasks a much cheaper model handles perfectly — classification, tagging, short extractions. Moving those calls to a smaller model often cuts spend more than any prompt tweak, and it takes minutes.
Does capping max_tokens hurt answer quality?
Only if you set it too low for the task. A cap stops runaway generations and accidental essays; it doesn't shorten a genuinely needed answer unless the ceiling is below what the task requires. Set it to a comfortable headroom above your typical good response, not at the average length.
How do I know which prompts are costing me the most?
Export your usage from the OpenAI dashboard as a CSV and look at total tokens per endpoint or per prompt, not just call count — a low-volume endpoint with huge prompts can outspend a high-volume one. token·flow does this ranking for you and flags the prompts you're repeating.
Is streaming cheaper than waiting for the full response?
No. Streaming changes how the response arrives (token by token, so the user sees it sooner) but you pay for exactly the same tokens. It's a UX improvement, not a cost one.

Last updated June 15, 2026

ready to try token·flow?

analyze your usage