Why output tokens are the expensive half
On most OpenAI models, generated (output) tokens are priced higher than prompt (input) tokens — often two to four times higher. That means an endpoint that returns a 1,500-token essay when a one-line answer would do is burning money on every call. Capping max_tokens, asking for terse answers, and using structured outputs (so the model returns clean JSON instead of rambling prose that you then have to retry) all attack the expensive half of the bill directly.
Structured outputs help twice: shorter responses, and fewer failed parses that trigger a second paid call. A retry is two completions where you wanted one. If 10% of your calls retry because the model returned malformed output, you're paying a 10% tax for nothing.
Send less, and send it less often
On the input side, the wins are trimming and reuse. Trim the system prompt down to what actually changes behavior, and for retrieval-augmented generation (RAG), stop stuffing the whole knowledge base into context — return the top few chunks, not the top fifty. Then look at repetition: if many calls share an identical long prefix (a big system prompt, a fixed instruction block), OpenAI's prompt caching discounts the repeated portion automatically when you keep that prefix stable and at the front of the message.
Finally, separate urgent calls from background ones. Anything that doesn't need an instant answer — nightly summaries, bulk classification, enrichment jobs — can go through the Batch API at a lower rate. Streaming, by contrast, changes the user experience but not the price: you pay for the same tokens whether they arrive all at once or word by word.
how it works
- 01
Right-size the model
Don't run a frontier model for classification, extraction, or routing. Move simple, high-volume calls to a smaller, cheaper model and keep the expensive one for genuinely hard reasoning. This is usually the single largest saving.
- 02
Cap max_tokens and ask for brevity
Set a max_tokens ceiling so a runaway generation can't cost 10x. Tell the model to be concise, and use structured outputs (JSON schema) where you need machine-readable results — shorter answers and fewer retries.
- 03
Trim the prompt and the context
Cut the system prompt to what changes behavior. For RAG, return the top few chunks instead of dumping the whole corpus. Every token you don't send is a token you don't pay for.
- 04
Cache and deduplicate
Keep long, stable prompt prefixes at the front so OpenAI's prompt caching discounts them. Add a response cache for identical requests so you never pay twice for the same answer.
- 05
Batch the non-urgent work
Move background jobs to the Batch API for a lower rate. Measure before and after — pull a usage CSV and feed it to token·flow to confirm the spend actually dropped where you expected.
frequently asked
- What's the fastest way to cut my OpenAI bill?
- Right-sizing the model. Most teams default to their most capable model for everything, including tasks a much cheaper model handles perfectly — classification, tagging, short extractions. Moving those calls to a smaller model often cuts spend more than any prompt tweak, and it takes minutes.
- Does capping max_tokens hurt answer quality?
- Only if you set it too low for the task. A cap stops runaway generations and accidental essays; it doesn't shorten a genuinely needed answer unless the ceiling is below what the task requires. Set it to a comfortable headroom above your typical good response, not at the average length.
- How do I know which prompts are costing me the most?
- Export your usage from the OpenAI dashboard as a CSV and look at total tokens per endpoint or per prompt, not just call count — a low-volume endpoint with huge prompts can outspend a high-volume one. token·flow does this ranking for you and flags the prompts you're repeating.
- Is streaming cheaper than waiting for the full response?
- No. Streaming changes how the response arrives (token by token, so the user sees it sooner) but you pay for exactly the same tokens. It's a UX improvement, not a cost one.
Last updated June 15, 2026