The four signs that cost the most
Sign one: a frontier model doing grunt work. If your most expensive model handles classification, routing, tagging, or short extraction, you're overpaying — a smaller model does those just as well. Fix: route simple, high-volume calls to a cheaper model and reserve the expensive one for hard reasoning. Sign two: the same prompt over and over. Identical or near-identical requests that get regenerated every time are pure waste. Fix: an exact-match or semantic cache, plus provider prompt caching for stable prefixes.
Sign three: runaway output. Responses far longer than the task needs, or frequent retries from malformed output, inflate the expensive (output) side of the bill. Fix: cap max_tokens, ask for brevity, and use structured outputs so you parse cleanly the first time. Sign four: oversized context. RAG pipelines that send fifty chunks when five would answer, or conversation histories that grow without bound, pay for input the model never used. Fix: trim retrieval, summarize long history, and compress the system prompt.
How to confirm it's actually happening
Each sign is measurable. Group your usage by model and you'll see if a frontier model carries volume it shouldn't. Group by prompt hash and you'll see the repeats. Compare input to output token ratios and you'll spot runaway generation. Look at average prompt size per endpoint and the context bloat appears. None of this needs guesswork — it's all in your provider's usage export.
The practical move is to rank the findings by money, not by count, and fix the top few first. A single endpoint is often responsible for a large share of waste. token·flow turns an uploaded CSV into exactly this ranked list, so the signs above stop being a checklist you run manually and become a report you read.
frequently asked
- How do I know if I'm using too powerful a model?
- Look at what the model is doing, not just what it can do. If your most capable (and expensive) model is handling classification, routing, short extraction, or formatting, that's a task a smaller model does for a fraction of the price. Move those calls down and keep the frontier model for genuinely hard reasoning.
- Repeated prompts seem normal — is that really waste?
- It is if you regenerate them. The same FAQ answered thousands of times, or the same document summarized twice, costs full price each time unless you cache. Repeated prompts aren't a problem in themselves; paying full price for each repeat is. A cache turns that repetition into near-free responses.
- My answers are sometimes too long. Does that matter for cost?
- Yes — output tokens are the expensive direction, so long answers hit the bill harder than long prompts. Cap max_tokens, instruct the model to be concise, and use structured outputs to avoid retries. Runaway output is one of the most common and most fixable sources of waste.
Last updated June 15, 2026