What's the most overlooked LLM cost?

Retries from malformed output, closely followed by unbounded conversation history. Both inflate the bill without showing up as anything unusual per call — you just make more calls, or longer ones, than you think. Structured outputs cut retries, and trimming or summarizing history caps the runaway input growth.

Why does conversation history get expensive?

Because most chat implementations resend the entire history as input on every turn. The conversation's tokens accumulate, so the later messages carry the weight of everything before them. Capping how much history you include, or summarizing older turns into a short brief, stops the input from growing without bound.

How do background jobs sneak onto the bill?

They run unattended — nightly summaries, enrichment pipelines, agent loops — so no one notices them the way they'd notice a slow user-facing feature. They can be your single biggest line item. Find them by looking for steady token spend at hours when real users aren't active in your usage export.

Do these hidden costs apply to both OpenAI and Anthropic?

Yes. Retries, history bloat, oversized context, runaway output, and unwatched background jobs are pattern problems, not provider problems — they happen on any token-billed model. Both OpenAI and Anthropic let you export usage you can analyze, and token·flow accepts CSVs from both.

The hidden costs of large language models and how to find them

The costs that don't show on the sticker

Retries are the quietest. When the model returns malformed output and your code calls again, you pay for two completions to get one result — and because output is the expensive direction, retries hurt more than their count suggests. Structured outputs (a JSON schema the model must follow) cut these dramatically. Next is unbounded conversation history: a chat that appends every turn to the prompt grows the input on every message, so the hundredth message in a session can cost many times the first.

Then there's context bloat in RAG. A retriever tuned to return fifty chunks 'to be safe' pays for forty-five the model never used, on every query. And background jobs are the easiest to forget: nightly summaries, enrichment loops, and agent chains that run unattended can be the single largest line on the bill precisely because nobody is watching them. Each of these is invisible in a headline 'cost per call' number and obvious the moment you group by direction and prompt.

How to surface them in your own data

Every hidden cost above has a fingerprint in your usage export. Retries show up as duplicate calls clustered in time. Unbounded history shows up as input tokens climbing within a session. RAG bloat shows up as large average prompt sizes on retrieval endpoints. Runaway completions show up as a high output-to-input ratio. Background jobs show up as steady spend at hours when no users are active.

Pulling these apart by hand is doable but tedious, because the provider dashboards tend to show totals, not the slices that reveal the cause. token·flow groups an uploaded CSV by prompt, model, and direction, then ranks by cost — which turns 'the bill is high' into a named list: this endpoint retries, this one bloats its context, this one runs all night. From there the fixes (caps, caching, trimming, right-sizing) are the easy part.

frequently asked

What's the most overlooked LLM cost?: Retries from malformed output, closely followed by unbounded conversation history. Both inflate the bill without showing up as anything unusual per call — you just make more calls, or longer ones, than you think. Structured outputs cut retries, and trimming or summarizing history caps the runaway input growth.
Why does conversation history get expensive?: Because most chat implementations resend the entire history as input on every turn. The conversation's tokens accumulate, so the later messages carry the weight of everything before them. Capping how much history you include, or summarizing older turns into a short brief, stops the input from growing without bound.
How do background jobs sneak onto the bill?: They run unattended — nightly summaries, enrichment pipelines, agent loops — so no one notices them the way they'd notice a slow user-facing feature. They can be your single biggest line item. Find them by looking for steady token spend at hours when real users aren't active in your usage export.
Do these hidden costs apply to both OpenAI and Anthropic?: Yes. Retries, history bloat, oversized context, runaway output, and unwatched background jobs are pattern problems, not provider problems — they happen on any token-billed model. Both OpenAI and Anthropic let you export usage you can analyze, and token·flow accepts CSVs from both.

Last updated June 15, 2026

The hidden costs of large language models — and how to find them

The costs that don't show on the sticker

How to surface them in your own data

frequently asked

more on token·flow

ready to try token·flow?