Input tokens, output tokens, and why the split matters
Every call has two token counts. Input (prompt) tokens are everything you send: the system prompt, the conversation history, the retrieved context, the user's message. Output (completion) tokens are what the model generates back. Providers price these separately, and output is almost always the more expensive direction — often two to four times the input rate. That's why a chat feature that returns long answers can cost far more than its input size suggests.
Cached tokens add a third line. When provider prompt caching kicks in, the repeated prefix of your input is billed at a reduced cached rate rather than the full input rate. So a single call can have full-price input tokens, cheaper cached input tokens, and full-price output tokens all at once. Reading your bill means seeing all three, not just a single total.
How to read your usage export
Both OpenAI and Anthropic let you export usage. The mistake is sorting by call count — a low-volume endpoint with enormous prompts or long completions can quietly outspend a high-volume one with tiny calls. Sort by total cost, or by total tokens weighted toward the output direction, and the real culprits surface immediately.
Look for three patterns. One: a single endpoint or prompt dominating the total — that's your first target. Two: output tokens far exceeding what the use case needs — a candidate for max_tokens caps and brevity instructions. Three: the same prompt appearing over and over — a caching candidate. token·flow runs exactly this grouping on an uploaded CSV, so you see the costliest prompts and the repeats without writing a query.
frequently asked
- Why is my output (completion) cost higher than my input cost?
- Because output tokens are priced higher per token on most models — frequently two to four times the input rate — and because chat-style features often generate long answers. If your bill is dominated by output, the fix is on the generation side: cap max_tokens, ask for concise answers, and use structured outputs to avoid retries.
- What are cached tokens on my bill?
- When a long prompt prefix is identical across calls, OpenAI and Anthropic can reuse internal computation and bill that repeated portion at a lower cached rate. Cached input tokens show up cheaper than full input tokens. Keeping your stable prefix unchanged and at the front of the prompt is what triggers the discount.
- How do I find which feature is driving my bill?
- Don't sort by number of calls — sort by total cost or total tokens. A handful of endpoints with big prompts or long completions usually account for most of the spend. Export a usage CSV and group it by prompt and model; token·flow does this automatically and ranks the costliest first.
- Does temperature or model speed affect the price?
- No. Temperature, top-p, and latency don't change pricing — only token counts and the model's per-token rates do. A faster or slower response with the same token usage costs the same. The levers that move the bill are model choice, prompt size, and completion length.
Last updated June 15, 2026