Token economics is the hinge between "AI demo works" and "AI product ships." The demo can afford any prompt; the product cannot. Between them sit three levers that decide what's affordable in 2026: the asymmetry between input and output pricing, the amortization you get from caching, and the savings from routing to the right model tier. This post, under the context engineering pillar, walks those levers and their trade-offs so the prompts and workflows you design aim at cost-per-outcome, not cost-per-token.
Why Token Economics Matters for Production Apps
At one call per user per day, nothing here matters. At one agent loop per request times a hundred thousand requests a day, everything here matters. Three forces push the conversation from theoretical to operational.
Prompts got longer. Multi-thousand-token system prompts are standard. Long-context features pack whole documents or codebases in. Every added token multiplies across every call.
Agents loop. One user action becomes ten model calls. Cost is per call, not per user action — and per-call costs compound.
Output is slow and expensive. Generation scales with output length, and output tokens cost more than input. Verbose models feel fine in a demo and bleed money at scale.
For an LLM feature in a product people pay for, you're engineering for cost-per-outcome. The levers below are how.
Input vs Output Pricing — The Asymmetry
Across mainstream providers in 2026, the shape is consistent: output tokens cost more than input tokens, often several times more. Input is processed once; output is generated autoregressively, one token at a time, with attention over everything so far. Output is intrinsically more expensive.
Two implications for how prompts get written.
You can afford a long prompt to get a short answer. Carefully-structured input leading to a concise response is cheaper than vague input producing a sprawling one. "Write a 3-bullet summary" beats "summarize" — for quality and for cost.
Verbose output is where cost silently escalates. Models that answer in five paragraphs where one would do multiply your output spend. Tightening output format — JSON schemas, bullet lists, explicit length caps — is a cost lever disguised as a quality lever.
Design principle: push instructions into input; ask for outputs shaped to be short and structured. The economics reward it; so does quality.
Caching Amortization — Cached Inputs Are Much Cheaper
The second lever is prompt caching. When a provider has seen the beginning of your prompt before, it reuses the processed form of that prefix instead of recomputing. Cached input tokens are charged at a small fraction of the standard rate — exact ratios vary by provider, but the shape is dramatic.
The break-even math. Writes cost slightly more than a normal input token; reads cost much less. Caching pays off when the same prefix is reused across enough calls within the cache TTL to cover the write premium. For any workflow that calls the same model more than a couple of times per session, caching is almost always a win.
What to cache: stable parts — system instructions, tool definitions, reference material, schemas. What not to cache: the variable tail — user input, retrieval results, recent history. Mutating the cached prefix breaks the cache, so the prompt must be ordered stable-first, variable-last. For depth, see prompt caching guide 2026. Caching is not the same as semantic caching, which people conflate often enough to cause bugs — semantic caching vs. prompt caching spells the difference.
Caching changes an agent loop from "system-prompt cost times N steps" to "system-prompt cost roughly once, plus step-variable cost times N" — often the difference between viable and not.
Model Tiering — Cheap Model for Routine, Expensive for Hard
Providers ship families — flagship, mid-tier, small fast model — with prices that can differ by an order of magnitude. Running everything on the flagship is overpaying when most requests don't need it.
How tiering works. Classify each request by difficulty, route to the cheapest model that handles it reliably, escalate on low-confidence or explicit complexity signals. Routing can be rule-based (keyword, length, category) or model-based (a tiny classifier in front).
Push down to the small model: routing and classification, extraction and reformatting, short factual answers on well-scoped domains, first-pass drafts before a premium polish.
Keep on the flagship: multi-step reasoning that holds a long thread, open-ended generation where quality compounds, tasks that tolerate no errors (billing, legal, financial).
Tiering compounds with caching — small models cache too, and cheap per-call plus cached prefix is often the most economical path for routine work. Compression matters here as well — see context compression techniques for how shrinking variable tails compounds with tiering.
Reasoning Tokens — How Extended Thinking Counts
Reasoning models (Claude with extended thinking, GPT reasoning modes) produce internal "thinking" tokens before the answer. These are billed, typically at output rates, even when they're not returned to the user.
Reasoning is a cost multiplier you opted into. A reasoning call on a hard problem can cost several times a non-reasoning call — sometimes worth it, sometimes not. Reasoning budget is tunable. Most providers let you cap thinking tokens; default caps are often higher than needed. Not every task needs it. Classification, extraction, and reformatting rarely benefit from reasoning and always pay for it. Gate it behind a difficulty signal, not a blanket toggle.
Batch vs Live — When Delay Is Worth the Discount
Most major providers offer a batch API: submit a collection of requests, get results within a 24-hour window, pay a substantial discount (often around half the live rate) on both input and output.
Where batch fits: bulk generation (product descriptions, embeddings preprocessing, content migration), overnight analytics (classify yesterday's records, summarize daily logs), training data pipelines, anything that doesn't need to respond inside a user-facing request.
Where it doesn't: interactive UX (chat, autocomplete, live assistants), anything on the critical path of a user waiting for output.
Batch turns cost-per-outcome math on its head for volume workloads. Work prohibitive at live rates can be comfortable at batch rates. The trade is latency, not quality.
Cost-Per-Outcome vs Cost-Per-Token
The most common token-economics mistake is optimizing the wrong quantity.
Cost-per-token is what the bill is denominated in. Minimizing it means shorter prompts, smaller models, less reasoning. Cost-per-outcome is what the business cares about — the bill to get one successful user-facing result, which includes retries on failure, escalations to bigger models, and human review when quality misses.
A prompt that halves token spend but doubles the failure rate can raise cost-per-outcome, because each failure burns another full call. A prompt that uses 20% more tokens but eliminates retries is cheaper per outcome even though it looks more expensive per token.
Measure cost against a success metric that matters to the product, not against tokens. Every optimization — shorter prompt, smaller model, compressed context — gets checked against outcome quality, not just bill size. Over-optimizing per-token is how you ship cheap unreliable features that churn users.
Example: A Cost Breakdown
Illustrative with hypothetical prices — do not treat as current rates. This is about the shape of cost in a real-looking call, not the numbers.
Scenario: agent loop, 10 steps per task, 1,000 tasks/day
Per step:
Static system prompt + tools: ~3,000 input tokens
Variable context (history): ~1,500 input tokens
User/tool output chunk: ~800 input tokens
Model output: ~400 output tokens
Hypothetical prices (illustrative, not real):
Input (uncached): $5 per million tokens
Input (cached): $0.50 per million tokens
Output: $20 per million tokens
Per-step cost, naive (no caching, flagship model):
5,300 input * $5/M = $0.0265
400 output * $20/M = $0.0080
Total per step = $0.0345
Per-step cost, with caching on the 3,000-token prefix:
3,000 cached * $0.50/M = $0.0015
2,300 uncached * $5/M = $0.0115
400 output * $20/M = $0.0080
Total per step = $0.0210 (~39% cheaper)
Per-step cost, with caching + small model for routing/extraction
(assume 6 of 10 steps route to a 5x-cheaper model):
4 flagship steps @ $0.0210 = $0.0840
6 small-model steps @ roughly $0.0042 each = $0.0252
Total per task = $0.1092
vs naive flagship total = $0.3450 (~68% cheaper)
The point isn't the numbers — it's the stack-up. Input/output asymmetry shapes the prompt, caching amortizes the stable prefix, tiering routes easy steps to cheap models. Each lever alone helps; together they move the bill by an order of magnitude.
Common Anti-Patterns
- Optimizing per-token without measuring outcomes. Shorter prompts that raise failure rates cost more per successful outcome, not less. Always measure against a product-relevant success metric.
- Leaving output unbounded. Verbose models in unconstrained prompts silently inflate the output side of the bill. Set explicit length caps and output schemas.
- Caching the wrong part. Putting user input or retrieval results before the system prompt breaks caching on every call. Stable-first, variable-last — always.
- One model for everything. Running flagship on routing and extraction wastes money by a wide margin. Route cheap work to cheap models.
- Reasoning on by default. Extended thinking budgets add up fast on tasks that don't need reasoning at all. Gate it.
- Ignoring batch. Treating every workload as real-time when much of it could run overnight leaves the batch-API discount untouched.
- Over-compressing cached prefixes. Compressing what you could have cached pays an inference cost to lose a bigger cache-hit discount. Compress variable tails; cache stable prefixes. See context compression techniques.
FAQ
Is input always cheaper than output?
Across mainstream providers in 2026, yes — often several times. Output generates token-by-token with attention over everything prior, making it intrinsically more expensive than input processed in one pass. Push instructions into input; constrain output to be short and structured.
How much does caching actually save?
Cached input costs a small fraction of uncached input. Exact ratios vary by provider and change over time. For any workflow where the same prefix reappears more than a couple of times within the cache TTL, caching pays off — often by large margins. See prompt caching guide 2026 for details that decide whether the cache actually fires.
Should I always use the cheapest model?
Only for work the cheap model handles reliably. Cost-per-outcome punishes cheap-model failures: retries, escalations, and rework cost more than running the right model once. Route by difficulty, measure failure rate per tier, pay for quality where quality matters.
Does batch make sense for my workload?
If any part tolerates a 24-hour delay — bulk generation, overnight analytics, data pipeline steps — yes. The trade is strictly latency; interactive requests stay live.
What's the single biggest lever?
Depends on workload. Agent loops with long stable prompts: caching. High-volume simple tasks: tiering. Document-heavy workflows: compression plus caching. Bulk background work: batch. The practical move is to profile your calls — where tokens go, which prefixes recur, which steps don't need the flagship — and apply levers where the profile says they fit.
Wrap-Up
Token economics is a design discipline, not an accounting detail. Input costs less than output — push instructions in, constrain output out. Cached inputs cost a fraction of uncached — structure prompts stable-first, variable-last. Models tier by capability — route routine work cheap, save the flagship for work that earns it. Reasoning bills at output rates — gate it. Batch discounts exist — use them where latency allows. Above all, measure cost-per-outcome, not cost-per-token. The cheapest prompt is the one that shipped a reliable feature, not the one with the smallest bill per call.
For the broader frame, the context engineering pillar. For how to actually make the cache fire, prompt caching guide 2026. For shrinking the variable tail, context compression techniques. For the concept people confuse with prompt caching, semantic caching vs. prompt caching. For the vocabulary, prompt caching.