AI coding agents often burn far more tokens than expected — input tokens drive costs and usage can vary 30x between runs; some models consume millions more tokens than others and none reliably predict their own token bill.
The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.
Summary
Main Finding
Agentic coding workflows (autonomous coding agents that read repos, call tools, and iterate) consume orders of magnitude more tokens than single-turn code reasoning or multi-turn code chat. Token costs are dominated by repeated input/context ingestion, highly variable across runs, poorly predicted by human difficulty ratings, strongly model-dependent, and difficult for models to estimate before execution.
Key Points
- Scale of the gap
- Agentic tasks consume dramatically more tokens: ~3,500× a single-turn code reasoning task and ~1,200× a multi-turn code chat task (paper aggregates).
- Input tokens, not output tokens, drive this gap (context accumulation and repeated re-feeding across rounds).
- High variance and stochasticity
- Token usage varies widely across problems and across repeated runs on the same problem: some runs differ by up to ~30× in total tokens; the most expensive instance is ~7M tokens more than the cheapest.
- Within the same problem, the most expensive run is on average ~2× the cheapest.
- Cost vs. performance
- More tokens do not reliably improve accuracy. Accuracy often peaks at intermediate token usage and then saturates or declines (inverse test-time scaling phenomenon).
- High-cost failing runs show more repeated file views/edits (inefficient, redundant exploration).
- Model-level token efficiency differences
- Large per-model differences in token consumption on the same tasks: e.g., Kimi-K2 and Claude Sonnet-4.5 on average consume >1.5M more tokens than GPT-5 on the SWE-bench set.
- Relative cost rankings of models persist on both shared-success and shared-failure subsets, indicating behavior-driven (model-specific) differences rather than task selection.
- Human difficulty ratings are weak predictors
- Expert-rated developer time/difficulty correlates only modestly with agent token consumption (Kendall τb ≈ 0.32); substantial overlap and many outliers exist.
- Self-prediction of token use is poor
- Frontier models achieve only weak-to-moderate correlations between predicted and actual token usage (max reported r ≈ 0.39).
- Output-token usage is easier to predict than input-token usage. Models systematically underestimate real token costs.
- Practical observation
- Even with token caching enabled, repeated context ingestion across rounds makes input-side cost dominant.
Data & Methods
- Environment and benchmark
- Agent framework: OpenHands.
- Benchmark: SWE-Bench-Verified — 500 real-world GitHub issues with repositories and tests (realistic, multi-step coding problems).
- Models evaluated
- Eight frontier LLMs: Claude Sonnet-3.7, Sonnet-4, Sonnet-4.5, GPT-5, GPT-5.2, Qwen3-Coder-480B-A35B-Instruct, Kimi-K2, Gemini-3-Pro.
- Experimental design
- Four independent runs per problem per model (to capture run-level variability).
- Agents act autonomously (no human in the loop); multiple rounds per problem where full conversation history is carried forward.
- Instrumentation: parsed structured JSON agent logs to extract per-type token counts (input vs. output), monetary cost estimates, action types (file view/modify/tool calls), and intermediate trajectories.
- Aggregation: reported averages across the four runs per problem for most analyses; also analyzed run-level quartiles (MinCost → MaxCost).
- Prediction task
- Formalized pre-execution token consumption prediction: the agent must estimate input and output token usage prior to executing the task, given tools and environment.
- Performance measured via correlation with actual usage and estimation bias.
- Open data
- All trajectories and logs from the experiments are open-sourced on the project website (per paper).
Implications for AI Economics
- Pricing and billing models
- Current per-token billing is opaque and unpredictable for agentic workflows. Because input-context ingestion dominates cost and is hard to predict, users face large unknown bills.
- Providers should consider pricing designs that account for agentic patterns: e.g., session/round caps, clearer pre-run estimates, tiered pricing for input-heavy sessions, or pricing tied to successful outcomes to mitigate paying for failed explorations.
- Product controls and transparency
- Systems should offer budget caps, early alerts, and coarse pre-execution cost estimates (self-prediction gives a useful coarse signal despite imperfect instance-level accuracy).
- Expose model-level “token-efficiency” benchmarks so users can choose models based on cost–accuracy trade-offs — raw accuracy alone is insufficient.
- Cost-management strategies
- Prefer token-efficient models (empirically measured) for high-volume production uses; even if absolute accuracy differences are small, model-specific behavior can massively change token spend.
- Invest in engineering mitigations: aggressive caching, truncated/unrolled context strategies, better file-access policies, and tooling that prevents redundant file re-reads/edits.
- SLA, procurement, and budgeting
- Organizations deploying agents should build stochasticity into budgets and SLAs (expect heavy tails and run-to-run variance).
- Human estimates of task difficulty are unreliable for cost forecasting; empirical measurement on target models/environments is necessary.
- Research and product priorities
- Develop improved pre-execution token estimation models (especially for input/context growth), stop criteria that detect unproductive exploration, and training/evaluation metrics that penalize token-inefficient behavior.
- Consider incentives in model design and fine-tuning to reduce unnecessary context growth (token-efficiency as a first-class training objective).
- Policy & user protection
- Given systemic underestimation and unpredictability, regulators and marketplaces could require clearer pre-run cost disclosures and easy-to-set hard caps to protect users from runaway charges.
Summary takeaway: Agentic LLM workflows fundamentally change the token-economics picture — they are input-heavy, highly variable, model-dependent, and hard to predict. For practical deployments and fair pricing, stakeholders must measure model-specific token behavior, provide coarse pre-execution estimates and budget controls, and prioritize token-efficiency in both model selection and system design.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat. Organizational Efficiency | negative | high | total token consumption (agentic vs. code reasoning/code chat) |
1000x
0.18
|
| Input tokens rather than output tokens drive the overall cost of agentic tasks. Organizational Efficiency | negative | high | share/contribution of input tokens vs output tokens to total token consumption |
0.18
|
| Token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens. Organizational Efficiency | mixed | high | run-to-run variability in total token consumption for the same task |
30x
0.18
|
| Higher token usage does not translate into higher accuracy; accuracy often peaks at intermediate cost and saturates at higher costs. Output Quality | null_result | high | task accuracy as a function of token usage |
0.18
|
| Models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5. Organizational Efficiency | negative | high | average total token consumption per model (tokens consumed by model A minus model B) |
over 1.5 million more tokens
0.18
|
| Task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend. Task Allocation | null_result | high | correspondence/alignment between human-rated task difficulty and measured token costs |
0.18
|
| Frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Organizational Efficiency | negative | high | correlation and bias between model self-predicted token usage and actual token usage |
correlations up to 0.39
0.18
|
| This paper presents the first systematic study of token consumption patterns in agentic coding tasks, analyzing trajectories from eight frontier LLMs on SWE-bench Verified and evaluating models' ability to predict their own token costs before task execution. Research Productivity | positive | high | scope of study (presence of systematic analysis and self-prediction evaluation) |
n=8
0.03
|