AI coding agents often burn far more tokens than expected — input tokens drive costs and usage can vary 30x between runs; some models consume millions more tokens than others and none reliably predict their own token bill.

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei · April 24, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Agentic coding tasks consume orders of magnitude more tokens than other code tasks, token usage is highly variable and model-dependent, and frontier models both differ substantially in efficiency and systematically underestimate their own token costs.

The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

Summary

Main Finding

Agentic coding workflows (autonomous coding agents that read repos, call tools, and iterate) consume orders of magnitude more tokens than single-turn code reasoning or multi-turn code chat. Token costs are dominated by repeated input/context ingestion, highly variable across runs, poorly predicted by human difficulty ratings, strongly model-dependent, and difficult for models to estimate before execution.

Key Points

Scale of the gap
- Agentic tasks consume dramatically more tokens: ~3,500× a single-turn code reasoning task and ~1,200× a multi-turn code chat task (paper aggregates).
- Input tokens, not output tokens, drive this gap (context accumulation and repeated re-feeding across rounds).
High variance and stochasticity
- Token usage varies widely across problems and across repeated runs on the same problem: some runs differ by up to ~30× in total tokens; the most expensive instance is ~7M tokens more than the cheapest.
- Within the same problem, the most expensive run is on average ~2× the cheapest.
Cost vs. performance
- More tokens do not reliably improve accuracy. Accuracy often peaks at intermediate token usage and then saturates or declines (inverse test-time scaling phenomenon).
- High-cost failing runs show more repeated file views/edits (inefficient, redundant exploration).
Model-level token efficiency differences
- Large per-model differences in token consumption on the same tasks: e.g., Kimi-K2 and Claude Sonnet-4.5 on average consume >1.5M more tokens than GPT-5 on the SWE-bench set.
- Relative cost rankings of models persist on both shared-success and shared-failure subsets, indicating behavior-driven (model-specific) differences rather than task selection.
Human difficulty ratings are weak predictors
- Expert-rated developer time/difficulty correlates only modestly with agent token consumption (Kendall τb ≈ 0.32); substantial overlap and many outliers exist.
Self-prediction of token use is poor
- Frontier models achieve only weak-to-moderate correlations between predicted and actual token usage (max reported r ≈ 0.39).
- Output-token usage is easier to predict than input-token usage. Models systematically underestimate real token costs.
Practical observation
- Even with token caching enabled, repeated context ingestion across rounds makes input-side cost dominant.

Data & Methods

Environment and benchmark
- Agent framework: OpenHands.
- Benchmark: SWE-Bench-Verified — 500 real-world GitHub issues with repositories and tests (realistic, multi-step coding problems).
Models evaluated
- Eight frontier LLMs: Claude Sonnet-3.7, Sonnet-4, Sonnet-4.5, GPT-5, GPT-5.2, Qwen3-Coder-480B-A35B-Instruct, Kimi-K2, Gemini-3-Pro.
Experimental design
- Four independent runs per problem per model (to capture run-level variability).
- Agents act autonomously (no human in the loop); multiple rounds per problem where full conversation history is carried forward.
- Instrumentation: parsed structured JSON agent logs to extract per-type token counts (input vs. output), monetary cost estimates, action types (file view/modify/tool calls), and intermediate trajectories.
- Aggregation: reported averages across the four runs per problem for most analyses; also analyzed run-level quartiles (MinCost → MaxCost).
Prediction task
- Formalized pre-execution token consumption prediction: the agent must estimate input and output token usage prior to executing the task, given tools and environment.
- Performance measured via correlation with actual usage and estimation bias.
Open data
- All trajectories and logs from the experiments are open-sourced on the project website (per paper).

Implications for AI Economics

Pricing and billing models
- Current per-token billing is opaque and unpredictable for agentic workflows. Because input-context ingestion dominates cost and is hard to predict, users face large unknown bills.
- Providers should consider pricing designs that account for agentic patterns: e.g., session/round caps, clearer pre-run estimates, tiered pricing for input-heavy sessions, or pricing tied to successful outcomes to mitigate paying for failed explorations.
Product controls and transparency
- Systems should offer budget caps, early alerts, and coarse pre-execution cost estimates (self-prediction gives a useful coarse signal despite imperfect instance-level accuracy).
- Expose model-level “token-efficiency” benchmarks so users can choose models based on cost–accuracy trade-offs — raw accuracy alone is insufficient.
Cost-management strategies
- Prefer token-efficient models (empirically measured) for high-volume production uses; even if absolute accuracy differences are small, model-specific behavior can massively change token spend.
- Invest in engineering mitigations: aggressive caching, truncated/unrolled context strategies, better file-access policies, and tooling that prevents redundant file re-reads/edits.
SLA, procurement, and budgeting
- Organizations deploying agents should build stochasticity into budgets and SLAs (expect heavy tails and run-to-run variance).
- Human estimates of task difficulty are unreliable for cost forecasting; empirical measurement on target models/environments is necessary.
Research and product priorities
- Develop improved pre-execution token estimation models (especially for input/context growth), stop criteria that detect unproductive exploration, and training/evaluation metrics that penalize token-inefficient behavior.
- Consider incentives in model design and fine-tuning to reduce unnecessary context growth (token-efficiency as a first-class training objective).
Policy & user protection
- Given systemic underestimation and unpredictability, regulators and marketplaces could require clearer pre-run cost disclosures and easy-to-set hard caps to protect users from runaway charges.

Summary takeaway: Agentic LLM workflows fundamentally change the token-economics picture — they are input-heavy, highly variable, model-dependent, and hard to predict. For practical deployments and fair pricing, stakeholders must measure model-specific token behavior, provide coarse pre-execution estimates and budget controls, and prioritize token-efficiency in both model selection and system design.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents a systematic, multi-model measurement study with clear empirical regularities (orders-of-magnitude differences, variability across runs and models, prediction correlations). However, it is not making causal claims and its findings rest on a single benchmark (SWE-bench Verified), a particular set of agent implementations and model settings, leaving open alternative explanations (prompt/agent design, tokenization differences, sampling settings) and limited external validation. Methods Rigormedium — The authors evaluate eight frontier LLMs, compare agentic tasks to other code-task modalities, run multiple stochastic trajectories, and quantify token breakdowns and prediction performance—showing attention to variance and robustness. Missing or unclear details (e.g., number of tasks and runs per model, seed/control of sampling parameters, exact prompt/agent wrappers, tokenization normalization across models, and potential selection of models/tasks) limit reproducibility and make it harder to rule out confounders. SampleTrajectories from eight frontier LLMs (including GPT-5, Kimi-K2, Claude-Sonnet-4.5 and others) executed on agentic coding tasks drawn from the SWE-bench Verified benchmark; comparisons are made to code reasoning and code chat modalities, with multiple stochastic runs per task to capture token-usage variance and analyses of input vs. output token breakdowns and models' pre-execution cost predictions. Themesproductivity adoption GeneralizabilityFocused on software engineering / coding tasks (SWE-bench Verified) — may not generalize to other domains (e.g., summarization, customer support, scientific workflows)., Results depend on specific agent implementations, prompt templates, tool usage patterns and API settings (temperature, sampling) which vary in real deployments., Tokenization schemes and model-specific token-accounting differ across providers and can affect measured costs; cross-model normalization may be imperfect., Limited model sample and benchmark scope — newer model releases or different frontier models might show different efficiency patterns., Benchmarks may not capture real-world long-running workflows, multi-agent interactions, or production orchestration overheads.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat. Organizational Efficiency	negative	high	total token consumption (agentic vs. code reasoning/code chat)	1000x 0.18
Input tokens rather than output tokens drive the overall cost of agentic tasks. Organizational Efficiency	negative	high	share/contribution of input tokens vs output tokens to total token consumption	0.18
Token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens. Organizational Efficiency	mixed	high	run-to-run variability in total token consumption for the same task	30x 0.18
Higher token usage does not translate into higher accuracy; accuracy often peaks at intermediate cost and saturates at higher costs. Output Quality	null_result	high	task accuracy as a function of token usage	0.18
Models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5. Organizational Efficiency	negative	high	average total token consumption per model (tokens consumed by model A minus model B)	over 1.5 million more tokens 0.18
Task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend. Task Allocation	null_result	high	correspondence/alignment between human-rated task difficulty and measured token costs	0.18
Frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Organizational Efficiency	negative	high	correlation and bias between model self-predicted token usage and actual token usage	correlations up to 0.39 0.18
This paper presents the first systematic study of token consumption patterns in agentic coding tasks, analyzing trajectories from eight frontier LLMs on SWE-bench Verified and evaluating models' ability to predict their own token costs before task execution. Research Productivity	positive	high	scope of study (presence of systematic analysis and self-prediction evaluation)	n=8 0.03