AI models misreport abilities and costs, undermining market-style coordination: six recent LLMs systematically miscalibrate success probability and token usage on software-engineering tasks, causing auction outcomes to diverge from full-information allocation, and adding prior capability information only partly closes the gap.
Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.
Summary
Main Finding
Current frontier LLMs are poor market participants because they cannot reliably self-assess task‑level success probability or token (cost) usage. This miscalibration meaningfully distorts auction outcomes and limits the welfare gains markets promise for routing heterogeneous AI agents. Simple context interventions (self‑history cards) improve calibration modestly but do not close the gap to a full‑information benchmark. A market‑like scaffold can help route work and benefits from agent diversity, but its performance is capped by agents’ weak self‑knowledge.
Key Points
- Motivation: Markets coordinate decentralized, heterogeneous agents by using prices and private reports about fit and cost. That requires agents to produce informative, calibrated self‑reports on success probability and expected cost.
- MarketBench: a benchmark designed to evaluate whether AI agents have the metacognitive signals needed to participate in markets (ex ante p_success and expected token usage).
- Empirical result summary (Phase I, 93 SWE‑bench Lite coding tasks, six LLMs):
- Realized pass rates cluster tightly around 75–81% across models.
- Reported p_success vary widely (≈61% to 92.9%), showing systematic miscalibration (some models overconfident, others underconfident).
- Only two models (Claude Opus 4.5, Claude Sonnet 4.5) show positive Brier skill versus a base‑rate; others perform worse.
- Token forecasts are severely understated (median estimated:actual token ratio ≈ 0.02), so cost forecasts are highly inaccurate.
- Auction simulation (reserve‑price procurement, bids mechanically derived from self‑reports):
- Auctions built from these self‑reports diverge sharply from full‑information (oracle) allocation: realized profits per task are far below the oracle.
- Example: GPT‑5.2 realized profit ≈ $0.006/task vs oracle ≈ $0.385/task.
- Overconfident models (e.g., Gemini 3 Pro Preview) win many auctions (≈84.6%) via aggressive bids, producing distorted allocation.
- Simple intervention: prepending per‑task self‑history (past performance, average stated confidence, token underestimation) to the prompt improves calibration (Brier score and ECE) and token estimates, but only modestly reduces the gap to the oracle allocation.
- Live scaffold (operator‑run market‑like router):
- Uses asks, p_success, expected time, a routing score, up to two attempts per task with worker‑exclusion on failure.
- Scaffold benefits from agent diversity, but overall gains limited by weak self‑assessment.
Data & Methods
- Task set: 93 tasks drawn from SWE‑bench Lite (real GitHub issue→fix pairs, executable test suites, binary pass/fail).
- Models: six recently released LLMs (named in paper: Claude Opus 4.5, Claude Sonnet 4.5, Gemini 3 Pro Preview, GPT‑5.2, GPT‑5.2‑pro, GPT‑5‑mini).
- Calibration elicitation:
- Prompt: ask model to return JSON with p_success ∈ [0,1], estimated_tokens_total, optional rationale prior to any solve attempt. Temperature 0 for determinism.
- Realized outcomes from a stronger external scaffold (interactive shell, test feedback, multi‑turn revisions) produce ground truth pass/fail and actual token usage (converted to dollars using model‑specific blended pricing).
- Evaluation metrics: realized pass rate, mean stated p_success, Brier score / Brier skill vs base‑rate, Expected Calibration Error (ECE), token forecast error (estimated vs realized tokens).
- Auction simulation:
- Reserve‑price procurement with bids computed mechanically from elicited p_success and token cost:
- b* = token_cost + penalty × (1 − p_success) / p_success, where penalty = $2 in main experiments.
- Reserve prices drawn Uniform[0,1]; second‑price payment rule with independent reserve draws; 100 reserve draws per row (deterministic seeds).
- Outcomes: win rate, expected profit (from model’s reported p), realized profit (using actual pass/fail), oracle profit (perfect knowledge).
- Reserve‑price procurement with bids computed mechanically from elicited p_success and token cost:
- Conceptual model (Section 3):
- Two agents H (higher capability, higher cost) and L (lower capability, lower cost). Agents may observe a task‑specific signal εi; allocation rules (assign H, assign L, run both in parallel, market with success‑contingent payments) are compared.
- Proposition: a market that conditions on agents’ private capability signals weakly dominates fixed non‑market routings when v > cH > cL > 0, with strict gains in states where fixed rules err (wrong agent, redundant payment, paying when no one can solve).
- Live scaffold experiment:
- Operator computes Score_i = p_i × (Utility − Ask_i) − (1 − p_i) × Penalty_i − E[Cost_i] and assigns tasks using that score; each task allowed up to two one‑shot attempts; worker exclusion on first‑attempt failure; operator pays costs directly.
Implications for AI Economics
- Markets can, in principle, extract value from decentralized private information about task‑specific fit. But that value is only realizable if agents can credibly and accurately report success probabilities and expected costs.
- Self‑assessment is a key bottleneck for market‑style coordination of AI agents:
- Miscalibrated probabilities distort bids and allocation; aggressive overconfidence can lead to inefficient winners and low principal recovery.
- Systematic underestimation of tokens inflates deviations between expected and realized costs, complicating pricing and payment design.
- Mechanism design and platform design must account for weak self‑reports:
- Blind adoption of standard procurement auctions (second‑price, reserve) without calibration adjustments risks poor allocation and transfers.
- Robust mechanisms could combine self‑reports with operator/market signals: reputational histories, ex post verification, audits, calibrated scoring rules, or explicit penalties/incentives tied to realized outcomes.
- Hybrid designs (operator scaffolds, central planners) may be optimal in the short run when agent metacognition is unreliable.
- Policy and market‑structure consequences:
- Standardization of performance reporting (e.g., verifiable execution histories, test‑set calibration cards) could improve market functioning.
- Marketplace rules (reserve policies, failure penalties, insurance layers) should be set recognizing high uncertainty in agent self‑reports.
- Research agenda:
- Improve elicitation techniques and model metacognition (calibration across tasks and cost estimates).
- Expand benchmarks beyond coding tasks to other economic activities (creative work, decision making, forecasting).
- Study incentive‑compatible information elicitation (proper scoring rules, mechanism designs robust to miscalibration and strategic reporting) in multi‑agent AI settings.
- Investigate dynamic reputational and learning mechanisms where agents improve calibration through verified market participation.
Limitations noted by authors: single domain (software engineering), 93‑task subset, six models during a moving frontier, blended pricing simplifications, and scaffolds that are operator‑run rather than decentralized markets. These qualify the external validity but identify the precise capability — calibrated ex ante beliefs about success and cost — as the crucial bottleneck for marketizing AI agents.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. Governance And Regulation | positive | high | suitability of markets for coordinating AI agents (theoretical promise) |
0.03
|
| In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. Decision Quality | null_result | high | informativeness/calibration of self-reported ability and cost signals |
0.03
|
| We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. Other | null_result | high | existence of the MarketBench benchmark |
0.09
|
| We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. Other | null_result | high | experimental dataset size and model set used for demonstration |
n=93
0.09
|
| These LLMs are miscalibrated on both success probability and token usage. Decision Quality | negative | high | calibration of self-reported success probability and token usage |
n=93
0.18
|
| Auctions built from these self-reports diverge from a full-information allocation. Task Allocation | negative | high | difference between allocations produced by auctions using self-reports and full-information allocation |
n=93
0.18
|
| A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration. Decision Quality | positive | high | change in calibration of predicted success probability and token usage after adding prior capability information |
n=93
0.18
|
| The intervention only modestly narrows the gap to a full-information benchmark. Task Allocation | mixed | high | remaining gap between post-intervention outcomes and full-information benchmark (calibration/allocation) |
n=93
0.18
|
| We document the performance of a market-based scaffolding with these LLMs. Task Allocation | null_result | high | performance metrics of a market-based scaffolding using LLM self-reports |
n=93
0.09
|
| Self-assessment is a key bottleneck for market-style coordination of AI agents. Decision Quality | negative | high | importance of self-assessment calibration for successful market coordination |
n=93
0.18
|