The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

AI models misreport abilities and costs, undermining market-style coordination: six recent LLMs systematically miscalibrate success probability and token usage on software-engineering tasks, causing auction outcomes to diverge from full-information allocation, and adding prior capability information only partly closes the gap.

MarketBench: Evaluating AI Agents as Market Participants
Andrey Fradkin, Rohit Krishnan · April 26, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
MarketBench shows six recent LLMs are poorly calibrated on self-reported success probability and token usage across 93 software-engineering tasks, causing auction allocations based on those self-reports to diverge from full-information allocations, and a simple context-based intervention only modestly improves calibration.

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.

Summary

Main Finding

Current frontier LLMs are poor market participants because they cannot reliably self-assess task‑level success probability or token (cost) usage. This miscalibration meaningfully distorts auction outcomes and limits the welfare gains markets promise for routing heterogeneous AI agents. Simple context interventions (self‑history cards) improve calibration modestly but do not close the gap to a full‑information benchmark. A market‑like scaffold can help route work and benefits from agent diversity, but its performance is capped by agents’ weak self‑knowledge.

Key Points

  • Motivation: Markets coordinate decentralized, heterogeneous agents by using prices and private reports about fit and cost. That requires agents to produce informative, calibrated self‑reports on success probability and expected cost.
  • MarketBench: a benchmark designed to evaluate whether AI agents have the metacognitive signals needed to participate in markets (ex ante p_success and expected token usage).
  • Empirical result summary (Phase I, 93 SWE‑bench Lite coding tasks, six LLMs):
    • Realized pass rates cluster tightly around 75–81% across models.
    • Reported p_success vary widely (≈61% to 92.9%), showing systematic miscalibration (some models overconfident, others underconfident).
    • Only two models (Claude Opus 4.5, Claude Sonnet 4.5) show positive Brier skill versus a base‑rate; others perform worse.
    • Token forecasts are severely understated (median estimated:actual token ratio ≈ 0.02), so cost forecasts are highly inaccurate.
  • Auction simulation (reserve‑price procurement, bids mechanically derived from self‑reports):
    • Auctions built from these self‑reports diverge sharply from full‑information (oracle) allocation: realized profits per task are far below the oracle.
    • Example: GPT‑5.2 realized profit ≈ $0.006/task vs oracle ≈ $0.385/task.
    • Overconfident models (e.g., Gemini 3 Pro Preview) win many auctions (≈84.6%) via aggressive bids, producing distorted allocation.
  • Simple intervention: prepending per‑task self‑history (past performance, average stated confidence, token underestimation) to the prompt improves calibration (Brier score and ECE) and token estimates, but only modestly reduces the gap to the oracle allocation.
  • Live scaffold (operator‑run market‑like router):
    • Uses asks, p_success, expected time, a routing score, up to two attempts per task with worker‑exclusion on failure.
    • Scaffold benefits from agent diversity, but overall gains limited by weak self‑assessment.

Data & Methods

  • Task set: 93 tasks drawn from SWE‑bench Lite (real GitHub issue→fix pairs, executable test suites, binary pass/fail).
  • Models: six recently released LLMs (named in paper: Claude Opus 4.5, Claude Sonnet 4.5, Gemini 3 Pro Preview, GPT‑5.2, GPT‑5.2‑pro, GPT‑5‑mini).
  • Calibration elicitation:
    • Prompt: ask model to return JSON with p_success ∈ [0,1], estimated_tokens_total, optional rationale prior to any solve attempt. Temperature 0 for determinism.
    • Realized outcomes from a stronger external scaffold (interactive shell, test feedback, multi‑turn revisions) produce ground truth pass/fail and actual token usage (converted to dollars using model‑specific blended pricing).
    • Evaluation metrics: realized pass rate, mean stated p_success, Brier score / Brier skill vs base‑rate, Expected Calibration Error (ECE), token forecast error (estimated vs realized tokens).
  • Auction simulation:
    • Reserve‑price procurement with bids computed mechanically from elicited p_success and token cost:
      • b* = token_cost + penalty × (1 − p_success) / p_success, where penalty = $2 in main experiments.
    • Reserve prices drawn Uniform[0,1]; second‑price payment rule with independent reserve draws; 100 reserve draws per row (deterministic seeds).
    • Outcomes: win rate, expected profit (from model’s reported p), realized profit (using actual pass/fail), oracle profit (perfect knowledge).
  • Conceptual model (Section 3):
    • Two agents H (higher capability, higher cost) and L (lower capability, lower cost). Agents may observe a task‑specific signal εi; allocation rules (assign H, assign L, run both in parallel, market with success‑contingent payments) are compared.
    • Proposition: a market that conditions on agents’ private capability signals weakly dominates fixed non‑market routings when v > cH > cL > 0, with strict gains in states where fixed rules err (wrong agent, redundant payment, paying when no one can solve).
  • Live scaffold experiment:
    • Operator computes Score_i = p_i × (Utility − Ask_i) − (1 − p_i) × Penalty_i − E[Cost_i] and assigns tasks using that score; each task allowed up to two one‑shot attempts; worker exclusion on first‑attempt failure; operator pays costs directly.

Implications for AI Economics

  • Markets can, in principle, extract value from decentralized private information about task‑specific fit. But that value is only realizable if agents can credibly and accurately report success probabilities and expected costs.
  • Self‑assessment is a key bottleneck for market‑style coordination of AI agents:
    • Miscalibrated probabilities distort bids and allocation; aggressive overconfidence can lead to inefficient winners and low principal recovery.
    • Systematic underestimation of tokens inflates deviations between expected and realized costs, complicating pricing and payment design.
  • Mechanism design and platform design must account for weak self‑reports:
    • Blind adoption of standard procurement auctions (second‑price, reserve) without calibration adjustments risks poor allocation and transfers.
    • Robust mechanisms could combine self‑reports with operator/market signals: reputational histories, ex post verification, audits, calibrated scoring rules, or explicit penalties/incentives tied to realized outcomes.
    • Hybrid designs (operator scaffolds, central planners) may be optimal in the short run when agent metacognition is unreliable.
  • Policy and market‑structure consequences:
    • Standardization of performance reporting (e.g., verifiable execution histories, test‑set calibration cards) could improve market functioning.
    • Marketplace rules (reserve policies, failure penalties, insurance layers) should be set recognizing high uncertainty in agent self‑reports.
  • Research agenda:
    • Improve elicitation techniques and model metacognition (calibration across tasks and cost estimates).
    • Expand benchmarks beyond coding tasks to other economic activities (creative work, decision making, forecasting).
    • Study incentive‑compatible information elicitation (proper scoring rules, mechanism designs robust to miscalibration and strategic reporting) in multi‑agent AI settings.
    • Investigate dynamic reputational and learning mechanisms where agents improve calibration through verified market participation.

Limitations noted by authors: single domain (software engineering), 93‑task subset, six models during a moving frontier, blended pricing simplifications, and scaffolds that are operator‑run rather than decentralized markets. These qualify the external validity but identify the precise capability — calibrated ex ante beliefs about success and cost — as the crucial bottleneck for marketizing AI agents.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents systematic empirical measurements using a benchmark (93 software-engineering tasks) and six recent LLMs, with experiments and an intervention that plausibly link self-report miscalibration to suboptimal auction outcomes; however, evidence is limited to simulated market settings, a narrow task domain, and a small set of models, so causal claims about real-world market performance are not strongly supported. Methods Rigormedium — The study uses a structured benchmark (MarketBench) and compares self-reports to realized outcomes and a full-information benchmark, and tests an intervention using prior capability information; but it is constrained by a modest number of models, tasks from a single domain (software engineering), simulated auctions rather than deployed markets, and limited discussion (in the summary) of robustness checks, variance, or statistical uncertainty. SampleA 93-task subset of SWE-bench Lite (software engineering tasks) evaluated across six recently released large language models; measures include models' self-reported success probabilities and token-usage estimates, realized task success and token consumption, simulated auctions built from self-reports, and a follow-up intervention that injects capability information from prior experiments into the model context. Themesorg_design adoption GeneralizabilityLimited to software-engineering tasks (SWE-bench subset) and may not generalize to other task types or domains., Only six LLMs evaluated — results may not hold across other model families, sizes, or future versions., Experiments use simulated auctions and self-reports rather than real economic incentives or deployed multi-agent markets., Token-cost reporting may not map cleanly to real-world resource or monetary costs across different deployment environments., Intervention tested (adding prior capability info in context) may behave differently under repeated interactions or strategic reporting with economic incentives.

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. Governance And Regulation positive high suitability of markets for coordinating AI agents (theoretical promise)
0.03
In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. Decision Quality null_result high informativeness/calibration of self-reported ability and cost signals
0.03
We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. Other null_result high existence of the MarketBench benchmark
0.09
We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. Other null_result high experimental dataset size and model set used for demonstration
n=93
0.09
These LLMs are miscalibrated on both success probability and token usage. Decision Quality negative high calibration of self-reported success probability and token usage
n=93
0.18
Auctions built from these self-reports diverge from a full-information allocation. Task Allocation negative high difference between allocations produced by auctions using self-reports and full-information allocation
n=93
0.18
A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration. Decision Quality positive high change in calibration of predicted success probability and token usage after adding prior capability information
n=93
0.18
The intervention only modestly narrows the gap to a full-information benchmark. Task Allocation mixed high remaining gap between post-intervention outcomes and full-information benchmark (calibration/allocation)
n=93
0.18
We document the performance of a market-based scaffolding with these LLMs. Task Allocation null_result high performance metrics of a market-based scaffolding using LLM self-reports
n=93
0.09
Self-assessment is a key bottleneck for market-style coordination of AI agents. Decision Quality negative high importance of self-assessment calibration for successful market coordination
n=93
0.18

Notes