A simulated multi-agent marketplace shows that only a few LLM retailers reliably grow capital while most break even, revealing a winner-take-most pattern in economic task performance; semantic matching alone does not predict economic success.

Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Xiongkuo Min, Guangtao Zhai · April 07, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Market-Bench is a reproducible multi-agent supply-chain benchmark showing large performance disparities among LLM retailers in procurement and retail tasks, with a small subset consistently achieving capital appreciation while many hover near break-even despite similar semantic scores.

The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbf{Market-Bench}, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Specifically, we construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the \textbf{procurement} stage, LLMs bid for limited inventory in budget-constrained auctions. In the \textbf{retail} stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and winner-take-most phenomenon, \textit{i.e.}, only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point despite similar semantic matching scores. Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.

Summary

Main Finding

Market-Bench is a reproducible, closed‑loop multi‑agent supply‑chain benchmark that jointly tests LLMs’ quantitative optimization (procurement, pricing) and generative language (marketing) under hard scarcity. Across 20 LLM-driven retailer agents, the benchmark reveals large performance dispersion and a winner‑take‑most dynamic: a small subset of models consistently compound capital while many remain near break‑even. Successful performance is driven primarily by procurement (auction) success; language matters because of the Persona‑Gated Attention mechanism, but securing inventory is the threshold skill.

Key Points

Purpose: Evaluate LLMs on combined numeric and semantic economic decision‑making in a competitive market (auctions, pricing, marketing).
New mechanism: Persona‑Gated Attention — buyers’ visibility to sellers depends on cosine similarity between seller slogans and hidden buyer persona embeddings; attention weight ∝ exp(λ · similarity / τ). Language thus directly affects which sellers enter each buyer’s consideration set.
Two-stage environment:
- Stage A: Per‑item multi‑unit first‑price procurement auction with reserve prices and hard budget constraints.
- Stage B: Retail: agents set per‑item retail prices and a free‑form slogan; buyers sample consideration sets by persona alignment and then buy from the lowest‑priced seller in their set.
Logged outputs: full trajectories of bids, allocations, prices, slogans, sales, funds and inventories — enabling economic, operational and semantic metrics and process‑level analysis.
Empirical findings:
- Large dispersion among 20 LLM agents; top performers (Gemini 2.5 Pro / Flash) achieve much higher profits and margins. Some smaller reasoning‑focused models (e.g., Phi‑4) compete well versus larger models.
- Procurement success (BidEfficiency) and downstream FillRate strongly correlate with profit (Spearman ρ ≈ 0.68 and 0.88 respectively); StockoutRate is strongly negatively correlated with profit (ρ ≈ −0.88). Procurement under scarcity is a threshold skill producing bimodality (winners vs effectively inactive agents).
- Market‑level inequality increases rapidly (example: Gini ≈0.07 →0.21; Theil ≈0.02 →0.10; CV ≈0.13 →0.45) but concentration remains competitive (HHI ≈0.08–0.10); CR4 rises (≈0.23 →0.33) and Active Ratio falls toward ≈0.4.
- Dynamics: a strong entry shock between steps 0→1 after which strategies and slogan‑persona similarities largely stabilize; early success compounds.
- Architecture matters more than raw scale: reasoning‑enabled models tend to outperform purely larger models in these tasks.
Metrics reported: economic (profit, net profit margin), operational (FillRate, StockoutRate, BidEfficiency, OSI, IEI), semantic/cognitive (mean matching score / MMS, persona inference), and market indices (Gini, Theil, CV, HHI, CR4, Active Ratio).

Data & Methods

Environment:
- Retailers m = 20; items X = 8 with tiered base prices (example tiers: 50 / 150 / 800 / 2000) and specified quantities summing to S.
- Horizon: 6 steps; 2 bidding rounds per step.
- Supply‑demand ratio r = 0.95 (aggregate buyer demand D = r · S).
- Initial funds per agent Kinit = α · (Σx Qx Pbase(x)) / m with α = 1.5; example total catalog value = 300,000 → Kinit = 22,500.
- Buyers: k = β m with β = 10 (k = 200 buyers); buyer personas are latent and sampled from tribes (Thrifty, Ethical, Hype, Quality); each buyer has a sensitivity λj and attention coefficient ρj controlling consideration set size (Kmax = 20, τ = 1.0).
Procurement: agents submit bids bi,x = (qi,x, pbid i,x) subject to budget constraint Σx qi,x · pbid i,x ≤ Fundsi. Supplier allocates per‑item to highest bids above reserve, winners pay their bids.
Retail: each agent posts price Pi,x(t) and a free‑form slogan Slogani(t). For each buyer, compute Sim(i,j) = cos(E(Slogan), E(Persona)). Attention weights wi,j = exp(λj · Sim(i,j) / τ); buyer samples consideration set Vj(t) proportional to wi,j (size controlled by ρj and Kmax). Purchases: buyer buys desired item from lowest‑price seller in their consideration set with available inventory.
Logged outputs: per‑step and per‑agent bids, allocations, prices, slogans, sales events, funds, inventory. Multiple runs: 10 independent runs in the reported experiments.
Models evaluated: 20 LLM backends (closed and open source). Top performers reported include Gemini 2.5 Pro/Flash, O3, Sonnet 4.5, GPT‑4o; Phi‑4 (14B) notably competitive among open models.
Example quantitative results (from Table 2 & figures):
- Gemini 2.5 Pro: profit Π ≈ 36,589; NPM ≈ 0.167.
- Gemini 2.5 Flash: Π ≈ 26,104; NPM ≈ 0.190.
- Phi‑4 (open): Π ≈ 7,565; comparable to GPT‑4o (Π ≈ 7,619).
- Market inequality: Gini ≈0.07 →0.21 over horizon; CR4 ≈0.23 →0.33; HHI ~0.08–0.10.

Implications for AI Economics

For LLM deployment in retail and supply‑chain settings:
- Procurement and upstream resource allocation are critical bottlenecks: models must manage scarce capital and bid effectively under uncertainty. Language/marketing cannot substitute for lack of inventory.
- Early performance matters: small early advantages compound; systems should prioritize robust early bidding and inventory policies to avoid lock‑out.
- Integrating semantic strategy with numeric optimization is necessary: models must optimize slogans for persona reach while aligning price/inventory strategy—language is a strategic economic resource when visibility is endogenous.
For model design and training:
- Reasoning‑focused architectures and chain‑of‑thought capabilities transfer to economic decision tasks; scale alone is insufficient.
- Training objectives that jointly reward numeric planning (budget constraints, auction reasoning) and persona‑aware language generation may improve market performance. Reinforcement learning with environment traces from Market‑Bench could be fruitful.
For evaluation and research:
- Market‑Bench fills an important gap: combined numeric + free‑form language evaluation under hard scarcity in a closed‑loop multi‑agent market. It provides rich interaction logs for studying strategy evolution, language adaptation, and market structure emergence.
- The benchmark enables study of systemic effects of automated agents (inequality, participation decline, multi‑winner oligopoly) and can support research on competition policy, market fairness, and stability under AI agents.
For policy and market outcomes:
- Winner‑take‑most dynamics and reduced active participation suggest potential risks of competitive exclusion if automated LLM agents are widely deployed without safeguards.
- Regulators and platform designers should consider monitoring procurement dynamics and visibility mechanisms (how language affects platform exposure) to prevent unfair lock‑ins or amplified inequality.

Overall, Market‑Bench demonstrates that economically consequential language plus hard resource constraints produce rich, realistic tests for LLMs; success requires combined numeric reasoning and persona‑sensitive language strategies, with procurement under scarcity the primary determinant of profitability.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents systematic, reproducible simulation evidence comparing 20 LLM agents on economically framed tasks with detailed operational and economic metrics; however, results come from a synthetic multi-agent environment rather than observational or experimental data from real markets, so they provide moderate internal evidence about model capabilities but limited causal external validity for real-world economic outcomes. Methods Rigormedium — Market-Bench appears carefully constructed (configurable supply-chain simulator, explicit procurement and retail stages, logged trajectories and multiple metric families) and benchmarks many models, but it relies on simulated buyer behavior, simplified market primitives, potential sensitivity to prompt engineering and agent design choices, and does not validate against real transaction data or field experiments. SampleA configurable multi-agent supply-chain simulation in which LLMs act as retailer agents that bid in budget-constrained procurement auctions, set retail prices, produce marketing slogans, and interact with simulated buyers via a role-based attention mechanism; benchmarked across 20 open- and closed-source LLM agents with logged trajectories of bids, prices, slogans, sales, and balance-sheet states used to compute economic, operational, and semantic metrics. Themesinnovation adoption GeneralizabilitySynthetic simulation rather than field data — buyer behavior and market dynamics are modeled, not observed., Simplified market primitives (single retailer role, limited product types, simplified auctions and demand) reduce realism of complex real-world supply chains., Results sensitive to prompt design, agent architecture, and hyperparameters; closed-source model opacity limits reproducibility and interpretation., Short-run, controlled interactions — may not capture long-run strategic adaptation, learning, or regulatory effects., No integration of real-world frictions like taxes, credit constraints, inventory perishability, supply-side strategic behavior, or heterogeneous firms/consumers across geographies.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We introduce Market-Bench, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Other	positive	high	None	0.3
We construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. Other	positive	high	None	0.3
In the procurement stage, LLMs bid for limited inventory in budget-constrained auctions. Other	positive	high	None	0.3
In the retail stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Other	positive	high	None	0.3
Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Other	positive	high	None	0.3
Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and a winner-take-most phenomenon. Firm Revenue	mixed	high	performance (financial/competitive outcomes of retailer agents)	n=20 0.18
Only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point. Firm Revenue	mixed	high	capital appreciation / agent profitability	n=20 0.18
Many agents hover around the break-even point despite similar semantic matching scores. Firm Revenue	negative	high	profitability relative to semantic matching score	n=20 0.18
Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets. Other	positive	high	None	0.18