A simulated multi-agent marketplace shows that only a few LLM retailers reliably grow capital while most break even, revealing a winner-take-most pattern in economic task performance; semantic matching alone does not predict economic success.
The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbf{Market-Bench}, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Specifically, we construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the \textbf{procurement} stage, LLMs bid for limited inventory in budget-constrained auctions. In the \textbf{retail} stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and winner-take-most phenomenon, \textit{i.e.}, only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point despite similar semantic matching scores. Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.
Summary
Main Finding
Market-Bench is a reproducible, closed‑loop multi‑agent supply‑chain benchmark that jointly tests LLMs’ quantitative optimization (procurement, pricing) and generative language (marketing) under hard scarcity. Across 20 LLM-driven retailer agents, the benchmark reveals large performance dispersion and a winner‑take‑most dynamic: a small subset of models consistently compound capital while many remain near break‑even. Successful performance is driven primarily by procurement (auction) success; language matters because of the Persona‑Gated Attention mechanism, but securing inventory is the threshold skill.
Key Points
- Purpose: Evaluate LLMs on combined numeric and semantic economic decision‑making in a competitive market (auctions, pricing, marketing).
- New mechanism: Persona‑Gated Attention — buyers’ visibility to sellers depends on cosine similarity between seller slogans and hidden buyer persona embeddings; attention weight ∝ exp(λ · similarity / τ). Language thus directly affects which sellers enter each buyer’s consideration set.
- Two-stage environment:
- Stage A: Per‑item multi‑unit first‑price procurement auction with reserve prices and hard budget constraints.
- Stage B: Retail: agents set per‑item retail prices and a free‑form slogan; buyers sample consideration sets by persona alignment and then buy from the lowest‑priced seller in their set.
- Logged outputs: full trajectories of bids, allocations, prices, slogans, sales, funds and inventories — enabling economic, operational and semantic metrics and process‑level analysis.
- Empirical findings:
- Large dispersion among 20 LLM agents; top performers (Gemini 2.5 Pro / Flash) achieve much higher profits and margins. Some smaller reasoning‑focused models (e.g., Phi‑4) compete well versus larger models.
- Procurement success (BidEfficiency) and downstream FillRate strongly correlate with profit (Spearman ρ ≈ 0.68 and 0.88 respectively); StockoutRate is strongly negatively correlated with profit (ρ ≈ −0.88). Procurement under scarcity is a threshold skill producing bimodality (winners vs effectively inactive agents).
- Market‑level inequality increases rapidly (example: Gini ≈0.07 →0.21; Theil ≈0.02 →0.10; CV ≈0.13 →0.45) but concentration remains competitive (HHI ≈0.08–0.10); CR4 rises (≈0.23 →0.33) and Active Ratio falls toward ≈0.4.
- Dynamics: a strong entry shock between steps 0→1 after which strategies and slogan‑persona similarities largely stabilize; early success compounds.
- Architecture matters more than raw scale: reasoning‑enabled models tend to outperform purely larger models in these tasks.
- Metrics reported: economic (profit, net profit margin), operational (FillRate, StockoutRate, BidEfficiency, OSI, IEI), semantic/cognitive (mean matching score / MMS, persona inference), and market indices (Gini, Theil, CV, HHI, CR4, Active Ratio).
Data & Methods
- Environment:
- Retailers m = 20; items X = 8 with tiered base prices (example tiers: 50 / 150 / 800 / 2000) and specified quantities summing to S.
- Horizon: 6 steps; 2 bidding rounds per step.
- Supply‑demand ratio r = 0.95 (aggregate buyer demand D = r · S).
- Initial funds per agent Kinit = α · (Σx Qx Pbase(x)) / m with α = 1.5; example total catalog value = 300,000 → Kinit = 22,500.
- Buyers: k = β m with β = 10 (k = 200 buyers); buyer personas are latent and sampled from tribes (Thrifty, Ethical, Hype, Quality); each buyer has a sensitivity λj and attention coefficient ρj controlling consideration set size (Kmax = 20, τ = 1.0).
- Procurement: agents submit bids bi,x = (qi,x, pbid i,x) subject to budget constraint Σx qi,x · pbid i,x ≤ Fundsi. Supplier allocates per‑item to highest bids above reserve, winners pay their bids.
- Retail: each agent posts price Pi,x(t) and a free‑form slogan Slogani(t). For each buyer, compute Sim(i,j) = cos(E(Slogan), E(Persona)). Attention weights wi,j = exp(λj · Sim(i,j) / τ); buyer samples consideration set Vj(t) proportional to wi,j (size controlled by ρj and Kmax). Purchases: buyer buys desired item from lowest‑price seller in their consideration set with available inventory.
- Logged outputs: per‑step and per‑agent bids, allocations, prices, slogans, sales events, funds, inventory. Multiple runs: 10 independent runs in the reported experiments.
- Models evaluated: 20 LLM backends (closed and open source). Top performers reported include Gemini 2.5 Pro/Flash, O3, Sonnet 4.5, GPT‑4o; Phi‑4 (14B) notably competitive among open models.
- Example quantitative results (from Table 2 & figures):
- Gemini 2.5 Pro: profit Π ≈ 36,589; NPM ≈ 0.167.
- Gemini 2.5 Flash: Π ≈ 26,104; NPM ≈ 0.190.
- Phi‑4 (open): Π ≈ 7,565; comparable to GPT‑4o (Π ≈ 7,619).
- Market inequality: Gini ≈0.07 →0.21 over horizon; CR4 ≈0.23 →0.33; HHI ~0.08–0.10.
Implications for AI Economics
- For LLM deployment in retail and supply‑chain settings:
- Procurement and upstream resource allocation are critical bottlenecks: models must manage scarce capital and bid effectively under uncertainty. Language/marketing cannot substitute for lack of inventory.
- Early performance matters: small early advantages compound; systems should prioritize robust early bidding and inventory policies to avoid lock‑out.
- Integrating semantic strategy with numeric optimization is necessary: models must optimize slogans for persona reach while aligning price/inventory strategy—language is a strategic economic resource when visibility is endogenous.
- For model design and training:
- Reasoning‑focused architectures and chain‑of‑thought capabilities transfer to economic decision tasks; scale alone is insufficient.
- Training objectives that jointly reward numeric planning (budget constraints, auction reasoning) and persona‑aware language generation may improve market performance. Reinforcement learning with environment traces from Market‑Bench could be fruitful.
- For evaluation and research:
- Market‑Bench fills an important gap: combined numeric + free‑form language evaluation under hard scarcity in a closed‑loop multi‑agent market. It provides rich interaction logs for studying strategy evolution, language adaptation, and market structure emergence.
- The benchmark enables study of systemic effects of automated agents (inequality, participation decline, multi‑winner oligopoly) and can support research on competition policy, market fairness, and stability under AI agents.
- For policy and market outcomes:
- Winner‑take‑most dynamics and reduced active participation suggest potential risks of competitive exclusion if automated LLM agents are widely deployed without safeguards.
- Regulators and platform designers should consider monitoring procurement dynamics and visibility mechanisms (how language affects platform exposure) to prevent unfair lock‑ins or amplified inequality.
Overall, Market‑Bench demonstrates that economically consequential language plus hard resource constraints produce rich, realistic tests for LLMs; success requires combined numeric reasoning and persona‑sensitive language strategies, with procurement under scarcity the primary determinant of profitability.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce Market-Bench, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Other | positive | high | None |
0.3
|
| We construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. Other | positive | high | None |
0.3
|
| In the procurement stage, LLMs bid for limited inventory in budget-constrained auctions. Other | positive | high | None |
0.3
|
| In the retail stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Other | positive | high | None |
0.3
|
| Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Other | positive | high | None |
0.3
|
| Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and a winner-take-most phenomenon. Firm Revenue | mixed | high | performance (financial/competitive outcomes of retailer agents) |
n=20
0.18
|
| Only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point. Firm Revenue | mixed | high | capital appreciation / agent profitability |
n=20
0.18
|
| Many agents hover around the break-even point despite similar semantic matching scores. Firm Revenue | negative | high | profitability relative to semantic matching score |
n=20
0.18
|
| Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets. Other | positive | high | None |
0.18
|