Benchmarking AI in isolation is misleading: marketplace dynamics like user switching, routing, and operational constraints determine post-deployment success. The authors propose a simulation-driven 'Marketplace Evaluation' framework to measure retention, market share and other longitudinal outcomes that accuracy metrics miss.

Evaluation of Agents under Simulated AI Marketplace Dynamics

To Eun Kim, Alireza Salemi, Hamed Zamani, Fernando Diaz · April 15, 2026

arxiv theoretical n/a evidence 7/10 relevance Source PDF

The paper proposes 'Marketplace Evaluation', a simulation-based paradigm for assessing information access systems as competing marketplace participants, producing longitudinal marketplace metrics (e.g., retention, market share) that complement traditional accuracy benchmarks.

Modern information access ecosystems consist of mixtures of systems, such as retrieval systems and large language models, and increasingly rely on marketplaces to mediate access to models, tools, and data, making competition between systems inherent to deployment. In such settings, outcomes are shaped not only by benchmark quality but also by competitive pressure, including user switching, routing decisions, and operational constraints. Yet evaluation is still largely conducted on static benchmarks with accuracy-focused measures that assume systems operate in isolation. This mismatch makes it difficult to predict post-deployment success and obscures competitive effects such as early-adoption advantages and market dominance. We introduce Marketplace Evaluation, a simulation-based paradigm that evaluates information access systems as participants in a competitive marketplace. By simulating repeated interactions and evolving user and agent preferences, the framework enables longitudinal evaluation and marketplace-level metrics, such as retention and market share, that complement and can extend beyond traditional accuracy-based metrics. We formalize the framework and outline a research agenda, motivated by business and economics, around marketplace simulation, metrics, optimization, and adoption in evaluation campaigns like TREC.

Summary

Main Finding

Evaluating information-access agents within a simulated marketplace—where users choose providers, preferences evolve, and providers (generators, retrievers, routers) make strategic choices—reveals qualitatively different outcomes than static, standalone benchmarks. Small capability differences and market-structural factors (timing of entry, concentration, routing, capacity, pricing) can produce large differences in adoption, retention, and market share (including winner-take-all outcomes) that static accuracy metrics alone do not predict.

Key Points

Motivation
- Traditional IA evaluation (Cranfield, pairwise A/B) treats systems in isolation or as anonymous variants and therefore misses competitive, temporal, and behavioral dynamics that determine real deployment success.
- Real-world ecosystems (LLMs, RAG systems, marketplaces) create interdependencies: generator value depends on retriever adoption, routing decisions, cost/latency, and users form persistent preferences from experience.
Marketplace Evaluation (contribution)
- Proposes an agent-based simulation framework that models users, generators, retrievers, routers and their interactions over time.
- Emphasizes longitudinal, ecosystem-level metrics (market share, retention, Herfindahl-Hirschman Index (HHI), Δ from “fair share”) that complement accuracy-based metrics.
- Formalizes single interaction dynamics: user selects generator (or is routed), generator may select retriever, response is produced, utility evaluated, preferences updated.
Empirical illustration (motivating experiment)
- Static benchmark (500 fact-seeking questions from SimpleQA) produces a ranking by F1.
- Minimal marketplace simulation: 10 users, sampling 5 per step, 200 steps; users update preference based on per-question correctness; seventh model introduced at step 100 to study entry effects.
- Findings: rankings and realized market shares under interaction differ from static ranking; late entry of a high-performing model into a lightly concentrated market can compress shares and reorder standing, whereas entry into highly concentrated markets may fail to capture expected share despite good standalone performance.
- Fluctuating windowed market shares show path dependence and transient dominance not visible from aggregate static scores.
Research agenda (high-level)
- RQ1: How to simulate IA marketplaces realistically (agent behaviors, routing, capacity, pricing, multi-agent strategies).
- RQ2: Which metrics best capture market and agent performance (market share, retention, concentration, welfare measures) and how to relate them to standard accuracy metrics.
- RQ3: How to integrate marketplace evaluation into benchmarking campaigns and infrastructures (reproducibility, standard scenarios, calibration against field data).
Practical considerations
- Marketplace dynamics matter for product strategy (timing of entry, differentiation, pricing), platform design (routing policies, capacity guarantees), and policy (concentration, innovation incentives).
- Simulation enables counterfactual and longitudinal studies without live-user experiments when calibrated carefully.

Data & Methods

Framework
- Agent-based simulation of IA marketplaces, instantiated in the paper for a retrieval-augmented generation (RAG) ecosystem with users, generators (LLMs), retrievers, and routers.
- Agents are heterogeneous, adaptive, and make sequential decisions; feedback signals (utility from responses) update preferences and routing behavior, producing path-dependent dynamics.
Motivating experiment (concrete setup)
- Static baseline: seven models evaluated on 500 fact-seeking questions (SimpleQA), scored by question-level correctness and aggregated F1 to produce a static ranking and a “fair share” proportional to F1.
- Simulation parameters:
  - User population: 10 users with individual preference distributions.
  - Per-step sampling: 5 users sampled per time step; each is assigned a question without replacement.
  - Interaction window: 200 steps; first 100 use six models, the 7th introduced at step 100 to model market entry.
  - Preference updating: users update preferences based solely on per-question correctness of the response; small exploration probability allows non-greedy choices.
- Metrics tracked:
  - Market share: fraction of queries received in sliding windows (window size 10).
  - HHI: concentration index across market shares.
  - ΔFS: deviation of realized market share from “fair share” implied by static F1.
- Experimental variations:
  - Two entry scenarios: (a) strong model (Qwen3) enters a lightly concentrated market; (b) average model (DeepSeek V3.2) enters a highly concentrated market.
- Observed dynamics:
  - Local windowed rankings fluctuate, sometimes dropping market share near zero then recovering.
  - Entry shocks yield different effects depending on pre-entry concentration—discrepancy between static “fair” expectations and realized shares increases after entry.
Methodological claims & assumptions
- Agent-based simulation is necessary to capture emergent aggregate behavior from heterogeneous, adaptive agents.
- The paper’s illustrative simulation is deliberately minimal to show principle effects; realistic calibration (larger user bases, more complex utility models, cost/latency, strategic provider behavior) is left for future work.

Implications for AI Economics

Market structure and entry timing
- Early-adoption advantages, path dependence, and incumbent concentration can block meritocratic outcomes: higher intrinsic quality does not guarantee proportional market share if the market is already concentrated.
- Late entry effects: introducing a strong model into a lightly concentrated market can reshuffle shares; entering a highly concentrated market is costly and may yield much less adoption than standalone performance predicts.
Winner-take-all and concentration risks
- Small differences in observed utility (accuracy, latency, cost) can be amplified by user learning and routing policies, producing high HHI and market dominance—relevant to antitrust considerations and innovation policy.
Platform design and routing economics
- Routing policies, intermediaries, and API/marketplace constraints (capacity limits, pricing, access to proprietary collections) keyly affect which technologies are adopted. Platform designers influence competition by controlling exposure and routing incentives.
Provider strategy and product design
- Providers should consider not just raw model capability but strategies for differentiation, pricing, capacity guarantees, partnerships (e.g., retrieval access), and timing of release.
- Portfolio effects: generators may hedge by using multiple retrievers—this affects demand allocation and the value of complementary goods.
Evaluation policy and benchmarking
- Benchmarks should be extended to include marketplace-aware metrics (market share under dynamics, retention, concentration) and simulation-based scenarios to better forecast deployment success.
- Benchmark campaigns (e.g., TREC-like) could supply standardized market scenarios and calibrated simulators for more deployment-relevant evaluation.
Welfare and regulatory considerations
- Simulation-based marketplace evaluation enables analysis of consumer surplus, provider profits, and social welfare across market structures; useful for assessing regulation, platform interventions, and incentives for diversity/innovation.
Research & empirical needs
- To be decision-useful for policymakers and firms, simulations must be calibrated to observed user behavior, routing algorithms, pricing contracts, and empirical traffic patterns; otherwise, results remain qualitative.
- Need to model richer utility signals (beyond correctness) including relevance, hallucination risk, latency, cost, privacy, and trust, as well as strategic responses by providers (price competition, feature gating).

Limitations noted by the authors - The illustrative simulation is stylized (small user pool, simple correctness-based preference updating, no explicit strategic provider optimization). - Real-world validation and calibration to live marketplace data are necessary next steps. - Extending metrics to welfare, pricing, and multi-dimensional quality signals is part of the proposed agenda.

Overall, the paper argues that marketplace-aware, simulation-based evaluation is essential to understand economic dynamics of AI information-access systems, and it lays out a program for developing simulations, metrics, and integration with existing benchmarking infrastructures to better predict post-deployment outcomes.

Assessment

Paper Typetheoretical Evidence Strengthn/a — The paper is conceptual and proposes a simulation-based evaluation paradigm rather than presenting empirical estimates or causal identification; it does not provide data-driven results or causal inference to support specific substantive claims. Methods Rigormedium — The work offers a formalized framework and sensible simulation ingredients (repeated interactions, evolving preferences, marketplace metrics) showing methodological awareness, but it lacks implemented simulations, calibration/validation with real-world data, sensitivity analyses, or demonstrations of robustness to modelling choices. SampleNo empirical sample or dataset is used; the paper presents a conceptual framework and outlines simulation components and research agenda rather than reporting analyses on real-world or synthetic datasets. Themesadoption governance GeneralizabilityConceptual proposals may not generalize without empirical calibration or validation against real marketplaces, Simulation outcomes will depend strongly on model assumptions and parameter choices (user heterogeneity, switching costs, routing rules), May not capture strategic firm/platform incentives, regulatory constraints, or operational realities of large providers, Applicability varies across domains (open web search vs. vertical model marketplaces) and may require domain-specific extensions, Does not address measurement limitations from noisy user-feedback signals or observability constraints in deployed systems

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Modern information access ecosystems consist of mixtures of systems, such as retrieval systems and large language models, and increasingly rely on marketplaces to mediate access to models, tools, and data, making competition between systems inherent to deployment. Market Structure	positive	high	marketplace composition and competition	0.06
Outcomes are shaped not only by benchmark quality but also by competitive pressure, including user switching, routing decisions, and operational constraints. Market Structure	mixed	high	post-deployment system outcomes (e.g., success influenced by competition factors)	0.06
Evaluation is still largely conducted on static benchmarks with accuracy-focused measures that assume systems operate in isolation. Adoption Rate	negative	high	evaluation practice (use of static accuracy-focused benchmarks)	0.12
This mismatch makes it difficult to predict post-deployment success and obscures competitive effects such as early-adoption advantages and market dominance. Market Structure	negative	high	predictability of post-deployment success and visibility of competitive effects (early-adoption advantages, market dominance)	0.06
We introduce Marketplace Evaluation, a simulation-based paradigm that evaluates information access systems as participants in a competitive marketplace. Adoption Rate	positive	high	evaluation paradigm (Marketplace Evaluation)	0.2
By simulating repeated interactions and evolving user and agent preferences, the framework enables longitudinal evaluation and marketplace-level metrics, such as retention and market share, that complement and can extend beyond traditional accuracy-based metrics. Market Structure	positive	high	marketplace-level metrics (retention, market share) and longitudinal evaluation capability	0.12
We formalize the framework and outline a research agenda, motivated by business and economics, around marketplace simulation, metrics, optimization, and adoption in evaluation campaigns like TREC. Adoption Rate	positive	high	research agenda and proposed adoption of Marketplace Evaluation in evaluation campaigns	0.2