Evidence (13661 claims)
Adoption
8339 claims
Productivity
7479 claims
Governance
6715 claims
Human-AI Collaboration
6267 claims
Org Design
4098 claims
Innovation
3987 claims
Labor Markets
3488 claims
Skills & Training
2888 claims
Inequality
2016 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 740 | 192 | 95 | 871 | 1945 |
| Governance & Regulation | 796 | 388 | 185 | 119 | 1512 |
| Organizational Efficiency | 765 | 186 | 123 | 82 | 1166 |
| Technology Adoption Rate | 610 | 227 | 121 | 95 | 1061 |
| Research Productivity | 409 | 121 | 56 | 331 | 928 |
| Output Quality | 464 | 174 | 58 | 47 | 743 |
| Decision Quality | 318 | 173 | 75 | 42 | 615 |
| Firm Productivity | 432 | 55 | 88 | 20 | 601 |
| AI Safety & Ethics | 214 | 273 | 65 | 33 | 589 |
| Market Structure | 175 | 165 | 120 | 24 | 489 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 161 | 57 | 57 | 16 | 291 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Fiscal & Macroeconomic | 130 | 69 | 43 | 26 | 275 |
| Employment Level | 104 | 50 | 105 | 13 | 274 |
| Consumer Welfare | 116 | 62 | 42 | 11 | 231 |
| Firm Revenue | 149 | 45 | 26 | 3 | 223 |
| Inequality Measures | 43 | 120 | 49 | 6 | 218 |
| Task Completion Time | 164 | 29 | 8 | 12 | 214 |
| Worker Satisfaction | 89 | 60 | 20 | 12 | 181 |
| Error Rate | 69 | 89 | 9 | 2 | 169 |
| Regulatory Compliance | 74 | 67 | 14 | 4 | 159 |
| Training Effectiveness | 91 | 19 | 13 | 19 | 144 |
| Wages & Compensation | 77 | 33 | 25 | 6 | 141 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Automation Exposure | 49 | 50 | 22 | 12 | 136 |
| Developer Productivity | 91 | 17 | 14 | 5 | 128 |
| Job Displacement | 12 | 80 | 19 | 1 | 112 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Skill Obsolescence | 5 | 43 | 6 | 1 | 55 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
This paper presents a taxonomy of seven failure modes unique to production agentic systems.
Author contribution: taxonomy presented in the paper (count = seven failure modes).
These findings provide insights for designing flexible yet reliable constraint-based workflows.
Synthesis and discussion of study results and technical evaluation in paper's conclusion.
User-defined constraint types improve user satisfaction.
Reported user study measures showing higher satisfaction for participants using U-Define compared to baselines (no sample size or numeric effects provided).
User-defined constraint types improve performance.
Reported results from user studies and/or technical evaluation indicating better task performance when users can set hard/soft constraint types (no numeric effect size or sample size in excerpt).
User-defined constraint types improve perceived usefulness.
Results from the reported user studies comparing U-Define (user-defined constraint types) to baselines; based on participant responses and measures of perceived usefulness (sample sizes/details not provided in excerpt).
U-Define verifies hard constraints using formal model checking and verifies soft constraints using an LLM-as-judge evaluation.
Description of the complementary verification methods employed in the U-Define system (technical design/implementation).
We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility.
System implementation and description in paper (design and implementation of U-Define).
KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.
Theoretical claim about economic and cumulative effects of adopting KOs; no cost-benefit analysis, pilot results, or quantitative evidence reported in the paper.
We propose Knowledge Objects (KOs) — structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse.
Proposed solution described in the paper; conceptual design and intended properties presented, without reported deployments, trials, or empirical evaluation.
Evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, provides added value compared to focusing on benchmark performance only.
Argument/interpretation in the paper based on the study's multi-turn human-in-the-loop evaluation showing differences between objective performance gains and participant perceptions.
Hybrid systems (human + RAG assistant) are beneficial in information-seeking scenarios.
Conclusion drawn from the experiment showing human-AI collaboration outperforms model-only baselines across model sizes in a realistic multi-turn information-seeking task with N=112 participants.
The performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size.
Reported results from the experimental comparison across conditions and three model sizes (3B, 8B, 70B) with N=112 participants; paper states the performance gain is significant across sizes (no numeric effect sizes or p-values provided in the excerpt).
The framework addresses AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.
Paper lists and provides guidance on AI-specific methodological issues (model versioning, interaction dynamics, contamination/spillover, equity). This is a descriptive claim about topics the framework covers, not an empirical evaluation of solutions.
The framework implements a graded transparency and repeatability framework.
Paper extends TOP-guideline-derived transparency principle into a graded scheme for transparency and repeatability; described as an operational feature of the proposed framework.
The framework integrates heterogeneity analysis and practical significance assessment.
Paper reports inclusion of guidance on analyzing heterogenous treatment effects and assessing practical significance; presented as part of guidelines rather than tested across datasets.
The framework formalizes causal inference through RCT methodology for AI contexts.
Paper states adoption of randomized controlled trial methods and causal inference framing for AI impact evaluation; described as methodological proposition rather than validated application.
Our framework extends prior work by centering evaluation on human performance rather than model output alone.
Paper claims a conceptual shift: focus on human performance metrics; supported by argumentative rationale and literature references rather than empirical demonstration.
The principles and guidelines serve three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms.
Paper's stated intended uses/positioning of the framework; presented as roles in the discussion/positioning section rather than empirically validated roles.
We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases.
Paper reports a concrete output: 33 guidelines derived from the five principles, with each guideline presented as requirement + rationale + implementation instructions + evidence base (documented in paper content).
The paper adopts the (Shadish et al., 2002) four-validity framework and extends it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025).
Explicit methodological choice described in the paper: adoption of Shadish et al. four-validity framework and addition of a transparency/repeatability principle based on TOP Guidelines; documented in the text as design decision.
The framework draws on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology.
Paper reports literature review and cross-disciplinary synthesis as the methodological foundation for the framework (references to those disciplines). No empirical cross-disciplinary experiment reported.
This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies).
Paper's stated contribution: development of a conceptual framework integrating RCT design principles for AI evaluation. Based on literature synthesis and methodological argumentation rather than empirical testing.
The paper introduces a Specification Governance Model (SGM), grounded in Transaction Cost Economics, and provides a practical governance decision guide.
Conceptual/modeling contribution described in the paper: SGM grounded in TCE with an applied decision guide (theoretical plus prescriptive).
The paper proposes the AI-Augmented Methodology Taxonomy (AAMT), classifying six methodologies under three AI integration tiers.
Conceptual contribution: taxonomy introduced and described in the paper (six methodologies, three tiers).
Telemetry across 10,000+ developers shows a 98% increase in pull requests.
Observational telemetry data aggregated across >10,000 developers reported in the paper; metric reported is percent increase in pull request count.
Controlled studies report 20-56% productivity gains on well-scoped tasks.
Aggregate of multiple controlled experimental studies cited in the paper (2022–2026); reported as observed productivity improvements on well-scoped tasks in those studies. Specific study-level sample sizes not reported in the claim text.
Practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration can be articulated, and calibrated beliefs plus utility-aware policies can improve agentic AI orchestration (illustrated via concrete examples and design patterns)
Paper provides articulated properties, examples, and design patterns but no empirical validation; claims of improvement are illustrated conceptually.
Coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters
Central prescriptive claim of the position paper; supported by conceptual argumentation and illustrative examples rather than empirical tests.
Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions
Argumentative/theoretical claim in the position paper; illustrated with conceptual examples and design patterns rather than empirical evaluation.
Many high-value deployments rely on decisions under uncertainty (for example, which tool to call, which expert to consult, or how many resources to invest)
Stated as a motivating observation in the paper; no quantitative data or sample provided.
LLMs excel at predictive tasks and complex reasoning tasks
Asserted in the paper's opening motivation; no empirical evaluation or sample reported in the paper itself.
The platform was used to support compound AI use cases at Salesforce, specifically Agentforce (autonomous AI agents) and ApexGuru (AI-powered code analysis).
Paper states the deployment was developed at Salesforce and lists Agentforce and ApexGuru as supported use cases; this is an implementation/adoption claim rather than a quantitative result.
The architecture enables compound AI systems to: (a) scale model invocations in parallel, (b) handle bursty multi-agent workloads, and (c) support rapid model iteration — capabilities essential for operationalizing agentic AI at enterprise scale.
Paper provides case studies (Agentforce, ApexGuru) and operational lessons from production deployment to support these functional claims; the provided text does not include numerical benchmarks for each capability individually nor sample sizes.
The modular, platform-agnostic inference architecture integrates serverless execution, dynamic autoscaling, and MLOps pipelines to deliver consistent low-latency inference across multi-component agent workflows.
System design and production deployment description in the paper; claim supported by implementation details and reported production performance (qualitative and operational evidence), but no detailed experimental protocol or sample sizes are given in the provided text.
The platform delivered 30 to 40% cost savings relative to prior static deployments.
Reported production cost comparisons between the new modular inference architecture and prior static deployments (paper states "30 to 40% cost savings"); the provided text does not include details on cost components, time period, or sample size.
The deployment produced up to 3.9x throughput improvement compared to prior static deployments.
Reported production results comparing throughput of the modular inference architecture to prior static deployments (statement in the paper: "up to 3.9x throughput improvement"); no sample size or confidence intervals provided in the provided text.
The production deployment achieved over 50% reduction in tail latency (P95) compared to prior static deployments.
Reported production results comparing the modular inference architecture to prior static deployments (production measurements of P95 tail latency); paper states this was observed in production but does not report sample size or detailed statistical tests in the provided text.
We release the benchmark, harness, sweep configurations, and full run corpus.
Statement of artifact release in the paper; verifiable by checking the project's repository or supplementary materials.
These findings suggest a practical design principle for agentic systems: use smaller open-weight models for the broad base of routine actions, and reserve large frontier models for the narrower class of tasks that truly demand deeper planning and control.
Synthesis/recommendation drawn from the empirical results on AgentFloor showing where small/mid models suffice and where frontier models have advantage; prescriptive claim rather than a direct empirical measurement.
The gap appears most clearly on long-horizon planning tasks that require sustained coordination and reliable constraint tracking over many steps, where frontier models still hold an advantage, though neither side reaches strong reliability.
Performance breakdown by capability tier on AgentFloor showing frontier (GPT-5) advantage on long-horizon planning/constraint-tracking tasks; both model groups have low absolute reliability on these tasks according to reported results.
We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs.
Empirical evaluation reported in the paper: 16 open-weight models spanning specified parameter sizes, inclusion of GPT-5, and a total of 16,542 scored runs (reported counts).
We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints.
Paper describes the design of the benchmark: deterministic, 30 tasks, organized into six tiers covering specified capabilities. This is a descriptive claim about the artifact introduced in the work.
The paper proposes five forms of online and offline issuance of RSDM, providing a prototype for creating a globally recognized modern honest money.
Authors' stated contribution in the paper (enumeration of five issuance forms and provision of a prototype); the excerpt explicitly refers to 'five forms'.
RSDM is an innovative version of Jiaozi (a deposit receipt for base metal coin that emerged in Sichuan, China, about a thousand years ago).
Comparative/analogical claim by the authors linking the proposed design to a historical instrument; no empirical analysis provided in the excerpt.
Redeemable Self-Decaying/Devaluing Money (RSDM) is a tokenized commodity money whose essential innovation is to fill the hole in the storage fee of metal coins through the self-devaluing of metal weight recorded on the deposit certificate (warehouse receipt) of metal coins.
Design/specification proposed in the paper (conceptual mechanism); no empirical evaluation or sample size reported in the excerpt.
When AI acts as an agent for cross-border capital pool and cross cyclical asset allocation, it needs a sound money that can resist the depreciation of fiat currency and store long-term value.
Theoretical argument in the paper about functional requirements of AI agents managing cross-border capital; no empirical sample reported in the excerpt.
In the AI world, however, the medium of exchange tends to be a globally recognized currency.
Author's theoretical assertion / forward-looking claim in the paper; no empirical data or sample provided in the excerpt.
TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication.
Author statement about the intended use of the benchmark and transparency practices (publication of provenance and limitations).
We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0.
Author statement of artifact release (license explicitly CC BY 4.0).
We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference).
Methodological contribution described by the authors; framework specification and composite metrics defined in the paper.