Evidence (13661 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	740	192	95	871	1945
Governance & Regulation	796	388	185	119	1512
Organizational Efficiency	765	186	123	82	1166
Technology Adoption Rate	610	227	121	95	1061
Research Productivity	409	121	56	331	928
Output Quality	464	174	58	47	743
Decision Quality	318	173	75	42	615
Firm Productivity	432	55	88	20	601
AI Safety & Ethics	214	273	65	33	589
Market Structure	175	165	120	24	489
Task Allocation	206	64	70	31	376
Skill Acquisition	161	57	57	16	291
Innovation Output	201	27	41	18	288
Fiscal & Macroeconomic	130	69	43	26	275
Employment Level	104	50	105	13	274
Consumer Welfare	116	62	42	11	231
Firm Revenue	149	45	26	3	223
Inequality Measures	43	120	49	6	218
Task Completion Time	164	29	8	12	214
Worker Satisfaction	89	60	20	12	181
Error Rate	69	89	9	2	169
Regulatory Compliance	74	67	14	4	159
Training Effectiveness	91	19	13	19	144
Wages & Compensation	77	33	25	6	141
Team Performance	86	17	27	9	140
Automation Exposure	49	50	22	12	136
Developer Productivity	91	17	14	5	128
Job Displacement	12	80	19	1	112
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	16	7	2	57
Skill Obsolescence	5	43	6	1	55
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

This paper presents a taxonomy of seven failure modes unique to production agentic systems.

Author contribution: taxonomy presented in the paper (count = seven failure modes).

high positive Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... cataloging of distinct failure modes in production agentic systems

These findings provide insights for designing flexible yet reliable constraint-based workflows.

Synthesis and discussion of study results and technical evaluation in paper's conclusion.

high positive U-Define: Designing User Workflows for Hard and Soft Constra... design guidance for constraint-based workflows

User-defined constraint types improve user satisfaction.

Reported user study measures showing higher satisfaction for participants using U-Define compared to baselines (no sample size or numeric effects provided).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... user satisfaction (self-reported)

User-defined constraint types improve performance.

Reported results from user studies and/or technical evaluation indicating better task performance when users can set hard/soft constraint types (no numeric effect size or sample size in excerpt).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... performance (task success / quality of generated plans)

User-defined constraint types improve perceived usefulness.

Results from the reported user studies comparing U-Define (user-defined constraint types) to baselines; based on participant responses and measures of perceived usefulness (sample sizes/details not provided in excerpt).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... perceived usefulness (user-reported)

U-Define verifies hard constraints using formal model checking and verifies soft constraints using an LLM-as-judge evaluation.

Description of the complementary verification methods employed in the U-Define system (technical design/implementation).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... verification of constraint types (hard via model checking, soft via LLM evaluati...

We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility.

System implementation and description in paper (design and implementation of U-Define).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... ability to specify constraints (natural-language input and categorization into h...

KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.

Theoretical claim about economic and cumulative effects of adopting KOs; no cost-benefit analysis, pilot results, or quantitative evidence reported in the paper.

high positive Reliable AI Needs to Externalize Implicit Knowledge: A Human... cost-effectiveness of verification and cumulative improvement in AI reliability

We propose Knowledge Objects (KOs) — structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse.

Proposed solution described in the paper; conceptual design and intended properties presented, without reported deployments, trials, or empirical evaluation.

high positive Reliable AI Needs to Externalize Implicit Knowledge: A Human... externalization and human verifiability of implicit knowledge via KOs

Evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, provides added value compared to focusing on benchmark performance only.

Argument/interpretation in the paper based on the study's multi-turn human-in-the-loop evaluation showing differences between objective performance gains and participant perceptions.

high positive Seeking Information with RAG-Assistants: Does Model Size Mat... evaluation methodology value (usability, satisfaction, accuracy)

Hybrid systems (human + RAG assistant) are beneficial in information-seeking scenarios.

Conclusion drawn from the experiment showing human-AI collaboration outperforms model-only baselines across model sizes in a realistic multi-turn information-seeking task with N=112 participants.

high positive Seeking Information with RAG-Assistants: Does Model Size Mat... task performance in information-seeking

The performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size.

Reported results from the experimental comparison across conditions and three model sizes (3B, 8B, 70B) with N=112 participants; paper states the performance gain is significant across sizes (no numeric effect sizes or p-values provided in the excerpt).

high positive Seeking Information with RAG-Assistants: Does Model Size Mat... task accuracy / performance

The framework addresses AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.

Paper lists and provides guidance on AI-specific methodological issues (model versioning, interaction dynamics, contamination/spillover, equity). This is a descriptive claim about topics the framework covers, not an empirical evaluation of solutions.

high positive Principles and Guidelines for Randomized Controlled Trials i... coverage of AI-specific methodological challenges in evaluation guidelines

The framework implements a graded transparency and repeatability framework.

Paper extends TOP-guideline-derived transparency principle into a graded scheme for transparency and repeatability; described as an operational feature of the proposed framework.

high positive Principles and Guidelines for Randomized Controlled Trials i... graded transparency and repeatability practices for AI RCTs

The framework integrates heterogeneity analysis and practical significance assessment.

Paper reports inclusion of guidance on analyzing heterogenous treatment effects and assessing practical significance; presented as part of guidelines rather than tested across datasets.

high positive Principles and Guidelines for Randomized Controlled Trials i... inclusion of heterogeneity and practical significance analysis in evaluation pra...

The framework formalizes causal inference through RCT methodology for AI contexts.

Paper states adoption of randomized controlled trial methods and causal inference framing for AI impact evaluation; described as methodological proposition rather than validated application.

high positive Principles and Guidelines for Randomized Controlled Trials i... use of RCTs to support causal inference in AI evaluations

Our framework extends prior work by centering evaluation on human performance rather than model output alone.

Paper claims a conceptual shift: focus on human performance metrics; supported by argumentative rationale and literature references rather than empirical demonstration.

high positive Principles and Guidelines for Randomized Controlled Trials i... focus of evaluation metrics (human performance vs. model output)

The principles and guidelines serve three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms.

Paper's stated intended uses/positioning of the framework; presented as roles in the discussion/positioning section rather than empirically validated roles.

high positive Principles and Guidelines for Randomized Controlled Trials i... utility of the framework in planning, evaluating, and standard-setting

We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases.

Paper reports a concrete output: 33 guidelines derived from the five principles, with each guideline presented as requirement + rationale + implementation instructions + evidence base (documented in paper content).

high positive Principles and Guidelines for Randomized Controlled Trials i... availability of operational guidelines for AI RCTs

The paper adopts the (Shadish et al., 2002) four-validity framework and extends it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025).

Explicit methodological choice described in the paper: adoption of Shadish et al. four-validity framework and addition of a transparency/repeatability principle based on TOP Guidelines; documented in the text as design decision.

high positive Principles and Guidelines for Randomized Controlled Trials i... methodological framework / validity criteria

The framework draws on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology.

Paper reports literature review and cross-disciplinary synthesis as the methodological foundation for the framework (references to those disciplines). No empirical cross-disciplinary experiment reported.

high positive Principles and Guidelines for Randomized Controlled Trials i... methodological comprehensiveness / interdisciplinary grounding

This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies).

Paper's stated contribution: development of a conceptual framework integrating RCT design principles for AI evaluation. Based on literature synthesis and methodological argumentation rather than empirical testing.

high positive Principles and Guidelines for Randomized Controlled Trials i... standardization of AI evaluation RCTs / evaluation methodology

The paper introduces a Specification Governance Model (SGM), grounded in Transaction Cost Economics, and provides a practical governance decision guide.

Conceptual/modeling contribution described in the paper: SGM grounded in TCE with an applied decision guide (theoretical plus prescriptive).

high positive The Productivity-Reliability Paradox: Specification-Driven G... governance decision-making for specification practices

The paper proposes the AI-Augmented Methodology Taxonomy (AAMT), classifying six methodologies under three AI integration tiers.

Conceptual contribution: taxonomy introduced and described in the paper (six methodologies, three tiers).

high positive The Productivity-Reliability Paradox: Specification-Driven G... existence and classification of methodologies (taxonomic contribution)

Telemetry across 10,000+ developers shows a 98% increase in pull requests.

Observational telemetry data aggregated across >10,000 developers reported in the paper; metric reported is percent increase in pull request count.

high positive The Productivity-Reliability Paradox: Specification-Driven G... number of pull requests (pull_request_count)

Controlled studies report 20-56% productivity gains on well-scoped tasks.

Aggregate of multiple controlled experimental studies cited in the paper (2022–2026); reported as observed productivity improvements on well-scoped tasks in those studies. Specific study-level sample sizes not reported in the claim text.

high positive The Productivity-Reliability Paradox: Specification-Driven G... developer productivity

Practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration can be articulated, and calibrated beliefs plus utility-aware policies can improve agentic AI orchestration (illustrated via concrete examples and design patterns)

Paper provides articulated properties, examples, and design patterns but no empirical validation; claims of improvement are illustrated conceptually.

high positive Position: agentic AI orchestration should be Bayes-consisten... improvement in agentic AI orchestration from calibrated beliefs and utility-awar...

Coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters

Central prescriptive claim of the position paper; supported by conceptual argumentation and illustrative examples rather than empirical tests.

high positive Position: agentic AI orchestration should be Bayes-consisten... coherence of decision-making in agentic systems as a function of orchestration-l...

Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions

Argumentative/theoretical claim in the position paper; illustrated with conceptual examples and design patterns rather than empirical evaluation.

high positive Position: agentic AI orchestration should be Bayes-consisten... decision quality of agentic control via belief maintenance and updating

Many high-value deployments rely on decisions under uncertainty (for example, which tool to call, which expert to consult, or how many resources to invest)

Stated as a motivating observation in the paper; no quantitative data or sample provided.

high positive Position: agentic AI orchestration should be Bayes-consisten... prevalence of decision-under-uncertainty requirements in high-value deployments

LLMs excel at predictive tasks and complex reasoning tasks

Asserted in the paper's opening motivation; no empirical evaluation or sample reported in the paper itself.

high positive Position: agentic AI orchestration should be Bayes-consisten... LLM performance on predictive and reasoning tasks

The platform was used to support compound AI use cases at Salesforce, specifically Agentforce (autonomous AI agents) and ApexGuru (AI-powered code analysis).

Paper states the deployment was developed at Salesforce and lists Agentforce and ApexGuru as supported use cases; this is an implementation/adoption claim rather than a quantitative result.

high positive Scalable Inference Architectures for Compound AI Systems: A ... support/adoption by named applications (Agentforce, ApexGuru)

The architecture enables compound AI systems to: (a) scale model invocations in parallel, (b) handle bursty multi-agent workloads, and (c) support rapid model iteration — capabilities essential for operationalizing agentic AI at enterprise scale.

Paper provides case studies (Agentforce, ApexGuru) and operational lessons from production deployment to support these functional claims; the provided text does not include numerical benchmarks for each capability individually nor sample sizes.

high positive Scalable Inference Architectures for Compound AI Systems: A ... scalability of model invocations, ability to handle bursty workloads, support fo...

The modular, platform-agnostic inference architecture integrates serverless execution, dynamic autoscaling, and MLOps pipelines to deliver consistent low-latency inference across multi-component agent workflows.

System design and production deployment description in the paper; claim supported by implementation details and reported production performance (qualitative and operational evidence), but no detailed experimental protocol or sample sizes are given in the provided text.

high positive Scalable Inference Architectures for Compound AI Systems: A ... consistency of low-latency inference (multi-component agent workflows)

The platform delivered 30 to 40% cost savings relative to prior static deployments.

Reported production cost comparisons between the new modular inference architecture and prior static deployments (paper states "30 to 40% cost savings"); the provided text does not include details on cost components, time period, or sample size.

high positive Scalable Inference Architectures for Compound AI Systems: A ... infrastructure / inference cost

The deployment produced up to 3.9x throughput improvement compared to prior static deployments.

Reported production results comparing throughput of the modular inference architecture to prior static deployments (statement in the paper: "up to 3.9x throughput improvement"); no sample size or confidence intervals provided in the provided text.

high positive Scalable Inference Architectures for Compound AI Systems: A ... inference throughput

The production deployment achieved over 50% reduction in tail latency (P95) compared to prior static deployments.

Reported production results comparing the modular inference architecture to prior static deployments (production measurements of P95 tail latency); paper states this was observed in production but does not report sample size or detailed statistical tests in the provided text.

high positive Scalable Inference Architectures for Compound AI Systems: A ... P95 tail latency

We release the benchmark, harness, sweep configurations, and full run corpus.

Statement of artifact release in the paper; verifiable by checking the project's repository or supplementary materials.

high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... availability of released materials (benchmark and run corpus)

These findings suggest a practical design principle for agentic systems: use smaller open-weight models for the broad base of routine actions, and reserve large frontier models for the narrower class of tasks that truly demand deeper planning and control.

Synthesis/recommendation drawn from the empirical results on AgentFloor showing where small/mid models suffice and where frontier models have advantage; prescriptive claim rather than a direct empirical measurement.

high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... recommended task routing strategy for agentic systems (model assignment to task ...

The gap appears most clearly on long-horizon planning tasks that require sustained coordination and reliable constraint tracking over many steps, where frontier models still hold an advantage, though neither side reaches strong reliability.

Performance breakdown by capability tier on AgentFloor showing frontier (GPT-5) advantage on long-horizon planning/constraint-tracking tasks; both model groups have low absolute reliability on these tasks according to reported results.

high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... performance on long-horizon planning tasks (ability to sustain coordination and ...

We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs.

Empirical evaluation reported in the paper: 16 open-weight models spanning specified parameter sizes, inclusion of GPT-5, and a total of 16,542 scored runs (reported counts).

high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... evaluation runs (model-by-task performance across 16,542 scored runs)

We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints.

Paper describes the design of the benchmark: deterministic, 30 tasks, organized into six tiers covering specified capabilities. This is a descriptive claim about the artifact introduced in the work.

high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... benchmark construction (30 tasks, six-tier capability ladder)

The paper proposes five forms of online and offline issuance of RSDM, providing a prototype for creating a globally recognized modern honest money.

Authors' stated contribution in the paper (enumeration of five issuance forms and provision of a prototype); the excerpt explicitly refers to 'five forms'.

high positive RSDM: The Consensus Honest Money in the AI Era number_of_issuance_forms_proposed_and_provision_of_a_prototype

RSDM is an innovative version of Jiaozi (a deposit receipt for base metal coin that emerged in Sichuan, China, about a thousand years ago).

Comparative/analogical claim by the authors linking the proposed design to a historical instrument; no empirical analysis provided in the excerpt.

high positive RSDM: The Consensus Honest Money in the AI Era similarity_between_RSDM_and_historical_Jiaozi

Redeemable Self-Decaying/Devaluing Money (RSDM) is a tokenized commodity money whose essential innovation is to fill the hole in the storage fee of metal coins through the self-devaluing of metal weight recorded on the deposit certificate (warehouse receipt) of metal coins.

Design/specification proposed in the paper (conceptual mechanism); no empirical evaluation or sample size reported in the excerpt.

high positive RSDM: The Consensus Honest Money in the AI Era design_feature_RSDM_self-devaluation_to_cover_storage_fee

When AI acts as an agent for cross-border capital pool and cross cyclical asset allocation, it needs a sound money that can resist the depreciation of fiat currency and store long-term value.

Theoretical argument in the paper about functional requirements of AI agents managing cross-border capital; no empirical sample reported in the excerpt.

high positive RSDM: The Consensus Honest Money in the AI Era need_for_sound_money_by_AI_agents_in_cross-border_capital_allocation

In the AI world, however, the medium of exchange tends to be a globally recognized currency.

Author's theoretical assertion / forward-looking claim in the paper; no empirical data or sample provided in the excerpt.

high positive RSDM: The Consensus Honest Money in the AI Era likelihood_of_global_currency_becoming_medium_of_exchange_for_AI

TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication.

Author statement about the intended use of the benchmark and transparency practices (publication of provenance and limitations).

high positive Token Arena: A Continuous Benchmark Unifying Energy and Cogn... positioning of TokenArena as a methodological framework with published provenanc...

We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0.

Author statement of artifact release (license explicitly CC BY 4.0).

high positive Token Arena: A Continuous Benchmark Unifying Energy and Cogn... availability of TokenArena artifacts and leaderboard under CC BY 4.0

We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference).

Methodological contribution described by the authors; framework specification and composite metrics defined in the paper.

high positive Token Arena: A Continuous Benchmark Unifying Energy and Cogn... five core axes (output speed, time to first token, workload-blended price, effec...

« Prev 1 2 3 … 123 124 125 … 273 274 Next »