Evidence (13661 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	740	192	95	871	1945
Governance & Regulation	796	388	185	119	1512
Organizational Efficiency	765	186	123	82	1166
Technology Adoption Rate	610	227	121	95	1061
Research Productivity	409	121	56	331	928
Output Quality	464	174	58	47	743
Decision Quality	318	173	75	42	615
Firm Productivity	432	55	88	20	601
AI Safety & Ethics	214	273	65	33	589
Market Structure	175	165	120	24	489
Task Allocation	206	64	70	31	376
Skill Acquisition	161	57	57	16	291
Innovation Output	201	27	41	18	288
Fiscal & Macroeconomic	130	69	43	26	275
Employment Level	104	50	105	13	274
Consumer Welfare	116	62	42	11	231
Firm Revenue	149	45	26	3	223
Inequality Measures	43	120	49	6	218
Task Completion Time	164	29	8	12	214
Worker Satisfaction	89	60	20	12	181
Error Rate	69	89	9	2	169
Regulatory Compliance	74	67	14	4	159
Training Effectiveness	91	19	13	19	144
Wages & Compensation	77	33	25	6	141
Team Performance	86	17	27	9	140
Automation Exposure	49	50	22	12	136
Developer Productivity	91	17	14	5	128
Job Displacement	12	80	19	1	112
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	16	7	2	57
Skill Obsolescence	5	43	6	1	55
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

AI integration significantly enhances customer satisfaction.

Paper reports statistically significant positive association between AI integration and customer satisfaction using System GMM and robustness checks (no details on customer satisfaction measurement or sample size in the supplied text).

high positive Research on the Transformation Acceleration of Financial Ins... Customer satisfaction

AI integration significantly enhances risk-adjusted returns.

Reported empirical results using System GMM with FE and RE robustness checks; the paper states statistical significance but does not provide effect magnitudes in the supplied summary.

high positive Research on the Transformation Acceleration of Financial Ins... Risk-adjusted returns

AI integration significantly enhances operational efficiency.

Same empirical analysis using System GMM, with FE and RE models for robustness (no sample size or numeric estimates provided in the supplied text).

high positive Research on the Transformation Acceleration of Financial Ins... Operational efficiency

AI integration significantly enhances return on assets (ROA).

Empirical analysis reported in the paper using System Generalized Method of Moments (System GMM) estimator, with Fixed Effects (FE) and Random Effects (RE) models used as robustness checks. (No sample size or test statistics provided in the text supplied.)

high positive Research on the Transformation Acceleration of Financial Ins... Return on assets (ROA)

Synthesizing evidence, the paper identifies gaps and opportunities in current responsible AI research: (1) to engage with the diverse range of levers that influence organizations to abandon AI development, and (2) to better support appropriate engagement or disengagement with AI system development.

Synthesis and discussion section combining the taxonomy and empirical case analysis to produce research agenda and recommendations.

high positive To Build or Not to Build? Factors that Lead to Non-Developme... research and practical opportunities to influence AI development decisions

Decisions taken in earlier stages of development shape which systems are ultimately released, representing potential points for intervention to influence AI deployment outcomes.

Conceptual argument supported by the paper's taxonomy and case analyses showing pre-deployment factors that lead to abandonment.

high positive To Build or Not to Build? Factors that Lead to Non-Developme... influence of early-stage development decisions on eventual system release/abando...

Academic responsible AI communities often emphasize ethical risks as reasons to not develop AI.

Observation from the scoping review and literature synthesis comparing academic emphases with other sources.

high positive To Build or Not to Build? Factors that Lead to Non-Developme... frequency/degree of emphasis on ethical risks in responsible AI academic literat...

The authors collected data on real-world cases of AI system abandonment via an AI incident database and a practitioner survey to evidence and compare factors that drive abandonment both prior to and following system deployment.

Empirical data collection described in the paper: use of an AI incident database and a practitioner survey; summary does not report sample sizes or survey response counts.

high positive To Build or Not to Build? Factors that Lead to Non-Developme... observed drivers of AI abandonment in real-world cases (pre- vs post-deployment)

Through thematic analysis of reviewed sources, the paper develops a taxonomy of six categories of factors contributing to AI abandonment: ethical concerns, stakeholder feedback, development lifecycle challenges, organizational dynamics, resource constraints, and legal/regulatory concerns.

Qualitative thematic analysis of the scoping review materials, resulting taxonomy enumerated in the paper; number of documents/sources not stated in the summary provided.

high positive To Build or Not to Build? Factors that Lead to Non-Developme... categorization of factors driving AI abandonment

The authors performed a scoping review of academic literature, civil society resources, and grey literature (including journalism and industry reports) to identify factors influencing AI abandonment.

Methods statement in the paper describing a systematic scoping review of multiple source types; no numeric sample size reported in the summary.

high positive To Build or Not to Build? Factors that Lead to Non-Developme... scope and composition of reviewed sources

The paper reframes AI safety as layered control, authorization, and externally reviewable limits, linking alignment, security engineering, organizational economics, and institutional design.

Synthesis and prescriptive claim based on the paper's theoretical analysis and proposed framework; supported by conceptual integration rather than empirical testing.

high positive AI Safety as Control of Irreversibility: A Systems Framework... safety governance approach (layered controls and limits)

The main result is a boundary stabilization theorem showing that safety need not require proving that advanced systems are always correct; instead it requires institutional and technical designs that prevent irreversible power from being released by a single high-efficiency node.

Formal/theoretical claim presented as the paper's primary theorem (a 'boundary stabilization theorem') demonstrated within the paper's formal model.

high positive AI Safety as Control of Irreversibility: A Systems Framework... safety (effectiveness of layered controls vs. proof-of-correctness)

The index diverges sharply from existing AI exposure measures for specific occupation groups: power plant operators, railroad conductors, and aircraft cargo handling supervisors score high on RL feasibility but low on general AI exposure.

Empirical comparison between the RL Feasibility Index and existing AI-exposure measures, with named occupation groups showing opposite rankings.

high positive What Jobs Can AI Learn? Measuring Exposure by Reinforcement ... relative RL feasibility vs. general AI exposure for named occupations

Using LLM annotators guided by a rubric developed with RL experts and validated against confirmed deployment cases, we score all 17,951 O*NET tasks for training feasibility and aggregate to the occupation level, producing an RL Feasibility Index.

Empirical method described in paper: LLM-based annotation process guided by expert-developed rubric; validation against confirmed deployment cases; explicit enumeration of 17,951 O*NET tasks scored and aggregated into an index.

high positive What Jobs Can AI Learn? Measuring Exposure by Reinforcement ... training feasibility of O*NET tasks; RL Feasibility Index at task and occupation...

We examine this for every occupation in the US economy.

Statement of study scope in the paper (methodological claim about coverage).

high positive What Jobs Can AI Learn? Measuring Exposure by Reinforcement ... coverage of US occupations in the RL feasibility analysis

The no-talk baseline establishes that communication is necessary.

Experimental no-talk baseline showing worse coordination without communication between agents.

high positive Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coordination performance with vs without communication

These results highlight dynamic grounding as a critical and understudied axis of multi-agent coordination.

Synthesis/interpretation of the experimental findings reported in the paper.

high positive Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... importance of dynamic grounding for multi-agent coordination

We introduce an iterated, multi-turn negotiation game in which two agents allocate shared resources toward private projects with verifiable jointly optimal outcomes.

Methodological contribution described in the paper (design of a new multi-turn negotiation game).

high positive Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... existence of a multi-turn negotiation benchmark with verifiable optimal outcomes

Grounding is the collaborative process of establishing mutual belief sufficient for the current communicative purpose.

Conceptual/definitional statement presented by the authors (no empirical data reported).

high positive Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... definition of grounding

The frontier for AI-augmented science is not acceleration; it is the redesign of the certifying infrastructure around these new scarcities.

Prescriptive conclusion in the paper arguing priority of institutional redesign over mere speed gains; presented without empirical testing in the excerpt.

high positive AI-Augmented Science and the New Institutional Scarcities prioritization of redesigning certifying infrastructure versus accelerating scie...

Competent-looking judgment, including selecting, ranking, attributing, and certifying, is now produced at scale at marginal cost approaching zero, inverting the dominant economics-of-AI reading that treats judgment as the scarce complement to cheap prediction.

Argumentative/theoretical claim in the paper; no empirical sample, experiment, or quantitative data reported in the excerpt (implicit basis: observation of scalable AI outputs).

high positive AI-Augmented Science and the New Institutional Scarcities production of competent-looking judgment (selecting, ranking, attributing, certi...

Policy recommendations: invest in digital infrastructure, human capital development, and inclusive technology diffusion strategies to ensure more equitable distribution of AI-driven economic value.

Policy implications drawn from study findings (heterogeneous effects and mediation by structural conditions).

high positive The Economic Value of Agentic AI: A Comparative Analysis of ... equitable distribution of AI-driven economic value (policy interventions)

The magnitude of AI's growth effects varies across economic contexts: developed economies experience substantially stronger growth impacts (approximately 0.33) than emerging economies (approximately 0.15).

Heterogeneity analysis / subgroup comparisons (developed vs emerging economies) using the panel data regressions and/or quantile regressions on the 2015–2024 dataset; exact sample sizes per subgroup not reported.

high positive The Economic Value of Agentic AI: A Comparative Analysis of ... economic growth (heterogeneous treatment effects by country group)

AI adoption has a comparatively weaker direct effect on economic growth (direct effect β = 0.09).

Mediation/structural decomposition from the paper showing direct (non-mediated) coefficient from AI adoption to growth.

high positive The Economic Value of Agentic AI: A Comparative Analysis of ... economic growth (direct effect)

Agentic AI influences economic growth primarily through a productivity channel (mediated effect β = 0.35, p < 0.01).

Mediation analysis (panel data) estimating indirect effect of AI adoption on GDP growth via measured productivity channel; data sources: World Bank and OECD indicators, 2015–2024.

high positive The Economic Value of Agentic AI: A Comparative Analysis of ... economic growth (mediated via productivity)

AI adoption significantly improves firm-level productivity (β = 0.18, p < 0.01).

Fixed-effects panel regression using an AI Adoption Index as predictor on firm-level productivity; data drawn from World Bank (World Development Indicators and Enterprise Surveys) and OECD AI indicators for 2015–2024 (sample size not reported in text).

high positive The Economic Value of Agentic AI: A Comparative Analysis of ... firm-level productivity

Agentic AI has strong potential to boost productivity and growth.

Statement in paper motivated by literature review and the study's empirical results linking AI adoption to productivity and growth.

high positive The Economic Value of Agentic AI: A Comparative Analysis of ... productivity and economic growth (general)

HAAS can serve as a pre-deployment workbench for comparing and inspecting human–AI allocation policies before organisational commitment.

Claim about intended use and demonstration of HAAS as an implemented tool; based on the framework implementation and benchmark experiments reported. No deployment-scale evaluation or sample sizes provided in the excerpt.

high positive HAAS: A Policy-Aware Framework for Adaptive Task Allocation ... ability to compare and inspect allocation policies prior to deployment

In manufacturing, stronger governance can improve operational performance and reduce fatigue simultaneously — a workload-buffering effect.

Domain-specific empirical result reported for the manufacturing benchmark in the paper, comparing operational performance and fatigue under different governance strengths. No numeric sample size or effect sizes provided in the excerpt.

high positive HAAS: A Policy-Aware Framework for Adaptive Task Allocation ... operational performance and worker fatigue

Task–agent fit is represented through five auditable cognitive dimensions and a five-mode autonomy spectrum (from human-only to fully autonomous) embedded in a reproducible benchmark spanning software engineering and manufacturing.

Design and benchmark description within the paper; specification of five cognitive dimensions and a five-mode autonomy spectrum and a reproducible benchmark across two domains. No numeric sample size provided.

high positive HAAS: A Policy-Aware Framework for Adaptive Task Allocation ... representation of task–agent fit and benchmarking across domains

HAAS combines a rule-based expert system that enforces governance constraints before any learning occurs, and a contextual-bandit learner that selects among feasible collaboration modes from outcome feedback.

Descriptive claim about the implemented HAAS framework as presented in the paper; method description of system architecture (rule-based expert system + contextual-bandit learner). No sample size reported.

high positive HAAS: A Policy-Aware Framework for Adaptive Task Allocation ... mechanism for adaptive task allocation (selected collaboration mode)

The field's near-term research agenda should explicitly include collecting and using triadic data.

Normative recommendation in the paper; presented as the authors' advised research priority rather than empirically justified within the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... inclusion of triadic data collection/use in near-term research agendas in the SW...

This data is the empirical key to four open questions in agent training.

Argumentative claim in the paper asserting centrality of triadic data to addressing unspecified four open research questions; no empirical demonstration included in the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... resolvability of four open questions in agent training using triadic data

This triadic data is capturable in 12-18 months with methods already mature in adjacent fields.

Claim in the paper based on authors' assessment of methodological maturity in adjacent fields; no empirical project timeline or pilot data is provided in the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... time required to collect a triadic dataset using existing methods

Any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher through a four-tier evidence framework: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation.

Methodological proposal in the paper outlining a four-tier evidence framework; presented as normative guidance rather than validated by application to a corpus in the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... quality and trustworthiness of fine-tuning corpora as judged by the four-tier fr...

The canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure.

Prescriptive specification in the paper proposing two concrete dataset types as canonical instantiations; presented as design/recommendation rather than empirically tested.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... availability and suitability of dataset modalities (stimulated-recall expert tra...

The substrate for the next generation of software-engineering (SWE) agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs; it is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both.

Argument and conceptual proposal in the paper; no empirical validation or comparative experiments are provided in the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... effectiveness of training data substrates for improving agent performance on lon...

SCDPs are a useful framework for policy simulation for the digital economy, mechanism design for information systems, and digital twin modeling of cyberinfrastructure.

Paper posits these applications as prospective uses of the framework (argumentative/speculative; no empirical evaluation reported in abstract).

high positive The Design and Composition of Structural Causal Decision Pro... usefulness for policy simulation, mechanism design, and digital twin modeling

SCDPs are capable of modeling variable discounting, a tool used widely in social scientific modeling.

Paper states the capability as part of SCDP definition and examples (theoretical claim).

high positive The Design and Composition of Structural Causal Decision Pro... modeling of variable discounting

An SCDP can endogenously model the memory-formation process and is thus useful for modeling resource‑rational agents in dynamic settings.

Paper asserts SCDP can represent memory-formation endogenously and discusses application to resource-rational agents (theoretical modeling capability).

high positive The Design and Composition of Structural Causal Decision Pro... ability to model endogenous memory formation / resource-rational agents

SCDPs are strictly more expressive than POMDPs because they do not assume rational belief formation.

Comparative expressiveness claim stated in the paper; supported by theoretical argument or formal separation result (paper text states the claim explicitly).

high positive The Design and Composition of Structural Causal Decision Pro... expressiveness relative to POMDPs (ability to represent non-rational belief form...

SCDPs inherit the composition properties of SCDMs (i.e., SCDPs benefit from SCDM composability).

Logical consequence argued in the paper from SCDP being constructed from SCDMs; likely supported by formal argumentation in the text.

high positive The Design and Composition of Structural Causal Decision Pro... inheritance of composability by SCDPs

A Structural Causal Decision Process (SCDP) is defined as a recurring SCDM with a discount variable.

Formal definition introduced in the paper (theoretical definition).

high positive The Design and Composition of Structural Causal Decision Pro... definition of SCDP as recurring SCDM with discounting

SCDMs have a well-defined and computationally useful property of composability.

Paper states and demonstrates ("We show") composability property — presumably via formal proofs or constructive arguments in the text (theoretical proofs/exposition).

high positive The Design and Composition of Structural Causal Decision Pro... composability of causal decision models

SCDMs can have open root variables for which no probability distribution or structural equation is given.

Model definitions in the paper explicitly allow open root variables (theoretical description).

high positive The Design and Composition of Structural Causal Decision Pro... support for open root variables in model formalism

In SCDMs, agent decisions can be constrained by their causal antecedents (i.e., decisions can be constrained by their causal parents).

Model specification and definitions in the paper describing constraints on decisions as part of SCDM structure (theoretical construction).

high positive The Design and Composition of Structural Causal Decision Pro... decision constraints by causal antecedents

Structural Causal Decision Models (SCDMs) expand on Structural Causal Influence Models by explicitly representing the causal relationships between model variables and the payoffs of agent decisions.

Formal model development and comparison to existing SCIMs provided in the paper (theoretical definitions and arguments).

high positive The Design and Composition of Structural Causal Decision Pro... explicit representation of causal relationships between variables and payoffs

We present two new classes of causal models of decision-making agents: Structural Causal Decision Models (SCDMs) and Structural Causal Decision Processes (SCDPs).

Paper introduces formal definitions for two model classes and describes their properties in the text (theoretical exposition).

high positive The Design and Composition of Structural Causal Decision Pro... introduction of new model classes (SCDMs and SCDPs)

We propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs.

Author contribution: design and open-source implementation of PAEF described in the paper.

high positive Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... provision of a continuous, production-focused evaluation framework (PAEF)

The taxonomy and its failure modes are grounded in observations from systems operating at billion-event scale.

Author statement that observations underlying the taxonomy come from systems operating at billion-event scale.

high positive Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... empirical grounding (scale) of observations used to derive the taxonomy

« Prev 1 2 3 … 122 123 124 … 273 274 Next »