Evidence (13661 claims)
Adoption
8339 claims
Productivity
7479 claims
Governance
6715 claims
Human-AI Collaboration
6267 claims
Org Design
4098 claims
Innovation
3987 claims
Labor Markets
3488 claims
Skills & Training
2888 claims
Inequality
2016 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 740 | 192 | 95 | 871 | 1945 |
| Governance & Regulation | 796 | 388 | 185 | 119 | 1512 |
| Organizational Efficiency | 765 | 186 | 123 | 82 | 1166 |
| Technology Adoption Rate | 610 | 227 | 121 | 95 | 1061 |
| Research Productivity | 409 | 121 | 56 | 331 | 928 |
| Output Quality | 464 | 174 | 58 | 47 | 743 |
| Decision Quality | 318 | 173 | 75 | 42 | 615 |
| Firm Productivity | 432 | 55 | 88 | 20 | 601 |
| AI Safety & Ethics | 214 | 273 | 65 | 33 | 589 |
| Market Structure | 175 | 165 | 120 | 24 | 489 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 161 | 57 | 57 | 16 | 291 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Fiscal & Macroeconomic | 130 | 69 | 43 | 26 | 275 |
| Employment Level | 104 | 50 | 105 | 13 | 274 |
| Consumer Welfare | 116 | 62 | 42 | 11 | 231 |
| Firm Revenue | 149 | 45 | 26 | 3 | 223 |
| Inequality Measures | 43 | 120 | 49 | 6 | 218 |
| Task Completion Time | 164 | 29 | 8 | 12 | 214 |
| Worker Satisfaction | 89 | 60 | 20 | 12 | 181 |
| Error Rate | 69 | 89 | 9 | 2 | 169 |
| Regulatory Compliance | 74 | 67 | 14 | 4 | 159 |
| Training Effectiveness | 91 | 19 | 13 | 19 | 144 |
| Wages & Compensation | 77 | 33 | 25 | 6 | 141 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Automation Exposure | 49 | 50 | 22 | 12 | 136 |
| Developer Productivity | 91 | 17 | 14 | 5 | 128 |
| Job Displacement | 12 | 80 | 19 | 1 | 112 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Skill Obsolescence | 5 | 43 | 6 | 1 | 55 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
AI integration significantly enhances customer satisfaction.
Paper reports statistically significant positive association between AI integration and customer satisfaction using System GMM and robustness checks (no details on customer satisfaction measurement or sample size in the supplied text).
AI integration significantly enhances risk-adjusted returns.
Reported empirical results using System GMM with FE and RE robustness checks; the paper states statistical significance but does not provide effect magnitudes in the supplied summary.
AI integration significantly enhances operational efficiency.
Same empirical analysis using System GMM, with FE and RE models for robustness (no sample size or numeric estimates provided in the supplied text).
AI integration significantly enhances return on assets (ROA).
Empirical analysis reported in the paper using System Generalized Method of Moments (System GMM) estimator, with Fixed Effects (FE) and Random Effects (RE) models used as robustness checks. (No sample size or test statistics provided in the text supplied.)
Synthesizing evidence, the paper identifies gaps and opportunities in current responsible AI research: (1) to engage with the diverse range of levers that influence organizations to abandon AI development, and (2) to better support appropriate engagement or disengagement with AI system development.
Synthesis and discussion section combining the taxonomy and empirical case analysis to produce research agenda and recommendations.
Decisions taken in earlier stages of development shape which systems are ultimately released, representing potential points for intervention to influence AI deployment outcomes.
Conceptual argument supported by the paper's taxonomy and case analyses showing pre-deployment factors that lead to abandonment.
Academic responsible AI communities often emphasize ethical risks as reasons to not develop AI.
Observation from the scoping review and literature synthesis comparing academic emphases with other sources.
The authors collected data on real-world cases of AI system abandonment via an AI incident database and a practitioner survey to evidence and compare factors that drive abandonment both prior to and following system deployment.
Empirical data collection described in the paper: use of an AI incident database and a practitioner survey; summary does not report sample sizes or survey response counts.
Through thematic analysis of reviewed sources, the paper develops a taxonomy of six categories of factors contributing to AI abandonment: ethical concerns, stakeholder feedback, development lifecycle challenges, organizational dynamics, resource constraints, and legal/regulatory concerns.
Qualitative thematic analysis of the scoping review materials, resulting taxonomy enumerated in the paper; number of documents/sources not stated in the summary provided.
The authors performed a scoping review of academic literature, civil society resources, and grey literature (including journalism and industry reports) to identify factors influencing AI abandonment.
Methods statement in the paper describing a systematic scoping review of multiple source types; no numeric sample size reported in the summary.
The paper reframes AI safety as layered control, authorization, and externally reviewable limits, linking alignment, security engineering, organizational economics, and institutional design.
Synthesis and prescriptive claim based on the paper's theoretical analysis and proposed framework; supported by conceptual integration rather than empirical testing.
The main result is a boundary stabilization theorem showing that safety need not require proving that advanced systems are always correct; instead it requires institutional and technical designs that prevent irreversible power from being released by a single high-efficiency node.
Formal/theoretical claim presented as the paper's primary theorem (a 'boundary stabilization theorem') demonstrated within the paper's formal model.
The index diverges sharply from existing AI exposure measures for specific occupation groups: power plant operators, railroad conductors, and aircraft cargo handling supervisors score high on RL feasibility but low on general AI exposure.
Empirical comparison between the RL Feasibility Index and existing AI-exposure measures, with named occupation groups showing opposite rankings.
Using LLM annotators guided by a rubric developed with RL experts and validated against confirmed deployment cases, we score all 17,951 O*NET tasks for training feasibility and aggregate to the occupation level, producing an RL Feasibility Index.
Empirical method described in paper: LLM-based annotation process guided by expert-developed rubric; validation against confirmed deployment cases; explicit enumeration of 17,951 O*NET tasks scored and aggregated into an index.
We examine this for every occupation in the US economy.
Statement of study scope in the paper (methodological claim about coverage).
The no-talk baseline establishes that communication is necessary.
Experimental no-talk baseline showing worse coordination without communication between agents.
These results highlight dynamic grounding as a critical and understudied axis of multi-agent coordination.
Synthesis/interpretation of the experimental findings reported in the paper.
We introduce an iterated, multi-turn negotiation game in which two agents allocate shared resources toward private projects with verifiable jointly optimal outcomes.
Methodological contribution described in the paper (design of a new multi-turn negotiation game).
Grounding is the collaborative process of establishing mutual belief sufficient for the current communicative purpose.
Conceptual/definitional statement presented by the authors (no empirical data reported).
The frontier for AI-augmented science is not acceleration; it is the redesign of the certifying infrastructure around these new scarcities.
Prescriptive conclusion in the paper arguing priority of institutional redesign over mere speed gains; presented without empirical testing in the excerpt.
Competent-looking judgment, including selecting, ranking, attributing, and certifying, is now produced at scale at marginal cost approaching zero, inverting the dominant economics-of-AI reading that treats judgment as the scarce complement to cheap prediction.
Argumentative/theoretical claim in the paper; no empirical sample, experiment, or quantitative data reported in the excerpt (implicit basis: observation of scalable AI outputs).
Policy recommendations: invest in digital infrastructure, human capital development, and inclusive technology diffusion strategies to ensure more equitable distribution of AI-driven economic value.
Policy implications drawn from study findings (heterogeneous effects and mediation by structural conditions).
The magnitude of AI's growth effects varies across economic contexts: developed economies experience substantially stronger growth impacts (approximately 0.33) than emerging economies (approximately 0.15).
Heterogeneity analysis / subgroup comparisons (developed vs emerging economies) using the panel data regressions and/or quantile regressions on the 2015–2024 dataset; exact sample sizes per subgroup not reported.
AI adoption has a comparatively weaker direct effect on economic growth (direct effect β = 0.09).
Mediation/structural decomposition from the paper showing direct (non-mediated) coefficient from AI adoption to growth.
Agentic AI influences economic growth primarily through a productivity channel (mediated effect β = 0.35, p < 0.01).
Mediation analysis (panel data) estimating indirect effect of AI adoption on GDP growth via measured productivity channel; data sources: World Bank and OECD indicators, 2015–2024.
AI adoption significantly improves firm-level productivity (β = 0.18, p < 0.01).
Fixed-effects panel regression using an AI Adoption Index as predictor on firm-level productivity; data drawn from World Bank (World Development Indicators and Enterprise Surveys) and OECD AI indicators for 2015–2024 (sample size not reported in text).
Agentic AI has strong potential to boost productivity and growth.
Statement in paper motivated by literature review and the study's empirical results linking AI adoption to productivity and growth.
HAAS can serve as a pre-deployment workbench for comparing and inspecting human–AI allocation policies before organisational commitment.
Claim about intended use and demonstration of HAAS as an implemented tool; based on the framework implementation and benchmark experiments reported. No deployment-scale evaluation or sample sizes provided in the excerpt.
In manufacturing, stronger governance can improve operational performance and reduce fatigue simultaneously — a workload-buffering effect.
Domain-specific empirical result reported for the manufacturing benchmark in the paper, comparing operational performance and fatigue under different governance strengths. No numeric sample size or effect sizes provided in the excerpt.
Task–agent fit is represented through five auditable cognitive dimensions and a five-mode autonomy spectrum (from human-only to fully autonomous) embedded in a reproducible benchmark spanning software engineering and manufacturing.
Design and benchmark description within the paper; specification of five cognitive dimensions and a five-mode autonomy spectrum and a reproducible benchmark across two domains. No numeric sample size provided.
HAAS combines a rule-based expert system that enforces governance constraints before any learning occurs, and a contextual-bandit learner that selects among feasible collaboration modes from outcome feedback.
Descriptive claim about the implemented HAAS framework as presented in the paper; method description of system architecture (rule-based expert system + contextual-bandit learner). No sample size reported.
The field's near-term research agenda should explicitly include collecting and using triadic data.
Normative recommendation in the paper; presented as the authors' advised research priority rather than empirically justified within the excerpt.
This data is the empirical key to four open questions in agent training.
Argumentative claim in the paper asserting centrality of triadic data to addressing unspecified four open research questions; no empirical demonstration included in the excerpt.
This triadic data is capturable in 12-18 months with methods already mature in adjacent fields.
Claim in the paper based on authors' assessment of methodological maturity in adjacent fields; no empirical project timeline or pilot data is provided in the excerpt.
Any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher through a four-tier evidence framework: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation.
Methodological proposal in the paper outlining a four-tier evidence framework; presented as normative guidance rather than validated by application to a corpus in the excerpt.
The canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure.
Prescriptive specification in the paper proposing two concrete dataset types as canonical instantiations; presented as design/recommendation rather than empirically tested.
The substrate for the next generation of software-engineering (SWE) agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs; it is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both.
Argument and conceptual proposal in the paper; no empirical validation or comparative experiments are provided in the excerpt.
SCDPs are a useful framework for policy simulation for the digital economy, mechanism design for information systems, and digital twin modeling of cyberinfrastructure.
Paper posits these applications as prospective uses of the framework (argumentative/speculative; no empirical evaluation reported in abstract).
SCDPs are capable of modeling variable discounting, a tool used widely in social scientific modeling.
Paper states the capability as part of SCDP definition and examples (theoretical claim).
An SCDP can endogenously model the memory-formation process and is thus useful for modeling resource‑rational agents in dynamic settings.
Paper asserts SCDP can represent memory-formation endogenously and discusses application to resource-rational agents (theoretical modeling capability).
SCDPs are strictly more expressive than POMDPs because they do not assume rational belief formation.
Comparative expressiveness claim stated in the paper; supported by theoretical argument or formal separation result (paper text states the claim explicitly).
SCDPs inherit the composition properties of SCDMs (i.e., SCDPs benefit from SCDM composability).
Logical consequence argued in the paper from SCDP being constructed from SCDMs; likely supported by formal argumentation in the text.
A Structural Causal Decision Process (SCDP) is defined as a recurring SCDM with a discount variable.
Formal definition introduced in the paper (theoretical definition).
SCDMs have a well-defined and computationally useful property of composability.
Paper states and demonstrates ("We show") composability property — presumably via formal proofs or constructive arguments in the text (theoretical proofs/exposition).
SCDMs can have open root variables for which no probability distribution or structural equation is given.
Model definitions in the paper explicitly allow open root variables (theoretical description).
In SCDMs, agent decisions can be constrained by their causal antecedents (i.e., decisions can be constrained by their causal parents).
Model specification and definitions in the paper describing constraints on decisions as part of SCDM structure (theoretical construction).
Structural Causal Decision Models (SCDMs) expand on Structural Causal Influence Models by explicitly representing the causal relationships between model variables and the payoffs of agent decisions.
Formal model development and comparison to existing SCIMs provided in the paper (theoretical definitions and arguments).
We present two new classes of causal models of decision-making agents: Structural Causal Decision Models (SCDMs) and Structural Causal Decision Processes (SCDPs).
Paper introduces formal definitions for two model classes and describes their properties in the text (theoretical exposition).
We propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs.
Author contribution: design and open-source implementation of PAEF described in the paper.
The taxonomy and its failure modes are grounded in observations from systems operating at billion-event scale.
Author statement that observations underlying the taxonomy come from systems operating at billion-event scale.