Evidence (7953 claims)
Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 402 | 112 | 67 | 480 | 1076 |
| Governance & Regulation | 402 | 192 | 122 | 62 | 790 |
| Research Productivity | 249 | 98 | 34 | 311 | 697 |
| Organizational Efficiency | 395 | 95 | 70 | 40 | 603 |
| Technology Adoption Rate | 321 | 126 | 73 | 39 | 564 |
| Firm Productivity | 306 | 39 | 70 | 12 | 432 |
| Output Quality | 256 | 66 | 25 | 28 | 375 |
| AI Safety & Ethics | 116 | 177 | 44 | 24 | 363 |
| Market Structure | 107 | 128 | 85 | 14 | 339 |
| Decision Quality | 177 | 76 | 38 | 20 | 315 |
| Fiscal & Macroeconomic | 89 | 58 | 33 | 22 | 209 |
| Employment Level | 77 | 34 | 80 | 9 | 202 |
| Skill Acquisition | 92 | 33 | 40 | 9 | 174 |
| Innovation Output | 120 | 12 | 23 | 12 | 168 |
| Firm Revenue | 98 | 34 | 22 | — | 154 |
| Consumer Welfare | 73 | 31 | 37 | 7 | 148 |
| Task Allocation | 84 | 16 | 33 | 7 | 140 |
| Inequality Measures | 25 | 77 | 32 | 5 | 139 |
| Regulatory Compliance | 54 | 63 | 13 | 3 | 133 |
| Error Rate | 44 | 51 | 6 | — | 101 |
| Task Completion Time | 88 | 5 | 4 | 3 | 100 |
| Training Effectiveness | 58 | 12 | 12 | 16 | 99 |
| Worker Satisfaction | 47 | 32 | 11 | 7 | 97 |
| Wages & Compensation | 53 | 15 | 20 | 5 | 93 |
| Team Performance | 47 | 12 | 15 | 7 | 82 |
| Automation Exposure | 24 | 22 | 9 | 6 | 62 |
| Job Displacement | 6 | 38 | 13 | — | 57 |
| Hiring & Recruitment | 41 | 4 | 6 | 3 | 54 |
| Developer Productivity | 34 | 4 | 3 | 1 | 42 |
| Social Protection | 22 | 10 | 6 | 2 | 40 |
| Creative Output | 16 | 7 | 5 | 1 | 29 |
| Labor Share of Income | 12 | 5 | 9 | — | 26 |
| Skill Obsolescence | 3 | 20 | 2 | — | 25 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
Training footprint is the largest cluster in the mapped Green AI literature.
Result from the paper's literature mapping / clustering (statement in abstract; no numeric cluster sizes given).
We map the recent Green AI literature into seven themes: training footprint is the largest cluster, while inference efficiency and system level optimisation are growing rapidly, alongside measurement protocols, green algorithms, governance, and security and efficiency trade-offs.
Bibliometric / thematic mapping of recent Green AI literature described in the paper (method: literature mapping; exact number of papers or mapping procedure not specified in abstract).
Compared to relationship-based debt, stable equity significantly promotes high-quality development in the high-end equipment manufacturing and new energy industries.
Comparative subgroup regression analysis on the same dataset (743 listed enterprises, 2014–2023) indicating that the coefficient for stable equity is significantly larger than that for relationship-based debt in the high-end equipment manufacturing and new energy industry subsamples.
The effects of two distinct forms of patient capital—stable equity and relationship-based debt—are more pronounced in promoting high-quality development in the new energy vehicle industry, energy conservation and environmental protection industry, biotechnology industry, new materials industry, and next-generation information technology industry.
Industry heterogeneity / subgroup analyses on the 2014–2023 panel of 743 listed firms showing stronger estimated effects of both stable equity and relationship-based debt on firm high-quality development within these specified industries.
The impact of patient capital on the high-quality development of enterprises exhibits regional heterogeneity: enterprises in the central region are more sensitive to patient capital in terms of high-quality development.
Subsample/regional heterogeneity analysis on the panel of 743 listed enterprises (2014–2023) comparing region-specific coefficients and finding a larger/stronger effect in the central region.
The application of artificial intelligence enhances the positive impact of patient capital on the high-quality development of enterprises in strategic emerging industries.
Moderation analysis using the same firm panel (743 listed enterprises, 2014–2023) that includes an interaction term between patient capital and measures of AI application, with the interaction reported as positive and statistically significant.
Patient capital promotes the high-quality development of these enterprises by easing financing constraints.
Mediation analysis on panel data of 743 listed firms (2014–2023) reporting that financing-constraint indicators mediate the impact of patient capital on firm high-quality development.
Patient capital promotes the high-quality development of these enterprises by alleviating information asymmetry.
Mediation tests using firm-level panel data (743 listed enterprises, 2014–2023) that include measures of information asymmetry and show a mediating effect in the patient capital → high-quality development pathway.
Patient capital promotes the high-quality development of these enterprises by enhancing the level of synergy in digital and green transformation (digital-green transformation synergy).
Mediation analysis on the same panel (743 listed enterprises, 2014–2023) showing that measures of digital-green transformation synergy mediate the relationship between patient capital and firm high-quality development.
Patient capital plays a significant role in promoting the high-quality development of enterprises in strategic emerging industries.
Empirical analysis using panel data from 743 listed enterprises in China’s strategic emerging industries over 2014–2023; regression analysis reporting a statistically significant positive coefficient for patient capital on a firm-level measure of high-quality development.
Average ratings [for same-caste matches were] up to 25% higher (on a 10-point scale) than inter-caste matches.
Quantitative result reported in the analysis comparing average ratings (10-point scale) between same-caste and inter-caste matches; statement specifies magnitude 'up to 25%'.
Our analysis reveals consistent hierarchical patterns across models: same-caste matches are rated most favorably.
Reported results across evaluated LLMs showing consistent patterns where same-caste profile pairings received higher ratings than inter-caste pairings.
We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.
Paper states intention and contribution: releasing methodology and lessons to allow replication by other organizations.
We detail data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks to address challenges in constructing reliable evaluation signals from monorepo environments.
Methodological description in paper listing specific practices (LLM-based classification, test relevance validation, multi-run stability checks) aimed at producing reliable evaluation signals in monorepos.
Models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates.
Reported relationship from paper's analysis correlating models' use of verification tools (test execution, static analysis) with higher solve rates across evaluated models.
Systematic analysis of four foundation models yields solve rates from 53.2% to 72.2%.
Empirical evaluation reported in paper: four foundation models were evaluated on the ProdCodeBench benchmark producing reported solve-rate range.
Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages.
Descriptive dataset claim in paper specifying components of each sample and that samples cover seven programming languages.
We present ProdCodeBench, a benchmark built from real sessions with a production AI coding assistant.
Paper describes methodology and introduces ProdCodeBench explicitly as constructed from real production assistant sessions.
Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings.
Argument presented in paper motivating creation of production-derived benchmark; no specific empirical comparison to alternative benchmarks reported in the abstract.
Carbon emissions initially increase with the expansion of robotics manufacturing.
Panel regressions on the 277 Chinese prefecture-level cities (2008–2019) showing the left-hand (rising) portion of the inverted U-shaped relationship.
A representative incident (ISS-004) demonstrated boundary-based containment with 10-minute detection latency, zero user exposure, and 80-minute resolution.
Incident ISS-004 report in the paper giving specific timings for detection latency (10 minutes), user exposure (zero), and resolution (80 minutes).
The multi-agent approach improved reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication.
Incident detection reported in the SF2Bench deployment where audited handoffs prevented publication of a coordinate transformation error that would have affected all 2,452 stations.
The multi-agent approach improved efficiency — the SF2Bench deployment was completed by a single operator in two days with repeated artifact reuse across deployments.
Operational report from the production deployment: single operator completion time of two days and reuse of artifacts across deployments as stated in the paper.
SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi-agent workflow.
Reported dataset composition and use in the paper: SF2Bench with stated counts and temporal span used to validate the multi-agent workflow.
EnviSmart treats reliability as an architectural property through two mechanisms: (1) a three-track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool-using procedures) as persistent, interlocking artifacts; and (2) a role-separated multi-agent design where deterministic validators and audited handoffs restore fail-stop semantics at trust boundaries before irreversible steps.
System architecture and design description in the paper; presented as the core reliability mechanisms implemented in EnviSmart.
We introduce EnviSmart, a production data management system deployed on campus-wide storage infrastructure for environmental research.
System description and statement of deployment in the paper; presented as a production deployment (no randomized evaluation reported).
Embedding LLM-driven agents into environmental FAIR data management can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions.
Conceptual / argumentative claim made in the paper as a motivation for the system; no quantitative experiment tied to this statement in the excerpt.
Overcoming the structural skill deficit through deliberate investment in tertiary education reform and strong private-public partnerships for continuous vocational learning is mandatory for Nigeria to successfully leverage the AI revolution for inclusive economic growth and ensure long-term workforce resilience.
Study conclusion synthesizing survey results (150 firms) and qualitative policy/workforce analysis to make policy recommendations.
The rate of new job creation hinges critically on the immediate implementation of targeted, scalable reskilling programs.
Paper's projections and analysis drawing on the survey of 150 firms and qualitative interviews; presented as a conditional/projection based on current skills gap and training initiatives.
The agentic-specificity classification helps organizations distinguish challenges that require novel approaches from those that are addressable with established practices.
Authors' proposed classification (agentic-specific vs. carried-over/amplified) intended as a practical decision aid; derived from the coding and comparative analysis.
The taxonomy provides a diagnostic framework for identifying priority barrier dimensions and understanding cross-dimensional amplification mechanisms.
Authors present a taxonomy derived from the review and claim it can be used diagnostically by organizations; supported by the coded barrier classification and STS mapping.
Azar et al. (2023) show that monopsonistic employers have stronger incentives to automate, and US commuting zones with higher labor market concentration experienced more robot adoption.
Citation to Azar et al. (2023) empirical evidence reported in the paper.
Noy and Zhang (2023) and Brynjolfsson et al. (2025) provide emerging empirical evidence that AI can function as a labor-complementary technology when designed to do so.
Cited empirical studies referenced in the paper arguing that certain AI applications complement human labor.
Eloundou et al. (2024) predict that half of US jobs are significantly exposed to recent advances in generative AI.
Citation to Eloundou et al. (2024) empirical study reported in the paper's introduction.
Firms may not sufficiently account for non-monetary aspects (safety, meaning of work) when choosing technologies; a planner would include these non-monetary considerations in steering technological progress.
Theoretical argument and model extension in Section 6 on monetary vs non-monetary aspects of technology choices.
In multi-good economies, a planner can raise poor agents' real incomes not only by affecting factor incomes but also by focusing technological progress on making goods cheaper that are disproportionately consumed by poorer agents.
Extension of the baseline model to multiple goods (Section 5) identifying distributional consumption-channel effects.
When capital and labor are gross complements, a planner concerned with workers' welfare would favor capital-augmenting innovations to raise wages.
Analytical result from a factor-augmenting application of the paper's model examining complementarity conditions between capital and labor.
A welfare-maximizing planner will impose positive robot taxes when robots substitute for human labor, with the optimal tax rate increasing in the planner's concern for workers' welfare.
Model application to robot taxation presented in the paper; comparative statics on planner weights.
When redistribution is costly or incomplete, production efficiency is no longer optimal and a planner will distort technology choice to improve distribution (i.e., engage more in steering).
Theoretical derivation extending Atkinson-Stiglitz framework with endogenous technology and costly redistribution; comparative statics on redistribution cost.
The welfare benefits of steering technological progress are greater the less efficient social safety nets are.
Theoretical result derived in the paper's baseline and extended models analyzing a planner who can shape technology choices and faces costly/incomplete redistribution.
In the short run, with fixed human capital, wages, and job boundaries, AI raises productivity by reducing the time required to perform steps.
Model distinction between short-run (fixed job design and skills) and long-run horizons; short-run optimization shows AI reduces expected execution times for steps, thereby raising productivity.
Aggregating heterogeneous firms that deploy a commonly available AI technology yields an aggregate production function that admits a constant elasticity of substitution (CES) representation with three inputs: aggregate manual labor, aggregate AI-assisted labor, and aggregate capital.
Theoretical aggregation argument drawing on Houthakker (1955) and Levhari (1968), deriving a macro-level CES representation from a microfounded algorithmic cost function defined by firms' joint optimization over AI deployment and job design.
Improvements in AI quality generate non-linear effects on labor demand and wages because firms' cost-minimizing AI deployment and job designs change discretely at particular AI quality thresholds (microfoundation for the productivity J-curve).
Theoretical analysis of discrete switches in the cost-minimizing arrangement as AI success probability and execution times change; characterization of threshold effects and discussion linking to the J-curve phenomenon (model results and comparative statics).
Adjacency to AI-executed steps increases the likelihood that a given step is executed by AI (local complementarities): a step is more likely to be AI-executed in occupations where its neighboring steps are also AI-executed.
Empirical comparisons of conceptually similar steps across occupations paired with workflow adjacency information and realized AI execution outcomes from Anthropic’s Economic Index; statistical tests reported in the paper.
AI-executed steps co-occur in contiguous chains rather than being randomly scattered across a production workflow.
Empirical analysis linking O*NET tasks to human assessments of AI exposure (Eloundou et al., 2024), realized AI execution outcomes from Anthropic’s Economic Index (Handa et al., 2025), and GPT-generated workflow orderings for occupations; statistical tests comparing observed contiguity to random/scaled baselines reported in the paper.
Instrumenting AI use cases with treatment assignment suggests each additional AI use case prompted by treatment leads to approximately 26% higher revenue.
Instrumental variable analysis using randomized treatment as instrument for number of AI use cases in the 515-firm sample; outcome measured as revenue.
Instrumenting AI use cases with treatment assignment suggests each additional AI use case prompted by treatment leads to 0.85 more completed tasks.
Instrumental variable analysis using randomized treatment as instrument for number of AI use cases in the 515-firm sample; outcome measured as completed tasks.
Revenue and investment gains are largest at the 90th percentile and above, suggesting AI expands the upper range of what firms achieve.
Quantile/upper-tail analysis of revenue and investment outcomes in the randomized sample (515 firms); reported concentration of gains at the 90th percentile+.
Treated firms generate 1.9x higher revenue compared to control firms.
RCT with 515 firms; revenue reported by firms during and after the accelerator; comparison of mean revenues between treated and control groups.
Treated firms are 11 percentage points (18%) more likely to acquire paying customers.
RCT with 515 firms; customer acquisition measured in weekly reports / traction outcomes; treatment vs control comparison.