Evidence (4793 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	402	112	67	480	1076
Governance & Regulation	402	192	122	62	790
Research Productivity	249	98	34	311	697
Organizational Efficiency	395	95	70	40	603
Technology Adoption Rate	321	126	73	39	564
Firm Productivity	306	39	70	12	432
Output Quality	256	66	25	28	375
AI Safety & Ethics	116	177	44	24	363
Market Structure	107	128	85	14	339
Decision Quality	177	76	38	20	315
Fiscal & Macroeconomic	89	58	33	22	209
Employment Level	77	34	80	9	202
Skill Acquisition	92	33	40	9	174
Innovation Output	120	12	23	12	168
Firm Revenue	98	34	22	—	154
Consumer Welfare	73	31	37	7	148
Task Allocation	84	16	33	7	140
Inequality Measures	25	77	32	5	139
Regulatory Compliance	54	63	13	3	133
Error Rate	44	51	6	—	101
Task Completion Time	88	5	4	3	100
Training Effectiveness	58	12	12	16	99
Worker Satisfaction	47	32	11	7	97
Wages & Compensation	53	15	20	5	93
Team Performance	47	12	15	7	82
Automation Exposure	24	22	9	6	62
Job Displacement	6	38	13	—	57
Hiring & Recruitment	41	4	6	3	54
Developer Productivity	34	4	3	1	42
Social Protection	22	10	6	2	40
Creative Output	16	7	5	1	29
Labor Share of Income	12	5	9	—	26
Skill Obsolescence	3	20	2	—	25
Worker Turnover	10	12	—	3	25

Productivity Remove filter

The effect of AI adoption on widening the electricity output growth gap is more pronounced for firms located in economically advanced regions.

Heterogeneity analysis by regional economic development level using the firm-level electricity consumption dataset; stratified or interaction regressions showing larger estimated effects in more advanced regions. Exact subgroup sizes not provided in the summary.

high positive The Impact of AI Adoption on Electricity Output Growth Gap: ... corporate electricity output growth gap (heterogeneous effect by region)

The main result (initial widening of electricity growth gap) is robust to alternative variable definitions, exclusion of firms relying on outsourced AI services or non-AI adoption samples, and controls for endogeneity.

Robustness checks reported in the paper: alternative variable definitions, sample restrictions (excluding outsourced-AI-reliant firms and non-AI samples), and application of endogeneity control methods (e.g., instrumental variables or panel fixed effects). Exact methods and sample sizes not specified in the summary.

high positive The Impact of AI Adoption on Electricity Output Growth Gap: ... corporate electricity output growth gap (robustness of estimated effect)

AI adoption initially widens the corporate electricity output growth gap at the firm level in China.

Empirical analysis using unique firm-level data on corporate electricity consumption in China; econometric estimation comparing electricity output growth between AI-adopting firms and non-adopting peers (panel/firm-level analysis). Sample size not stated in the summary.

high positive The Impact of AI Adoption on Electricity Output Growth Gap: ... corporate electricity output growth gap

To optimize agentic AI integration and ensure responsible innovation across financial services, interdisciplinary, longitudinal research and robust governance frameworks are needed.

Authors' conclusions and recommendations based on the identified findings and gaps in the reviewed literature.

high positive A Comparative & Systematic Review of Literature on the I... recommended research and governance actions

Diverse architectural models such as multi-agent systems and cloud-based frameworks enable scalable, adaptive agentic AI deployments in financial services.

Synthesis of architecture-focused studies and framework descriptions within the reviewed literature (architectural benchmarking across papers).

high positive A Comparative & Systematic Review of Literature on the I... scalability and adaptivity of deployments

Findings reveal substantial productivity gains and operational efficiencies predominantly in banking and investment.

Systematic review synthesizing multidisciplinary qualitative, quantitative, and bibliometric studies of agentic AI applications in financial services published up to mid-2024 (review-level synthesis).

high positive A Comparative & Systematic Review of Literature on the I... productivity gains and operational efficiencies

The ManagerWorker two-agent pipeline (expensive text-only manager + cheaper worker with repo access) can substitute expensive execution by using expensive reasoning in the manager and cheaper execution in the worker.

System design description plus empirical results on 200 SWE-bench Lite instances showing parity in success rates between a strong-manager/weak-worker pipeline and a strong single agent while using fewer strong-model tokens.

high positive Can AI Models Direct Each Other? Organizational Structure as... ability to substitute expensive execution with expensive reasoning (operationali...

A minimal review-only manager loop adds only 2 percentage points over the baseline, whereas structured exploration and planning by the manager add 11 percentage points, demonstrating that active direction (not mere reviewing) produces most of the benefit.

Ablation-style comparison of pipeline variants on the 200-instance SWE-bench Lite evaluation: review-only manager loop versus manager with structured exploration and planning; reported improvements in percentage points.

high positive Can AI Models Direct Each Other? Organizational Structure as... improvement in task success rate (percentage-point increase)

A strong manager directing a weak worker achieves a 62% success rate on software-engineering tasks, matching a strong single agent which achieves 60%, while using a fraction of the strong-model token usage.

Empirical evaluation on 200 instances from SWE-bench Lite across five pipeline configurations and model pairings; measured task success rates and token usage for manager-worker pipelines versus single-agent baselines.

high positive Can AI Models Direct Each Other? Organizational Structure as... task success rate (percentage of tasks solved)

Under economy-wide deployment, the share of computer-vision-exposed labor compensation that is cost-effectively automatable rises sharply (relative to the firm-level 11% estimate).

Model counterfactuals or calibration scenarios comparing firm-level deployment vs economy-wide deployment; qualitative statement that share increases substantially.

high positive Economics of Human and AI Collaboration: When is Partial Aut... share of labor compensation automatable under economy-wide deployment

At the firm level, cost-effective automation captures approximately 11% of computer-vision-exposed labor compensation.

Calibration and implementation in computer vision; reported firm-level estimate from the framework.

high positive Economics of Human and AI Collaboration: When is Partial Aut... share of computer-vision-exposed labor compensation captured by cost-effective a...

Scale of deployment is a key determinant: AI-as-a-Service and AI agents spread fixed costs across users, sharply expanding economically viable tasks.

Modeling and calibration arguments showing fixed-cost spreading effects increase set of tasks for which automation is cost-effective; qualitative and quantitative comparisons in implementation.

high positive Economics of Human and AI Collaboration: When is Partial Aut... number/coverage of economically viable tasks (adoption potential) as a function ...

Because higher accuracy is disproportionately costly (convex cost), full automation is often not cost-minimizing; partial automation, where firms retain human workers for residual tasks, frequently emerges as the equilibrium.

Theoretical model combined with calibration (scaling laws + task mappings); equilibrium outcomes reported from the framework implementation.

high positive Economics of Human and AI Collaboration: When is Partial Aut... prevalence of partial automation vs full automation as cost-minimizing choices

We model automation intensity as a continuous choice in which firms minimize costs by selecting an AI accuracy level, from no automation through partial human-AI collaboration to full automation.

The paper develops a theoretical framework / model that treats automation intensity as a continuous decision variable; described as the central modeling approach.

high positive Economics of Human and AI Collaboration: When is Partial Aut... degree of automation (accuracy level chosen by firms)

The findings demonstrate that technological innovation strategies, when effectively implemented, provide measurable competitive advantages for banks and offer evidence-based insights for policymakers and practitioners.

Authors' interpretation/conclusion drawing on the reported statistically significant relationships between innovation (product and technological) and competitiveness.

high positive Technology Innovation Strategy and the Competitiveness of Ke... competitiveness (market share, profitability, customer satisfaction)

Technological innovation is positively and statistically significantly related to bank competitiveness (simple linear regression result reported).

Simple linear regression reported in the paper testing the hypothesis that technological innovation influences competitiveness; data collected from innovation-focused executives across licensed banks (paper states data from 39 licensed banks).

high positive Technology Innovation Strategy and the Competitiveness of Ke... competitiveness (market share, return on equity, customer satisfaction)

Product innovation strategy has a positive and statistically significant effect on competitiveness (F(1,134) = 74.983, p < .001).

Bivariate regression analysis reported in the paper with F(1,134)=74.983, p < .001; based on survey data from innovation-focused executives (regression degrees of freedom indicate n≈136 observations).

high positive Technology Innovation Strategy and the Competitiveness of Ke... competitiveness (measured via market share, return on equity, and customer satis...

In the user study, AI-expanded 5W3H prompts increase user satisfaction from 3.16 to 4.04.

Reported pre/post or baseline vs AI-expanded satisfaction scores in the N=50 user study with numeric scores 3.16 and 4.04.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... user satisfaction (rating scale)

In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent.

Reported comparison in the N=50 user study between baseline interaction rounds and rounds after AI-assisted 5W3H expansion; percentage reduction reported as 60%.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... interaction rounds (number of back-and-forth interactions to reach goal)

A weak-model compensation pattern was observed: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217).

Model-level comparison of D-A gain (difference between structured and unstructured conditions) across three models (Claude, GPT-4o, Gemini) on the evaluated outputs; reported gains for Gemini and Claude.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... D-A gain (improvement in goal-alignment score from structured prompting)

The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020.

Reported numeric comparison of sigma (variance) between unstructured baseline and strongest structured prompting conditions across evaluated outputs.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... cross-language sigma (standard deviation of scores across languages)

Structured prompting substantially reduces cross-language score variance relative to unstructured baselines.

Empirical comparison across 3,240 outputs evaluated by DeepSeek-V3, comparing structured vs. unstructured prompting across three languages.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... cross-language score variance (sigma)

Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese.

Statement referring to prior work (not new experiments in this paper); no sample size or methods provided in this text excerpt.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... goal alignment (language generalization)

Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.

Aggregate of reported experimental results across architecture design, pretraining data curation, reinforcement learning algorithm design, and preliminary transfer experiments.

high positive ASI-Evolve: AI Accelerates AI feasibility and promise of closed-loop AI-driven research (ASI-Evolve) to accele...

In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +5.04 points on OlympiadBench.

Reinforcement learning algorithm design experiments reported in the paper comparing discovered algorithms to GRPO on OlympiadBench.

high positive ASI-Evolve: AI Accelerates AI performance difference vs GRPO on OlympiadBench (points)

In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +11.67 points on AIME24.

Reinforcement learning algorithm design experiments reported in the paper comparing discovered algorithms to GRPO on AIME24.

high positive ASI-Evolve: AI Accelerates AI performance difference vs GRPO on AIME24 (points)

In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32.

Reinforcement learning algorithm design experiments reported in the paper comparing discovered algorithms to GRPO on AMC32.

high positive ASI-Evolve: AI Accelerates AI performance difference vs GRPO on AMC32 (points)

In pretraining data curation, gains exceed 18 points on MMLU.

Reported experimental result on MMLU benchmark within pretraining data curation experiments.

high positive ASI-Evolve: AI Accelerates AI MMLU benchmark performance (points)

In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points.

Pretraining data curation experiments reported in the paper showing an average benchmark performance improvement of +3.96 points.

high positive ASI-Evolve: AI Accelerates AI average benchmark performance (points)

The best discovered model surpasses DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements.

Reported performance comparison between the best discovered model and DeltaNet in neural architecture experiments; statement comparing relative gain to recent human-designed improvements.

high positive ASI-Evolve: AI Accelerates AI performance difference vs DeltaNet (points)

In neural architecture design, it discovered 105 SOTA linear attention architectures.

Neural architecture design experiments reported in the paper, with 105 discovered architectures labeled as SOTA.

high positive ASI-Evolve: AI Accelerates AI count of discovered state-of-the-art (SOTA) linear attention architectures

ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations.

Method description of ASI-Evolve's architecture/components in the paper (cognition base and analyzer added to evolutionary agents).

high positive ASI-Evolve: AI Accelerates AI design and inclusion of cognition base and dedicated analyzer components in the ...

We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle.

Methodological contribution described in the paper: presentation and implementation of the ASI-Evolve framework and its learn-design-experiment-analyze loop.

high positive ASI-Evolve: AI Accelerates AI existence and operation of a learn-design-experiment-analyze closed-loop framewo...

Large language model (LLM) use can improve observable output and short-term task performance.

Paper synthesizes empirical findings from human–AI interaction studies, learning-research experiments, and model-evaluation work indicating improved produced outputs and short-term task performance when humans use LLMs; no single pooled sample size or unified effect estimate is reported in the paper.

high positive Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupli... observable output quality and short-term task performance

These empirical insights provide actionable guidelines advocating dynamically routed architectures that adapt their collaborative structures to real-time task complexity.

Authors' recommendation derived from reported empirical findings comparing architectures under varying time budgets and task complexities (prescriptive claim based on study results).

high positive An Empirical Study of Multi-Agent Collaboration for Automate... effectiveness of dynamically routed architectures in matching collaborative stru...

Given extended compute budgets, the agent team topology achieves the deep theoretical alignment necessary for complex architectural refactoring.

Empirical benchmarks run with longer/extended computational budgets showing agent teams perform better on complex architectural refactoring tasks (qualitative claim; no numeric effect sizes or sample counts provided in the abstract).

high positive An Empirical Study of Multi-Agent Collaboration for Automate... ability to perform complex architectural refactoring / depth of theoretical alig...

The subagent mode functions as a highly resilient, high-throughput search engine optimal for broad, shallow optimizations under strict time constraints.

Benchmark comparisons in the execution-based testbed under strictly fixed computational time budgets showing subagent architecture excels in throughput/resilience for broad, shallow optimization tasks (qualitative claim in paper; no numeric effect sizes provided).

high positive An Empirical Study of Multi-Agent Collaboration for Automate... search throughput/resilience and effectiveness on broad, shallow optimization ta...

Autor et al. (2024) show that the majority of current employment is in job specialties that did not exist in 1940, with new task creation driven by augmentation-type innovations.

Citation reported in the paper summarizing Autor et al. (2024); no sample size provided in excerpt.

high positive NBER WORKING PAPER SERIES share of employment in new job specialties (post-1940) and driver of new task cr...

Firms may not sufficiently account for non-monetary aspects of technological progress (well-being, safety, quality of work); a planner would include such considerations in steering technological progress.

Normative conclusion based on theoretical analysis comparing firm objective functions (profits) vs social planner objectives (including non-monetary utility).

high positive NBER WORKING PAPER SERIES attention to non-monetary aspects / inclusion in technological steering

The planner can raise social welfare by focusing technological progress on making goods cheaper that are disproportionately consumed by relatively poorer agents, thereby raising their real income.

Extension of the baseline model to multiple goods showing distributional gains via composition of price changes (real income channel).

high positive NBER WORKING PAPER SERIES real income of poorer agents / social welfare

When capital and labor are gross complements, a planner concerned with workers' welfare would favor capital-augmenting innovations to raise wages.

Analytical result from the model analyzing factor-augmenting technological progress and complementarity between capital and labor.

high positive NBER WORKING PAPER SERIES wages

A planner with sufficient welfare weight on workers will impose positive robot taxes, with the tax rate increasing in the planner's concern for workers' welfare.

Application of the baseline model to robot taxation; analytical derivation of optimal robot tax under planner preferences.

high positive NBER WORKING PAPER SERIES optimal robot tax rate

As labor's economic value diminishes, steering progress focuses increasingly on enhancing human well-being (non-monetary aspects) rather than labor productivity.

Theoretical discussion and model results in the paper showing planner's shifting objective when labor is devalued.

high positive NBER WORKING PAPER SERIES focus of technological steering (monetary productivity vs non-monetary well-bein...

The welfare benefits of steering technology are greater the less efficient social safety nets are.

Analytical result from the paper's theoretical model comparing a planner who can/cannot perform transfers and evaluating steering as second-best when redistribution is costly.

high positive NBER WORKING PAPER SERIES welfare benefits of steering technological progress

These household-level non-market productivity gains (ChatGPT making productive online tasks more efficient and freeing time for leisure) are economically large and likely constitute a substantial share of the overall economic impact of generative AI.

Combination of empirical IV estimates showing leisure increases and productivity-unchanged productive time, plus model-implied efficiency gains; authors' interpretation and welfare discussion in paper.

high positive https://arxiv.org/pdf/2603.03144 household non-market productivity and welfare (implied aggregate economic impact...

Mapping the empirical time-reallocation into a quantitative household time-allocation model implies generative AI approximately doubles the efficiency of productive online tasks for adopters; preferred calibration implies efficiency gains of 76%–176%.

Quantitative time-allocation model adapted from Aguiar et al. (2021); model uses empirical IV estimates for time reallocation and Engel curve elasticities estimated via IV (local precipitation shocks). Authors report implied efficiency gains of 76%–176% and state 'approximately doubles' efficiency.

high positive https://arxiv.org/pdf/2603.03144 efficiency (productivity) of productive digital tasks

Households predominantly utilize ChatGPT in the context of productive online activities (education, job search, informational research) rather than during leisure browsing, as inferred from the browsing context around ChatGPT use.

High-frequency analysis comparing 30-minute browsing intervals around ChatGPT visits to intervals of demographically similar non-users; LLM-based inference of website purpose; observed co-occurrence with productive-site categories.

high positive https://arxiv.org/pdf/2603.03144 context/purpose of ChatGPT use (productive vs leisure)

ChatGPT adoption increases the leisure share of browsing duration by about 30 percentage points.

IV long-difference estimates from Comscore browsing data with LLM-based site classification; authors report a ~30 percentage point increase in leisure share after adoption.

high positive https://arxiv.org/pdf/2603.03144 leisure share of total browsing duration

In long-difference IV estimates, ChatGPT adoption raises total leisure browsing time by roughly 150 log points.

IV long-difference estimates using pre-ChatGPT exposure as instrument; reported effect described as 'roughly 150 log points' increase in total leisure browsing time.

high positive https://arxiv.org/pdf/2603.03144 total leisure browsing time (log change)

A household's pre-ChatGPT ex-ante exposure (based on 2021 browsing composition) strongly predicts subsequent ChatGPT adoption: a 1 SD higher exposure predicts a 2.5 percentage point higher rate of having used ChatGPT by December 2024.

Constructed 'exposure' measure by aggregating site-level overlap with chatbot capabilities over household 2021 browsing; predictive regression (household-level) linking 1 SD change in exposure to 2.5pp higher adoption by Dec 2024 (statistic reported in paper).

high positive https://arxiv.org/pdf/2603.03144 probability / rate of ChatGPT adoption by Dec 2024

« Prev 1 2 3 … 25 26 27 … 95 96 Next »