Evidence (4560 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	378	106	59	455	1007
Governance & Regulation	379	176	116	58	739
Research Productivity	240	96	34	294	668
Organizational Efficiency	370	82	63	35	553
Technology Adoption Rate	296	118	66	29	513
Firm Productivity	277	34	68	10	394
AI Safety & Ethics	117	177	44	24	364
Output Quality	244	61	23	26	354
Market Structure	107	123	85	14	334
Decision Quality	168	74	37	19	301
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	89	32	39	9	169
Firm Revenue	96	34	22	—	152
Innovation Output	106	12	21	11	151
Consumer Welfare	70	30	37	7	144
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	75	11	29	6	121
Training Effectiveness	55	12	12	16	96
Error Rate	42	48	6	—	96
Worker Satisfaction	45	32	11	6	94
Task Completion Time	78	5	4	2	89
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	17	9	5	50
Job Displacement	5	31	12	—	48
Social Protection	21	10	6	2	39
Developer Productivity	29	3	3	1	36
Worker Turnover	10	12	—	3	25
Skill Obsolescence	3	19	2	—	24
Creative Output	15	5	3	1	24
Labor Share of Income	10	4	9	—	23

Productivity Remove filter

Structured prompting substantially reduces cross-language score variance relative to unstructured baselines.

Empirical comparison across 3,240 outputs evaluated by DeepSeek-V3, comparing structured vs. unstructured prompting across three languages.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... cross-language score variance (sigma)

Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese.

Statement referring to prior work (not new experiments in this paper); no sample size or methods provided in this text excerpt.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... goal alignment (language generalization)

Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.

Aggregate of reported experimental results across architecture design, pretraining data curation, reinforcement learning algorithm design, and preliminary transfer experiments.

high positive ASI-Evolve: AI Accelerates AI feasibility and promise of closed-loop AI-driven research (ASI-Evolve) to accele...

In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +5.04 points on OlympiadBench.

Reinforcement learning algorithm design experiments reported in the paper comparing discovered algorithms to GRPO on OlympiadBench.

high positive ASI-Evolve: AI Accelerates AI performance difference vs GRPO on OlympiadBench (points)

In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +11.67 points on AIME24.

Reinforcement learning algorithm design experiments reported in the paper comparing discovered algorithms to GRPO on AIME24.

high positive ASI-Evolve: AI Accelerates AI performance difference vs GRPO on AIME24 (points)

In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32.

Reinforcement learning algorithm design experiments reported in the paper comparing discovered algorithms to GRPO on AMC32.

high positive ASI-Evolve: AI Accelerates AI performance difference vs GRPO on AMC32 (points)

In pretraining data curation, gains exceed 18 points on MMLU.

Reported experimental result on MMLU benchmark within pretraining data curation experiments.

high positive ASI-Evolve: AI Accelerates AI MMLU benchmark performance (points)

In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points.

Pretraining data curation experiments reported in the paper showing an average benchmark performance improvement of +3.96 points.

high positive ASI-Evolve: AI Accelerates AI average benchmark performance (points)

The best discovered model surpasses DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements.

Reported performance comparison between the best discovered model and DeltaNet in neural architecture experiments; statement comparing relative gain to recent human-designed improvements.

high positive ASI-Evolve: AI Accelerates AI performance difference vs DeltaNet (points)

In neural architecture design, it discovered 105 SOTA linear attention architectures.

Neural architecture design experiments reported in the paper, with 105 discovered architectures labeled as SOTA.

high positive ASI-Evolve: AI Accelerates AI count of discovered state-of-the-art (SOTA) linear attention architectures

ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations.

Method description of ASI-Evolve's architecture/components in the paper (cognition base and analyzer added to evolutionary agents).

high positive ASI-Evolve: AI Accelerates AI design and inclusion of cognition base and dedicated analyzer components in the ...

We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle.

Methodological contribution described in the paper: presentation and implementation of the ASI-Evolve framework and its learn-design-experiment-analyze loop.

high positive ASI-Evolve: AI Accelerates AI existence and operation of a learn-design-experiment-analyze closed-loop framewo...

Large language model (LLM) use can improve observable output and short-term task performance.

Paper synthesizes empirical findings from human–AI interaction studies, learning-research experiments, and model-evaluation work indicating improved produced outputs and short-term task performance when humans use LLMs; no single pooled sample size or unified effect estimate is reported in the paper.

high positive Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupli... observable output quality and short-term task performance

These empirical insights provide actionable guidelines advocating dynamically routed architectures that adapt their collaborative structures to real-time task complexity.

Authors' recommendation derived from reported empirical findings comparing architectures under varying time budgets and task complexities (prescriptive claim based on study results).

high positive An Empirical Study of Multi-Agent Collaboration for Automate... effectiveness of dynamically routed architectures in matching collaborative stru...

Given extended compute budgets, the agent team topology achieves the deep theoretical alignment necessary for complex architectural refactoring.

Empirical benchmarks run with longer/extended computational budgets showing agent teams perform better on complex architectural refactoring tasks (qualitative claim; no numeric effect sizes or sample counts provided in the abstract).

high positive An Empirical Study of Multi-Agent Collaboration for Automate... ability to perform complex architectural refactoring / depth of theoretical alig...

The subagent mode functions as a highly resilient, high-throughput search engine optimal for broad, shallow optimizations under strict time constraints.

Benchmark comparisons in the execution-based testbed under strictly fixed computational time budgets showing subagent architecture excels in throughput/resilience for broad, shallow optimization tasks (qualitative claim in paper; no numeric effect sizes provided).

high positive An Empirical Study of Multi-Agent Collaboration for Automate... search throughput/resilience and effectiveness on broad, shallow optimization ta...

Autor et al. (2024) show that the majority of current employment is in job specialties that did not exist in 1940, with new task creation driven by augmentation-type innovations.

Citation reported in the paper summarizing Autor et al. (2024); no sample size provided in excerpt.

high positive NBER WORKING PAPER SERIES share of employment in new job specialties (post-1940) and driver of new task cr...

Firms may not sufficiently account for non-monetary aspects of technological progress (well-being, safety, quality of work); a planner would include such considerations in steering technological progress.

Normative conclusion based on theoretical analysis comparing firm objective functions (profits) vs social planner objectives (including non-monetary utility).

high positive NBER WORKING PAPER SERIES attention to non-monetary aspects / inclusion in technological steering

The planner can raise social welfare by focusing technological progress on making goods cheaper that are disproportionately consumed by relatively poorer agents, thereby raising their real income.

Extension of the baseline model to multiple goods showing distributional gains via composition of price changes (real income channel).

high positive NBER WORKING PAPER SERIES real income of poorer agents / social welfare

When capital and labor are gross complements, a planner concerned with workers' welfare would favor capital-augmenting innovations to raise wages.

Analytical result from the model analyzing factor-augmenting technological progress and complementarity between capital and labor.

high positive NBER WORKING PAPER SERIES wages

A planner with sufficient welfare weight on workers will impose positive robot taxes, with the tax rate increasing in the planner's concern for workers' welfare.

Application of the baseline model to robot taxation; analytical derivation of optimal robot tax under planner preferences.

high positive NBER WORKING PAPER SERIES optimal robot tax rate

As labor's economic value diminishes, steering progress focuses increasingly on enhancing human well-being (non-monetary aspects) rather than labor productivity.

Theoretical discussion and model results in the paper showing planner's shifting objective when labor is devalued.

high positive NBER WORKING PAPER SERIES focus of technological steering (monetary productivity vs non-monetary well-bein...

The welfare benefits of steering technology are greater the less efficient social safety nets are.

Analytical result from the paper's theoretical model comparing a planner who can/cannot perform transfers and evaluating steering as second-best when redistribution is costly.

high positive NBER WORKING PAPER SERIES welfare benefits of steering technological progress

These household-level non-market productivity gains (ChatGPT making productive online tasks more efficient and freeing time for leisure) are economically large and likely constitute a substantial share of the overall economic impact of generative AI.

Combination of empirical IV estimates showing leisure increases and productivity-unchanged productive time, plus model-implied efficiency gains; authors' interpretation and welfare discussion in paper.

high positive https://arxiv.org/pdf/2603.03144 household non-market productivity and welfare (implied aggregate economic impact...

Mapping the empirical time-reallocation into a quantitative household time-allocation model implies generative AI approximately doubles the efficiency of productive online tasks for adopters; preferred calibration implies efficiency gains of 76%–176%.

Quantitative time-allocation model adapted from Aguiar et al. (2021); model uses empirical IV estimates for time reallocation and Engel curve elasticities estimated via IV (local precipitation shocks). Authors report implied efficiency gains of 76%–176% and state 'approximately doubles' efficiency.

high positive https://arxiv.org/pdf/2603.03144 efficiency (productivity) of productive digital tasks

Households predominantly utilize ChatGPT in the context of productive online activities (education, job search, informational research) rather than during leisure browsing, as inferred from the browsing context around ChatGPT use.

High-frequency analysis comparing 30-minute browsing intervals around ChatGPT visits to intervals of demographically similar non-users; LLM-based inference of website purpose; observed co-occurrence with productive-site categories.

high positive https://arxiv.org/pdf/2603.03144 context/purpose of ChatGPT use (productive vs leisure)

ChatGPT adoption increases the leisure share of browsing duration by about 30 percentage points.

IV long-difference estimates from Comscore browsing data with LLM-based site classification; authors report a ~30 percentage point increase in leisure share after adoption.

high positive https://arxiv.org/pdf/2603.03144 leisure share of total browsing duration

In long-difference IV estimates, ChatGPT adoption raises total leisure browsing time by roughly 150 log points.

IV long-difference estimates using pre-ChatGPT exposure as instrument; reported effect described as 'roughly 150 log points' increase in total leisure browsing time.

high positive https://arxiv.org/pdf/2603.03144 total leisure browsing time (log change)

A household's pre-ChatGPT ex-ante exposure (based on 2021 browsing composition) strongly predicts subsequent ChatGPT adoption: a 1 SD higher exposure predicts a 2.5 percentage point higher rate of having used ChatGPT by December 2024.

Constructed 'exposure' measure by aggregating site-level overlap with chatbot capabilities over household 2021 browsing; predictive regression (household-level) linking 1 SD change in exposure to 2.5pp higher adoption by Dec 2024 (statistic reported in paper).

high positive https://arxiv.org/pdf/2603.03144 probability / rate of ChatGPT adoption by Dec 2024

ChatGPT adoption among private households has been rapid following release, but adoption is far from uniform.

Descriptive adoption patterns measured from Comscore browsing data over time (pre- and post-Nov 30, 2022) on the household panel (2021–2024); time-series of observed ChatGPT site visits and adoption rates.

high positive https://arxiv.org/pdf/2603.03144 ChatGPT adoption rate over time

Despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token.

Observed industry/empirical trend cited in the paper (example: falling cost per token). No numerical samples or sample size given in the excerpt.

high positive The Unreasonable Effectiveness of Scaling Laws in AI cost per token and continued progress (performance improvements over time)

Scaling laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes.

Paper asserts repeated empirical appearance across model families and training-adjacent regimes; claim is descriptive/observational without sample size in the excerpt.

high positive The Unreasonable Effectiveness of Scaling Laws in AI generalizability (occurrence) of scaling-law patterns across model families and ...

Scaling laws make progress predictable, albeit at a declining rate.

Conceptual claim in the paper based on the power-law form of scaling laws (no numerical quantification or sample size provided in the excerpt).

high positive The Unreasonable Effectiveness of Scaling Laws in AI predictability of progress (model performance as compute increases)

Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form.

Stated observationally in the paper as established empirical regularity across pre-training runs and prior literature on scaling laws (no sample size or specific experiments reported in the excerpt).

high positive The Unreasonable Effectiveness of Scaling Laws in AI training loss

Task-level analyses show that activities expanded in AI-enabled projects—particularly ideation and experimentation—are increasingly compatible with large language model capabilities, suggesting potential for future productivity gains as these technologies mature.

Task-level classification mapping tasks described in proposals to LLM-relevant capabilities using LLM-based classification; finding that tasks expanded in AI-enabled projects cluster on ideation and experimentation, which align with current LLM strengths.

high positive Artificial Intelligence in Science: Returns, Reallocation, a... frequency/expansion of specific task categories (ideation, experimentation) and ...

AI-enabled projects undertake a broader set of tasks.

Task-level analysis of proposal descriptions (task inventories) classifying tasks via keyword extraction and LLMs, showing AI-enabled proposals list a wider variety of activities than non-AI proposals.

high positive Artificial Intelligence in Science: Returns, Reallocation, a... breadth/variety of tasks undertaken in projects

AI-enabled projects involve larger teams.

Comparison of team structure in proposals (team size) between AI-enabled and non-AI projects using the same comprehensive proposal dataset and LLM-based classification of AI presence.

high positive Artificial Intelligence in Science: Returns, Reallocation, a... team size / team structure

AI-enabled projects reallocate resources toward human capital (i.e., shift budget allocations toward labor / human capital).

Analysis of detailed budget allocations in the proposal dataset, comparing projects identified as AI-enabled versus non-AI projects using keyword extraction and LLM classification to identify AI presence and role.

high positive Artificial Intelligence in Science: Returns, Reallocation, a... budget allocation share toward human capital (labor share)

In the short run, AI adoption is associated with modest improvements in scientific outcomes concentrated in the upper tail.

Observational analysis linking identified AI presence in a comprehensive dataset of research proposals (funded and unfunded) to subsequent publication outcomes; AI presence identified via keyword extraction combined with large language model (LLM) classification; publication outcomes measured after proposal submission.

high positive Artificial Intelligence in Science: Returns, Reallocation, a... subsequent publication outcomes (scientific outcomes)

Education and workforce development should shift focus from rote knowledge accumulation to cultivating skills in human-AI collaboration, creative problem-solving, and the design of novel economic domains.

Normative policy recommendation derived from the paper's framework and analysis of anticipated labor market changes (no empirical evaluation or trial data reported in the abstract).

high positive AI Civilization and the Transformation of Work educational focus / skill composition

Human-AI co-evolution will significantly increase individual productivity and open new frontiers of economic activity.

Projected outcome based on combined analysis of AI capabilities, historical patterns, and platform growth; the abstract does not report empirical measurement or sample sizes for this projection.

high positive AI Civilization and the Transformation of Work individual productivity and emergence of new economic activities

AI-driven productivity augmentation dramatically lowers the barriers to creating economic value, enabling the decentralized generation of employment.

Argument supported by paper's analysis of contemporary labor market dynamics and the growth of digital platforms; no quantified empirical estimates or sample sizes provided in the abstract.

high positive AI Civilization and the Transformation of Work barriers to entry for value creation / individual productivity

The transition to an AI-civilization will fundamentally restructure the mechanisms of employment creation from a centralized model (few organizations creating jobs for the many) to a decentralized ecosystem where individuals are empowered to generate their own employment opportunities.

Central thesis of the paper, motivated by theoretical argumentation and synthesis of contemporary data on labor markets and digital platforms (no empirical test or sample sizes specified in the abstract).

high positive AI Civilization and the Transformation of Work structure/mechanism of employment creation (centralized vs decentralized)

Historical precedents from past technological revolutions suggest that innovation tends to expand, rather than shrink, the scope of economic activity and employment in the long run.

Paper draws on analysis of economic history (qualitative historical analysis implied; no specific historical datasets or sample sizes provided in the abstract).

high positive AI Civilization and the Transformation of Work scope of economic activity and long-run employment levels

Google has been pioneering machine learning usage across dozens of products.

Contextual statement in the abstract about the organization's activity; asserted without empirical detail in abstract.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... extent of ML usage across Google products

The techniques and approaches described can be generalized for other framework migrations and general code transformation tasks.

Authors' stated expectation/generalization claim in the abstract; no empirical evidence or cross-framework experiments reported in the abstract.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... generalizability to other framework migrations / code transformation tasks

The system creates a virtuous circle where effectively AI supports its own development workflow.

Conceptual claim supported by the system's design and reported improvements that enable iterative AI-assisted development; described qualitatively in the paper.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... self-supporting/iterative improvement of AI-assisted development workflow

Our approach dramatically reduces the time (6.4x-8x speedup) for deep learning model migrations.

Quantitative speedup figure reported in the paper's abstract (6.4x-8x); likely based on measured migration times on demonstrated cases, though the abstract does not state sample size or exact experimental setup.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... time required to perform deep learning model migrations

The system accelerates code migrations in a large hyperscaler environment on commercial real-world use-cases.

Reported demonstration and evaluation in a hyperscaler (commercial) environment using real-world cases as described in the paper; no detailed sample size given in abstract.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... speed of code migrations in commercial/hyperscaler environment

We define quality metrics and AI-based judges that accelerate development when the code to evaluate has no tests and has to adhere to strict style and dependency requirements.

Design and implementation of quality metrics and AI-based judges described in the paper; claimed acceleration of development workflow (no numeric quantification in abstract).

high positive A Multi-agent AI System for Deep Learning Model Migration fr... development speed / time to develop when evaluating untested code under strict s...

« Prev 1 2 3 … 21 22 23 … 91 92 Next »