Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

In the user study, AI-expanded 5W3H prompts increase user satisfaction from 3.16 to 4.04.

Reported pre/post or baseline vs AI-expanded satisfaction scores in the N=50 user study with numeric scores 3.16 and 4.04.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... user satisfaction (rating scale)

In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent.

Reported comparison in the N=50 user study between baseline interaction rounds and rounds after AI-assisted 5W3H expansion; percentage reduction reported as 60%.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... interaction rounds (number of back-and-forth interactions to reach goal)

A weak-model compensation pattern was observed: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217).

Model-level comparison of D-A gain (difference between structured and unstructured conditions) across three models (Claude, GPT-4o, Gemini) on the evaluated outputs; reported gains for Gemini and Claude.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... D-A gain (improvement in goal-alignment score from structured prompting)

The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020.

Reported numeric comparison of sigma (variance) between unstructured baseline and strongest structured prompting conditions across evaluated outputs.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... cross-language sigma (standard deviation of scores across languages)

Structured prompting substantially reduces cross-language score variance relative to unstructured baselines.

Empirical comparison across 3,240 outputs evaluated by DeepSeek-V3, comparing structured vs. unstructured prompting across three languages.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... cross-language score variance (sigma)

Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese.

Statement referring to prior work (not new experiments in this paper); no sample size or methods provided in this text excerpt.

high positive Structured Intent as a Protocol-Like Communication Layer: Cr... goal alignment (language generalization)

The case for mutually beneficial industrial policy is stronger for product innovation than for process innovation, because product innovation directly affects demand and triggers stronger network effects while process innovation operates indirectly through supply.

Model variants distinguishing product vs. process R&D within the two-country framework; comparative analysis showing larger demand-driven network effects for product innovation (theoretical model results; no empirical sample).

high positive Industrial Policy with Network Externalities: Race to the Bo... magnitude/likelihood of welfare gains from industrial policy (product vs process...

Under sufficiently strong network externalities and weak substitutability (or weak complementarity) of the goods, industrial policy competition can make both countries simultaneously better off compared to the laissez-faire outcome because of a mutual business-enhancement effect.

Theoretical demonstration within the two-country model: parameter regions (strength of externality, degree of product differentiation) where simultaneous welfare improvements occur relative to laissez-faire (analytical/model results; no empirical sample).

high positive Industrial Policy with Network Externalities: Race to the Bo... aggregate welfare (comparison to laissez-faire)

Social security solutions must be adapted to evolving human-technology interactions to secure social justice and cohesion.

Normative conclusion/recommendation from the paper's discussion; advanced as a necessary policy direction without reported empirical validation in the provided text.

high positive IoT, artificial intelligence, cloud computing and robotics a... social justice and social cohesion via adapted social security solutions

Establishing contributory frameworks based on technology-generated income will ensure the sustainability of social protection in the era of labor displacement.

Presented as a novel policy proposal in the paper; stated as a solution with the asserted effect of ensuring sustainability rather than demonstrated via empirical testing or simulation within the text provided.

high positive IoT, artificial intelligence, cloud computing and robotics a... sustainability of social protection/social security financing

The Internet of Things (IoT) represents a transformative force, integrating digital intelligence with the physical world and catalyzing new relationships across economic sectors.

Stated as a conceptual assertion in the paper's introduction/overview; presented as a high-level literature-informed claim (no empirical sample or quantitative analysis reported).

high positive IoT, artificial intelligence, cloud computing and robotics a... integration of digital intelligence with the physical world and cross-sectoral e...

Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.

Aggregate of reported experimental results across architecture design, pretraining data curation, reinforcement learning algorithm design, and preliminary transfer experiments.

high positive ASI-Evolve: AI Accelerates AI feasibility and promise of closed-loop AI-driven research (ASI-Evolve) to accele...

In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +5.04 points on OlympiadBench.

Reinforcement learning algorithm design experiments reported in the paper comparing discovered algorithms to GRPO on OlympiadBench.

high positive ASI-Evolve: AI Accelerates AI performance difference vs GRPO on OlympiadBench (points)

In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +11.67 points on AIME24.

Reinforcement learning algorithm design experiments reported in the paper comparing discovered algorithms to GRPO on AIME24.

high positive ASI-Evolve: AI Accelerates AI performance difference vs GRPO on AIME24 (points)

In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32.

Reinforcement learning algorithm design experiments reported in the paper comparing discovered algorithms to GRPO on AMC32.

high positive ASI-Evolve: AI Accelerates AI performance difference vs GRPO on AMC32 (points)

In pretraining data curation, gains exceed 18 points on MMLU.

Reported experimental result on MMLU benchmark within pretraining data curation experiments.

high positive ASI-Evolve: AI Accelerates AI MMLU benchmark performance (points)

In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points.

Pretraining data curation experiments reported in the paper showing an average benchmark performance improvement of +3.96 points.

high positive ASI-Evolve: AI Accelerates AI average benchmark performance (points)

The best discovered model surpasses DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements.

Reported performance comparison between the best discovered model and DeltaNet in neural architecture experiments; statement comparing relative gain to recent human-designed improvements.

high positive ASI-Evolve: AI Accelerates AI performance difference vs DeltaNet (points)

In neural architecture design, it discovered 105 SOTA linear attention architectures.

Neural architecture design experiments reported in the paper, with 105 discovered architectures labeled as SOTA.

high positive ASI-Evolve: AI Accelerates AI count of discovered state-of-the-art (SOTA) linear attention architectures

ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations.

Method description of ASI-Evolve's architecture/components in the paper (cognition base and analyzer added to evolutionary agents).

high positive ASI-Evolve: AI Accelerates AI design and inclusion of cognition base and dedicated analyzer components in the ...

We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle.

Methodological contribution described in the paper: presentation and implementation of the ASI-Evolve framework and its learn-design-experiment-analyze loop.

high positive ASI-Evolve: AI Accelerates AI existence and operation of a learn-design-experiment-analyze closed-loop framewo...

Large language model (LLM) use can improve observable output and short-term task performance.

Paper synthesizes empirical findings from human–AI interaction studies, learning-research experiments, and model-evaluation work indicating improved produced outputs and short-term task performance when humans use LLMs; no single pooled sample size or unified effect estimate is reported in the paper.

high positive Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupli... observable output quality and short-term task performance

Frontier models (Claude Haiku 4.5, GPT-5-chat, GPT-5-mini) achieve statistically indistinguishable semantic closeness scores above 4.6 out of 5.0.

Reported semantic closeness scores from the LLM-as-Judge evaluation on the 15-proposal dataset; the paper states frontier models scored above 4.6/5.0 and were statistically indistinguishable from each other.

high positive Can Commercial LLMs Be Parliamentary Political Companions? C... semantic closeness score (LLM-as-Judge)

These empirical insights provide actionable guidelines advocating dynamically routed architectures that adapt their collaborative structures to real-time task complexity.

Authors' recommendation derived from reported empirical findings comparing architectures under varying time budgets and task complexities (prescriptive claim based on study results).

high positive An Empirical Study of Multi-Agent Collaboration for Automate... effectiveness of dynamically routed architectures in matching collaborative stru...

Given extended compute budgets, the agent team topology achieves the deep theoretical alignment necessary for complex architectural refactoring.

Empirical benchmarks run with longer/extended computational budgets showing agent teams perform better on complex architectural refactoring tasks (qualitative claim; no numeric effect sizes or sample counts provided in the abstract).

high positive An Empirical Study of Multi-Agent Collaboration for Automate... ability to perform complex architectural refactoring / depth of theoretical alig...

The subagent mode functions as a highly resilient, high-throughput search engine optimal for broad, shallow optimizations under strict time constraints.

Benchmark comparisons in the execution-based testbed under strictly fixed computational time budgets showing subagent architecture excels in throughput/resilience for broad, shallow optimization tasks (qualitative claim in paper; no numeric effect sizes provided).

high positive An Empirical Study of Multi-Agent Collaboration for Automate... search throughput/resilience and effectiveness on broad, shallow optimization ta...

Proposition 2: An increase in the pace of technology creation (m(b) rising from m to m') generates a transitory increase in the skill premium (even if the increase is permanent, because new technologies eventually age).

Analytical result (proposition) proved in the paper's model appendix; intuition and special-case (γ=σ) illustrated in text.

high positive THE SKILL PREMIUM IN TIMES OF RAPID TECHNOLOGICAL CHANGE transitional behavior of skill premium following a change in m(b)

The college premium rose first among young workers and later among older workers; a model extension that assumes younger workers have a comparative advantage in new technologies generates age-specific increases that account for half of the observed age gaps.

Extension of the model with worker demographics; calibration using CPS data on computer use by worker age (showing young workers used computers more intensively initially) and simulation comparing model to observed age-specific wage premium changes.

high positive THE SKILL PREMIUM IN TIMES OF RAPID TECHNOLOGICAL CHANGE college premium by worker age (timing and magnitude of increase)

Slow diffusion, combined with the rapid pace of technology creation, accounts for 6.2 of the 8.7 log-point differential increase in the skill premium between high- and low-density regions over 1980–2005.

Model calibrated with estimated diffusion rates across regions from the text-based dataset; quantitative decomposition attributing portions of the regional differential to the mechanism.

high positive THE SKILL PREMIUM IN TIMES OF RAPID TECHNOLOGICAL CHANGE regional differential increase in skill premium (log points) over 1980–2005

The mechanism explains why the college premium is higher in dense cities and why its increase was mainly urban.

Model extension incorporating regional diffusion of technologies combined with estimates of diffusion rates across locations (using the Kalyani et al. dataset); comparison of model predictions to documented urban–rural wage premium patterns.

high positive THE SKILL PREMIUM IN TIMES OF RAPID TECHNOLOGICAL CHANGE college premium by city density

Total demand for college-educated workers increased by 100 log points since 1980; changes in the pace of technology creation account for one-third of that increase, with the remainder attributed to residual structural changes in production.

Model-based decomposition calibrated to data (demand and supply of college-educated workers since 1980); quantitative accounting exercise reported in the paper.

high positive THE SKILL PREMIUM IN TIMES OF RAPID TECHNOLOGICAL CHANGE demand for college-educated workers (log points since 1980)

When calibrated to the observed pace of technology creation, the model generates a 28 log-point (32 percent) increase in the college premium between 1980 and 2010, which then flattens and begins to revert.

Quantitative calibration of the model to novel text-based technology data (arrival and diffusion) and wage series (CPS); simulation results.

high positive THE SKILL PREMIUM IN TIMES OF RAPID TECHNOLOGICAL CHANGE college premium over 1980–2010

The data show a temporary increase in the pace of new technology creation beginning in the 1970s, accelerating in the 1980s, and tapering off in the 2000s.

Time series of identified new technologies from text-based measures (patent text/job posting linkage) covering 1976–2007 (as in Kalyani et al., 2025) used to measure arrival rates by cohort.

high positive THE SKILL PREMIUM IN TIMES OF RAPID TECHNOLOGICAL CHANGE rate of arrival of new technologies (pace of technology creation)

The pace of technology creation is a key driver of the skill premium: a rapid pace of technology creation leads to a sustained increase in the skill premium (because skilled workers learn to use new technologies faster).

Theoretical model developed in the paper in which new technologies arrive exogenously and skilled workers have a comparative advantage in learning new technologies; supported by calibration using novel text-based data (patent text and job postings) and CPS wage data.

high positive THE SKILL PREMIUM IN TIMES OF RAPID TECHNOLOGICAL CHANGE skill premium (college wage premium)

Autor et al. (2024) show that the majority of current employment is in job specialties that did not exist in 1940, with new task creation driven by augmentation-type innovations.

Citation reported in the paper summarizing Autor et al. (2024); no sample size provided in excerpt.

high positive NBER WORKING PAPER SERIES share of employment in new job specialties (post-1940) and driver of new task cr...

Firms may not sufficiently account for non-monetary aspects of technological progress (well-being, safety, quality of work); a planner would include such considerations in steering technological progress.

Normative conclusion based on theoretical analysis comparing firm objective functions (profits) vs social planner objectives (including non-monetary utility).

high positive NBER WORKING PAPER SERIES attention to non-monetary aspects / inclusion in technological steering

The planner can raise social welfare by focusing technological progress on making goods cheaper that are disproportionately consumed by relatively poorer agents, thereby raising their real income.

Extension of the baseline model to multiple goods showing distributional gains via composition of price changes (real income channel).

high positive NBER WORKING PAPER SERIES real income of poorer agents / social welfare

When capital and labor are gross complements, a planner concerned with workers' welfare would favor capital-augmenting innovations to raise wages.

Analytical result from the model analyzing factor-augmenting technological progress and complementarity between capital and labor.

high positive NBER WORKING PAPER SERIES wages

A planner with sufficient welfare weight on workers will impose positive robot taxes, with the tax rate increasing in the planner's concern for workers' welfare.

Application of the baseline model to robot taxation; analytical derivation of optimal robot tax under planner preferences.

high positive NBER WORKING PAPER SERIES optimal robot tax rate

As labor's economic value diminishes, steering progress focuses increasingly on enhancing human well-being (non-monetary aspects) rather than labor productivity.

Theoretical discussion and model results in the paper showing planner's shifting objective when labor is devalued.

high positive NBER WORKING PAPER SERIES focus of technological steering (monetary productivity vs non-monetary well-bein...

The welfare benefits of steering technology are greater the less efficient social safety nets are.

Analytical result from the paper's theoretical model comparing a planner who can/cannot perform transfers and evaluating steering as second-best when redistribution is costly.

high positive NBER WORKING PAPER SERIES welfare benefits of steering technological progress

These household-level non-market productivity gains (ChatGPT making productive online tasks more efficient and freeing time for leisure) are economically large and likely constitute a substantial share of the overall economic impact of generative AI.

Combination of empirical IV estimates showing leisure increases and productivity-unchanged productive time, plus model-implied efficiency gains; authors' interpretation and welfare discussion in paper.

high positive https://arxiv.org/pdf/2603.03144 household non-market productivity and welfare (implied aggregate economic impact...

Mapping the empirical time-reallocation into a quantitative household time-allocation model implies generative AI approximately doubles the efficiency of productive online tasks for adopters; preferred calibration implies efficiency gains of 76%–176%.

Quantitative time-allocation model adapted from Aguiar et al. (2021); model uses empirical IV estimates for time reallocation and Engel curve elasticities estimated via IV (local precipitation shocks). Authors report implied efficiency gains of 76%–176% and state 'approximately doubles' efficiency.

high positive https://arxiv.org/pdf/2603.03144 efficiency (productivity) of productive digital tasks

Households predominantly utilize ChatGPT in the context of productive online activities (education, job search, informational research) rather than during leisure browsing, as inferred from the browsing context around ChatGPT use.

High-frequency analysis comparing 30-minute browsing intervals around ChatGPT visits to intervals of demographically similar non-users; LLM-based inference of website purpose; observed co-occurrence with productive-site categories.

high positive https://arxiv.org/pdf/2603.03144 context/purpose of ChatGPT use (productive vs leisure)

ChatGPT adoption increases the leisure share of browsing duration by about 30 percentage points.

IV long-difference estimates from Comscore browsing data with LLM-based site classification; authors report a ~30 percentage point increase in leisure share after adoption.

high positive https://arxiv.org/pdf/2603.03144 leisure share of total browsing duration

In long-difference IV estimates, ChatGPT adoption raises total leisure browsing time by roughly 150 log points.

IV long-difference estimates using pre-ChatGPT exposure as instrument; reported effect described as 'roughly 150 log points' increase in total leisure browsing time.

high positive https://arxiv.org/pdf/2603.03144 total leisure browsing time (log change)

A household's pre-ChatGPT ex-ante exposure (based on 2021 browsing composition) strongly predicts subsequent ChatGPT adoption: a 1 SD higher exposure predicts a 2.5 percentage point higher rate of having used ChatGPT by December 2024.

Constructed 'exposure' measure by aggregating site-level overlap with chatbot capabilities over household 2021 browsing; predictive regression (household-level) linking 1 SD change in exposure to 2.5pp higher adoption by Dec 2024 (statistic reported in paper).

high positive https://arxiv.org/pdf/2603.03144 probability / rate of ChatGPT adoption by Dec 2024

ChatGPT adoption among private households has been rapid following release, but adoption is far from uniform.

Descriptive adoption patterns measured from Comscore browsing data over time (pre- and post-Nov 30, 2022) on the household panel (2021–2024); time-series of observed ChatGPT site visits and adoption rates.

high positive https://arxiv.org/pdf/2603.03144 ChatGPT adoption rate over time

Despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token.

Observed industry/empirical trend cited in the paper (example: falling cost per token). No numerical samples or sample size given in the excerpt.

high positive The Unreasonable Effectiveness of Scaling Laws in AI cost per token and continued progress (performance improvements over time)

Scaling laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes.

Paper asserts repeated empirical appearance across model families and training-adjacent regimes; claim is descriptive/observational without sample size in the excerpt.

high positive The Unreasonable Effectiveness of Scaling Laws in AI generalizability (occurrence) of scaling-law patterns across model families and ...

« Prev 1 2 3 … 159 160 161 … 277 278 Next »