Evidence (8807 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	870	233	116	1066	2363
Governance & Regulation	976	451	218	133	1809
Organizational Efficiency	949	224	144	88	1416
Technology Adoption Rate	764	287	141	122	1325
Research Productivity	501	152	74	362	1101
Output Quality	542	216	69	69	896
Decision Quality	387	198	94	54	740
Firm Productivity	513	67	101	27	714
AI Safety & Ethics	249	303	73	36	667
Market Structure	190	192	134	27	548
Task Allocation	243	77	91	36	452
Innovation Output	291	33	55	20	401
Skill Acquisition	206	72	65	21	364
Employment Level	133	63	115	22	335
Fiscal & Macroeconomic	153	79	52	32	323
Task Completion Time	206	37	12	15	272
Firm Revenue	179	52	29	5	266
Consumer Welfare	130	76	47	13	266
Inequality Measures	48	137	51	6	242
Worker Satisfaction	101	81	25	13	220
Error Rate	84	110	11	5	210
Wages & Compensation	98	47	30	10	185
Regulatory Compliance	88	73	17	7	185
Automation Exposure	66	64	33	16	182
Team Performance	105	29	30	11	176
Training Effectiveness	109	22	14	21	168
Developer Productivity	114	21	14	8	158
Job Displacement	12	90	24	1	127
Hiring & Recruitment	57	9	9	5	80
Skill Obsolescence	6	56	9	1	72
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	21	17	1	57
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Productivity Remove filter

Failures in engineering reasoning by AI systems may produce physically invalid yet superficially plausible solutions, posing risks for engineering education, scientific assistance, and technical decision-making.

Argumentative claim in the paper highlighting potential risks of reasoning failures in high-stakes engineering contexts (motivational/background statement).

high negative Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise ... risk of producing physically invalid but plausible solutions

Datasets are rarely standardized or shared.

Review synthesis and commentary across included studies and supplementary documents indicating limited data standardization and sharing.

high negative Artificial Intelligence-Driven Optimization in Pharmacy Inve... dataset standardization and data-sharing practices

Agents performed more weakly on a task requiring novel bioinformatics reasoning.

Reported ABC-Bench results indicating relatively lower agent scores on the task characterized by novel bioinformatics reasoning (authors' summary in the abstract).

high negative ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecu... performance on novel bioinformatics reasoning task

This regulatory pressure creates a direct conflict between multi-stakeholder transparency and corporate data privacy.

Paper's conceptual argument describing a tension between transparency requirements and proprietary data protection; no empirical study provided.

high negative Trustworthy Smart Fabs via Professional Proxies: Scaling Saf... conflict between stakeholder transparency and corporate data privacy

Regulatory compliance demands have surpassed the capacity of manual corporate reporting.

Assertion in paper (conceptual observation about reporting capacity); no empirical measurement or sample size reported.

high negative Trustworthy Smart Fabs via Professional Proxies: Scaling Saf... capacity of manual corporate reporting to meet regulatory demands

The convergence of the 2026 European Union Safe and Sustainable by Design (SSbD) framework, Corporate Sustainability Due Diligence Directive (CSDDD), and Carbon Border Adjustment Mechanism (CBAM) introduce a severe governance bottleneck for advanced semiconductor manufacturing facilities ("Smart Fabs").

Declarative claim in paper based on policy convergence analysis; no empirical dataset or sample size reported (conceptual/analytical argument).

high negative Trustworthy Smart Fabs via Professional Proxies: Scaling Saf... governance bottleneck for Smart Fabs

Learning specialized simulator input languages can cost domain scientists hours to days.

Stated motivating claim in the paper (no experimental sample size or formal measurement reported in abstract).

high negative SIGA: Self-Evolving Coding-Agent Adapters for Scientific Sim... time required to learn simulator input languages

In hyperscale cloud network infrastructure, traditional human-driven incident response cannot keep pace with the volume, velocity, and complexity of failures.

Stated as background/motivation in the paper; no quantitative data, sample size, or empirical comparison provided in the abstract.

high negative Autonomous Incident Resolution at Hyperscale: An Agentic AI ... ability of human-driven incident response to keep pace with incident volume, vel...

Agents frequently overlook subtle yet critical details that are obvious to real human researchers.

Reported as a qualitative result/observation from the authors' experiments on AARRI-Bench; no numeric frequency or sample size provided in the excerpt.

high negative Act As a Real Researcher: A Suite of Benchmarks Evaluating F... frequency of overlooking subtle, critical research details

Extensive experiments across frontier models and agentic systems reveal that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only a 68.3% success rate on AARRI-Bench.

Empirical evaluation reported in the paper: experiments across multiple models/agentic systems; the excerpt reports the top configuration and its success rate. The excerpt does not state the number of tasks or sample size.

high negative Act As a Real Researcher: A Suite of Benchmarks Evaluating F... success rate on AARRI-Bench tasks

Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment, and consequently remain unable to fully replace human researchers.

Asserted in the paper as a high-level observation and motivation; the excerpt does not provide quantified evidence or sample sizes for these limitations.

high negative Act As a Real Researcher: A Suite of Benchmarks Evaluating F... ability to match human researchers on field sensitivity, research ethics, and nu...

Current research on AI-supported conflict techniques has focused predominantly on Devil's Advocate (DA) and has neglected Dialectical Inquiry (DI).

Literature review / gap statement in the paper pointing to relative emphasis on DA in prior research and lack of work on DI.

high negative Shaping The Tool Or Shaping The Mind: An Investigation Of Du... research attention on DA vs DI

Other methods, such as variants of prediction-powered inference, do not have the 'do no harm' guarantee.

Comparative methodological claim in the paper (abstract)—likely supported by theoretical discussion and comparisons in the main text.

high negative AI-Assisted Variance Reduction in Randomized Experiments presence or absence of guarantee that adjustment does not worsen estimator when ...

Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage (i.e., B imposes an upper bound on non-proprietary informed decision-quality).

Analytic upper-bound calculation based on B's measured coverage on the curated gold record (exact derivation not provided in abstract).

high negative AI Scientists Are Only as Good as Their Evidence: A Stratifi... maximum achievable informed decision-quality for non-proprietary-data reports un...

GenAI usage significantly decreased creativity-relevant skills.

Experiment with 82 participants reported in the paper; authors report a statistically significant decrease in measures of creativity-relevant skills for participants using GenAI.

high negative When Ai Sparks Less: Generative Ai And The Decline Of Self-P... creativity-relevant skills

GenAI usage significantly decreased domain-relevant skills.

Experiment with 82 participants reported in the paper; authors report a statistically significant reduction in measures of domain-relevant skills for the GenAI condition.

high negative When Ai Sparks Less: Generative Ai And The Decline Of Self-P... domain-relevant skills

GenAI usage significantly decreased intrinsic task motivation.

Randomized experiment reported in the paper with 82 participants; authors report a statistically significant decrease in intrinsic task motivation for participants using GenAI.

high negative When Ai Sparks Less: Generative Ai And The Decline Of Self-P... intrinsic task motivation

AI cannot yet refute economic theory on its own.

Main conclusion: based on the experiments (models failed to autonomously find true errors) and caveats about data contamination, the author concludes models are not yet capable of independently refuting economic-theory papers.

high negative Can AI Refute Economic Theory? Evidence from Beyond the Know... autonomous_theory_refutation_capability

No model located a true error without substantial human guidance.

Author reports that in the experiments none of the models identified a real error autonomously; successful identifications required substantial human guidance.

high negative Can AI Refute Economic Theory? Evidence from Beyond the Know... error_detection_without_human_guidance

Other models (Gemini, Refine, Claude) fared worse than ChatGPT Pro at these tasks.

Reported qualitative performance differences across the four models on the 4 papers; other models did not match ChatGPT Pro's performance.

high negative Can AI Refute Economic Theory? Evidence from Beyond the Know... output_quality (relative performance across models)

Algeria lags behind peer countries on key indicators of digital infrastructure, human capital, and institutional frameworks as evidenced by World Bank (2022) and Oxford Insights indices.

Specific comparative claim based on the paper's use of World Bank (2022) indicators and Oxford Insights Government AI Readiness Index scores; the summary does not report numeric index values or sample sizes.

high negative Artificial Intelligence and Economic Productivity: A Compara... index scores for digital infrastructure, human capital, institutional readiness

Findings reveal that Algeria exhibits significant lag in digital infrastructure, human capital, and institutional frameworks compared to peers (Morocco, Egypt, Turkey).

Result reported from the paper's comparative analysis using World Bank indicators, the Oxford Insights Government AI Readiness Index, and sector-specific studies comparing Algeria to Morocco, Egypt, and Turkey; specific quantitative comparisons not provided in the summary.

high negative Artificial Intelligence and Economic Productivity: A Compara... digital infrastructure, human capital, institutional readiness for AI

Existing research has significant shortcomings in terms of local empirical evidence, micro task mechanisms, and the impact of cutting-edge AI.

Critical appraisal in the paper's discussion of gaps identified through the systematic literature review; no single-study sample size.

high negative Influence of Artificial Intelligence in the Labor Market completeness/coverage of empirical research

Skill mismatch constitutes the core contradiction of labor force transformation.

Interpretive conclusion from the literature review asserting that mismatches between worker skills and job/task requirements are central to the labor-market effects of AI.

high negative Influence of Artificial Intelligence in the Labor Market skill mismatch / skill obsolescence

Despite the growing prevalence of human-AI decision making, the human-AI team’s decision performance often remains suboptimal, partially due to insufficient examination of humans’ own reasoning.

Motivating claim stated in the paper's introduction/abstract (appears to be based on broader literature and motivation rather than a new empirical test in this paper).

high negative Understanding the Effects of AI-Assisted Critical Thinking o... human-AI team decision performance

AACT also triggers higher cognitive load.

Reported measurement of cognitive load in the same house price prediction case study comparing AACT to traditional AI support (details and sample size not provided in abstract).

high negative Understanding the Effects of AI-Assisted Critical Thinking o... cognitive load

Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%.

Empirical evaluation results reported by the authors summarizing ALE benchmark performance across mainstream harness and backbone configurations (no further detail on exact configurations or task/sample counts in excerpt).

high negative Agents' Last Exam average full pass rate (task success rate) on the hardest tier

The gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows.

Author argument presented in the paper; motivated by benchmarking limitations rather than an empirical test in the excerpt.

high negative Agents' Last Exam coverage and sustained measurement of benchmarks on real workflows

These gains have not translated into economically meaningful deployment across many professional domains.

Assertion in paper arguing a deployment gap between benchmark performance and real-world economic adoption; no quantitative deployment data provided in the excerpt.

high negative Agents' Last Exam translation of benchmark gains into economic deployment

Manual processing of these documents is time-consuming, inconsistent across reviewers, and unscalable.

Author claim / background motivation; no quantitative time or consistency metrics reported in the statement.

high negative Leveraging LLMs for Unstructured Claims Data Analysis effort, consistency, and scalability of manual document processing

Actuaries rely primarily on structured numerical data for reserving and ratemaking, while valuable predictive information in unstructured text including medical records, adjuster notes, and call transcripts remains largely unused.

Author statement/observation in paper introduction; no empirical data or sample size provided to support prevalence claim.

high negative Leveraging LLMs for Unstructured Claims Data Analysis use of unstructured text in actuarial processes

The determining barrier to adoption observed in the two studied public-service units was not technological but training-related.

Qualitative analysis and intervention observations across two auditable case studies (SES/CONT in 2024 and UCI/SEDET in 2025); author-developed intervention and outcome changes used to support inference.

high negative The Main Barrier to AI Adoption in the Public Sector is Lack... primary barrier to adoption (training vs. technology)

The adoption of generative artificial intelligence in the public sector has been treated predominantly as a technological problem, with the expectation that productivity gains would follow from more capable models.

Author statement / literature-positioning in paper (assertion about prevailing treatment); no quantitative data provided in text to support prevalence.

high negative The Main Barrier to AI Adoption in the Public Sector is Lack... framing of AI adoption (technological vs. training-related)

Government subsidies exert a negative moderating influence on the relationship between fintech development and corporate total factor productivity.

Moderation analysis reported in the paper on Chinese A-share listed manufacturing firms (2015–2023); paper states government subsidies weaken the positive fintech–TFP relationship (no numeric interaction estimates provided in the excerpt).

high negative Research on the Impact of Financial Technology on the Total ... corporate total factor productivity (moderated by government subsidies)

Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys.

Author's argument / positioning (identifying a gap between existing published twins and practical marketing use cases).

high negative Synthetic Personalities: How Well Can LLMs Mimic Individual ... applicability of existing twin construction approaches to pre-existing heterogen...

In binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses (including standard Bregman and many finite Bernoulli f-divergence losses); an analogous obstruction holds for multiclass aggregation under cross-entropy.

Impossibility results proved in the paper for binary classification under endpoint-monotone losses and for multiclass cross-entropy (formal mathematical proofs; no empirical sample).

high negative Tree-Based Formalization of Multi-Agent Complementarity in H... complementarity in classification aggregation

Selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality.

Formal impossibility theorem proved within the paper's tree-based HAI formalism (mathematical proof; no empirical sample).

high negative Tree-Based Formalization of Multi-Agent Complementarity in H... complementarity (HAI performance relative to best member)

Reliable deployment faces three obstacles: (1) no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; (2) no agent architecture adapted to the legal vertical, only general-purpose harnesses; and (3) no mechanism for systems to learn from their own outcomes in a changing setting.

Authors' diagnosis / framing of gaps in the literature and practice motivating the study and system design (stated in the paper's introduction/abstract).

high negative Parthenon Law: A Self-Evolving Legal-Agent Framework availability of prior large-scale evidence, existence of legal-specific agent ar...

Strict matter completion stalls (does not improve) despite stronger models.

Harvey LAB empirical results (12,510 agent trajectories) report that while per-criterion accuracy increases, strict matter completion does not show corresponding improvement.

high negative Parthenon Law: A Self-Evolving Legal-Agent Framework strict matter completion rate

Even frontier agents remain far from completing matters in a single pass.

Results reported from the Harvey LAB empirical study (12,510 agent trajectories) comparing end-to-end matter completion across agent runs.

high negative Parthenon Law: A Self-Evolving Legal-Agent Framework matter completion in a single pass (strict end-to-end completion)

GPU utilization surged from 57% to 94% following the mining software's public release, displacing legitimate research workloads.

Measurement of GPU utilization levels before (57%) and after (94%) the public release of mining software; authors attribute displacement of research workloads to the utilization surge.

high negative The Usefulness Gap in Proof-of-Useful-Work: An Empirical Stu... GPU utilization (and displacement of research workloads)

Budget GPU rental prices rose 38% following the mining software's public release.

Market measurements of budget GPU rental prices before and after the public release of the mining software, reporting a 38% increase.

high negative The Usefulness Gap in Proof-of-Useful-Work: An Empirical Stu... budget GPU rental price change

The mining computation is commodity integer arithmetic portable to any hardware platform, offering no vendor lock-in.

Analysis of the computation showing it relies on basic integer arithmetic operations and is implementable across diverse hardware architectures.

high negative The Usefulness Gap in Proof-of-Useful-Work: An Empirical Stu... hardware specificity / vendor lock-in of mining computation

Mining is unprofitable at current PRL prices ($0.21) across all GPU tiers (-54% to -72% ROI).

Profitability analysis/calculation across GPU tiers using current token price of $0.21; reported ROI range of -54% to -72%.

high negative The Usefulness Gap in Proof-of-Useful-Work: An Empirical Stu... economic profitability (ROI) of mining across GPU tiers

Statistical distribution checks are trivially defeated by adversarial Gaussian sampling.

Demonstration that adversarial Gaussian-sampled outputs pass the system's statistical distribution checks; experimental or analytic demonstration reported.

high negative The Usefulness Gap in Proof-of-Useful-Work: An Empirical Stu... robustness of statistical checks to adversarial sampling

The verification protocol accepts random matrices by design, confirmed by 44 pool-accepted shares from our open-source miner across NVIDIA, AMD, CPU, and Apple Silicon hardware.

Protocol analysis showing acceptance criteria; empirical confirmation via 44 pool-accepted shares generated by an open-source miner run on multiple hardware architectures (44 accepted shares observed).

high negative The Usefulness Gap in Proof-of-Useful-Work: An Empirical Stu... ability of verification protocol to accept non-useful/random computation

The dominant mining software contains no inference code.

Static/dynamic analysis of the dominant mining software deployed on the network showing absence of AI inference routines.

high negative The Usefulness Gap in Proof-of-Useful-Work: An Empirical Stu... presence/absence of inference code in mining software

Pearl's 24 EH/s network -- representing approximately 320,000 GPU-equivalents consuming an estimated 112 MW -- produces zero useful AI computation.

Empirical measurement of Pearl network hashrate (24 EH/s) and mapping to GPU-equivalents and power consumption; analysis of miner code and verification showing no useful AI inference performed.

high negative The Usefulness Gap in Proof-of-Useful-Work: An Empirical Stu... usefulness of AI computation performed by the network (zero useful AI computatio...

Across most risks, experts identified information, finance, and national security as the most vulnerable sectors.

Sector vulnerability ratings from the Delphi study (n=272); paper reports that information, finance, and national security sectors were most frequently judged vulnerable across risks.

high negative Prioritization of Risks from Artificial Intelligence: A Delp... sector vulnerability across listed risks

AI users and the general public were judged the most vulnerable to these risks.

Delphi panel rated actor vulnerability; results reported in paper indicate AI users and general public received highest vulnerability ratings (n=272).

high negative Prioritization of Risks from Artificial Intelligence: A Delp... actor vulnerability ratings

« Prev 1 2 3 … 14 15 16 … 176 177 Next »