Evidence (8807 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	870	233	116	1066	2363
Governance & Regulation	976	451	218	133	1809
Organizational Efficiency	949	224	144	88	1416
Technology Adoption Rate	764	287	141	122	1325
Research Productivity	501	152	74	362	1101
Output Quality	542	216	69	69	896
Decision Quality	387	198	94	54	740
Firm Productivity	513	67	101	27	714
AI Safety & Ethics	249	303	73	36	667
Market Structure	190	192	134	27	548
Task Allocation	243	77	91	36	452
Innovation Output	291	33	55	20	401
Skill Acquisition	206	72	65	21	364
Employment Level	133	63	115	22	335
Fiscal & Macroeconomic	153	79	52	32	323
Task Completion Time	206	37	12	15	272
Firm Revenue	179	52	29	5	266
Consumer Welfare	130	76	47	13	266
Inequality Measures	48	137	51	6	242
Worker Satisfaction	101	81	25	13	220
Error Rate	84	110	11	5	210
Wages & Compensation	98	47	30	10	185
Regulatory Compliance	88	73	17	7	185
Automation Exposure	66	64	33	16	182
Team Performance	105	29	30	11	176
Training Effectiveness	109	22	14	21	168
Developer Productivity	114	21	14	8	158
Job Displacement	12	90	24	1	127
Hiring & Recruitment	57	9	9	5	80
Skill Obsolescence	6	56	9	1	72
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	21	17	1	57
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Productivity Remove filter

The basin of attraction of the partial adoption trap is enlarged by a threshold coordination failure arising from the non-appropriable nature of systemic benefits.

Model analysis showing how non-appropriable systemic benefits (externalities) change payoff structure and enlarge the basin of attraction for partial adoption. Theoretical derivation; no empirical sample.

high negative The partial adoption trap: Coordination failure, trust, and ... size of basin of attraction for partial adoption (likelihood of landing in parti...

Observed failures in the pilot were localized primarily to external integrations.

Pilot outcome summary in the paper stating failure localization was mainly due to external integrations (no numeric breakdown provided).

high negative GraphFlow: An Architecture for Formally Verifiable Visual Wo... failure source localization (external integrations vs core system)

Agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit.

Author statement characterizing agentic (planning) AI systems and their inference-time sensitivity and auditability challenges.

high negative GraphFlow: An Architecture for Formally Verifiable Visual Wo... auditability / behavior sensitivity to prompts

Existing workflow platforms offer few semantic correctness guarantees.

Author statement contrasting current platforms' observability/durability with lack of semantic correctness guarantees.

high negative GraphFlow: An Architecture for Formally Verifiable Visual Wo... semantic correctness guarantees (presence/absence)

Under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time.

Analytic, idealized independence model reported in the paper (mathematical calculation: 0.9^10 ≈ 0.3487).

high negative GraphFlow: An Architecture for Formally Verifiable Visual Wo... process completion probability

Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4× worse mean return while using 1.8–2.7× more tokens.

Empirical comparisons across the twelve configurations showing distributed deliberation vs. hierarchy-alone across five model families and six models; measured mean returns and token consumption over 3,475 episodes with token-level accounting.

high negative Context, Reasoning, and Hierarchy: A Cost-Performance Study ... mean return (primary) and token usage (secondary)

Our results show that multi-resource stranding materially changes deployable capacity, effective capital expenditure, and delivered performance.

Empirical/modeling results from the paper's framework (simulation results using projection models + Azure operational data); the abstract claims material effects but does not report numeric sample sizes or effect sizes in the excerpt provided.

high negative Designing Datacenter Power Delivery Hierarchies for the AI E... deployable capacity / effective capex / delivered performance (primary: deployab...

Designing an efficient power delivery hierarchy for the long run is difficult because rack placement feasibility, workload impact, and cost depend jointly on electrical topology, deployment granularity, placement policy, power oversubscription, and workload mix.

Analytic/methodological claim enumerating interacting factors; stated as a complexity motivating the modeling framework.

high negative Designing Datacenter Power Delivery Hierarchies for the AI E... difficulty/complexity of designing efficient power delivery hierarchies

Power utilization is particularly important as grid power capacity is a scarce resource in the AI era.

Contextual claim in the paper linking increased AI demand to constrained grid power capacity; supported by the paper's framing rather than reported empirical measurements in the abstract.

high negative Designing Datacenter Power Delivery Hierarchies for the AI E... grid power scarcity/importance of power utilization

As power densities increase, a datacenter designed for a different target density may strand power, i.e., may be unable to use all the power that its delivery hierarchy has provisioned.

Conceptual/mechanistic claim supported by the paper's modeling framework that examines mismatches between provisioned power and deployed demand; no numeric sample size provided in the abstract.

high negative Designing Datacenter Power Delivery Hierarchies for the AI E... power stranding (unused provisioned power)

This poses a major challenge for datacenter power delivery designers.

Argument based on the projected rise in rack power density and resulting engineering constraints; asserted in the paper's introduction/contextual framing rather than an experimental result.

high negative Designing Datacenter Power Delivery Hierarchies for the AI E... difficulty/challenge for datacenter power delivery design

Analysis indicates a significant negative relationship between perceived opportunities and challenges related to AI (i.e., higher perceived opportunities are associated with lower perceived challenges).

Correlation and regression analyses performed in SPSS on primary survey data showed a statistically significant negative association between measures of perceived opportunities and perceived challenges.

high negative Opportunities and Challenges of Human- AI Collaboration in W... association between perceived opportunities and perceived challenges

There exists employee resistance to change in response to AI adoption.

Survey-based measures of resistance included in the questionnaire and analyzed (descriptive/correlation/regression) using SPSS.

high negative Opportunities and Challenges of Human- AI Collaboration in W... self-reported resistance to organizational change related to AI

Employees identify ethical issues—particularly transparency and accountability of AI systems—as a notable challenge.

Survey items on ethical concerns analyzed with SPSS (descriptive and reliability analyses).

high negative Opportunities and Challenges of Human- AI Collaboration in W... perceived ethical concerns (transparency, accountability)

Employees have concerns regarding data privacy related to AI systems.

Primary survey data using a Likert-scale questionnaire; findings summarized with descriptive statistics and reliability analysis.

high negative Opportunities and Challenges of Human- AI Collaboration in W... level of concern about data privacy

Employees report lack of AI-related skills (skill gaps) as a significant challenge to human–AI collaboration.

Survey responses from employees in AI-enabled organizations collected via a structured questionnaire and analyzed (descriptive/correlation).

high negative Opportunities and Challenges of Human- AI Collaboration in W... self-reported AI-related skill gaps

Employees report fear of job displacement as a notable challenge associated with AI adoption.

Primary survey data (structured questionnaire) capturing perceived challenges; descriptive statistics reported.

high negative Opportunities and Challenges of Human- AI Collaboration in W... perceived risk/fear of job displacement

Das Dokument untersucht neuere Daten zur Verbreitung von KI in den G7-Volkswirtschaften, die auf große und anhaltende Unterschiede zwischen KMU und großen Unternehmen hindeuten.

Empirical examination of recent diffusion/adoption data across G7 economies as described in the paper; no sample size or specific datasets provided in the excerpt.

high negative Einführung von KI in kleinen und mittleren Unternehmen Unterschiede in der KI-Verbreitung zwischen KMU und großen Unternehmen

Trotz der jüngsten technologischen Fortschritte bei KI-Tools, sind KMU bei der Einführung von KI im Vergleich zu anderen digitalen Technologien und größeren Unternehmen zurückhaltender.

Statement referencing 'neuere Daten zur Verbreitung von KI in den G7-Volkswirtschaften' showing differences between SMEs and large firms; implies empirical analysis of diffusion/adoption data (no sample size given in excerpt).

high negative Einführung von KI in kleinen und mittleren Unternehmen Adoption/Verbreitung von KI-Technologien in KMU versus großen Unternehmen

In algorithm-triggered emotional escalations, workers showed lower engagement: they sent fewer messages, contributed a smaller share of total chat rounds, and showed less proactivity in information seeking and solution provision.

Behavioral measures derived from chat logs in the randomized experiment comparing worker actions post-escalation across escalation types; reported differences in message counts, share of rounds, and proxies for proactivity.

high negative Agentic AI and Human-in-the-Loop Interventions: Field Experi... worker engagement measures (message count, share of chat rounds, proactivity ind...

Human intervention is less effective in algorithm-triggered emotional escalations (where customers express frustration or dissatisfaction).

Experimental subgroup analysis comparing intervention outcomes for algorithm-triggered emotional escalations versus technical escalations; emotional escalations showed worse post-intervention outcomes.

high negative Agentic AI and Human-in-the-Loop Interventions: Field Experi... service quality after emotional escalations

AI deployment substantially lowers ratings for AI-eligible chats.

Randomized field experiment measuring customer ratings for AI-eligible chats; treated condition (AI + human oversight) produced substantially lower ratings relative to control (humans only).

high negative Agentic AI and Human-in-the-Loop Interventions: Field Experi... customer ratings for AI-eligible chats

AI deployment reduces average chat duration.

Randomized field experiment on Alibaba's Taobao platform: workers in treatment supervised an agentic AI resolving AI-eligible chats while handling AI-ineligible chats; control workers resolved all chats without AI. Effect observed on average chat duration in experiment data.

high negative Agentic AI and Human-in-the-Loop Interventions: Field Experi... average chat duration

Parsing through LLM-generated code can be tedious and time-consuming, potentially negating the productivity gains promised by AI-coding tools.

Motivation/background statement in the paper: a qualitative claim about the cost (time/effort) of reviewing LLM-generated code; presented as motivation rather than empirically quantified evidence in the excerpt.

high negative Viverra: Text-to-Code with Guarantees time/effort required to review LLM-generated code

Overthinking is a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses.

Conclusion drawn by authors based on their empirical findings described in the abstract (amplification of output length across multiple models and transferability experiments).

high negative Inducing Overthink: Hierarchical Genetic Algorithm-based DoS... presence of shared vulnerability across models (qualitative security posture)

This overthinking behavior significantly increases inference latency and energy consumption, forming a potential vector for denial-of-service (DoS)-style resource exhaustion.

Authors assert increased latency and energy consumption as consequences of longer reasoning traces; framed as a potential attack vector in the abstract (no quantitative latency/energy measurements provided in abstract).

high negative Inducing Overthink: Hierarchical Genetic Algorithm-based DoS... inference latency and energy consumption

Large reasoning models (LRMs) exhibit a tendency to "overthink", producing excessively long and redundant reasoning traces when confronted with incomplete or logically inconsistent inputs.

Empirical observation reported by the authors based on experiments described in the paper (abstract references experiments across multiple SOTA reasoning models); no numerical sample size for inputs reported in abstract.

high negative Inducing Overthink: Hierarchical Genetic Algorithm-based DoS... response length / reasoning trace length (verbosity and redundancy)

Distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt in LLM-generated code that could affect long-term maintainability.

Interpretation/conclusion in paper combining empirical findings (distinct issue patterns and limited prompt impact) to argue for potential technical debt and maintainability risks; presented as a forward-looking implication rather than a quantified causal estimate.

high negative The Readability Spectrum: Patterns, Issues, and Prompt Effec... maintainability_risk / technical_debt_inferred_from_readability

LLM-generated code displays distinct readability issue patterns compared to human-written code.

Empirical analysis of readability subcomponents/features showing different patterns of readability issues between LLM-generated and human-written code (paper reports qualitative/quantitative distinctions in issue patterns).

high negative The Readability Spectrum: Patterns, Issues, and Prompt Effec... readability_issue_patterns (feature-level readability problems)

Policy responses in Europe are fragmented across the EU and Member State levels and do not match the potential scale of disruption from AGI.

Paper's policy analysis of EU- and Member-State-level responses (stated in abstract); no quantitative metrics provided in the abstract.

high negative Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation

Europe has low rates of industrial AI adoption.

Paper's empirical/policy review claiming low industrial AI adoption in Europe (as stated in abstract); the abstract does not provide numeric adoption rates or sample sizes.

high negative Europe and the Geopolitics of AGI: The Need for a Preparedne... adoption_rate

Europe exhibits structural weaknesses in compute infrastructure and talent retention.

Paper's structural assessment of Europe's AI value-chain capabilities (stated in abstract); no numerical measures provided in the abstract.

high negative Europe and the Geopolitics of AGI: The Need for a Preparedne... adoption_rate

Europe has limited strategic awareness of frontier AI progress.

Paper's assessment of Europe's positioning based on policy analysis and review of capabilities monitoring (as stated in abstract); no supporting metrics or sample sizes provided in the abstract.

high negative Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation

AGI could strain existing governance frameworks.

Paper's policy analysis describing potential mismatches between governance capacity and AGI-induced disruptions (as stated in abstract); no empirical tests or quantification reported in the abstract.

high negative Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation

AGI could intensify interstate competition.

Paper's geopolitical analysis and scenario-based reasoning informed by trends in AI capabilities (stated in abstract); no quantitative measures reported in the abstract.

high negative Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation

AGI could fundamentally alter the global distribution of economic and military power.

Paper's geopolitical analysis drawing on capability trends and scenario reasoning (as stated in abstract); no empirical quantification provided in the abstract.

high negative Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation

Increased levels of AI assistance may degrade productivity, leading to potentially significant shortfalls under the model's identified conditions.

Model-based comparative-statics and steady-state analysis showing scenarios where marginal increases in AI assistance reduce expected task output; examples/parameter illustrations provided in the paper (theoretical, no empirical sample).

high negative Human-AI Productivity Paradoxes: Modeling the Interplay of S... expected task output / productivity shortfalls associated with increased AI assi...

Introducing AI unreliability (errors/noise in AI outputs) in the model can also generate a productivity paradox: greater AI assistance may lower productivity.

Analytical/theoretical model incorporating AI unreliability; model derivations and examples demonstrating conditions under which unreliability leads to reduced productivity (no empirical data).

high negative Human-AI Productivity Paradoxes: Modeling the Interplay of S... agent productivity (task output) as influenced by AI assistance and AI unreliabi...

Incorporating endogeneity in skill development into the model can induce a productivity paradox where increased AI assistance reduces productivity.

Analytical/theoretical model of human-AI interaction with utility-maximizing human agents and endogenous skill development; steady-state and comparative-static analysis reported in the paper (no empirical sample).

high negative Human-AI Productivity Paradoxes: Modeling the Interplay of S... agent productivity (task output) as a function of AI assistance and endogenous s...

AI integration simultaneously increases labor concerns about skill obsolescence by 33%.

Reported as a survey/result in the paper; the study includes surveys of 800 marketers (self-reported concerns about skill obsolescence are likely derived from that survey sample).

high negative Augmented Intelligence: Resolving the AI integration-obsoles... worker concerns about skill obsolescence

Rising data velocity renders legacy systems obsolete—threatening approximately $3.4 trillion in global marketing spending.

Paper reports an estimate/claim about threatened global marketing spending tied to legacy systems becoming obsolete (derivation likely from the study's quantitative analysis or economic estimate described in the paper).

high negative Augmented Intelligence: Resolving the AI integration-obsoles... value of global marketing spending at risk

62% of teams suffer from "AI paralysis," unable to scale pilot initiatives beyond isolated implementations.

Reported as a finding in the paper's mixed-methods study (paper states AI adoption audits of 120 organizations and surveys of 800 marketers as part of the study).

high negative Augmented Intelligence: Resolving the AI integration-obsoles... AI paralysis / inability to scale AI pilots

Autonomous software-engineering agents remain unreliable in realistic development settings.

Assertion in abstract summarizing the observed current state; likely based on prior literature and/or authors' observations (no empirical sample size given in abstract).

high negative AI Harness Engineering: A Runtime Substrate for Foundation-M... reliability of autonomous software-engineering agents (ability to perform correc...

Individuals low in trait self-efficacy experienced the steepest ownership erosion (i.e., AI-authorship reduced psychological ownership most for low self-efficacy participants).

Reported moderation analysis in the preregistered experiment showing trait self-efficacy moderated the authorship effect on psychological ownership; preregistered N = 470. (No numeric effect size reported in the abstract.)

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... change/erosion in psychological ownership as moderated by trait self-efficacy

Participants in the LLM condition reported lower perceived importance (d = 1.13).

Same preregistered experiment; reported effect size d = 1.13; preregistered N = 470.

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... perceived importance of goals (self-reported)

Participants in the LLM condition reported lower commitment (d = 1.19).

Same preregistered experiment comparing self-authored vs LLM-authored goals; reported effect size d = 1.19; preregistered N = 470.

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... commitment (self-reported)

Participants in the LLM condition reported lower psychological ownership (d = 1.38).

Same preregistered experiment (between-subjects comparison of authorship); reported effect size d = 1.38; preregistered N = 470.

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... psychological ownership (self-reported)

The paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics.

Conceptual analysis and problem-framing presented in the paper (qualitative identification of five mismatch categories).

high negative Agent-First Tool API: A Semantic Interface Paradigm for Ente... architectural_mismatches_between_conventional_APIs_and_autonomous_agent_requirem...

Using LLMs led to fewer creative moments observed in participants (p=0.002).

Within-subject comparison between LLM-assisted and unassisted conditions with reported p-value p=0.002. Study sample N=20.

high negative "Like Taking the Path of Least Resistance": Exploring the Im... count of creative moments

Participants using LLMs had significantly shorter idea-generation periods (p=0.0004).

Within-subject comparison between LLM-assisted and unassisted conditions reported in paper; p-value reported as p=0.0004. Sample size N=20.

high negative "Like Taking the Path of Least Resistance": Exploring the Im... idea-generation period (time spent generating ideas)

« Prev 1 2 3 … 18 19 20 … 176 177 Next »