Evidence (7560 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	870	233	116	1066	2363
Governance & Regulation	976	451	218	133	1809
Organizational Efficiency	949	224	144	88	1416
Technology Adoption Rate	764	287	141	122	1325
Research Productivity	501	152	74	362	1101
Output Quality	542	216	69	69	896
Decision Quality	387	198	94	54	740
Firm Productivity	513	67	101	27	714
AI Safety & Ethics	249	303	73	36	667
Market Structure	190	192	134	27	548
Task Allocation	243	77	91	36	452
Innovation Output	291	33	55	20	401
Skill Acquisition	206	72	65	21	364
Employment Level	133	63	115	22	335
Fiscal & Macroeconomic	153	79	52	32	323
Task Completion Time	206	37	12	15	272
Firm Revenue	179	52	29	5	266
Consumer Welfare	130	76	47	13	266
Inequality Measures	48	137	51	6	242
Worker Satisfaction	101	81	25	13	220
Error Rate	84	110	11	5	210
Wages & Compensation	98	47	30	10	185
Regulatory Compliance	88	73	17	7	185
Automation Exposure	66	64	33	16	182
Team Performance	105	29	30	11	176
Training Effectiveness	109	22	14	21	168
Developer Productivity	114	21	14	8	158
Job Displacement	12	90	24	1	127
Hiring & Recruitment	57	9	9	5	80
Skill Obsolescence	6	56	9	1	72
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	21	17	1	57
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Human Ai Collab Remove filter

Early evidence indicates AI is reducing the productivity difference between beginner and expert employees.

Reported 'early evidence' from the paper's empirical analysis (difference-in-differences on freelance platforms) indicating convergence in productivity between novices and experts; no numeric effect estimates given in the abstract.

high negative THE ASYMMETRIC IMPACT OF GENERATIVE ARTIFICIAL INTELLIGENCE ... productivity difference between beginner and expert employees

A strategic AI sender may withhold evidence or garble information in order to steer the human's decision.

Theoretical reasoning and examples within the Bayesian persuasion framework showing that sender-optimal signaling need not fully reveal the state and can be manipulative; supported by model analysis rather than empirical data.

high negative Quantifying Theoretical AI Alignment Guarantees: Receiver-Ut... amount/quality of information reaching the human (information transfer / resulti...

These results demonstrate how people's decision-making processes can be insufficient for overseeing AI in high-stakes domains.

Synthesis/interpretation of experimental findings (longer viewing when no AI, small increases in selection probability with more time for non-recommended candidates, IAT effects) to argue that human decision processes may not adequately supervise biased AI in high-stakes settings. This is an interpretive/concluding claim based on the experiment; not a direct empirical measure. Sample size not stated in the excerpt.

high negative Resume Screening, Fast and Slow: (Biased) AI Recommendations... adequacy of human decision-making processes for overseeing AI

Pooled across five AI coding agents, pull requests (PRs) with a human Co-Authored-By trailer merge less often than purely-autonomous ones (53.8% vs. 79.8%).

Aggregate analysis of PR merge rates across five AI coding agents in the AIDev dataset; pooled sample of PRs (33,596 PRs referenced elsewhere in the paragraph).

high negative Beyond Simpson's Paradox: A Cascade of Confounders in AI Age... PR merge rate (whether a PR was merged)

The paper identifies two distinct gaps that have widened as GPTs exposure scores traveled from their time and place of production: (1) a structural gap between what static exposure scores measure and what policy questions require, and (2) a coordination gap between researchers and policymakers.

Explicit framing and thesis presented in the paper summarizing the central arguments.

high negative AI Exposure Scores: what they measure, what they miss, and w... alignment between measurement and policy needs; researcher–policymaker coordinat...

Policy-relevant work that asks who is harmed or benefits, how, and when continues to reference static GPTs exposure scores without engaging with methodological updates needed to answer these questions more reliably.

Critical literature review and observed citation practices reported by the authors; claim based on review of how policy analyses cite/ use the scores.

high negative AI Exposure Scores: what they measure, what they miss, and w... quality of policy-relevant analyses and use of updated methods

These temporal, geographic, and ontological limitations compound when exposure scores are used in policy-facing analyses.

Conceptual argument and case-study approach in the paper showing how limitations interact and worsen policy analysis outcomes.

high negative AI Exposure Scores: what they measure, what they miss, and w... reliability/accuracy of policy-facing analyses using exposure scores

The GPTs exposure scores have temporal, geographic, and ontological limitations that do not always travel with the scores as they are reused.

Authors' methodological critique discussing the limits named by Eloundou et al. (2023) and how those limits are often ignored when scores are repurposed.

high negative AI Exposure Scores: what they measure, what they miss, and w... validity/applicability of exposure scores across time, place, and task ontology

Under individual selection, self-interested prompts dominate, causing populations to collapse into collective defection.

Simulation experiments with individual-level selection/transmission showing emergence and dominance of self-interested prompts and subsequent decline into collective defection.

high negative Group Selection Promotes Prosocial Prompts in Populations of... prevalence of defection / decline in cooperative behavior

As frontier training shifts toward individual rewards for verifiable tasks (e.g., mathematics and coding), this outcome-based focus may further undermine cooperation in multi-agent settings.

Argumentative/prognostic claim in the paper's motivation; not an empirical result from the study but framed as a risk informed by the literature and authors' reasoning.

high negative Group Selection Promotes Prosocial Prompts in Populations of... extent of cooperation in multi-agent settings

Current approaches to instill prosociality in LLM agents often rely on humans specifying desired behaviors at the individual level, which does not guarantee cooperation within LLM populations.

Background statement in paper; conceptual critique of human-specified, individual-level reward/behavior specification as commonly used in LLM alignment and fine-tuning literature (no new empirical test reported in this study).

high negative Group Selection Promotes Prosocial Prompts in Populations of... guarantee of cooperation in LLM populations

Existing frameworks address AI-assisted development maturity or the productivity-reliability tension but offer no mechanism for calibrating human oversight intensity to regulatory impact.

Comparative framework analysis and literature review reported in the paper (claims about gaps in existing frameworks).

high negative Governed AI-Assisted Engineering: Graduated Human Oversight ... absence of mechanisms to calibrate human oversight intensity with regulatory imp...

The adoption of agentic AI coding systems -- where autonomous agents generate, review, test, and deploy code with minimal human intervention -- creates a governance challenge in regulated industries.

Argumentation in the paper framing the problem; conceptual analysis of agentic AI capabilities and regulatory constraints (literature/contextual reasoning rather than empirical data).

high negative Governed AI-Assisted Engineering: Graduated Human Oversight ... governance challenge / regulatory risk arising from agentic AI code generation

Neither the task design nor the retrieval approach of Finance Agent v2 addresses the distinct challenges of IPO due diligence.

Author argument comparing periodic reporting tasks to IPO due-diligence requirements, noting Finance Agent v2's task and retrieval design do not address IPO-specific complexities.

high negative IPO Finance Agent: Evaluation of LLM Financial Analysts beyo... suitability for IPO due diligence

The Finance Agent v2 agentic harness relies on naive, unenriched chunk retrieval.

Author statement describing the retrieval approach used by Finance Agent v2 as naive chunk retrieval.

high negative IPO Finance Agent: Evaluation of LLM Financial Analysts beyo... retrieval architecture (chunk retrieval)

The essay introduces the concept of a 'vouching gap' to describe a growing divide between students who graduate with credible advocates willing to stake their reputations on their behalf and those who do not.

Conceptual contribution defined in the essay and motivated by social capital theory and mentoring research; no empirical quantification or sample provided.

high negative Vouching towards Bethlehem: what colleges and universities o... presence and growth of a gap in access to credible advocates among graduates

Automation of student work and candidate screening will widen existing inequalities between students.

Theoretical claim in the essay linking AI-driven automation to differential outcomes across students, motivated by social capital and mentoring literature; no empirical data or sample reported.

high negative Vouching towards Bethlehem: what colleges and universities o... distributional inequality in graduate outcomes/access to opportunities

This automation threatens to hollow out the value of a university degree.

Argument presented in the essay, grounded in social capital theory and mentoring research; no empirical test or sample size reported.

high negative Vouching towards Bethlehem: what colleges and universities o... market and signaling value of a university degree

Manual preparation of engineering designs for thousands of wells constitutes an enormous administrative burden and is prone to inconsistencies.

Introductory/background statement in the paper describing the pre-existing manual workflow burden; no numerical study reported for this specific statement.

high negative Transforming Engineering Workflows: A Data-Driven Generative... administrative burden and inconsistency in design preparation

The demand premium enjoyed by workers with strong human capital declines in more AI-exposed categories.

Heterogeneity analysis within the Upwork dataset: workers characterized by stronger human-capital signals (via profile embeddings) show a reduced demand premium in job categories more exposed to AI following ChatGPT; identified using difference-in-differences around ChatGPT release. (Sample size not reported in abstract.)

high negative Human Capital, AI, and Labor Commoditization demand premium for workers with strong human capital

In more AI-exposed job categories, the importance of human capital information in predicting labor demand declines.

Empirical analysis of Upwork platform data using high-dimensional text embeddings to represent worker profiles; the paper computes the predictive importance of human-capital-related profile information and uses a difference-in-differences design around the release of ChatGPT to estimate changes by AI exposure of job categories. (Sample size not reported in abstract.)

high negative Human Capital, AI, and Labor Commoditization importance of human capital information in predicting labor demand

Adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions.

Empirical comparisons across experimental sessions in the Collaborative Gym / DiscoveryBench setup; result reported across the study (1,482 sessions).

high negative Searching for Synergy in Shared Workspace Human-AI Collabora... team performance (task success / accuracy)

A wide range of empirical evidence shows that humans avoid complexity, delegate judgement, and prefer simplified social worlds.

Asserted as empirical background; paper references a broad empirical literature but does not report primary data, sample sizes, or specific studies in the provided text.

high negative The Simplicity Paradox: Why Evolution Does Not Produce Unive... propensity to avoid complexity / delegate judgment / preference for simplified s...

Most organizations (59%) approach AI implementation through a technology-first lens, layering intelligent systems onto legacy processes rather than intentionally redesigning how humans and machines collaborate.

Reported descriptive statistic from Deloitte's 2026 Global Human Capital Trends survey of over 3,000 business leaders across 15 countries (paper cites 59% figure).

high negative Designing Human-Machine Collaboration: Strategic Imperatives... percentage of organizations using a technology-first approach to AI implementati...

Only 14% of organizational leaders report proficiency in designing effective human-machine interactions.

Reported descriptive statistic from the same Deloitte 2026 Global Human Capital Trends survey of over 3,000 business leaders across 15 countries.

high negative Designing Human-Machine Collaboration: Strategic Imperatives... percentage of organizational leaders reporting proficiency in designing effectiv...

Current machine learning models commonly require large and well-annotated datasets, and the annotation process often becomes a bottleneck with increased complexity leading to higher chances of human errors.

Background statement in the paper summarizing common knowledge and prior literature about dataset requirements and annotation challenges.

high negative Speeding up the annotation process in semantic segmentation ... annotation bottleneck / annotation error likelihood

At the macro level, values-driven withdrawal from AI use has the potential to narrow the diversity of visible applications, amplifying risk-focused narratives and reinforcing perceptions of harm in public discourse.

Theoretical extension of the guarded engagement loop to societal/public discourse dynamics; based on synthesis of social amplification of risk literature rather than empirical measurement in the abstract.

high negative The guarded engagement loop: risk salience and interaction-d... market_structure

These constrained (guarded) interactions can lower output quality and increase the likelihood of visible errors, which may further erode trust and reinforce cautious engagement.

Theoretical causal chain posited by the authors within their conceptual framework; supported by literature-based argumentation rather than reported empirical results in the abstract.

high negative The guarded engagement loop: risk salience and interaction-d... output_quality

At the micro level, elevated risk salience related to privacy, safety, or ethical concerns may lead users to adopt guarded interaction strategies characterized by reduced contextual disclosure and limited iteration.

Theoretical proposition within the paper's guarded engagement loop framework, drawing on prior research in privacy calculus and algorithm aversion; no specific empirical data reported in the abstract.

high negative The guarded engagement loop: risk salience and interaction-d... automation_exposure

Generative AI adoption is often framed primarily as a question of learning technical skills, and this perspective overlooks a defining feature of large language models (LLMs): their output quality depends heavily on how users engage with them.

Conceptual argument presented in the paper's introduction/abstract; literature synthesis framing adoption debates (no empirical sample or experimental method reported in the abstract).

high negative The guarded engagement loop: risk salience and interaction-d... adoption_rate

Expertise moderated the effect of LLM guidance: novices exhibited passive AI reliance.

Stratified analyses by participant expertise level using behavioral and eye-tracking measures indicating novices shifted attention to the AI/chat and exhibited more passive acceptance of guidance.

high negative LLM-Mediated Human-AI Interaction in Search and Rescue: Impa... AI reliance / passive acceptance behavior (gaze patterns and decision behavior)

AI augmentation breaks the accounting link between labor time and productive contribution, yet firms continue to evaluate talent through time-based overhead bundles.

Theoretical argument and conceptual framing presented in the paper (no empirical sample reported for this specific proposition).

high negative What Capital After Labor? Forecasting the Talent ROI Transit... accounting link between labor time and productive contribution / use of time-bas...

Financial LLMs face regulatory compliance violations, fraud facilitation, and systemic trust erosion that require targeted evaluation.

Paper's risk analysis listing finance-specific threats (regulatory compliance violations, facilitation of fraud, systemic trust erosion). This is a conceptual/risk framing rather than reported empirical incidence rates in the provided summary.

high negative FFinRED: An Expert-Guided Benchmark Generation and Evaluatio... presence of finance-specific risks (regulatory violations, fraud facilitation, t...

Existing safety benchmarks target general adversarial scenarios but miss finance-specific risks.

Authors' comparative assertion in paper (conceptual analysis arguing gap between general LLM safety benchmarks and finance-specific threats). No numeric evaluation reported in the provided summary.

high negative FFinRED: An Expert-Guided Benchmark Generation and Evaluatio... coverage of finance-specific risks by existing LLM safety benchmarks

Investment is being directed toward AI deployment when achieving productivity gains requires prior development of convergence capacity (C), leading to a misallocation of investment.

Theoretical reasoning within the paper: conceptual argument that deployment-focused spending misses prerequisite cognitive capacity (C).

high negative Forecasting AI-Era Productivity: The Intellectually Converge... alignment of AI investment with productivity-enhancing prerequisites (convergenc...

Prevailing production-function frameworks encounter a structural boundary because they treat AI as a separable factor of production without modeling the cognitive mediation through which AI generates productive value.

Theoretical / conceptual argument presented in the paper (derivation and critique of existing production-function approaches).

high negative Forecasting AI-Era Productivity: The Intellectually Converge... adequacy of production-function frameworks to capture AI-driven productivity

Massive AI investment has failed to generate commensurate productivity gains (the "AI productivity paradox").

Stated as the motivating empirical paradox in the paper; presented as an observed phenomenon motivating the theoretical argument (no specific dataset or numeric evidence provided in the abstract).

high negative Forecasting AI-Era Productivity: The Intellectually Converge... productivity gains (total factor productivity / output per worker)

The translation of AI's potential into operational capability within government audit contexts requires navigating complex technical, institutional, legal, and ethical challenges that differ substantially from private sector environments.

Paper's conceptual analysis and comparative argument (paper contrasts government audit contexts with private sector origins of many AI tools); no quantitative empirical evidence or sample size reported.

high negative Towards AI-Augmented Public Audit Systems: A Policy and Impl... barriers to implementation / governance constraints

There are barriers and challenges that the labor force faces in meeting new skill requirements.

Review conclusion noting barriers and challenges reported in the empirical literature (types of barriers not enumerated in the excerpt; no measures or prevalence reported).

high negative Labor Market The Impact of Artificial Intelligence on Employ... existence of barriers to skill acquisition/upskilling

The root causes of these problems include the disruption of labor relations boundaries by the transformation of the means of production, the exclusion of implicit data labor from distribution rules, the concentration of capital driven by high industry barriers, and social structural constraints on technological dissemination.

Synthesis and causal argumentation grounded in Marx's theory of reproduction; conceptual reasoning rather than empirical testing.

high negative Challenges and Reconstruction of Human-Machine Collaboration... Structural causes of inequality and power concentration in human-machine collabo...

In the consumption phase, high costs lead to service stratification, making it difficult for technological dividends to benefit the general public.

Theoretical/qualitative argument about cost barriers and unequal access to AI-enabled services; no empirical evidence or sample sizes reported.

high negative Challenges and Reconstruction of Human-Machine Collaboration... Distribution of benefits / access to services (service stratification, consumer ...

In the exchange phase, high barriers to entry for technology and capital foster market monopolies.

Analytical claim based on structural characteristics of AI/embodied intelligence industries; no empirical sample or quantitative measures provided in the paper.

high negative Challenges and Reconstruction of Human-Machine Collaboration... Market concentration / monopoly formation

In the distribution phase, behavioral data unconsciously generated by workers drives algorithmic iteration yet remains excluded from the distribution system, resulting in hidden data exploitation.

Theoretical argument that worker-generated behavioral data fuels algorithmic development but is not accounted for in value distribution; no empirical data or sample reported.

high negative Challenges and Reconstruction of Human-Machine Collaboration... Value distribution of data contributions (hidden data exploitation)

In the production stage, workers are alienated into becoming data producers.

Conceptual claim based on Marxian analysis of labor and data extraction; no empirical sample or quantitative evidence presented.

high negative Challenges and Reconstruction of Human-Machine Collaboration... Role shift of workers toward producing data as labor

In the production stage, workers are disciplined by algorithms.

Theoretical/qualitative argument in the paper describing algorithmic management and control; no empirical measures or sample provided.

high negative Challenges and Reconstruction of Human-Machine Collaboration... Algorithmic control/discipline over workers

In the production stage, workers lose decision-making power.

Theoretical analysis of production relations using Marxist reproduction framework; qualitative claim without reported empirical data.

high negative Challenges and Reconstruction of Human-Machine Collaboration... Workers' decision-making power

The canonical manifestation of this failure pattern is called 'Phantom Legislation' (internally consistent symbolic outputs disconnected from real business semantics).

Terminology and descriptive example provided by the authors based on their analysis of observed failure cases in the Bang-v3 project.

high negative Written by AI, Managed by AI: Semantic Space Control and Ind... failure_pattern_description

Test file counts substantially overestimate verification strength.

Conclusion drawn from the high prevalence (80.2%) of test patches with weak/no oracle signals compared to mere presence-of-test-file counts.

high negative All Smoke, No Alarm: Oracle Signals in Agent-Authored Test C... accuracy of using test-file counts as proxy for verification strength

Raw merge rates are lower for strong-oracle PRs.

Unadjusted (raw) comparisons of merge rates between PRs classified by oracle strength in the study dataset.

high negative All Smoke, No Alarm: Oracle Signals in Agent-Authored Test C... raw merge rate

Applied at scale, 80.2% of test patches contain weak or no explicit oracle signals.

Automated/syntactic classification of oracle-signal categories applied to the full dataset of test-file patches (as described in methods).

high negative All Smoke, No Alarm: Oracle Signals in Agent-Authored Test C... presence of explicit oracle signals in test patches

« Prev 1 2 3 … 12 13 14 … 151 152 Next »