Evidence (7560 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	870	233	116	1066	2363
Governance & Regulation	976	451	218	133	1809
Organizational Efficiency	949	224	144	88	1416
Technology Adoption Rate	764	287	141	122	1325
Research Productivity	501	152	74	362	1101
Output Quality	542	216	69	69	896
Decision Quality	387	198	94	54	740
Firm Productivity	513	67	101	27	714
AI Safety & Ethics	249	303	73	36	667
Market Structure	190	192	134	27	548
Task Allocation	243	77	91	36	452
Innovation Output	291	33	55	20	401
Skill Acquisition	206	72	65	21	364
Employment Level	133	63	115	22	335
Fiscal & Macroeconomic	153	79	52	32	323
Task Completion Time	206	37	12	15	272
Firm Revenue	179	52	29	5	266
Consumer Welfare	130	76	47	13	266
Inequality Measures	48	137	51	6	242
Worker Satisfaction	101	81	25	13	220
Error Rate	84	110	11	5	210
Wages & Compensation	98	47	30	10	185
Regulatory Compliance	88	73	17	7	185
Automation Exposure	66	64	33	16	182
Team Performance	105	29	30	11	176
Training Effectiveness	109	22	14	21	168
Developer Productivity	114	21	14	8	158
Job Displacement	12	90	24	1	127
Hiring & Recruitment	57	9	9	5	80
Skill Obsolescence	6	56	9	1	72
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	21	17	1	57
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Human Ai Collab Remove filter

Workers were assigned to no overrides, free overrides, or a two-per-machine limit on downward overrides.

Experimental design statement in paper: randomized assignment into three arms (no overrides, free overrides, constrained two-per-machine downward override limit).

high null result A Simple Solution to Improving Human Supervision of Algorith... Treatment assignment (experimental arms)

We tested [the policy] through a randomized field experiment with 553 workers at a major Chinese smart vending machine retailer that manages more than 59,000 machines and 4,000 SKUs.

Randomized field experiment described in paper; sample stated as 553 workers and operational context (retailer with >59,000 machines and >4,000 SKUs).

high null result A Simple Solution to Improving Human Supervision of Algorith... Experimental implementation / sample and setting description

The runs spanned several model generations, two agent harnesses, two reasoning effort levels, a testing tool, and two design oriented prompts.

Description of experimental conditions reported in the study (factors varied across the 90 runs).

high null result Reasoning effort, not tool access, buys first-try reliabilit... experimental condition coverage (model generation, harness, effort level, testin...

Ninety independent agent runs built the same application from one detailed specification, each scored on a fixed 14-criterion functional rubric (42 point maximum) and a visual quality review.

Experimental study described in the paper: 90 independent agent runs, single specification, evaluated on a 14-criterion rubric (42-point max) plus visual quality review.

high null result Reasoning effort, not tool access, buys first-try reliabilit... functional score (14-criterion rubric) and visual quality review

The framework is intended primarily as a scholarly contribution to clarify the conceptual landscape and support future theoretical and empirical work, not as prescriptive guidance for practitioners.

Authors' explicit statement of intent in the abstract describing the purpose and scope of the proposed framework.

high null result Mapping Human–AI Relationships: Intellectual Structure and C... intended purpose of the framework (scholarly vs. prescriptive)

The authors propose a conceptual framework that classifies human–AI relationships into four categories—symbiotic, augmented, assisted, and substituted intelligence—according to the level of AI autonomy and human involvement.

Authors' conceptual synthesis and proposal based on thematic mapping and literature synthesis (framework described in abstract as an output of analysis).

high null result Mapping Human–AI Relationships: Intellectual Structure and C... categorization of human–AI relationship types by autonomy and human involvement

This study employs a bibliometric co-word analysis of 4093 peer-reviewed documents indexed in Scopus to map the intellectual structure of the field.

Authors report performing a bibliometric co-word analysis on 4,093 peer-reviewed documents from the Scopus database (method stated in abstract).

high null result Mapping Human–AI Relationships: Intellectual Structure and C... use of bibliometric co-word analysis on 4093 documents

The docs CLI used in the constrained condition is approximately 200 lines of code (~200 LoC).

Paper text states the CLI used is about 200 lines of code.

high null result Steerability via constraints: a substrate for scalable overs... tool size (lines of code)

We report a controlled experiment in scalable oversight: a small reviewer (Gemma 4 e4b) inspects a Python codebase containing 11 inserted backdoors.

Described controlled experiment in the paper: a single automated reviewer (Gemma 4 e4b) evaluated a Python codebase where the authors inserted 11 backdoors.

high null result Steerability via constraints: a substrate for scalable overs... detection of inserted backdoors

The study relies on secondary evidence from the U.S. Census Bureau, U.S. Bureau of Labor Statistics, OECD, IMF, Stanford AI Index, McKinsey Global Institute, NBER, and recent experimental research published from 2020 onward.

Explicit methodological statement in the paper.

high null result Effect of Artificial Intelligence Adoption on Labour Product... data sources and evidence base

The paper synthesizes evidence drawing on reports from the World Economic Forum, PwC, McKinsey Global Institute, Gartner, and the International Monetary Fund.

Literature/report synthesis explicitly described in the paper (citation list to those organizations).

high null result AI-Driven Workforce Transformation: Displacement, Opportunit... sources and scope of evidence used in the paper

An evaluation agenda is outlined for future artifact development and evaluation.

Paper includes a proposed agenda for future empirical evaluation and artifact development (conceptual/recommendation content).

high null result Mitigating Attention Bias in Index-Fund Investment: AI-Enabl... future research and evaluation plans

Research on AI-enabled decision support in the index-fund context remains limited.

Finding from the paper's progressive systematic literature review (statement of limited prior research; number of studies not provided in excerpt).

high null result Mitigating Attention Bias in Index-Fund Investment: AI-Enabl... extent of empirical/theoretical research on AI-enabled decision support for inde...

Through a progressive systematic literature review, this study finds that incorporating cognitive-bias mechanisms as design drivers in AI-enabled investment artifacts has not been studied.

Progressive systematic literature review reported in the paper (review method stated; specific number of papers not provided in the excerpt).

high null result Mitigating Attention Bias in Index-Fund Investment: AI-Enabl... presence/absence of design practices in AI-enabled investment artifact literatur...

These pilot findings motivate a pre-registered replication that is now in preparation.

Statement in the paper reporting intention to run a pre-registered replication study following the pilot.

high null result Human Capital, Not Model Benchmarks, Predicts Hybrid Intelli... planned pre-registered replication

The results are preliminary but statistically robust.

Authors' characterization of the pilot findings (explicit statement in the paper indicating preliminary status and statistical robustness).

high null result Human Capital, Not Model Benchmarks, Predicts Hybrid Intelli... statistical significance / robustness of reported pilot effects

Raw cognitive ability or model benchmark metrics did not distinguish who engaged in complementary reasoning.

Pilot study reports lack of predictive power from cognitive ability measures and model benchmark scores for identifying participants who achieved complementary, high-performing collaboration.

high null result Human Capital, Not Model Benchmarks, Predicts Hybrid Intelli... engagement in complementary reasoning (prediction by cognitive ability / model b...

Most participants deferred to the model, producing forecasts that matched the model's predictions.

Statement in the paper summarizing distribution of individual behaviors in the pilot (majority reported as deferring/matching the model).

high null result Human Capital, Not Model Benchmarks, Predicts Hybrid Intelli... degree of agreement with model predictions / matching rate

The study used a real-money prediction market (Polymarket) as an objective, externally resolved benchmark.

Pilot study described in the paper explicitly states use of Polymarket as an external, real-money benchmark for forecast resolution.

high null result Human Capital, Not Model Benchmarks, Predicts Hybrid Intelli... benchmark outcomes (Polymarket market prices / resolution)

Awards do not increase user activity and downstream impact.

Reported experimental finding from the Reddit field experiment described in the paper; the authors compare subsequent user behavior (volume and downstream impact) after receiving symbolic awards with different rationales. Sample size not reported in the abstract.

high null result A field experiment of social influence and behavioral contag... user activity (volume) and downstream impact

In fixed-unit subsets where complexity rose (Python on the cognitive metric, and all languages on the cyclomatic metric), newcomer participation does not decline.

Subgroup (fixed-unit) analyses that split units by whether complexity rose; DiD estimates within subsets show no decline in newcomer participation despite increases in complexity.

high null result Decoupling Code Complexity from Newcomer Participation: A Ca... newcomer participation in subsets where code complexity increased

A sparse, correlational beginner-task measure (good-first-issue labels) shows no decline, but we cannot test it for parallel trends.

Correlational analysis of frequency of 'good-first-issue' labels before and after adoption; authors note inability to test parallel trends for this measure.

high null result Decoupling Code Complexity from Newcomer Participation: A Ca... availability/number of beginner-task labels (good-first-issue)

Onboarding and retention are unchanged after adoption.

Difference-in-differences estimates comparing onboarding and retention metrics between adopting projects and matched non-adopting controls; reported as no significant change post-adoption.

high null result Decoupling Code Complexity from Newcomer Participation: A Ca... onboarding and retention of newcomers

We find no evidence of crowding-out: across estimators newcomer inflow shows no significant decline after adoption (point estimates run from a small increase to, under the most conservative trend specification, a slight and insignificant dip).

Difference-in-differences analysis against matched non-adopting controls, applied to 603 adopters with pre-adoption periods; multiple estimators and trend specifications reported.

high null result Decoupling Code Complexity from Newcomer Participation: A Ca... newcomer inflow (new contributors joining projects)

We recruited 1,283 participants to play iterated Collective Risk Games in small groups.

Statement of sample recruitment and experimental procedure in the paper (iterated Collective Risk Games; total N = 1,283).

high null result AI Persuasive Framing in Collective Dilemmas study sample / experimental participants

We analyze more than 930,000 agent-authored pull requests.

Descriptive statement about the dataset used for the study: an analysis of >930,000 pull requests authored by autonomous coding agents.

high null result Govern the Repository, Not the Agent: Measuring Ecosystem-Le... number of agent-authored pull requests analyzed

ATHENA is not presented as a validated measurement instrument; rather, it is a conceptual and methodological scaffold for empirical validation and responsible organizational experimentation.

Explicit qualification in the paper that ATHENA is a conceptual scaffold and has not been validated as a measurement instrument (stated limitation).

high null result Reconceptualizing Competence through Facets: ATHENA as a Str... validation status of ATHENA as a measurement instrument

We hired 49 programmers to interact with GitHub Copilot to assess 148 HIPAA-derived NFRs against the iTrust codebase across three dimensions: requirement satisfaction level, reasoning, and code localization.

Study design reported in the paper: recruitment of 49 programmers, 148 NFR assessments, use of GitHub Copilot and iTrust codebase, and three specified assessment dimensions.

high null result Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NF... study sample and experimental setup

Evaluating how well LLM-based dialogue systems support collaborative reasoning about NFRs requires methods that go beyond single-turn accuracy to capture both the correctness of system outputs and the quality of multi-turn interaction.

Methodological argument presented by the authors to motivate multi-turn study design (conceptual / methodological claim).

high null result Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NF... evaluation methodology adequacy for NFR assessment

Non-Functional Requirements (NFRs) are inherently vague, context-dependent, and involve many parts of a program, making them difficult to assess with single-turn correctness benchmarks.

Conceptual claim motivating the study, based on properties of NFRs discussed in the paper (no empirical measurement reported for this claim).

high null result Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NF... characteristics of NFRs (vagueness, context-dependence, broad code impact)

LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness.

Positioning statement in the paper's introduction / literature overview (no new empirical data reported for this claim).

high null result Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NF... scope of evaluation benchmarks (functional correctness focus)

The same model rebuilt to withhold answers erased the harm (i.e., removed the negative effect on unaided exam performance).

Reported as a finding from the same causal evidence referenced above (details of the study design and sample size not provided in the excerpt).

high null result The Effortless Trap: Productive Struggle, AI, and the Illusi... score on an unaided exam (difference relative to control)

The study contrasts usage across three populations: external personal-account users, external organizational-account users, and workers within OpenAI using an automated, privacy-protecting pipeline.

Methodological description in the paper stating the populations compared and the privacy-preserving data pipeline used.

high null result The Shift to Agentic AI: Evidence from Codex methodological scope (populations compared)

We model misalignment as an information advantage: the AI sender observes the world state (a bit string) while the human receiver only has a prior and acts after seeing the sender's signal.

Model specification and definitions in the paper (conceptual/theoretical modeling choice).

high null result Quantifying Theoretical AI Alignment Guarantees: Receiver-Ut... model structure / information transfer mechanism

This study is the first to theorize the relationship between organizations' agentic AI adoption and circular procurement performance.

Author statement in the abstract claiming novelty of theory contribution (literature review / positioning claim).

high null result Agentic AI and Circular Procurement Performance: An Empirica... novelty of theoretical contribution

The analysis in the paper was conducted using covariance-based structural equation modeling (CB-SEM) and a Process analytical method.

Methods described in the abstract (explicitly names the analytical techniques used).

high null result Agentic AI and Circular Procurement Performance: An Empirica... analytical methods used

Data for this study were collected from a developing nation.

Explicit statement in the abstract indicating the sample source/setting for the empirical data.

high null result Agentic AI and Circular Procurement Performance: An Empirica... study sample/source (geographic setting)

The benchmark includes controlled evaluation settings for local improvement, cross-task transfer, cross-role transfer, and cross-model generalization.

Paper description of benchmark design and experimental protocol specifying controlled evaluation settings for local improvement, cross-task transfer, cross-role transfer, and cross-model generalization.

high null result Managing Procedural Memory in LLM Agents: Control, Adaptatio... availability of controlled evaluation settings

We introduce AFTER, a benchmark of 382 realistic enterprise tasks spanning six professional roles and 22 procedural skills.

Description of benchmark dataset introduced in the paper; the paper reports the benchmark contains 382 tasks, covers six professional roles and 22 procedural skills (dataset construction / annotation process described in methods).

high null result Managing Procedural Memory in LLM Agents: Control, Adaptatio... benchmark size and coverage (number of tasks, roles, skills)

When restricted to multi-commit PRs, the Copilot within-repo effect dissolves to +4.8 percentage points (p = 0.59).

Subset analysis limited to multi-commit PRs for Copilot in the AIDev dataset; reported point estimate and p-value.

high null result Beyond Simpson's Paradox: A Cascade of Confounders in AI Age... PR merge rate (Copilot co-authorship effect in multi-commit PR subset)

Within-repository controls eliminate Devin's co-authorship gap, reducing it from +33.5 percentage points to +1.6 pp (p = 0.73).

Within-repo controlled regression/analysis on Devin PRs in the AIDev dataset showing adjusted effect and p-value.

high null result Beyond Simpson's Paradox: A Cascade of Confounders in AI Age... PR merge rate (Devin co-authorship effect before and after within-repo control)

During experiments, the original Finance Agent v2 harness basically failed to deliver any output related to the SpaceX S-1 filing, due to document length.

Authors' experimental observation reporting failure of the original Finance Agent v2 harness to produce output for the SpaceX S-1 file; reason given: document length.

high null result IPO Finance Agent: Evaluation of LLM Financial Analysts beyo... ability to produce output / retrieval success

Artificial intelligence is taking on advising functions and automating both the production of student work and employer-side candidate screening.

Statement in the essay (perspective/argumentative piece). The claim is supported as a conceptual observation drawing on literature on AI adoption; no empirical sample or quantified measurement reported.

high null result Vouching towards Bethlehem: what colleges and universities o... degree of automation of advising, student work production, and candidate screeni...

We conclude by outlining implications for designing and evaluating human-AI teams as socio-technical systems and for prioritizing longitudinal and in-context studies that capture how teaming evolves over time.

Authors' conclusions and recommendations based on the systematic review and observed gaps in the literature (noted need for longitudinal, in-context studies).

high null result From testbeds to high-stakes work: a review of Human-AI team... research_design_priorities (longitudinal and in-context evaluation)

Bibliometric patterns suggest a shift since 2020 from foundational demonstrations in controlled settings toward applied, higher-stakes contexts where trust dynamics, communication, and ethical accountability more directly shape adoption and sustained performance.

Bibliometric analysis of the 104 studies showing temporal trends (pre- vs post-2020) in research contexts and topics.

high null result From testbeds to high-stakes work: a review of Human-AI team... temporal_shift_in_research_contexts (prevalence of applied/higher-stakes context...

Across studies, performance was the most frequently examined aspect, followed by trust, explainability and transparency, decision-making, and team processes.

Synthesis and frequency coding of outcomes/measured constructs across the 104 included empirical studies.

high null result From testbeds to high-stakes work: a review of Human-AI team... performance (and ranked prevalence of constructs like trust, explainability, dec...

Gaming and entertainment, aviation, military and defense operations, emergency response and public safety, and healthcare also represented substantial portions of the literature.

Domain breakdown from the systematic review of 104 empirical studies (frequency counts by domain reported in Results).

high null result From testbeds to high-stakes work: a review of Human-AI team... study_domain_representation_by_industry

Cross-domain and interdisciplinary studies were the largest category, representing broad workplace or team-based investigations not tied to a single industry and instead focused on general collaboration issues such as communication, teamwork, coordination, and coworker interaction.

Categorization / coding of the 104 included empirical studies; frequency counts by study domain reported in review.

high null result From testbeds to high-stakes work: a review of Human-AI team... study_domain_prevalence

We conducted a PRISMA-guided systematic review with bibliometric analysis of 104 peer-reviewed empirical studies published between 2015 and 2025 and identified through Engineering Village, IEEE Xplore, PubMed, ScienceDirect, and Web of Science.

Methods reported in paper: PRISMA-guided systematic review and bibliometric analysis; explicit statement of 104 peer-reviewed empirical studies and databases searched (Engineering Village, IEEE Xplore, PubMed, ScienceDirect, Web of Science).

high null result From testbeds to high-stakes work: a review of Human-AI team... number_of_studies_reviewed

Order, entropy, information, and useful energy are task-dependent and system-relative concepts whose meanings depend on the objectives of the system.

Conceptual argument and discussion in the paper about the context-dependence of informational and energetic notions within the proposed framework; no empirical evidence provided.

high null result Optimal Order of Multi-Agent and General Many-Body Systems other

« Prev 1 2 3 … 36 37 38 … 151 152 Next »