Evidence (8974 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	882	244	117	1097	2424
Governance & Regulation	1010	469	229	135	1875
Organizational Efficiency	977	235	149	90	1462
Technology Adoption Rate	781	299	143	128	1362
Research Productivity	506	155	74	363	1110
Output Quality	555	219	71	70	915
Decision Quality	395	200	95	54	751
Firm Productivity	523	67	101	27	724
AI Safety & Ethics	262	309	75	36	688
Market Structure	195	201	135	30	566
Task Allocation	248	77	96	38	464
Innovation Output	300	34	55	20	411
Skill Acquisition	207	75	65	21	368
Employment Level	138	67	119	24	350
Fiscal & Macroeconomic	156	80	53	33	329
Task Completion Time	211	38	13	16	280
Firm Revenue	183	52	29	5	270
Consumer Welfare	131	77	48	13	269
Inequality Measures	50	141	54	9	254
Worker Satisfaction	104	85	25	13	227
Error Rate	87	112	11	5	215
Automation Exposure	69	69	37	20	198
Wages & Compensation	102	49	31	11	193
Team Performance	115	30	30	11	187
Regulatory Compliance	88	74	17	7	186
Training Effectiveness	109	22	14	21	168
Developer Productivity	116	21	15	8	161
Job Displacement	12	92	26	1	131
Hiring & Recruitment	57	12	9	5	83
Skill Obsolescence	6	59	10	2	77
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	23	17	1	59
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Productivity Remove filter

Evidence also includes pattern matching with documented agentic R&D deployments.

Methodological statement in the paper claiming pattern matching with documented agentic R&D deployments (unspecified number/source).

high null result From Replacement to Orchestration: A Socio-Technical Archite... similarity between proposed design and existing agentic R&D deployments

The study includes a foresight scenario analysis projecting four plausible 2040 R&D futures to stress-test design choices.

Methodological statement in the paper describing a four-scenario foresight analysis.

high null result From Replacement to Orchestration: A Socio-Technical Archite... plausibility and robustness of design across future scenarios

Empirical evidence for the design is triangulated from four semi-structured expert interviews with senior R&D leaders across industrial, healthcare, and academic settings.

Methodological statement in the paper specifying four semi-structured expert interviews.

high null result From Replacement to Orchestration: A Socio-Technical Archite... qualitative expert insights informing design

Because all observations come from a single practitioner, the inferential statistics are exploratory and hypothesis-generating rather than confirmatory; portability across the full portfolio awaits multi-practitioner replication.

Explicit limitation stated in the paper about the single-practitioner design and its implications for inference.

high null result Augment Engineering: A Methodology for Multi-Tool AI Orchest... generalizability/replicability of the findings

The framework is illustrated with an accounts-payable simulation and a companion spreadsheet.

Empirical illustration: the paper includes (or accompanies) an accounts-payable simulation and a spreadsheet to demonstrate the model and estimation approach.

high null result Modeling Agentic Technical Debt and Stochastic Tax: A Standa... practical illustration of framework through accounts-payable simulation and spre...

The note starts from a compact dashboard expression, expands it into a fuller structural model, defines all variables and parameters, and shows how each cost category can be estimated from operational data.

Methodological description in the paper: construction of dashboard, expansion to structural model, full variable/parameter definitions, and stated procedures for estimating cost categories from operational data; accompanied by worked examples.

high null result Modeling Agentic Technical Debt and Stochastic Tax: A Standa... methodological capacity to estimate agentic costs from operational data

Agentic Technical Debt is a stock of accumulated design and governance liability.

Definition provided in the paper as part of the conceptual framework that labels Agentic Technical Debt as a stock (accumulated) liability tied to design and governance.

high null result Modeling Agentic Technical Debt and Stochastic Tax: A Standa... conceptual characterization of Agentic Technical Debt (stock of design and gover...

This note develops a formal and managerially usable model that distinguishes Agentic Technical Debt from Stochastic Tax.

Author states development of a formal, managerially usable model and explicit distinction between the two constructs; supported by model construction in the paper (structural model and dashboard).

high null result Modeling Agentic Technical Debt and Stochastic Tax: A Standa... ability to distinguish Agentic Technical Debt from Stochastic Tax via a formal m...

Agentic AI systems combine probabilistic reasoning with delegated action through tools, context, memory, orchestration, and external workflow integration.

Conceptual/definitional statement in the paper; presented as the working characterization of 'Agentic AI systems' within the model specification.

high null result Modeling Agentic Technical Debt and Stochastic Tax: A Standa... structural composition of agentic AI systems (probabilistic reasoning + delegate...

We evaluate SIA across three contrasting domains: Chinese legal charge classification (LawBench), low-level GPU kernel optimisation, and single-cell RNA denoising.

Experimental design described in the paper (three benchmark domains used for evaluation).

high null result SIA: Self Improving AI with Harness & Weight Updates domains/tasks used for evaluation

We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent.

Methodological contribution described in the paper (proposal of a new combined approach; implementation details presumably in methods).

high null result SIA: Self Improving AI with Harness & Weight Updates capability of an agent to update both harness and weights

These two silos (harness-update and test-time training) operate in isolation.

Authors' characterization of the research landscape presented in the paper (conceptual claim/literature observation).

high null result SIA: Self Improving AI with Harness & Weight Updates degree of integration between research lines

Two largely disjoint research lines attack this bottleneck: the harness-update school (a meta-agent rewrites the scaffold while model weights are fixed) and the test-time training school (hand-written RL pipelines update model weights while the harness is fixed).

Paper's literature/positioning claim classifying prior work into two categories (conceptual/literature summary).

high null result SIA: Self Improving AI with Harness & Weight Updates classification of prior research approaches

Seventeen operators completed continuous search tasks under high cognitive workload while their spatial covariance was mapped using a 2D Adaptive Riemannian Oracle.

Methodological description in the paper: 17 human operators performed continuous search tasks in a Virtual Reality drone task; spatial covariance recorded using a 2D Adaptive Riemannian Oracle.

high null result The Timing Dependencies of Trust: Speed, Accuracy, and cBCI ... experiment sample and measurement modality (operators; spatial covariance mappin...

The paper proposes a policy framework consisting of six groups of solutions for Vietnam to both promote AI development and control risks in the digital age.

Declared in abstract: the paper presents a six-group policy framework for Vietnam; the framework itself is the paper's output (proposal), not empirically tested in the paper.

high null result Regulatory Policy for the Agent Economy in the Digital Age: ... existence of a six-group policy framework aimed at promoting AI development and ...

This study employs document synthesis and comparative analysis of international policies.

Methodological statement in the paper abstract describing the research approach; no sample size specified beyond document sources.

high null result Regulatory Policy for the Agent Economy in the Digital Age: ... research method used (document synthesis and comparative policy analysis)

The rise of artificial intelligence (AI) is shaping a new Agent Economy (AE), in which autonomous AI agents represent humans in performing a wide range of complex tasks.

Statement in paper abstract/intro (conceptual definition); no empirical data or sample reported.

high null result Regulatory Policy for the Agent Economy in the Digital Age: ... existence/definition of Agent Economy (autonomous AI agents representing humans ...

A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads.

Subset analysis of telemetry for a May 2026 trajectory reported by authors; counts of model-completed events and token logs, with cache-read classification.

high null result Persistent AI Agents in Academic Research: A Single-Investig... model-completed events, total recorded tokens, proportion of tokens served from ...

Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events.

Analysis/parsing of memory-derived records from the persistent environment yielding categorized event counts.

high null result Persistent AI Agents in Academic Research: A Single-Investig... counts of output-proxy events and counts of failure/verification/correction/prot...

Active system time was 579.7 hours (30-minute capped-gap estimate).

Computed runtime activity metric from system telemetry/logs over the study period; authors report a 30-minute capped-gap estimate to compute active system time.

high null result Persistent AI Agents in Academic Research: A Single-Investig... active system runtime (hours)

The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files.

Inventory of the implemented persistent agent workspace reported by authors as part of the case study (counts extracted from workspace metadata/filesystem).

high null result Persistent AI Agents in Academic Research: A Single-Investig... counts of workspace memory files, agent directories, and skill files

Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages.

Structured self-observed implementation case study (unit: a single persistent human-agent environment) conducted Jan 31–May 25, 2026; authors report recoverable telemetry logs totaling these counts.

high null result Persistent AI Agents in Academic Research: A Single-Investig... number of telemetry records and role-specific messages

Identification limits prevent a strict causal claim; the paper outlines an agenda for cleaner tests.

Authors' explicit caveat in the abstract noting limits to identification and stating they outline future cleaner tests.

high null result Coding Beyond Your Training: Claude Code and the Technologic... causal identification credibility / limitations

The analysis exploits the staggered rollout of Claude Code across GitHub between May 2025 and January 2026, using a panel of 5,838 developers observed monthly over 28 months, with treatment defined by a developer's first Claude-co-authored commit and not-yet-treated developers as controls, and estimates obtained via the doubly robust Callaway and Sant'Anna (2021) estimator.

Methods and data description as stated in the abstract: staggered rollout timing, sample size (5,838), observation window (28 months), treatment definition (first Claude-co-authored commit), estimator (Callaway & Sant'Anna 2021).

high null result Coding Beyond Your Training: Claude Code and the Technologic... study design / identification strategy

Results are robust to two stricter activity filters.

Robustness checks reported in the paper applying two stricter activity filters to the sample; claim refers to consistency of estimated effects under these alternate sample definitions.

high null result Coding Beyond Your Training: Claude Code and the Technologic... sensitivity/robustness of estimated treatment effects to stricter activity filte...

We conducted a global large-scale randomized field experiment, delivering customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions.

Statement in paper describing experimental design and scale: randomized field experiment; sample described as >31,000 preprints, >45,000 researchers, 150 fields, 133 regions.

high null result Human-AI Collaboration in Science at Scale: A Global Large-s... n/a (description of experimental sample and coverage)

There is a significant deficiency in India-centric qualitative investigations on human-AI collaboration in the IT sector.

Authors' review of peer-reviewed literature and secondary data concluding a gap in India-focused qualitative studies (literature gap analysis). No numeric count provided.

high null result Human–AI Collaboration in the Indian IT Industry: A Qualitat... quantity/coverage of India-centric qualitative research

The same bias was not observed when imagining help from another human participant.

Empirical comparison reported in the abstract: predictions about receiving help from another human did not show the same faster-than-reality bias as predictions about AI assistance (from the same preregistered study, N = 1237).

high null result Cognitive offloading and the speedup illusion in human-AI in... predicted completion time when imagining help from another human

Actual completion times between independent completion and AI-assisted completion did not differ.

Empirical result reported in the abstract comparing measured completion times for independent vs. AI-assisted task completion in the preregistered study (N = 1237).

high null result Cognitive offloading and the speedup illusion in human-AI in... actual completion time

We conducted a preregistered large-scale behavioral study (N = 1237) to characterize mismatches between expectations and reality, with a focus on simple cognitive tasks.

Authors report study design and sample size in the abstract: preregistered behavioral experiment with N = 1237 participants.

high null result Cognitive offloading and the speedup illusion in human-AI in... study design / sample size (methodological claim)

The degree of persuasiveness for LLM-based narrative explanations did not meaningfully impact decision accuracy over a simple AI prediction alone.

Large-scale human behavioral experiment comparing decision accuracy with AI prediction alone versus AI prediction plus narrative explanations of varying persuasiveness (method described in paper).

high null result Human Decision-Making with Persuasive and Narrative LLM Expl... decision accuracy

This study is a systematic literature review conducted following PRISMA 2020 guidelines synthesizing peer-reviewed studies published between 2019 and 2025 identified via searches in Scopus, Web of Science and Google Scholar.

Author-stated methodology in the paper: PRISMA 2020 systematic literature review covering 2019–2025 with database searches in Scopus, Web of Science, and Google Scholar.

high null result Yapay Zeka Sistemleri ve İnsan İşbirliğinin Psikolojik, Sosy... scope and coverage of literature search / methodological transparency

This scoping review adhered to the PRISMA-ScR guidelines and encompassed 29 peer-reviewed empirical studies published from 2020 to 2025.

Methods statement in the paper (explicit methodological description).

high null result The influence of AI-Driven Employee Performance Management (... scope and methodological adherence of the review (PRISMA-ScR; n=29 studies)

The paper identifies five major research gaps and proposes future research directions in intelligent international marketing.

Author-reported outcome of the paper's systematic review and content analysis (2010–2025); descriptive claim about the paper's contributions.

high null result Research on International Marketing in the Context of Intell... identification of research gaps and proposed directions

Prior productivity does not predict AI use.

Analysis linking prior productivity measures to reported AI adoption in the Census Bureau survey data; finding of no predictive relationship reported.

high null result The Adoption of Industrial AI in America predictive relationship between prior productivity and AI adoption

The analysis uses a mandatory, purpose-designed Census Bureau survey of approximately 28,500 establishments.

Census Bureau mandatory survey specifically designed for this study; sample size stated as approximately 28,500 establishments.

high null result The Adoption of Industrial AI in America survey_sample_size / data source

When execution is standardized on a cheaper Gemini Flash scaffold (separating planning from execution), a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821).

Empirical experiment: 32-game planner-only comparison where execution was standardized; reported p-value ≈ 0.821 indicating no significant difference among planners.

high null result Evaluating Large Language Models as Live Strategic Agents: P... planner performance equality (pooled test)

We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles.

Methodological description of the experimental environment used in the paper (timed multi-phase Risk environment with explicit victory targets and repeated cycles).

high null result Evaluating Large Language Models as Live Strategic Agents: P... experimental_environment_description

The study extends the Technology Acceptance Model (TAM), Dynamic Capabilities Theory, and the Technology-Organisation-Environment (TOE) framework into the qualitative, emerging-economy entrepreneurial context.

Authors' stated theoretical contribution based on mapping thematic results to TAM, Dynamic Capabilities, and TOE frameworks within analysis and discussion sections.

high null result Navigating the Intelligence Frontier: AI Adoption as a Succe... theoretical contribution / framework extension

This study employed an interpretivist, qualitative research design using sixteen in-depth semi-structured interviews with entrepreneurs across fintech, edtech, health-tech, logistics, retail, and SaaS in Delhi/NCR, India, and used Braun & Clarke's (2006) six-phase thematic analysis framework.

Explicit methodological description in the paper: interpretivist qualitative design; n=16 in-depth semi-structured interviews across specified sectors in Delhi/NCR; thematic analysis following Braun & Clarke (2006).

high null result Navigating the Intelligence Frontier: AI Adoption as a Succe... research design / data collection (qualitative interviews)

The paper's findings are based on three pre-registered user studies with a combined sample size of N = 2691.

Statement in the paper's abstract reporting three pre-registered user studies and combined N = 2691.

high null result The efficiency-gain illusion: People underestimate the rate ... study_sample_description

Light AI users perform similarly to matched users who do not use AI.

Same controlled logical reasoning experiment with on-demand AI assistance comparing light AI users to matched non-users (sample size not stated in abstract).

high null result The Impact of AI Usage and Informativeness on Skill Developm... post-AI performance / skill development

We map that space through six interconnected elements: sociotechnical context, decision-making frameworks, human decision participants, AI capabilities, interaction, and holistic evaluation.

The paper's proposed analytical/framework contribution listing six elements (descriptive of the authors' mapping work).

high null result Addressing the Synergy Gap: The Six Elements of the Design S... n/a (framework description)

Most current work treats human-AI combination as an engineering problem and concentrates on interpretability, trust calibration, or interface design.

Authors' characterization of the existing literature and dominant research foci (qualitative literature assessment; no quantitative breakdown provided).

high null result Addressing the Synergy Gap: The Six Elements of the Design S... research focus/themes in human-AI combination literature

We call this persistent shortfall the 'synergy gap.'

Terminology/definition introduced by the authors in the paper (conceptual claim, not an empirical finding).

high null result Addressing the Synergy Gap: The Six Elements of the Design S... n/a (terminology defining a phenomenon)

Current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes.

Synthesis of mixed results from controlled studies, meta-analyses, and benchmarks reported in the paper (no single sample size given in abstract).

high null result Agentic Agile-V: From Vibe Coding to Verified Engineering in... engineering outcomes (overall improvement from autonomous code generation)

We compared the traits causing the incidents with the traits that 197 developers building AI systems for those tasks would have preferred.

Study design: comparison between trait set responsible for incidents (from incident reports) and stated developer preferences collected from a sample of 197 developers working on those tasks.

high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... developers' preferred AI system traits (self-reported)

We compared the extracted traits with the traits that 202 workers highly familiar with those tasks would have preferred.

Study design: a comparison between LLM-extracted traits from incident reports and stated preferences from a sample of 202 workers familiar with the tasks.

high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... workers' preferred AI system traits (self-reported preferences)

We used an LLM-as-an-expert approach to extract the main traits of the AI systems involved in those incidents using an established framework of twelve traits.

Methods statement: applied a Large Language Model to code/extract AI system traits from the incident reports using an established 12-trait framework.

high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... trait classification of AI systems involved in incidents

We analyzed 1,524 reports of incidents in which AI systems were used to perform 171 occupational tasks across 12 industry sectors.

Descriptive statement in paper: dataset comprised 1,524 incident reports, covering 171 occupational tasks and 12 industry sectors (dataset construction / corpus used for analysis).

high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... scope and coverage of analyzed incident reports (number of incidents, tasks, and...

« Prev 1 2 3 … 39 40 41 … 179 180 Next »