Evidence (8807 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	870	233	116	1066	2363
Governance & Regulation	976	451	218	133	1809
Organizational Efficiency	949	224	144	88	1416
Technology Adoption Rate	764	287	141	122	1325
Research Productivity	501	152	74	362	1101
Output Quality	542	216	69	69	896
Decision Quality	387	198	94	54	740
Firm Productivity	513	67	101	27	714
AI Safety & Ethics	249	303	73	36	667
Market Structure	190	192	134	27	548
Task Allocation	243	77	91	36	452
Innovation Output	291	33	55	20	401
Skill Acquisition	206	72	65	21	364
Employment Level	133	63	115	22	335
Fiscal & Macroeconomic	153	79	52	32	323
Task Completion Time	206	37	12	15	272
Firm Revenue	179	52	29	5	266
Consumer Welfare	130	76	47	13	266
Inequality Measures	48	137	51	6	242
Worker Satisfaction	101	81	25	13	220
Error Rate	84	110	11	5	210
Wages & Compensation	98	47	30	10	185
Regulatory Compliance	88	73	17	7	185
Automation Exposure	66	64	33	16	182
Team Performance	105	29	30	11	176
Training Effectiveness	109	22	14	21	168
Developer Productivity	114	21	14	8	158
Job Displacement	12	90	24	1	127
Hiring & Recruitment	57	9	9	5	80
Skill Obsolescence	6	56	9	1	72
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	21	17	1	57
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Productivity Remove filter

Historical collaboration patterns (CPs) can be represented as knowledge-graph episodic memories and used for reuse via graph representation learning with a node-classification objective to identify a representative and effective memory.

Methodological description in the paper: authors construct knowledge-graph episodic memories from prior CPs and apply graph representation learning with a node-classification objective to select memories for reuse; no external validation beyond the study's experimental tests.

high null result Improving Human-Robot Teamwork in Urban Search and Rescue Th... ability to represent and select prior CPs (method implementation)

The study uses cross-country evidence from OECD Productivity, OECD STAN, OECD Patents, INTAN-Invest, and Functional Urban Areas (FUAs) databases and combines descriptive analysis with panel and robust regression techniques.

Methods statement from the paper (datasets named and methods described in abstract/introduction).

high null result The Illusionary Model of Relative Economic Growth in the Era... methodological approach / data sources

AI patent intensity has weak and statistically insignificant associations with aggregate Total Factor Productivity (TFP).

Cross-country panel and robust regression analysis using OECD Productivity and OECD Patents data (descriptive analysis + panel/robust regressions reported in paper).

high null result The Illusionary Model of Relative Economic Growth in the Era... aggregate Total Factor Productivity (TFP)

The dataset contains large high-resolution images of dimensions 1280x959 and 960x703, which increase the complexity of the annotation task.

Dataset description in the paper specifying image dimensions (1280x959 and 960x703) and noting annotation complexity due to image size.

high null result Speeding up the annotation process in semantic segmentation ... image resolution / annotation complexity

LLM guidance did not increase the total number of victims saved (no increase in total victims saved relative to baseline).

Same experimental comparison (two LLM-guided conditions vs no-LLM) in the simulated SAR environment; behavioral measure of total victims saved reported.

high null result LLM-Mediated Human-AI Interaction in Search and Rescue: Impa... total victims saved

A 2015-2017 backward extension (224 firms, 601 observations) supplies pre-treatment data and provides evidence against pre-existing upward-trend confounds in SG&A-to-revenue.

Additional panel extension covering 2015-2017 with 224 firms and 601 firm-year observations, used to test pre-trends.

high null result What Capital After Labor? Forecasting the Talent ROI Transit... absence of pre-existing upward trend in SG&A-to-revenue

South Korea exemplifies national-scale under-augmentation: high human capital (H), substantial AI (A), but low convergence capacity (C) produce phi = 0.

Case/example presented in the paper as an illustrative national example (descriptive/case-study evidence).

high null result Forecasting AI-Era Productivity: The Intellectually Converge... augmentation factor phi (measured as zero for the South Korea example)

The identification strategy exploits the staggered establishment of National Supercomputing Centers (NSCs) as quasi-natural experiments and uses a staggered difference-in-differences model for causal identification.

Methodological design as described in the paper.

high null result Computing power infrastructure and corporate financializatio... identification strategy (method)

The study sample consists of Chinese A-share listed companies from 2012 to 2023.

Paper sample description.

high null result Computing power infrastructure and corporate financializatio... sample/time coverage

The study contributes theoretically by integrating perspectives from productivity economics, public administration, and systemic risk within a sociotechnical systems framework; empirically by providing a comprehensive synthesis of evidence on AI and public sector productivity; and methodologically by applying transparent PRISMA 2020 review procedures.

Author-stated contributions supported by the paper's literature integration, systematic review (68 studies), and use of PRISMA 2020 methods.

high null result AI Adoption in Local Government: Productivity, Systemic Risk... theoretical, empirical, and methodological contributions

This study systematically reviews 68 peer reviewed empirical studies published between 2015 and 2025 using PRISMA 2020 methodology.

Methods statement in the paper describing the systematic review procedure and sample of included studies.

high null result AI Adoption in Local Government: Productivity, Systemic Risk... number and scope of empirical studies reviewed

Digital transformation, AI adoption, and foreign direct investment (FDI) do not display statistically significant direct effects on export performance in the baseline specification.

Null statistical significance reported for these predictors in the study's baseline pooled OLS / fixed-effects regressions (abstract statement); no specific coefficients reported in abstract.

high null result Internal capabilities, digital transformation, and SME expor... export performance

Across the subsequent ~150 sessions after deploying Baseline-Log Physical Separation, no recurrence of Index Sickness was observed.

Authors' observational report from Bang-v3 following deployment: subsequent ~150 collaborative sessions with no observed recurrence.

high null result Written by AI, Managed by AI: Semantic Space Control and Ind... occurrence_of_Index_Sickness

We used action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions.

Paper statement of study design: action research in Bang-v3, duration ~1 month, 391 collaborative sessions.

high null result Written by AI, Managed by AI: Semantic Space Control and Ind... study_design_and_scope

Whether wear-aware placement improves task value remains open — χ is measured against a value proxy, and the non-monotone optimum, while proven, is not yet observed in data.

Explicit caveat in the paper noting measurement limitations (value proxy) and that the proven non-monotone optimal policy has not been observed empirically. This is a statement about the limits of current empirical evidence.

high null result Memory as a Wasting Asset: Pricing Flash Endurance for Embod... effect_of_wear-aware_placement_on_task_value

The paper proposes an integration framework covering use case suitability, autonomy levels, technical integration, governance, security, employee enablement, and measurable impact.

Paper presents a proposed framework (descriptive; the existence of the proposal is internal to the paper).

high null result The Integrator Advantage: Controlled Agentic AI for Small an... framework_components_for_integration

Unlike traditional automation or conversational AI, agentic systems can interpret goals, plan multi-step tasks, access tools, interact with enterprise systems, and execute workflows with varying degrees of autonomy.

Descriptive definition and capability listing in the paper (conceptual/technical description). No empirical validation provided.

high null result The Integrator Advantage: Controlled Agentic AI for Small an... capabilities_of_ai_systems

Agentic AI marks a new phase of enterprise automation.

Author's high-level claim in the paper (conceptual/position statement). No empirical data or sample reported.

high null result The Integrator Advantage: Controlled Agentic AI for Small an... emergence_of_technology

We provide a candid assessment of the problems Mojo does and does not yet solve.

Paper claims to include evaluative discussion of limitations and unsolved problems; descriptive statement about paper content.

high null result Mojo: A Promising Tool for Scalable Financial AI Efficiency discussion/assessment of Mojo's current limitations

Larger-scale GPU workload results are projections calibrated from published benchmarks.

Paper states that larger GPU results are not directly measured but are projections calibrated using published benchmarks; no calibration dataset size given in the excerpt.

high null result Mojo: A Promising Tool for Scalable Financial AI Efficiency projected performance for larger-scale GPU workloads

We benchmark four core financial AI workloads: Monte Carlo option pricing, LLM sentiment inference, multi-asset backtesting, and portfolio Value at Risk.

Paper reports that these four workloads were benchmarked (method: benchmarking); number of distinct workload types = 4.

high null result Mojo: A Promising Tool for Scalable Financial AI Efficiency conducting benchmarks on four specified workloads

A qualitative analysis of 384 stratified patches informs a syntactic taxonomy of eight oracle signal categories.

Manual qualitative coding/analysis of a stratified sample of 384 test-file patches, resulting in an eight-category syntactic taxonomy.

high null result All Smoke, No Alarm: Oracle Signals in Agent-Authored Test C... taxonomy of oracle signal categories derived from qualitative analysis

We conduct an empirical study of 86,156 test-file patches from 33,596 agent-authored PRs across 2,807 GitHub repositories produced by five coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code.

Dataset collection and descriptive statistics reported by the study (counts of patches, PRs, repositories, and agents analyzed).

high null result All Smoke, No Alarm: Oracle Signals in Agent-Authored Test C... dataset composition (counts of patches/PRs/repos/agents)

The production technology is multiplicative and cognitive capital functions as collateral that determines the return to AI adoption.

Analytic model specification and derivation inside the paper (formal mechanism). No empirical data.

high null result Cognitive Debt: AI as Intellectual Leverage and the Dynamics... return to AI adoption (determined by cognitive capital collateral)

The model features two state variables per agent, cognitive capital and cognitive debt.

Formal theoretical model presented in the paper (model specification). No empirical sample; analytic construction.

high null result Cognitive Debt: AI as Intellectual Leverage and the Dynamics... model_structure (cognitive capital and cognitive debt as state variables)

A pilot deployment in Newham's secure environment evaluated operational performance relative to manual workflows.

Paper reports a pilot deployment and an operational evaluation comparing DOMUS to existing manual workflows (method: pilot deployment; specifics such as duration or sample size not stated in provided text).

high null result Optimising Temporary Accommodation Placement Across London w... operational performance relative to manual workflows

The authors evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels on LabOSBench.

Paper reports an evaluation suite covering multiple model classes and evaluation granularities (subtask and end-to-end); specific model identities and counts are not provided in the excerpt.

high null result LabOSBench: Benchmarking Computer Use Agents for Scientific ... evaluation_coverage (model_types and evaluation_levels)

LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection.

Explicit specification in the paper of benchmark breadth: 96 subtasks and 8 simulators including the listed workflow stages.

high null result LabOSBench: Benchmarking Computer Use Agents for Scientific ... benchmark_scope (number_of_subtasks, number_of_simulators)

Scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment.

Conceptual claim in the paper describing the nature of scientific-instrument operation; not supported by empirical data in the excerpt.

high null result LabOSBench: Benchmarking Computer Use Agents for Scientific ... requirements_of_instrument_control

Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems.

Statement in paper framing the problem; based on literature/field observation rather than reported experiments or data in this excerpt.

high null result LabOSBench: Benchmarking Computer Use Agents for Scientific ... benchmark_scope

In evaluation runs, the evaluated model controls one coffee roaster while the remaining firms are controlled by fixed reference agents.

Experimental setup described in the paper: one roaster is controlled by the model under test; other five firms use fixed reference agents.

high null result CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterog... agent-control configuration

Each firm in CoffeeBench seeks to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing.

Specification of agent objectives and state variables in the benchmark design (cumulative net income objective; resources: cash, inventory; decision variables: pricing and transactions).

high null result CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterog... cumulative net income

CoffeeBench simulates an economy of two farmers, two roasters, and two retailers operating autonomously over a 90-day simulation.

Environment description in the paper specifying the number and types of firms and the 90-day simulation horizon.

high null result CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterog... environment structure (number and types of agents and simulation horizon)

Frontier costs have stayed relatively stable between 2024 and 2026.

Authors' reported trend analysis of frontier model costs across the 2024–2026 period; excerpt lacks numeric cost trend data and sample size.

high null result WorkBench Revisited: Workplace Agents Two Years On trend in frontier model costs over time

Every output is valuable (in H), trivial (in F \ H), or a hallucination (not in F).

Definition/assumption in the paper's formal model classifying outputs into three mutually exclusive categories relative to the formal language F and the valuable language H.

high null result Flood and Harvest: The Provable Necessity of Trivia for Gene... categorization of generated outputs

The empirical analysis used archival microdata from 770 large Spanish firms and employed staged OLS regression models.

Statement of data source and method in the paper's abstract.

high null result Beyond AI Adoption: An Empirical Study on the Antecedents an... methodological description (data and analytical approach)

The complementarity between AI deployment depth and breadth offers a configurational explanation for the AI productivity paradox.

Theoretical interpretation plus empirical finding of a positive interaction between depth and breadth in staged OLS analyses of archival microdata from 770 large Spanish firms.

high null result Beyond AI Adoption: An Empirical Study on the Antecedents an... explanation for AI productivity paradox (interpretive/theoretical outcome)

AI capability can be conceptualized as two-dimensional: AI deployment depth (technological variety of AI implementations) and AI deployment breadth (organizational scope of AI diffusion).

Theoretical framing drawing on Resource-Based Theory and organizational search theory; conceptual argument presented in the paper.

high null result Beyond AI Adoption: An Empirical Study on the Antecedents an... conceptualization: AI deployment depth and breadth

Those preliminary experiments do not establish behavior preservation, scaling economics, or verified-change cost.

Authors' explicit limitation statement following the preliminary QLoRA experiments.

high null result No Accidental Software Agent First Canonical Code for Human ... establishment of behavior preservation / scaling economics / verified-change cos...

The review focuses on three core dimensions of impact: employee attitudes (job satisfaction, motivation, adaptability), workplace behaviours (performance, creativity, technology adoption), and organisational dynamics (leadership, trust, team cohesion).

Stated scope and focus areas in the abstract describing the review's analytical framework.

high null result Emotional AI in the Workplace: Systematic Review of Effects ... scope of outcomes assessed (attitudes, behaviours, organisational dynamics)

The study contributes a structured framework that clarifies the role of emotional AI in organisational contexts and outlines actionable, scalable strategies for real-world application.

Authors claim to have developed and presented a framework and strategies as part of the review paper (a descriptive/conceptual contribution rather than empirical evidence).

high null result Emotional AI in the Workplace: Systematic Review of Effects ... framework and recommended strategies (conceptual contribution)

The study identifies key patterns, methodological trends, and underexplored areas in research on emotional AI systems in organisational contexts.

Authors report findings of their comparative analysis of the state-of-the-art literature (exact patterns/trends and counts are detailed in the full review).

high null result Emotional AI in the Workplace: Systematic Review of Effects ... research patterns and methodological trends

This study follows the PRISMA framework to conduct a systematic evaluation and comparative analysis of the state-of-the-art literature on emotional AI in organisations.

Statement of methods in the abstract that the review used PRISMA; implies structured search, screening, and selection procedures described in the full paper.

high null result Emotional AI in the Workplace: Systematic Review of Effects ... methodological approach (use of PRISMA for systematic review)

The literature on AI-powered emotional intelligence systems is fragmented and insufficiently synthesised.

Authors' assessment based on a systematic literature review conducted following the PRISMA framework (details of databases, search terms, and included studies reported in the paper); exact number of studies not stated in the abstract.

high null result Emotional AI in the Workplace: Systematic Review of Effects ... state of the literature (comprehensiveness / synthesis)

The paper derives tight conditions that determine whether the economy is partially versus fully automated in the long run.

Analytical characterization in the model: derivation of necessary and sufficient (tight) conditions separating long-run partial automation from full automation (mathematical proofs within the paper).

high null result Data-Driven Automation long-run automation regime (partial vs full)

Data accumulates endogenously as a byproduct of economic activity.

Model assumption and mechanism in the theoretical dynamic model: data generation is modeled as an endogenous outcome of agents' economic activity (analytical model specification).

high null result Data-Driven Automation endogenous data accumulation

Data is heterogeneous and task-specific.

Model assumption stated in the paper's setup: the model is built with data that varies across tasks and is task-specific (analytical model specification).

high null result Data-Driven Automation data heterogeneity (task-specificity)

Density-normalized outcomes (e.g., smells per LOC) can mislead when treatment affects system size; raw counts and explicit decomposition are required for causal mining studies of AI tool adoption.

Interpretation and methodological recommendation derived from the observed pattern (unchanged smell counts + increased LOC leading to lower density) in the paper's empirical results.

high null result Mining Architectural Quality Under Agentic AI Adoption: A Ca... validity of density-normalized metrics (e.g., smells/LOC) under treatment that c...

Per-type estimates and robustness checks (wild cluster bootstrap, Lee bounds, stale-observation sensitivity) corroborate the main pattern; pre-trends are flat (Wald p = 0.90), consistent with the parallel trends assumption.

Placebo and robustness analyses reported in the paper (per-type breakdowns and multiple sensitivity checks) applied to the 151-repository panel; pre-trend test result reported as Wald p = 0.90.

high null result Mining Architectural Quality Under Agentic AI Adoption: A Ca... pre-treatment trends and robustness of estimated effects

Total architectural smell counts are essentially unchanged after adoption (+1.1%, p = 0.82).

Estimated treatment effect from staggered DiD / Borusyak imputation on total smell counts using the 151-repository panel (74 treated, 77 controls).

high null result Mining Architectural Quality Under Agentic AI Adoption: A Ca... total architectural smell counts

« Prev 1 2 3 … 35 36 37 … 176 177 Next »