Evidence (8807 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	870	233	116	1066	2363
Governance & Regulation	976	451	218	133	1809
Organizational Efficiency	949	224	144	88	1416
Technology Adoption Rate	764	287	141	122	1325
Research Productivity	501	152	74	362	1101
Output Quality	542	216	69	69	896
Decision Quality	387	198	94	54	740
Firm Productivity	513	67	101	27	714
AI Safety & Ethics	249	303	73	36	667
Market Structure	190	192	134	27	548
Task Allocation	243	77	91	36	452
Innovation Output	291	33	55	20	401
Skill Acquisition	206	72	65	21	364
Employment Level	133	63	115	22	335
Fiscal & Macroeconomic	153	79	52	32	323
Task Completion Time	206	37	12	15	272
Firm Revenue	179	52	29	5	266
Consumer Welfare	130	76	47	13	266
Inequality Measures	48	137	51	6	242
Worker Satisfaction	101	81	25	13	220
Error Rate	84	110	11	5	210
Wages & Compensation	98	47	30	10	185
Regulatory Compliance	88	73	17	7	185
Automation Exposure	66	64	33	16	182
Team Performance	105	29	30	11	176
Training Effectiveness	109	22	14	21	168
Developer Productivity	114	21	14	8	158
Job Displacement	12	90	24	1	127
Hiring & Recruitment	57	9	9	5	80
Skill Obsolescence	6	56	9	1	72
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	21	17	1	57
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Productivity Remove filter

Temporary accommodation has become a major fiscal and administrative pressure for English local authorities, particularly in London, where demand and costs have risen sharply.

Statement in paper introduction/background; contextual claim based on administrative observations and cited motivation for building DOMUS (no specific sample size or numerical data reported in the provided text).

high negative Optimising Temporary Accommodation Placement Across London w... demand and costs of Temporary accommodation

Directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation.

Argument presented by authors as motivation for creating a simulated testbed; no empirical cost/safety/accessibility metrics provided in the excerpt.

high negative LabOSBench: Benchmarking Computer Use Agents for Scientific ... practicality_of_physical_evaluation

Claude Haiku 4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans.

Qualitative observation from agent trajectories showing that Claude Haiku 4.5 repeatedly selects no-action decisions over time while still generating coherent internal assessments and plans.

high negative CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterog... action frequency / inaction behavior (idle-drift) and coherence of assessments/p...

Frontier models still make some basic mistakes that occasionally result in irreversible harm (for example, sending an email to the wrong person).

Reported observed incidents from WorkBench evaluations indicating that even top-performing models sometimes make mistakes that can cause irreversible harm; no incident counts or sample size provided in the excerpt.

high negative WorkBench Revisited: Workplace Agents Two Years On incidence of serious irreversible errors (e.g., misdirected emails)

In June 2026 the best agent to date, Claude Opus 4.8, took an unintended harmful action on 2.5% of tasks.

Reported evaluation result on the WorkBench benchmark (June 2026) measuring incidence of unintended harmful actions by agents; exact sample size not stated in the excerpt.

high negative WorkBench Revisited: Workplace Agents Two Years On rate of unintended harmful actions

In March 2024 the best agent on WorkBench, GPT-4, took an unintended harmful action (such as emailing the wrong person) on 26% of tasks.

Reported evaluation result on the WorkBench benchmark (March 2024) measuring incidence of unintended harmful actions by agents; exact sample size not stated in the excerpt.

high negative WorkBench Revisited: Workplace Agents Two Years On rate of unintended harmful actions

A lack of strategic alignment is a critical barrier that leads AI initiatives to be unused despite technical success.

Paper identifies misalignment between AI projects and organizational strategy as a failure mode in its failure analysis; methodological details not specified in the summary.

high negative Zombie Ai Investments: From Technical Success To Business Fa... alignment with business strategy / adoption and value realization

User resistance is a critical barrier that prevents AI initiatives from delivering business value.

Paper lists user resistance among critical barriers based on analysis of failed projects; no sample size or quantitative method stated in the summary.

high negative Zombie Ai Investments: From Technical Success To Business Fa... user uptake / adoption of AI systems

Siloed deployments are a critical barrier causing AI initiatives to remain unused.

Identified in the paper's analysis of failure modes; presented as a key barrier (method and sample size not provided in the summary).

high negative Zombie Ai Investments: From Technical Success To Business Fa... barriers to adoption / integration of AI

AI initiatives meet functional specifications yet remain unused.

Paper's examination of AI projects that passed technical/functional tests but were not adopted; method/sample size not stated in the summary.

high negative Zombie Ai Investments: From Technical Success To Business Fa... actual usage / adoption of deployed AI systems

Organizations struggle with "zombie AI investments" that succeed technically but fail to generate tangible business value.

Paper's analysis of failed AI initiatives and described observations; methods not specified in the summary (likely qualitative/case analysis).

high negative Zombie Ai Investments: From Technical Success To Business Fa... generation of tangible business value / adoption of AI outputs

A perfect verifier cannot substitute for taste: the unbounded stream of correct-but-worthless statements is not an engineering accident but a provable necessity, since covering unrecorded valuable mathematics requires an infinite, but asymptotically negligible, stream of certified trivia.

Theoretical proof within the model that achieving coverage of unrecorded valuable mathematics necessitates an infinite stream of verifiable-but-trivial outputs; argument that these outputs must be asymptotically negligible in rate yet unbounded in total count.

high negative Flood and Harvest: The Provable Necessity of Trivia for Gene... necessity of an unbounded stream of trivial (correct-but-worthless) outputs to c...

The verifier is not taste: the collections admitting generation with breadth are exactly those of the oracle-free model, characterized fiber-wise by Angluin's condition.

Theoretical model and proofs in the paper comparing a verifier-equipped nested-language generation model to an oracle-free model; characterization via Angluin's condition (formal, fiber-wise).

high negative Flood and Harvest: The Provable Necessity of Trivia for Gene... which collections (languages) can be generated with breadth

Those valuable signals are entangled with framework churn, naming drift, generated-source ambiguity, dependency rituals, CI dialects, weak proof routes, and human-oriented review customs.

Qualitative claim/analysis in the paper describing entanglement of signal and accidental complexity; no empirical quantification provided.

high negative No Accidental Software Agent First Canonical Code for Human ... degree of entanglement between signal and accidental repository noise

Frontier coding models may spend substantial capacity learning not only program behavior, but also accidental entropy in human repositories.

Conceptual/theoretical assertion presented in the paper (no empirical sample or experiment reported for this claim).

high negative No Accidental Software Agent First Canonical Code for Human ... model_capacity_usage (learning wasted/noisy patterns)

The review highlights critical challenges related to privacy, emotional surveillance, algorithmic bias, and employee trust associated with emotional AI in the workplace.

Aggregated observation from the systematic review; these concerns are reported as recurring themes across the surveyed literature (specific counts/examples not given in the abstract).

high negative Emotional AI in the Workplace: Systematic Review of Effects ... privacy concerns / emotional surveillance / algorithmic bias / employee trust

The economy is generically inefficient (under the laissez-faire equilibrium) and a planner can optimally tilt the direction of data accumulation to improve outcomes.

Welfare analysis within the model: comparison of decentralized equilibrium and planner's problem, demonstrating inefficiency and characterizing planner's optimal policy for directing data accumulation (analytical welfare results).

high negative Data-Driven Automation welfare/efficiency; direction of data accumulation under planner vs equilibrium

In the fully automated long-run case, short-run dynamics depend on the pattern of data spillovers, but automation is always slow in the long run: the share of tasks produced by labor decays asymptotically as a power law in time.

Analytical asymptotic result from the dynamic model showing that, under full automation, the labor-produced task share follows a power-law decay; short-run behavior is shown to depend on spillover structure (model derivation and asymptotic analysis).

high negative Data-Driven Automation share of tasks produced by labor over time (decay rate)

At min-cost, Brick incurs 11.85 points accuracy loss.

Empirical evaluation on the 5,504-query benchmark reporting accuracy loss at the min-cost operating point.

high negative Brick: Spatial Capability Routing for the Mixture-of-Models ... accuracy loss (percentage points)

Frontier models cost ten to one hundred times more than local open-weight models.

Cost comparison statement in the paper (asserted market/commercial cost multiples).

high negative Brick: Spatial Capability Routing for the Mixture-of-Models ... model_inference_cost

Existing LLM routers rely on surface features such as domain labels, keywords, and token count, ignoring the within-domain variance that actually determines model success.

Claim about prior work / existing systems presented in the paper; no explicit empirical test shown in the abstract.

high negative Brick: Spatial Capability Routing for the Mixture-of-Models ... router_feature_use_vs_within-domain_variance

AI adoption is significantly hampered by a lack of workforce skills and supporting infrastructure in these accounting organizations.

Qualitative interview findings and questionnaire responses synthesized via thematic analysis and inferential/statistical analysis (sample size not reported).

high negative Utilization of Artificial Intelligence Technology among Acco... barriers to AI adoption (skills and infrastructure)

Accounting organizations in the study are still in the early stages of AI adoption.

Synthesis of questionnaire and interview findings with thematic analysis indicating limited breadth/depth of AI use (sample size not reported).

high negative Utilization of Artificial Intelligence Technology among Acco... stage/level of AI adoption

AI is used mainly for repetitive and routine accounting tasks, with very little use for higher-level work.

Questionnaire responses and interview data summarized with descriptive statistics and thematic analysis (sample size not reported).

high negative Utilization of Artificial Intelligence Technology among Acco... types of tasks for which AI is used (routine vs higher-level)

The absence of standardized data governance policies and localized, language-accessible software platforms exacerbates the technological divide in digital agriculture.

Review synthesis identifying governance and software-localization as structural barriers; no empirical governance-audit sample sizes provided in the abstract.

high negative Digital Agriculture and Smart Farming: A Review of Emerging ... technological divide / barriers to adoption linked to governance and software lo...

In India, where the sector is dominated by smallholder farmers with fragmented landholdings, the transition to digital agriculture is significantly hindered by severe economic constraints, a lack of robust rural digital infrastructure, and pervasive digital illiteracy.

Targeted review analysis focusing on the Indian agricultural context; claim draws on country-specific literature but the abstract does not report specific empirical sample sizes or quantified barriers.

high negative Digital Agriculture and Smart Farming: A Review of Emerging ... adoption of digital agriculture technologies by Indian smallholder farmers

Despite these proven agronomic and environmental benefits, the global diffusion of digital agriculture remains highly uneven.

Review assertion based on cross-study synthesis that diffusion/adoption is not uniform globally; abstract provides no country-by-country adoption statistics or sample sizes.

high negative Digital Agriculture and Smart Farming: A Review of Emerging ... diffusion/adoption of digital agriculture

API-based approaches struggle with heterogeneous protocols and inaccessible commercial interfaces.

Author assertion contrasting API-based approaches with GUI and COM approaches (conceptual/architectural argument rather than specific experiment).

high negative ComAct: Reframing Professional Software Manipulation via COM... difficulty/adoption barriers due to heterogeneous protocols and inaccessible com...

GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation.

Author assertion in paper introduction describing limitations of GUI-based agents (conceptual analysis / literature-grounding rather than new experimental data).

high negative ComAct: Reframing Professional Software Manipulation via COM... fragility of visual grounding and accumulation of errors over long-horizon GUI i...

An unconstrained multi-agent baseline produced critical failures in 72% of runs.

Reported experimental result from the 2x4 factorial experiment (failure rate for the unconstrained multi-agent baseline reported as 72%).

high negative (Human) Attention Is (Still) All You Need: Human oversight m... critical failure rate (binary outcome: critical failure vs. not)

The bottleneck is often not model capability but missing project memory.

Assertion made in the abstract without accompanying quantitative evidence in the abstract.

high negative PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment... primary_bottleneck_for_ai_coding_agents

Reconstructing this context can consume an estimated 5,000-20,000 tokens per session.

Statement in paper abstract presenting an estimate (no detailed method or sample described in the abstract).

high negative PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment... context_size_in_tokens_per_session

Existing benchmarks for time-series forecasting focus solely on prediction error metrics; the decision utility of advanced forecasting (foundation) models remains unverified.

Authors' literature/benchmark review and critique presented in the paper.

high negative CloudCons: A Comprehensive End-to-End Benchmark for Cloud Re... coverage of evaluation metrics (prediction error vs decision utility)

Resource utilization in cloud data centers remains at low levels due to conservative over-provisioning to guarantee service reliability.

Stated as background motivation in the paper (literature/operational observation); asserted by authors as common industry phenomenon.

high negative CloudCons: A Comprehensive End-to-End Benchmark for Cloud Re... resource utilization (low levels) driven by over-provisioning

With the instruction files, 26.35% of the projects decreased their merge rate.

Reported proportion of projects showing a decrease in merge rate after creating instruction files based on the pre/post comparison of projects in the dataset (148 projects, 15,549 PRs).

high negative Toward Instructions-as-Code: Understanding the Impact of Ins... merge rate of agentic pull requests (projects showing decrease)

Xie et al. (2026) show experimentally that job candidates are less satisfied with firms using AI evaluators than with human experts due to perceived loss of control; the negative effect is stronger for individuals with an internal locus of control.

Experimental study on recruitment using control theory as described (sample size not provided).

high negative Guest editorial: Digital age wisdom in Chinese management: a... candidate satisfaction with recruitment process

In the healthcare sector, Chou et al. (2026a, 2026b) identify AI anxiety as a multifaceted hurdle to adoption; emotional affect and outcome expectations are essential influences on usage intentions (two-stage SEM-ANN approach).

Two-stage SEM–ANN modeling grounded in social cognitive theory as reported; empirical data specifics not provided in text.

high negative Guest editorial: Digital age wisdom in Chinese management: a... usage intentions for AI in healthcare

Liu et al. (2026a, 2026b) find experimentally that the severity of AI service failure in hotel contactless services significantly decreases customers' forgiveness willingness, but high levels of brand attachment mitigate this negative effect.

Experimental studies in hotel contactless service contexts (details and sample sizes not provided in the text).

high negative Guest editorial: Digital age wisdom in Chinese management: a... forgiveness willingness following AI service failure

Allowing AI to take the lead in strategic decision-making without human wisdom may be inappropriate due to AI's inability to navigate tacit knowledge and ethical nuances in Chinese management wisdom.

Argumentative claim based on cited literature (e.g., De Cremer and Kasparov, 2021; Del Giudice et al., 2023) and authors' synthesis.

high negative Guest editorial: Digital age wisdom in Chinese management: a... appropriateness/effectiveness of AI-led strategic decision-making

Developers reject fixes for (a) incorrect implementation (e.g., incomplete, wrong approach), (b) fixes that do not pass CI pipelines and fail tests, (c) fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and (d) fixes whose priority is low.

Observed categories from the qualitative analysis of the 306 non-merged PRs described in the study.

high negative Understanding the Rejection of Fixes Generated by Agentic Pu... reasons for rejection of agent-generated fixes (implementation correctness, CI/t...

The qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes.

Result of the paper's qualitative analysis on the representative sample (306 non-merged PRs).

high negative Understanding the Rejection of Fixes Generated by Agentic Pu... number and categorization of reasons for rejection

From a first exploration of the AIDev dataset, 46.41% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected.

Empirical analysis of the AIDev dataset reported by the authors; agents named explicitly (Copilot, Devin, Cursor, Claude).

high negative Understanding the Rejection of Fixes Generated by Agentic Pu... proportion of proposed fixes that are rejected

Existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost.

Comparative analysis of performance versus computational cost across evaluated systems showing limited marginal gains despite higher cost (authors' analysis across experiments).

high negative The Illusion of Multi-Agent Advantage marginal utility (performance gains per unit cost)

Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), automatically generated MAS consistently underperform Chain-of-Thought with Self-Consistency (CoT-SC) despite being up to 10x more expensive.

Systematic empirical evaluation comparing automatically generated MAS to CoT-SC across multiple task suites including traditional reasoning datasets and interactive multi-step workflows such as BrowseComp-Plus (experimental comparisons reported in the paper).

high negative The Illusion of Multi-Agent Advantage task performance (accuracy/quality) and computational cost

Empirical support for MAS superiority relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess MAS advantages.

Critical literature review and analysis of prior empirical evaluations (authors' claim about the composition and limitations of existing benchmarks).

high negative The Illusion of Multi-Agent Advantage adequacy of existing benchmarks for evaluating MAS advantages

More than 70% of respondents cite organisational resistance as a barrier to digital adoption.

Industry MRO digital survey reported in the paper (more than 70% reported); method = secondary evidence from an industry MRO digital survey. Sample size not stated in abstract.

high negative Aviation 4.0: the impacts of digital transformation on the a... prevalence of organisational resistance cited as barriers

Over 80% of respondents cite data limitations as a barrier to scaling digital implementations.

Industry MRO digital survey reported in the paper (over 80% reported); method = secondary evidence from an industry MRO digital survey. Sample size not stated in abstract.

high negative Aviation 4.0: the impacts of digital transformation on the a... prevalence of data limitations cited as barriers

Only 6% of MROs have scaled digital and analytics across the enterprise.

Industry MRO digital survey reported in the paper (6% reported); method = secondary evidence from an industry MRO digital survey. Sample size not stated in abstract.

high negative Aviation 4.0: the impacts of digital transformation on the a... enterprise-scale implementation of digital and analytics

Benchmarking multiple state-of-the-art open and closed source VLMs on our evaluation framework demonstrates substantial limitations in current engineering reasoning capabilities.

Empirical claim based on the paper's benchmarking experiments using the EngVQA dataset and the 8-stage framework (models and detailed results not provided in the excerpt).

high negative Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise ... engineering reasoning capabilities of state-of-the-art VLMs

Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes.

Claim in paper contrasting EngVQA's process-oriented evaluation with prior benchmarks (literature/benchmark review claim; no specific benchmarks or quantitative comparison provided in the excerpt).

high negative Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise ... extent to which benchmarks assess intermediate reasoning processes

« Prev 1 2 3 … 13 14 15 … 176 177 Next »