Evidence (5157 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

These characteristics are properties of the tasks themselves rather than limitations of current AI models.

Conceptual argument in the paper asserting task-inherent properties drive resistance to automation; supported by theory and argumentation, not by empirical model-comparison experiments.

high negative Metis AI: The Overlooked Middle Zone Between AI-Native and W... source of automation limitation (task-inherent vs model limitation)

The resistance of Metis tasks to automation is not due to computational intractability but to institutional, social, and normative entanglements.

Theoretical argument differentiating computational from institutional/social/normative causes; supported by citations and cross-disciplinary theory rather than empirical causal identification.

high negative Metis AI: The Overlooked Middle Zone Between AI-Native and W... cause of automation resistance

There exists a class of entirely digital tasks, called 'Metis AI', that resist reliable AI automation.

Conceptual identification and definition introduced by the authors; supported by theoretical grounding in social sciences, philosophy, and humanitarian practice rather than empirical trials or quantified samples.

high negative Metis AI: The Overlooked Middle Zone Between AI-Native and W... resistance to reliable AI automation

That digital-vs-physical framing misses the most consequential boundary: the one within digital tasks.

Normative/theoretical argument presented in the paper contrasting existing framing with a proposed alternative; grounded in cross-disciplinary literature rather than empirical measurement.

high negative Metis AI: The Overlooked Middle Zone Between AI-Native and W... relevance of boundary framing for AI capabilities

Parsing through LLM-generated code can be tedious and time-consuming, potentially negating the productivity gains promised by AI-coding tools.

Motivation/background statement in the paper: a qualitative claim about the cost (time/effort) of reviewing LLM-generated code; presented as motivation rather than empirically quantified evidence in the excerpt.

high negative Viverra: Text-to-Code with Guarantees time/effort required to review LLM-generated code

Employees experience technostress, anxiety and micro-political negotiation around AI tools in everyday work.

Reported experiences from semistructured interviews with 28 managers/professionals across 12 organizations; thematic analysis highlighting technostress and anxiety as themes.

high negative Reimagining work in the age of intelligent automation: a qua... technostress and anxiety among employees

An analysis of a 21-instrument inventory identifies an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification.

Empirical/qualitative analysis of an inventory of 21 governance instruments compiled and analysed in the paper (n=21 instruments).

high negative Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation

Behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify.

The paper's normative and conceptual argument synthesising governance requirements and the epistemic limits of behavioural testing.

high negative Position: Behavioural Assurance Cannot Verify the Safety Cla... ai_safety_and_ethics

Current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify latent representations or long-horizon agentic behaviours.

Conceptual/analytic argument and review of existing assurance methodologies presented in the paper.

high negative Position: Behavioural Assurance Cannot Verify the Safety Cla... ai_safety_and_ethics

Distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt in LLM-generated code that could affect long-term maintainability.

Interpretation/conclusion in paper combining empirical findings (distinct issue patterns and limited prompt impact) to argue for potential technical debt and maintainability risks; presented as a forward-looking implication rather than a quantified causal estimate.

high negative The Readability Spectrum: Patterns, Issues, and Prompt Effec... maintainability_risk / technical_debt_inferred_from_readability

LLM-generated code displays distinct readability issue patterns compared to human-written code.

Empirical analysis of readability subcomponents/features showing different patterns of readability issues between LLM-generated and human-written code (paper reports qualitative/quantitative distinctions in issue patterns).

high negative The Readability Spectrum: Patterns, Issues, and Prompt Effec... readability_issue_patterns (feature-level readability problems)

Increased levels of AI assistance may degrade productivity, leading to potentially significant shortfalls under the model's identified conditions.

Model-based comparative-statics and steady-state analysis showing scenarios where marginal increases in AI assistance reduce expected task output; examples/parameter illustrations provided in the paper (theoretical, no empirical sample).

high negative Human-AI Productivity Paradoxes: Modeling the Interplay of S... expected task output / productivity shortfalls associated with increased AI assi...

Introducing AI unreliability (errors/noise in AI outputs) in the model can also generate a productivity paradox: greater AI assistance may lower productivity.

Analytical/theoretical model incorporating AI unreliability; model derivations and examples demonstrating conditions under which unreliability leads to reduced productivity (no empirical data).

high negative Human-AI Productivity Paradoxes: Modeling the Interplay of S... agent productivity (task output) as influenced by AI assistance and AI unreliabi...

Incorporating endogeneity in skill development into the model can induce a productivity paradox where increased AI assistance reduces productivity.

Analytical/theoretical model of human-AI interaction with utility-maximizing human agents and endogenous skill development; steady-state and comparative-static analysis reported in the paper (no empirical sample).

high negative Human-AI Productivity Paradoxes: Modeling the Interplay of S... agent productivity (task output) as a function of AI assistance and endogenous s...

Simulated users produce feedback dynamics that diverge from humans.

Temporal/interaction analysis in the replication showing differences in how simulators provide feedback across multi-turn interactions compared to humans.

high negative PRISM-X: Experiments on Personalised Fine-Tuning with Human ... feedback/interaction dynamics over multi-turn conversations (simulator vs human)

Simulated users exhibit amplified position biases relative to human participants.

Behavioral comparison in the simulator replication showing stronger position biases in simulated responses than in human responses.

high negative PRISM-X: Experiments on Personalised Fine-Tuning with Human ... magnitude of position bias in simulated vs human responses

Simulated users discuss different topics compared to the human participants.

Analysis of conversation content in the simulator replication showing differences in topical distribution between simulators and humans.

high negative PRISM-X: Experiments on Personalised Fine-Tuning with Human ... topic distribution of conversations produced by simulators versus humans

Simulators perform far below human self-consistency baselines for individual judgements.

Comparison in the replication study between simulator consistency and human self-consistency on individual-level judgments; reported large performance gap (simulators far below humans).

high negative PRISM-X: Experiments on Personalised Fine-Tuning with Human ... individual-level judgment consistency (simulator vs human self-consistency)

Amplified sycophancy and relationship-seeking behaviours may introduce deleterious long-term consequences.

Authors' interpretation and cautionary note based on observed behavioral amplification after fine-tuning; presented as potential long-term risk rather than an empirically measured long-term outcome.

high negative PRISM-X: Experiments on Personalised Fine-Tuning with Human ... long-term social/consequential harms from amplified model behaviours (hypothesiz...

In a controlled experiment across six industry configurations (72 tool invocations using Qwen3-32B), unconstrained tool parameters produced a 43% hallucination rate for domain identifiers.

Controlled experiment reported in the paper: six industry configurations, 72 tool invocations, model used: Qwen3-32B; reported unconstrained parameter condition resulted in 43% hallucination rate for domain identifiers.

high negative The Semantic Training Gap: Ontology-Grounded Tool Architectu... hallucination rate for domain identifiers

In multi-agent configurations the semantic training gap produces a compounding failure mode termed 'semantic drift'.

Analytical description and demonstration in the paper describing multi-agent interactions and observed/argued compounding failures (conceptual demonstration; no numeric sample stated).

high negative The Semantic Training Gap: Ontology-Grounded Tool Architectu... occurrence of semantic drift (compounding errors in multi-agent setups)

The semantic training gap causes operationally incorrect outputs even when model responses are linguistically precise.

Demonstrations and examples reported in the paper showing cases where model outputs are linguistically fluent but operationally incorrect; supported by the paper's analysis and experimental illustrations (no numeric sample provided for this general claim).

high negative The Semantic Training Gap: Ontology-Grounded Tool Architectu... operational correctness of outputs (vs. linguistic precision)

There exists a 'semantic training gap': a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships.

Paper provides a formalization and conceptual framing of the gap (theoretical description and argumentation within the manuscript).

high negative The Semantic Training Gap: Ontology-Grounded Tool Architectu... existence of semantic training gap (structural disconnect)

LLM-based AI agents deployed in manufacturing demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics.

Stated assertion in the paper describing observed behavior of deployed LLM agents; supported by conceptual analysis and examples/demonstrations reported in the paper (no numeric sample size given).

high negative The Semantic Training Gap: Ontology-Grounded Tool Architectu... grounded understanding of operational semantics

AI integration simultaneously increases labor concerns about skill obsolescence by 33%.

Reported as a survey/result in the paper; the study includes surveys of 800 marketers (self-reported concerns about skill obsolescence are likely derived from that survey sample).

high negative Augmented Intelligence: Resolving the AI integration-obsoles... worker concerns about skill obsolescence

Rising data velocity renders legacy systems obsolete—threatening approximately $3.4 trillion in global marketing spending.

Paper reports an estimate/claim about threatened global marketing spending tied to legacy systems becoming obsolete (derivation likely from the study's quantitative analysis or economic estimate described in the paper).

high negative Augmented Intelligence: Resolving the AI integration-obsoles... value of global marketing spending at risk

62% of teams suffer from "AI paralysis," unable to scale pilot initiatives beyond isolated implementations.

Reported as a finding in the paper's mixed-methods study (paper states AI adoption audits of 120 organizations and surveys of 800 marketers as part of the study).

high negative Augmented Intelligence: Resolving the AI integration-obsoles... AI paralysis / inability to scale AI pilots

Autonomous software-engineering agents remain unreliable in realistic development settings.

Assertion in abstract summarizing the observed current state; likely based on prior literature and/or authors' observations (no empirical sample size given in abstract).

high negative AI Harness Engineering: A Runtime Substrate for Foundation-M... reliability of autonomous software-engineering agents (ability to perform correc...

Individuals low in trait self-efficacy experienced the steepest ownership erosion (i.e., AI-authorship reduced psychological ownership most for low self-efficacy participants).

Reported moderation analysis in the preregistered experiment showing trait self-efficacy moderated the authorship effect on psychological ownership; preregistered N = 470. (No numeric effect size reported in the abstract.)

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... change/erosion in psychological ownership as moderated by trait self-efficacy

Participants in the LLM condition reported lower perceived importance (d = 1.13).

Same preregistered experiment; reported effect size d = 1.13; preregistered N = 470.

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... perceived importance of goals (self-reported)

Participants in the LLM condition reported lower commitment (d = 1.19).

Same preregistered experiment comparing self-authored vs LLM-authored goals; reported effect size d = 1.19; preregistered N = 470.

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... commitment (self-reported)

Participants in the LLM condition reported lower psychological ownership (d = 1.38).

Same preregistered experiment (between-subjects comparison of authorship); reported effect size d = 1.38; preregistered N = 470.

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... psychological ownership (self-reported)

The gap is prompt-resistant across seven variants.

Experiments applying seven different prompt variants to the evaluated models on IMAVB showing that the representation-action mismatch and failure modes persist despite prompt changes.

high negative Senses Wide Shut: A Representation-Action Gap in Omnimodal L... decision_quality

The gap is modality-asymmetric (audio grounding underperforms vision).

Within IMAVB's 2x2 design (vision vs audio), comparative performance metrics indicate worse grounding/rejection behavior for audio-targeted conditions versus vision-targeted conditions across evaluated models.

high negative Senses Wide Shut: A Representation-Action Gap in Omnimodal L... decision_quality

Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy.

Behavioral results on IMAVB showing distinct response patterns across tested models: some rarely reject misleading premises (under-rejection) while others reject too often including correct/standard questions (over-rejection), measured across the 500-clip benchmark.

high negative Senses Wide Shut: A Representation-Action Gap in Omnimodal L... error_rate

Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise–perception mismatches even when the same models almost never reject the false claim in their outputs.

Empirical evaluation on IMAVB across 9 models (8 open-source + Gemini 3.1 Pro); internal probing of hidden states showing mismatch signal and behavioral output analysis showing low rejection rates for false premises.

high negative Senses Wide Shut: A Representation-Action Gap in Omnimodal L... decision_quality

The paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics.

Conceptual analysis and problem-framing presented in the paper (qualitative identification of five mismatch categories).

high negative Agent-First Tool API: A Semantic Interface Paradigm for Ente... architectural_mismatches_between_conventional_APIs_and_autonomous_agent_requirem...

Using LLMs led to fewer creative moments observed in participants (p=0.002).

Within-subject comparison between LLM-assisted and unassisted conditions with reported p-value p=0.002. Study sample N=20.

high negative "Like Taking the Path of Least Resistance": Exploring the Im... count of creative moments

Participants using LLMs had significantly shorter idea-generation periods (p=0.0004).

Within-subject comparison between LLM-assisted and unassisted conditions reported in paper; p-value reported as p=0.0004. Sample size N=20.

high negative "Like Taking the Path of Least Resistance": Exploring the Im... idea-generation period (time spent generating ideas)

Existing AI assistants (e.g., ChatGPT, Copilot) utilize pre-defined user preferences and chat interaction histories and are therefore confined to reactive exchanges lacking sufficient adaptability to users' psychophysiological states.

Authorial characterization/argument about current AI assistant behavior; no empirical data reported in abstract to substantiate beyond description.

high negative AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... adaptability of AI assistants

Small-scale retail businesses remain structurally excluded from these advancements due to configuration complexity, technical overhead, and limited digital capabilities.

Asserted as a problem statement in the paper; no empirical evidence, sample size, or quantitative analysis provided in the excerpt.

high negative From Configuration to Cognition: A Self-Configuring Agentic ... exclusion from AI-enhanced CRM adoption

Producing hardened, production-grade agent workflows may require extra compute and time, and these costs must be amortized through reuse across a broad user community.

Argument in paper reasoning that added rigor entails higher compute/time costs and that reuse across users is needed to amortize these costs; no empirical cost estimates provided.

high negative Engineering Robustness into Personal Agents with the AI Work... resource_costs (compute/time) and implications for amortization/adoption

By focusing on rapid, real-time synthesis, AI agents are effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them.

Conceptual argument presented in the paper asserting a qualitative mismatch between on-the-fly agents and high-stakes production needs; no empirical validation reported.

high negative Engineering Robustness into Personal Agents with the AI Work... suitability for high-stakes use / risk to users

The on-the-fly paradigm short-circuits disciplined software engineering processes—iterative design, rigorous testing, adversarial evaluation, staged deployment, and more—that have delivered relatively reliable and secure systems.

Argumentative claim in paper linking the on-the-fly loop to reduced application of standard SE processes; no empirical study, sample, or quantitative evidence provided.

high negative Engineering Robustness into Personal Agents with the AI Work... reliability and security (degree to which SE processes are applied)

These findings underscore the insufficiency of current agents for interdependent workflows, positioning ComplexMCP as a critical testbed for the next generation of resilient autonomous systems.

Synthesis of empirical results (low agent success rates, identified bottlenecks) presented by authors to make a broader claim about agent readiness and the benchmark's relevance.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... agent suitability/readiness for interdependent workflows

(3) strategic defeatism, a tendency to rationalize failure rather than pursuing recovery.

Qualitative/quantitative trajectory analysis indicating agents often choose rationalization/explanatory actions over recovery or retry strategies after failures.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... rate of recovery/persistence actions vs rationalization actions after failure

(2) over-confidence, where agents skip essential environment verifications;

Trajectory analyses showing agents often omit verification steps leading to failed interactions; reported as an identified failure mode.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... frequency of environment verification checks performed by agents

Granular trajectory analysis identifies three fundamental bottlenecks: (1) tool retrieval saturation as action spaces scale;

Trajectory analyses of agent interactions with the benchmark reported by authors; observational claim from analysis of agent action sequences as action space increases.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... tool retrieval performance / selection accuracy as action space scales

We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%.

Empirical evaluation reported by authors comparing multiple LLM agents (full-context and RAG) against human performance on benchmark tasks; specific reported success rates: <=60% for top models, 90% for humans.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... task success rate (agent vs human)

Without parallel investment in digital literacy, organizational culture, and inter-firm networks, AI will reproduce rather than reduce employment inequalities.

Authors' conclusion drawn from thematic analysis of interviews and conceptual framing; predictive statement based on qualitative findings.

high negative Artificial Intelligence, Social Capital, and Sustainable Emp... employment_inequalities

« Prev 1 2 3 … 6 7 8 … 103 104 Next »