Evidence (11633 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Simulators perform far below human self-consistency baselines for individual judgements.
Comparison in the replication study between simulator consistency and human self-consistency on individual-level judgments; reported large performance gap (simulators far below humans).
Amplified sycophancy and relationship-seeking behaviours may introduce deleterious long-term consequences.
Authors' interpretation and cautionary note based on observed behavioral amplification after fine-tuning; presented as potential long-term risk rather than an empirically measured long-term outcome.
Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored.
Literature/contextual positioning in the paper contrasting prior benchmarks' focus with the proposed task.
There is a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.
Synthesis of experimental findings indicating that existing detectors and MLLMs are insufficiently reliable for the specific task of claim-conditioned refund-evidence verification.
Current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets.
Experimental results reported in the paper comparing MLLM true positive rates (TPR) on real-damaged vs. fake-damaged subsets produced by multiple generators.
In a controlled experiment across six industry configurations (72 tool invocations using Qwen3-32B), unconstrained tool parameters produced a 43% hallucination rate for domain identifiers.
Controlled experiment reported in the paper: six industry configurations, 72 tool invocations, model used: Qwen3-32B; reported unconstrained parameter condition resulted in 43% hallucination rate for domain identifiers.
In multi-agent configurations the semantic training gap produces a compounding failure mode termed 'semantic drift'.
Analytical description and demonstration in the paper describing multi-agent interactions and observed/argued compounding failures (conceptual demonstration; no numeric sample stated).
The semantic training gap causes operationally incorrect outputs even when model responses are linguistically precise.
Demonstrations and examples reported in the paper showing cases where model outputs are linguistically fluent but operationally incorrect; supported by the paper's analysis and experimental illustrations (no numeric sample provided for this general claim).
There exists a 'semantic training gap': a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships.
Paper provides a formalization and conceptual framing of the gap (theoretical description and argumentation within the manuscript).
LLM-based AI agents deployed in manufacturing demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics.
Stated assertion in the paper describing observed behavior of deployed LLM agents; supported by conceptual analysis and examples/demonstrations reported in the paper (no numeric sample size given).
Direct demographic targeting excludes users whose demographics the platform cannot infer ('unknown users') if advertising platforms do not provide a way to target unknown users directly, as is the case on Google Ads.
Platform capability statement about Google Ads (authors' description of Google Ads targeting options); no sample size provided.
Skewed ad delivery of public-service ads can prevent certain groups of individuals from accessing information about resources on the basis of their demographic identity.
Argument/implication drawn from observed demographic skew in ad delivery and its relevance to public-service outreach; no specific empirical sample size reported in the excerpt.
Ad delivery can be skewed by demographic attributes, such that ads are systematically under-delivered to certain groups despite advertiser intent to reach groups proportionally.
Cites prior audits of ad delivery (literature/audit studies referenced by the paper); descriptive claim based on prior empirical work (no sample size stated in the provided excerpt).
AI integration simultaneously increases labor concerns about skill obsolescence by 33%.
Reported as a survey/result in the paper; the study includes surveys of 800 marketers (self-reported concerns about skill obsolescence are likely derived from that survey sample).
Rising data velocity renders legacy systems obsolete—threatening approximately $3.4 trillion in global marketing spending.
Paper reports an estimate/claim about threatened global marketing spending tied to legacy systems becoming obsolete (derivation likely from the study's quantitative analysis or economic estimate described in the paper).
62% of teams suffer from "AI paralysis," unable to scale pilot initiatives beyond isolated implementations.
Reported as a finding in the paper's mixed-methods study (paper states AI adoption audits of 120 organizations and surveys of 800 marketers as part of the study).
Autonomous software-engineering agents remain unreliable in realistic development settings.
Assertion in abstract summarizing the observed current state; likely based on prior literature and/or authors' observations (no empirical sample size given in abstract).
Individuals low in trait self-efficacy experienced the steepest ownership erosion (i.e., AI-authorship reduced psychological ownership most for low self-efficacy participants).
Reported moderation analysis in the preregistered experiment showing trait self-efficacy moderated the authorship effect on psychological ownership; preregistered N = 470. (No numeric effect size reported in the abstract.)
Participants in the LLM condition reported lower perceived importance (d = 1.13).
Same preregistered experiment; reported effect size d = 1.13; preregistered N = 470.
Participants in the LLM condition reported lower commitment (d = 1.19).
Same preregistered experiment comparing self-authored vs LLM-authored goals; reported effect size d = 1.19; preregistered N = 470.
Participants in the LLM condition reported lower psychological ownership (d = 1.38).
Same preregistered experiment (between-subjects comparison of authorship); reported effect size d = 1.38; preregistered N = 470.
The gap is prompt-resistant across seven variants.
Experiments applying seven different prompt variants to the evaluated models on IMAVB showing that the representation-action mismatch and failure modes persist despite prompt changes.
The gap is modality-asymmetric (audio grounding underperforms vision).
Within IMAVB's 2x2 design (vision vs audio), comparative performance metrics indicate worse grounding/rejection behavior for audio-targeted conditions versus vision-targeted conditions across evaluated models.
Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy.
Behavioral results on IMAVB showing distinct response patterns across tested models: some rarely reject misleading premises (under-rejection) while others reject too often including correct/standard questions (over-rejection), measured across the 500-clip benchmark.
Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise–perception mismatches even when the same models almost never reject the false claim in their outputs.
Empirical evaluation on IMAVB across 9 models (8 open-source + Gemini 3.1 Pro); internal probing of hidden states showing mismatch signal and behavioral output analysis showing low rejection rates for false premises.
The paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics.
Conceptual analysis and problem-framing presented in the paper (qualitative identification of five mismatch categories).
Using LLMs led to fewer creative moments observed in participants (p=0.002).
Within-subject comparison between LLM-assisted and unassisted conditions with reported p-value p=0.002. Study sample N=20.
Participants using LLMs had significantly shorter idea-generation periods (p=0.0004).
Within-subject comparison between LLM-assisted and unassisted conditions reported in paper; p-value reported as p=0.0004. Sample size N=20.
AI-assisted engineering teams concurrently face a 19% risk of skills obsolescence.
Empirical finding reported by the study, presumably based on the mixed-methods data (survey/Delphi/case studies) described in abstract.
Forecasts indicate that automation may supplant as much as 45% of traditional tasks by 2030.
Statement in paper referencing external forecasts (no specific source or sample reported in abstract).
Current surveys remain fragmented across system optimization, architecture design, and trust, lacking a unified framework to evaluate the fundamental trade-off between output quality and economic cost.
Authors' literature review and critique of existing surveys; based on mapping of prior works into separated strands (qualitative assessment rather than quantified meta-analysis).
Exponential token consumption introduces severe computational, collaborative, and security bottlenecks.
Synthesis presented in the paper arguing that rising token usage causes system-level constraints; based on literature survey and conceptual analysis (no single empirical sample reported).
Existing AI assistants (e.g., ChatGPT, Copilot) utilize pre-defined user preferences and chat interaction histories and are therefore confined to reactive exchanges lacking sufficient adaptability to users' psychophysiological states.
Authorial characterization/argument about current AI assistant behavior; no empirical data reported in abstract to substantiate beyond description.
Small-scale retail businesses remain structurally excluded from these advancements due to configuration complexity, technical overhead, and limited digital capabilities.
Asserted as a problem statement in the paper; no empirical evidence, sample size, or quantitative analysis provided in the excerpt.
Producing hardened, production-grade agent workflows may require extra compute and time, and these costs must be amortized through reuse across a broad user community.
Argument in paper reasoning that added rigor entails higher compute/time costs and that reuse across users is needed to amortize these costs; no empirical cost estimates provided.
By focusing on rapid, real-time synthesis, AI agents are effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them.
Conceptual argument presented in the paper asserting a qualitative mismatch between on-the-fly agents and high-stakes production needs; no empirical validation reported.
The on-the-fly paradigm short-circuits disciplined software engineering processes—iterative design, rigorous testing, adversarial evaluation, staged deployment, and more—that have delivered relatively reliable and secure systems.
Argumentative claim in paper linking the on-the-fly loop to reduced application of standard SE processes; no empirical study, sample, or quantitative evidence provided.
These findings underscore the insufficiency of current agents for interdependent workflows, positioning ComplexMCP as a critical testbed for the next generation of resilient autonomous systems.
Synthesis of empirical results (low agent success rates, identified bottlenecks) presented by authors to make a broader claim about agent readiness and the benchmark's relevance.
(3) strategic defeatism, a tendency to rationalize failure rather than pursuing recovery.
Qualitative/quantitative trajectory analysis indicating agents often choose rationalization/explanatory actions over recovery or retry strategies after failures.
(2) over-confidence, where agents skip essential environment verifications;
Trajectory analyses showing agents often omit verification steps leading to failed interactions; reported as an identified failure mode.
Granular trajectory analysis identifies three fundamental bottlenecks: (1) tool retrieval saturation as action spaces scale;
Trajectory analyses of agent interactions with the benchmark reported by authors; observational claim from analysis of agent action sequences as action space increases.
We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%.
Empirical evaluation reported by authors comparing multiple LLM agents (full-context and RAG) against human performance on benchmark tasks; specific reported success rates: <=60% for top models, 90% for humans.
Credential erosion is evident in the aggregate pattern (credentials losing signaling value relative to AI-augmented skill demonstrations).
Synthesis statement from included studies noting credential erosion alongside skill signaling changes; not quantified in the excerpt.
Developing economies reliant on cognitive services outsourcing face disproportionate disruption through both direct exposure and indirect demand-erosion channels.
Preliminary empirical evidence across included studies indicating larger negative impacts for economies dependent on cognitive-services exports; described as preliminary but material.
Observable labor market data already document patterns consistent with AI-driven displacement rather than mere transformation—concentrated among routine cognitive tasks and junior roles.
Synthesis of observed labor market indicators from retained empirical studies since 2020 showing concentration of declines in routine cognitive tasks and junior roles.
Evidence from online labor markets shows a 2%–21% reduction in posting volumes for automatable creative tasks following ChatGPT's release.
Empirical analyses of online labor market posting volumes reported in multiple studies included in the review; range reported across studies.
Across synthesized studies, there was a 14–41% reduction in postings for entry- and mid-level software development and content-creation roles in high-income economies between 2022 and 2024 (range across individual studies: −14% to −41%; median: −23%).
Synthesis of empirical studies retained in the systematic review (numerical range and median reported across non-overlapping study designs and geographies); no pooled meta-analytic estimate provided.
Without parallel investment in digital literacy, organizational culture, and inter-firm networks, AI will reproduce rather than reduce employment inequalities.
Authors' conclusion drawn from thematic analysis of interviews and conceptual framing; predictive statement based on qualitative findings.
AI adoption in peripheral economies is not a purely technological or financial challenge but a social and human capital challenge, embedded in a biocultural environment shaped by brain drain, institutional thinness, and weak civic intermediation.
Synthesis of interview findings using Bitsani's Biocultural City framework; qualitative evidence from 12 interviews supports this argument.
Knowledge deficits and financial constraints emerge as primary barriers [to AI adoption].
Thematic analysis of the twelve semi-structured interviews reporting these themes as primary barriers.