Evidence (6507 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Productivity
Remove filter
AGI could fundamentally alter the global distribution of economic and military power.
Paper's geopolitical analysis drawing on capability trends and scenario reasoning (as stated in abstract); no empirical quantification provided in the abstract.
Increased levels of AI assistance may degrade productivity, leading to potentially significant shortfalls under the model's identified conditions.
Model-based comparative-statics and steady-state analysis showing scenarios where marginal increases in AI assistance reduce expected task output; examples/parameter illustrations provided in the paper (theoretical, no empirical sample).
Introducing AI unreliability (errors/noise in AI outputs) in the model can also generate a productivity paradox: greater AI assistance may lower productivity.
Analytical/theoretical model incorporating AI unreliability; model derivations and examples demonstrating conditions under which unreliability leads to reduced productivity (no empirical data).
Incorporating endogeneity in skill development into the model can induce a productivity paradox where increased AI assistance reduces productivity.
Analytical/theoretical model of human-AI interaction with utility-maximizing human agents and endogenous skill development; steady-state and comparative-static analysis reported in the paper (no empirical sample).
AI integration simultaneously increases labor concerns about skill obsolescence by 33%.
Reported as a survey/result in the paper; the study includes surveys of 800 marketers (self-reported concerns about skill obsolescence are likely derived from that survey sample).
Rising data velocity renders legacy systems obsolete—threatening approximately $3.4 trillion in global marketing spending.
Paper reports an estimate/claim about threatened global marketing spending tied to legacy systems becoming obsolete (derivation likely from the study's quantitative analysis or economic estimate described in the paper).
62% of teams suffer from "AI paralysis," unable to scale pilot initiatives beyond isolated implementations.
Reported as a finding in the paper's mixed-methods study (paper states AI adoption audits of 120 organizations and surveys of 800 marketers as part of the study).
Autonomous software-engineering agents remain unreliable in realistic development settings.
Assertion in abstract summarizing the observed current state; likely based on prior literature and/or authors' observations (no empirical sample size given in abstract).
Individuals low in trait self-efficacy experienced the steepest ownership erosion (i.e., AI-authorship reduced psychological ownership most for low self-efficacy participants).
Reported moderation analysis in the preregistered experiment showing trait self-efficacy moderated the authorship effect on psychological ownership; preregistered N = 470. (No numeric effect size reported in the abstract.)
Participants in the LLM condition reported lower perceived importance (d = 1.13).
Same preregistered experiment; reported effect size d = 1.13; preregistered N = 470.
Participants in the LLM condition reported lower commitment (d = 1.19).
Same preregistered experiment comparing self-authored vs LLM-authored goals; reported effect size d = 1.19; preregistered N = 470.
Participants in the LLM condition reported lower psychological ownership (d = 1.38).
Same preregistered experiment (between-subjects comparison of authorship); reported effect size d = 1.38; preregistered N = 470.
The paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics.
Conceptual analysis and problem-framing presented in the paper (qualitative identification of five mismatch categories).
Using LLMs led to fewer creative moments observed in participants (p=0.002).
Within-subject comparison between LLM-assisted and unassisted conditions with reported p-value p=0.002. Study sample N=20.
Participants using LLMs had significantly shorter idea-generation periods (p=0.0004).
Within-subject comparison between LLM-assisted and unassisted conditions reported in paper; p-value reported as p=0.0004. Sample size N=20.
AI-assisted engineering teams concurrently face a 19% risk of skills obsolescence.
Empirical finding reported by the study, presumably based on the mixed-methods data (survey/Delphi/case studies) described in abstract.
Forecasts indicate that automation may supplant as much as 45% of traditional tasks by 2030.
Statement in paper referencing external forecasts (no specific source or sample reported in abstract).
Existing AI assistants (e.g., ChatGPT, Copilot) utilize pre-defined user preferences and chat interaction histories and are therefore confined to reactive exchanges lacking sufficient adaptability to users' psychophysiological states.
Authorial characterization/argument about current AI assistant behavior; no empirical data reported in abstract to substantiate beyond description.
Producing hardened, production-grade agent workflows may require extra compute and time, and these costs must be amortized through reuse across a broad user community.
Argument in paper reasoning that added rigor entails higher compute/time costs and that reuse across users is needed to amortize these costs; no empirical cost estimates provided.
By focusing on rapid, real-time synthesis, AI agents are effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them.
Conceptual argument presented in the paper asserting a qualitative mismatch between on-the-fly agents and high-stakes production needs; no empirical validation reported.
The on-the-fly paradigm short-circuits disciplined software engineering processes—iterative design, rigorous testing, adversarial evaluation, staged deployment, and more—that have delivered relatively reliable and secure systems.
Argumentative claim in paper linking the on-the-fly loop to reduced application of standard SE processes; no empirical study, sample, or quantitative evidence provided.
These findings underscore the insufficiency of current agents for interdependent workflows, positioning ComplexMCP as a critical testbed for the next generation of resilient autonomous systems.
Synthesis of empirical results (low agent success rates, identified bottlenecks) presented by authors to make a broader claim about agent readiness and the benchmark's relevance.
(3) strategic defeatism, a tendency to rationalize failure rather than pursuing recovery.
Qualitative/quantitative trajectory analysis indicating agents often choose rationalization/explanatory actions over recovery or retry strategies after failures.
(2) over-confidence, where agents skip essential environment verifications;
Trajectory analyses showing agents often omit verification steps leading to failed interactions; reported as an identified failure mode.
Granular trajectory analysis identifies three fundamental bottlenecks: (1) tool retrieval saturation as action spaces scale;
Trajectory analyses of agent interactions with the benchmark reported by authors; observational claim from analysis of agent action sequences as action space increases.
We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%.
Empirical evaluation reported by authors comparing multiple LLM agents (full-context and RAG) against human performance on benchmark tasks; specific reported success rates: <=60% for top models, 90% for humans.
Common failures include replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns.
Error-mode analysis described in the paper/abstract showing that models substitute complex CAD operations (sweep, loft, twist-extrude) with simpler sketch-and-extrude sequences.
Common failures include misinterpreting industrial design parameters.
Reported error analysis in the paper/abstract indicating models often misinterpret engineering/design parameters when generating CAD programs.
Common failures include missing fine 3D structure.
Qualitative and quantitative analysis of model outputs on BenchCAD reported in the paper/abstract noting missing fine 3D structural details as a frequent error mode.
Human capital and technological innovation channels show weaker or even negative effects on Lae, attributed to short-term resource misallocation and skill mismatches.
Spatial mediation analysis (channel analysis) using panel data for 30 provincial regions (2012–2022) assessing mediating roles of human capital and technological innovation.
Functional deployment and operational investment in AI are associated with employment declines.
Regression analyses from the BTOS AI supplement linking measures of functional AI deployment and operational AI investment to firm-reported employment changes; observational associations (sample size and exact model specification not shown in excerpt).
Employment reductions attributable to AI are rare: only 2% of firms report employment reductions.
Firm self-reports on employment outcomes related to AI from the BTOS AI supplement (Nov 2025–Jan 2026); descriptive statistic reported; sample size not excerpted.
Among firms with worker-level AI use, 65% restrict use to three or fewer tasks.
Descriptive statistic from BTOS AI supplement giving distribution of number of worker tasks using AI among firms that report worker-level use; sample size not shown.
Among adopter firms, scope remains limited: 57% use AI in three or fewer functions.
Descriptive distribution of number of business functions using AI among adopter firms in the BTOS AI supplement (Nov 2025–Jan 2026); sample restricted to adopter firms (sample size not provided).
Institutional inertia in property valuation poses risks to asset pricing, collateral risk modelling and investor confidence.
Analytical inference from interview findings and theoretical synthesis highlighting implications for property investment and financial market stability.
Despite advances in automation, data analytics and AI, the sector has been slow to digitise.
Background statement supported by interview data and sector observation reported in the study.
The IDOI framework provides a transferable model for understanding digital transformation in regulated, high-trust professions and highlights the market-level risks of institutional inertia in property valuation.
Development of the IDOI conceptual framework from qualitative data and theoretical integration; authors' claim about transferability and implications.
Generational divides, protectionist attitudes and fears of automation reinforce digital resistance.
Qualitative interview evidence reporting attitudes across cohorts of valuers and firm personnel; thematic analysis identifying cultural and attitudinal themes.
The Valuers Act (1948), fragmented infrastructure and sovereignty concerns limit innovation.
Interview data from practitioners, firm leaders and regulators in New Zealand citing specific regulatory and infrastructure constraints; thematic analysis.
Barriers to adoption arise primarily from institutional conservatism, outdated regulation and weak data governance rather than technical shortcomings.
Qualitative semi-structured interviews with valuers, firm leaders and regulators in New Zealand; thematic analysis guided by Rogers' diffusion of innovations and institutional theory synthesised into the IDOI framework.
Consequently, generated artifacts may exhibit brittle behavior and limited deployability.
Paper asserts that lack of production awareness leads to brittle artifacts and limited deployability; no quantitative measures or sample sizes provided in the abstract.
AI-assisted development tools often lack awareness of architectural constraints, infrastructure dependencies, and organizational standards required in production environments.
Asserted observation in the paper arguing limitations of general-purpose AI code generation when targeting production-ready systems; no empirical sample size or methodological details provided in the excerpt.
Current AI tools are not yet mature enough to replace developers.
Conclusion drawn from the controlled experiment and participant feedback comparing AI-assisted vs traditional task-splitting.
Breaking down user stories into actionable tasks is a critical yet time-consuming process in agile software development.
Background/introductory statement in the paper describing the problem motivation; no experimental sample size reported for this claim.
Nominally cheaper models can incur higher total cost due to token-intensive reasoning.
Cost and token usage analysis reported in the paper showing cheaper-per-token models may generate more tokens and thus higher total cost in practice.
Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets.
Stated as background/motivation in the paper (conceptual claim; no empirical sample size reported).
Cascade performance is limited primarily by structural cost (they pay the cheap model before any escalation decision), rather than by a shortage of intermediate stages.
Synthesis of theoretical insights and empirical results reported in the paper (theoretical analysis of structural costs + empirical comparisons showing limited benefit from additional stages).
Optimized subsequence cascades do not deliver practically meaningful held-out gains over the pairwise envelope.
Empirical evaluation on the five benchmarks comparing optimized subsequence cascades to the pairwise envelope; reported lack of practically meaningful held-out improvement.
Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope.
Empirical comparison across the reported benchmarks and models showing that full fixed chains achieve worse cost-quality tradeoffs than the pairwise envelope (experimental results described in the paper).
Municipal 311 call centers and complaint intake systems face a structural mismatch between incoming volume and classification capacity that produces a bottleneck and differential service quality that follows income and racial lines.
Stated in the paper's introduction; cites prior work (Liu 2024 SLA) as support for the differential service-quality / demographic claim. No sample size or quantitative result reported in the excerpt.