Evidence (6507 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Productivity Remove filter

AGI could fundamentally alter the global distribution of economic and military power.

Paper's geopolitical analysis drawing on capability trends and scenario reasoning (as stated in abstract); no empirical quantification provided in the abstract.

high negative Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation

Increased levels of AI assistance may degrade productivity, leading to potentially significant shortfalls under the model's identified conditions.

Model-based comparative-statics and steady-state analysis showing scenarios where marginal increases in AI assistance reduce expected task output; examples/parameter illustrations provided in the paper (theoretical, no empirical sample).

high negative Human-AI Productivity Paradoxes: Modeling the Interplay of S... expected task output / productivity shortfalls associated with increased AI assi...

Introducing AI unreliability (errors/noise in AI outputs) in the model can also generate a productivity paradox: greater AI assistance may lower productivity.

Analytical/theoretical model incorporating AI unreliability; model derivations and examples demonstrating conditions under which unreliability leads to reduced productivity (no empirical data).

high negative Human-AI Productivity Paradoxes: Modeling the Interplay of S... agent productivity (task output) as influenced by AI assistance and AI unreliabi...

Incorporating endogeneity in skill development into the model can induce a productivity paradox where increased AI assistance reduces productivity.

Analytical/theoretical model of human-AI interaction with utility-maximizing human agents and endogenous skill development; steady-state and comparative-static analysis reported in the paper (no empirical sample).

high negative Human-AI Productivity Paradoxes: Modeling the Interplay of S... agent productivity (task output) as a function of AI assistance and endogenous s...

AI integration simultaneously increases labor concerns about skill obsolescence by 33%.

Reported as a survey/result in the paper; the study includes surveys of 800 marketers (self-reported concerns about skill obsolescence are likely derived from that survey sample).

high negative Augmented Intelligence: Resolving the AI integration-obsoles... worker concerns about skill obsolescence

Rising data velocity renders legacy systems obsolete—threatening approximately $3.4 trillion in global marketing spending.

Paper reports an estimate/claim about threatened global marketing spending tied to legacy systems becoming obsolete (derivation likely from the study's quantitative analysis or economic estimate described in the paper).

high negative Augmented Intelligence: Resolving the AI integration-obsoles... value of global marketing spending at risk

62% of teams suffer from "AI paralysis," unable to scale pilot initiatives beyond isolated implementations.

Reported as a finding in the paper's mixed-methods study (paper states AI adoption audits of 120 organizations and surveys of 800 marketers as part of the study).

high negative Augmented Intelligence: Resolving the AI integration-obsoles... AI paralysis / inability to scale AI pilots

Autonomous software-engineering agents remain unreliable in realistic development settings.

Assertion in abstract summarizing the observed current state; likely based on prior literature and/or authors' observations (no empirical sample size given in abstract).

high negative AI Harness Engineering: A Runtime Substrate for Foundation-M... reliability of autonomous software-engineering agents (ability to perform correc...

Individuals low in trait self-efficacy experienced the steepest ownership erosion (i.e., AI-authorship reduced psychological ownership most for low self-efficacy participants).

Reported moderation analysis in the preregistered experiment showing trait self-efficacy moderated the authorship effect on psychological ownership; preregistered N = 470. (No numeric effect size reported in the abstract.)

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... change/erosion in psychological ownership as moderated by trait self-efficacy

Participants in the LLM condition reported lower perceived importance (d = 1.13).

Same preregistered experiment; reported effect size d = 1.13; preregistered N = 470.

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... perceived importance of goals (self-reported)

Participants in the LLM condition reported lower commitment (d = 1.19).

Same preregistered experiment comparing self-authored vs LLM-authored goals; reported effect size d = 1.19; preregistered N = 470.

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... commitment (self-reported)

Participants in the LLM condition reported lower psychological ownership (d = 1.38).

Same preregistered experiment (between-subjects comparison of authorship); reported effect size d = 1.38; preregistered N = 470.

high negative Optimized but Unowned: How AI-Authored Goals Undermine the M... psychological ownership (self-reported)

The paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics.

Conceptual analysis and problem-framing presented in the paper (qualitative identification of five mismatch categories).

high negative Agent-First Tool API: A Semantic Interface Paradigm for Ente... architectural_mismatches_between_conventional_APIs_and_autonomous_agent_requirem...

Using LLMs led to fewer creative moments observed in participants (p=0.002).

Within-subject comparison between LLM-assisted and unassisted conditions with reported p-value p=0.002. Study sample N=20.

high negative "Like Taking the Path of Least Resistance": Exploring the Im... count of creative moments

Participants using LLMs had significantly shorter idea-generation periods (p=0.0004).

Within-subject comparison between LLM-assisted and unassisted conditions reported in paper; p-value reported as p=0.0004. Sample size N=20.

high negative "Like Taking the Path of Least Resistance": Exploring the Im... idea-generation period (time spent generating ideas)

AI-assisted engineering teams concurrently face a 19% risk of skills obsolescence.

Empirical finding reported by the study, presumably based on the mixed-methods data (survey/Delphi/case studies) described in abstract.

high negative The AI-engineering imperative - Navigating synergy and obsol... risk of skills obsolescence

Forecasts indicate that automation may supplant as much as 45% of traditional tasks by 2030.

Statement in paper referencing external forecasts (no specific source or sample reported in abstract).

high negative The AI-engineering imperative - Navigating synergy and obsol... percentage of traditional tasks automated by 2030

Existing AI assistants (e.g., ChatGPT, Copilot) utilize pre-defined user preferences and chat interaction histories and are therefore confined to reactive exchanges lacking sufficient adaptability to users' psychophysiological states.

Authorial characterization/argument about current AI assistant behavior; no empirical data reported in abstract to substantiate beyond description.

high negative AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... adaptability of AI assistants

Producing hardened, production-grade agent workflows may require extra compute and time, and these costs must be amortized through reuse across a broad user community.

Argument in paper reasoning that added rigor entails higher compute/time costs and that reuse across users is needed to amortize these costs; no empirical cost estimates provided.

high negative Engineering Robustness into Personal Agents with the AI Work... resource_costs (compute/time) and implications for amortization/adoption

By focusing on rapid, real-time synthesis, AI agents are effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them.

Conceptual argument presented in the paper asserting a qualitative mismatch between on-the-fly agents and high-stakes production needs; no empirical validation reported.

high negative Engineering Robustness into Personal Agents with the AI Work... suitability for high-stakes use / risk to users

The on-the-fly paradigm short-circuits disciplined software engineering processes—iterative design, rigorous testing, adversarial evaluation, staged deployment, and more—that have delivered relatively reliable and secure systems.

Argumentative claim in paper linking the on-the-fly loop to reduced application of standard SE processes; no empirical study, sample, or quantitative evidence provided.

high negative Engineering Robustness into Personal Agents with the AI Work... reliability and security (degree to which SE processes are applied)

These findings underscore the insufficiency of current agents for interdependent workflows, positioning ComplexMCP as a critical testbed for the next generation of resilient autonomous systems.

Synthesis of empirical results (low agent success rates, identified bottlenecks) presented by authors to make a broader claim about agent readiness and the benchmark's relevance.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... agent suitability/readiness for interdependent workflows

(3) strategic defeatism, a tendency to rationalize failure rather than pursuing recovery.

Qualitative/quantitative trajectory analysis indicating agents often choose rationalization/explanatory actions over recovery or retry strategies after failures.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... rate of recovery/persistence actions vs rationalization actions after failure

(2) over-confidence, where agents skip essential environment verifications;

Trajectory analyses showing agents often omit verification steps leading to failed interactions; reported as an identified failure mode.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... frequency of environment verification checks performed by agents

Granular trajectory analysis identifies three fundamental bottlenecks: (1) tool retrieval saturation as action spaces scale;

Trajectory analyses of agent interactions with the benchmark reported by authors; observational claim from analysis of agent action sequences as action space increases.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... tool retrieval performance / selection accuracy as action space scales

We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%.

Empirical evaluation reported by authors comparing multiple LLM agents (full-context and RAG) against human performance on benchmark tasks; specific reported success rates: <=60% for top models, 90% for humans.

high negative ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... task success rate (agent vs human)

Common failures include replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns.

Error-mode analysis described in the paper/abstract showing that models substitute complex CAD operations (sweep, loft, twist-extrude) with simpler sketch-and-extrude sequences.

high negative BenchCAD: A Comprehensive, Industry-Standard Benchmark for P... use_of_appropriate_CAD_operations_in_generated_code

Common failures include misinterpreting industrial design parameters.

Reported error analysis in the paper/abstract indicating models often misinterpret engineering/design parameters when generating CAD programs.

high negative BenchCAD: A Comprehensive, Industry-Standard Benchmark for P... accuracy_of_inferred_design_parameters

Common failures include missing fine 3D structure.

Qualitative and quantitative analysis of model outputs on BenchCAD reported in the paper/abstract noting missing fine 3D structural details as a frequent error mode.

high negative BenchCAD: A Comprehensive, Industry-Standard Benchmark for P... completeness_of_3D_structure_in_generated_models

Human capital and technological innovation channels show weaker or even negative effects on Lae, attributed to short-term resource misallocation and skill mismatches.

Spatial mediation analysis (channel analysis) using panel data for 30 provincial regions (2012–2022) assessing mediating roles of human capital and technological innovation.

high negative A study of the impact of artificial intelligence on the low-... mediated effect of human capital and technological innovation on Lae

Functional deployment and operational investment in AI are associated with employment declines.

Regression analyses from the BTOS AI supplement linking measures of functional AI deployment and operational AI investment to firm-reported employment changes; observational associations (sample size and exact model specification not shown in excerpt).

high negative The Microstructure of AI Diffusion: Evidence from Firms, Bus... employment change associated with functional deployment and operational investme...

Employment reductions attributable to AI are rare: only 2% of firms report employment reductions.

Firm self-reports on employment outcomes related to AI from the BTOS AI supplement (Nov 2025–Jan 2026); descriptive statistic reported; sample size not excerpted.

high negative The Microstructure of AI Diffusion: Evidence from Firms, Bus... reported employment reductions due to AI

Among firms with worker-level AI use, 65% restrict use to three or fewer tasks.

Descriptive statistic from BTOS AI supplement giving distribution of number of worker tasks using AI among firms that report worker-level use; sample size not shown.

high negative The Microstructure of AI Diffusion: Evidence from Firms, Bus... breadth of worker-task AI use per firm (number of tasks)

Among adopter firms, scope remains limited: 57% use AI in three or fewer functions.

Descriptive distribution of number of business functions using AI among adopter firms in the BTOS AI supplement (Nov 2025–Jan 2026); sample restricted to adopter firms (sample size not provided).

high negative The Microstructure of AI Diffusion: Evidence from Firms, Bus... number of business functions using AI per adopting firm (breadth of functional d...

Institutional inertia in property valuation poses risks to asset pricing, collateral risk modelling and investor confidence.

Analytical inference from interview findings and theoretical synthesis highlighting implications for property investment and financial market stability.

high negative Exploring barriers to valuation technology adoption in prope... risks to asset pricing, collateral risk modelling and investor confidence

Despite advances in automation, data analytics and AI, the sector has been slow to digitise.

Background statement supported by interview data and sector observation reported in the study.

high negative Exploring barriers to valuation technology adoption in prope... pace of digitisation in the property valuation sector

The IDOI framework provides a transferable model for understanding digital transformation in regulated, high-trust professions and highlights the market-level risks of institutional inertia in property valuation.

Development of the IDOI conceptual framework from qualitative data and theoretical integration; authors' claim about transferability and implications.

high negative Exploring barriers to valuation technology adoption in prope... transferability of the framework and market-level risks from institutional inert...

Generational divides, protectionist attitudes and fears of automation reinforce digital resistance.

Qualitative interview evidence reporting attitudes across cohorts of valuers and firm personnel; thematic analysis identifying cultural and attitudinal themes.

high negative Exploring barriers to valuation technology adoption in prope... cultural/attitudinal resistance to VTech

The Valuers Act (1948), fragmented infrastructure and sovereignty concerns limit innovation.

Interview data from practitioners, firm leaders and regulators in New Zealand citing specific regulatory and infrastructure constraints; thematic analysis.

high negative Exploring barriers to valuation technology adoption in prope... regulatory and infrastructure constraints on innovation

Barriers to adoption arise primarily from institutional conservatism, outdated regulation and weak data governance rather than technical shortcomings.

Qualitative semi-structured interviews with valuers, firm leaders and regulators in New Zealand; thematic analysis guided by Rogers' diffusion of innovations and institutional theory synthesised into the IDOI framework.

high negative Exploring barriers to valuation technology adoption in prope... barriers to VTech adoption

Consequently, generated artifacts may exhibit brittle behavior and limited deployability.

Paper asserts that lack of production awareness leads to brittle artifacts and limited deployability; no quantitative measures or sample sizes provided in the abstract.

high negative Architectural Constraints Alignment in AI-assisted, Platform... brittleness of artifacts and deployability

AI-assisted development tools often lack awareness of architectural constraints, infrastructure dependencies, and organizational standards required in production environments.

Asserted observation in the paper arguing limitations of general-purpose AI code generation when targeting production-ready systems; no empirical sample size or methodological details provided in the excerpt.

high negative Architectural Constraints Alignment in AI-assisted, Platform... awareness of architectural constraints / suitability for production

Current AI tools are not yet mature enough to replace developers.

Conclusion drawn from the controlled experiment and participant feedback comparing AI-assisted vs traditional task-splitting.

high negative Splitting User Stories Into Tasks with AI -- A Foe or an All... suitability of AI to replace developers

Breaking down user stories into actionable tasks is a critical yet time-consuming process in agile software development.

Background/introductory statement in the paper describing the problem motivation; no experimental sample size reported for this claim.

high negative Splitting User Stories Into Tasks with AI -- A Foe or an All... time required to split user stories (descriptive claim about time consumption)

Nominally cheaper models can incur higher total cost due to token-intensive reasoning.

Cost and token usage analysis reported in the paper showing cheaper-per-token models may generate more tokens and thus higher total cost in practice.

high negative Switchcraft: AI Model Router for Agentic Tool Calling total inference cost as a function of token usage and per-token price

Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets.

Stated as background/motivation in the paper (conceptual claim; no empirical sample size reported).

high negative Switchcraft: AI Model Router for Agentic Tool Calling inference cost / developer tendency to use large models

Cascade performance is limited primarily by structural cost (they pay the cheap model before any escalation decision), rather than by a shortage of intermediate stages.

Synthesis of theoretical insights and empirical results reported in the paper (theoretical analysis of structural costs + empirical comparisons showing limited benefit from additional stages).

high negative Is Escalation Worth It? A Decision-Theoretic Characterizatio... primary constraint on cascade performance (structural cost vs availability of in...

Optimized subsequence cascades do not deliver practically meaningful held-out gains over the pairwise envelope.

Empirical evaluation on the five benchmarks comparing optimized subsequence cascades to the pairwise envelope; reported lack of practically meaningful held-out improvement.

high negative Is Escalation Worth It? A Decision-Theoretic Characterizatio... held-out performance gains of optimized subsequence cascades relative to the pai...

Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope.

Empirical comparison across the reported benchmarks and models showing that full fixed chains achieve worse cost-quality tradeoffs than the pairwise envelope (experimental results described in the paper).

high negative Is Escalation Worth It? A Decision-Theoretic Characterizatio... relative cost-quality performance of full fixed-chain cascades versus the pairwi...

Municipal 311 call centers and complaint intake systems face a structural mismatch between incoming volume and classification capacity that produces a bottleneck and differential service quality that follows income and racial lines.

Stated in the paper's introduction; cites prior work (Liu 2024 SLA) as support for the differential service-quality / demographic claim. No sample size or quantitative result reported in the excerpt.

high negative Scaling the Queue: Reinforcement Learning for Equitable Call... differential service quality by income and race

« Prev 1 2 3 … 6 7 8 … 130 131 Next »