Evidence (6507 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Productivity Remove filter

The near-uncorrelated rankings and rank shifts on the n=11 subset are driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset.

Subgroup analysis/observation within the 11-agent SWE-bench overlap indicating a negative correlation between Adoption and Capability for closed-source high-capability agents (no numerical coefficient reported in the excerpt).

high negative AgentPulse: A Continuous Multi-Signal Framework for Evaluati... Adoption-Capability correlation among closed-source high-capability agents

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment.

Conceptual statement in the paper; no empirical sample cited for this specific claim (framing/argumentation).

high negative AgentPulse: A Continuous Multi-Signal Framework for Evaluati... scope of measurement of static benchmarks (capability vs. deployment/adoption)

Standard PayGo degrades substantially under classroom-scale concurrency.

Empirical latency measurements and comparative analysis across throughput tiers and concurrency levels in the instrumented deployment.

high negative Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response time (latency) degradation under concurrency

Each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face.

Architectural description and instrumentation of the four-agent ITAS system (paper reports measurements and latency analysis across tiers and concurrency levels).

high negative Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response latency (task completion time)

In the absence of intervention, individually rational adoption of genAI will assuredly and profoundly reduce collective welfare.

Conclusion drawn from the paper's theoretical model (normative/predictive claim based on model dynamics; no empirical validation or sample reported in abstract).

high negative Generative artificial intelligence reduces social welfare th... collective (social) welfare

Habit formation around genAI use can couple otherwise separate domains, so that adoption in low-stakes tasks spills over into high-value tasks and amplifies welfare losses.

Theoretical/model-based claim showing coupling across domains via habit formation (model extension; no empirical sample reported in abstract).

high negative Generative artificial intelligence reduces social welfare th... spillover adoption and amplified welfare losses

The introduction of genAI—while initially beneficial at the individual level—will reduce social welfare for the most important types of tasks.

Model-derived result: theoretical analysis indicates social-welfare reductions in high-value tasks despite individual gains (no empirical sample reported in abstract).

high negative Generative artificial intelligence reduces social welfare th... social welfare for high-value tasks

Generative models are vulnerable to model collapse: when trained on data generated by earlier versions of themselves, their outputs can lose diversity and accuracy.

Theoretical claim / conceptual claim presented in the paper (no empirical sample size given in abstract); refers to degradation of model outputs when trained on self-generated data.

high negative Generative artificial intelligence reduces social welfare th... output diversity and accuracy

Frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs.

Evaluation of models' self-predicted token cost versus realized token usage across agentic runs on SWE-bench Verified; reported correlations up to 0.39 and systematic underestimation bias.

high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... correlation and bias between model self-predicted token usage and actual token u...

Models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5.

Cross-model comparisons of average total token consumption per task run across the eight evaluated LLMs on SWE-bench Verified; paper reports average differential between named models and GPT-5.

high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... average total token consumption per model (tokens consumed by model A minus mode...

Input tokens rather than output tokens drive the overall cost of agentic tasks.

Breakdown of token usage into input vs output token components from the analyzed agentic task trajectories on SWE-bench Verified (across the eight LLMs evaluated).

high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... share/contribution of input tokens vs output tokens to total token consumption

Agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat.

Empirical measurement of token counts from agentic coding task runs compared to runs labeled as code reasoning and code chat across the evaluated trajectories (paper reports comparisons on SWE-bench Verified across eight frontier LLMs).

high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... total token consumption (agentic vs. code reasoning/code chat)

Industrial robots are widely used in manufacturing, yet most manipulation still depends on fixed waypoint scripts that are brittle to environmental changes.

Background statement in the paper's introduction; general literature/field observation (no new primary data reported for this claim in the abstract).

high negative Learning-augmented robotic automation for real-world manufac... robustness of fixed waypoint script manipulation

Each new task domain requires painstaking, expert-driven harness engineering: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective.

Author assertion in the paper's introduction/abstract describing the state of practice; no empirical method, dataset, or sample size reported in the excerpt.

high negative The Last Harness You'll Ever Build need for human (expert) harness engineering

Vibe coding (unstructured GenAI-driven coding) promises rapid prototyping but often suffers from architectural drift, limited traceability, and reduced maintainability.

Paper asserts this as a motivating observation and characterizes vibe coding's weaknesses; the abstract frames these as commonly observed problems motivating the Shift-Up approach (no sample size given in abstract).

high negative Shift-Up: A Framework for Software Engineering Guardrails in... architectural drift, traceability, maintainability

Every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) degrades performance.

Findings from the nine-variant ablation study reported in the paper; comparison of variants that add each listed mechanism versus the memory+reflection combination.

high negative AEL: Agent Evolving Learning for Open-Ended Environments performance (e.g., Sharpe ratio or other benchmark metrics) relative to memory+r...

There is a stark geopolitical divide between 'AI Core' nations and the Global South; the Global South faces acute risks of 'Digital Dependency' and eroded digital sovereignty.

Cross-study synthesis in the systematic review (2018-2026) identifying geopolitical patterns and risks; abstract does not quantify the number of studies or present empirical effect sizes.

high negative Artificial Intelligence, Public Policy and Governance - impl... digital dependency and digital sovereignty

The 'black box' nature of automated systems undermines the democratic social contract and principles of procedural justice, epitomised by the Australian 'Robo-debt' scandal.

Case study material and literature synthesized in the systematic review referencing the Australian Robo-debt case as an exemplar; abstract does not provide primary data or sample sizes.

high negative Artificial Intelligence, Public Policy and Governance - impl... democratic legitimacy and procedural justice

Traditional forecasting and optimization approaches often operate in isolation, limiting their real-world effectiveness in volatile-demand, uncertain-supply industries.

Positioning/background statement in the paper motivating the integrated framework (literature-based claim).

high negative Hybrid Deep Learning Approach for Coupled Demand Forecasting... effectiveness of isolated forecasting/optimization approaches

The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable: each decision the agent makes is recorded directly in cells that belong to and reflect on the user.

Conceptual / domain-specific argument made by the authors (no empirical sample attached to the claim).

high negative Auditing and Controlling AI Agent Actions in Spreadsheets risk associated with automated changes to user-owned artifacts

AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution: by the time users receive the output, all underlying decisions have already been made without their involvement.

Author assertion / conceptual description in the paper (no empirical quantification provided for this general statement).

high negative Auditing and Controlling AI Agent Actions in Spreadsheets process transparency / accessibility during execution

Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution.

Author assertion / literature-level observation presented in the paper (no empirical sample reported for this claim).

high negative Auditing and Controlling AI Agent Actions in Spreadsheets user oversight ability

Selective forgetting remains underexplored compared to retention in LLM agent memory research.

Authors' literature survey / position statement in paper (assertion made in abstract).

high negative FSFM: A Biologically-Inspired Framework for Selective Forget... extent of research coverage on forgetting vs retention

Beyond technical barriers there are organizational ones: a persistent AI literacy gap, cultural heterogeneity, and governance structures that have not yet caught up with agentic capabilities.

Interview data (over 30) reporting organizational challenges including limited AI literacy, diverse cultural attitudes across organizations, and lagging governance relative to agentic AI capabilities.

high negative Agentic AI in Engineering and Manufacturing: Industry Perspe... organizational readiness factors (AI literacy, culture, governance alignment)

Adoption is constrained less by model capability than by fragmented and machine-unfriendly data, stringent security and regulatory requirements, and limited API-accessible legacy toolchains.

Stakeholder interviews (over 30) reporting barriers to deployment; qualitative synthesis identifies data fragmentation, security/regulatory requirements, and legacy toolchain access as primary constraints.

high negative Agentic AI in Engineering and Manufacturing: Industry Perspe... barriers to AI adoption in engineering/manufacturing

Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns.

Turn-level coding of user behavior in the SWE-chat dataset: proportion of conversational turns containing correction/complaint/interrupt signals, computed across >63,000 user prompts and sessions.

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... rate of user pushback per interaction turn

Agent-written code introduces more security vulnerabilities than code authored by humans.

Comparative analysis of security vulnerabilities attributed to agent-authored code versus human-authored code within the SWE-chat dataset (method details not specified in excerpt).

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... security vulnerabilities introduced by agent-written code versus human-written c...

Just 44% of all agent-produced code survives into user commits.

Empirical measurement of code provenance and survival within the SWE-chat dataset: proportion of agent-produced code that becomes part of subsequent user commits across sessions.

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... survival/usefulness of agent-produced code (proportion incorporated into commits...

Despite rapidly improving capabilities, coding agents remain inefficient in natural settings.

Authors' summary claim supported by dataset-derived metrics such as agent code survival rate (44%) and user pushback (44% of turns); observational analysis of SWE-chat.

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... overall agent efficiency in natural developer workflows (qualitative synthesis)

Regulated deployment imposes four load-bearing systems properties — deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale — and stateful architectures violate them by construction.

Conceptual/architectural argument presented in the paper (theoretical analysis), not an empirical measurement in the abstract.

high negative Stateless Decision Memory for Enterprise AI Agents compatibility of stateful architectures with regulatory/system properties

The policy and research challenge posed by platform-mediated automation is not merely job quantity (technological unemployment) but institutional continuity — how societies reproduce practical competence when platforms optimize for efficiency rather than formation.

Normative and conceptual claim developed through literature synthesis (institutional economics, platform governance, workforce development); presented as an analytical reframing rather than an empirically tested hypothesis.

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... institutional continuity and human capital reproduction (quality of workforce fo...

Entry-level roles have historically functioned as apprenticeships in which workers acquire tacit knowledge and critical judgment; if platforms curtail these formative occupational layers, organizations may lack future workers capable of exercising contextual reasoning required to manage complex systems.

Institutional economics and workforce development literature cited in the paper; conceptual synthesis without original empirical measurement reported.

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... human capital formation (tacit knowledge acquisition and contextual reasoning ca...

Platform-mediated automation risks hollowing out labor structures from both directions: eroding repetitive, junior roles from below and automating supervisory coordination functions from above.

Theoretical argument synthesizing institutional economics and platform literature; articulated as a conceptual risk rather than demonstrated with original empirical data.

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... structural change in occupational layers (hollowing out of junior and supervisor...

Algorithmic systems are displacing routine tasks across both low-wage entry-level work and middle-management functions.

Stated in paper's argumentation; supported by a literature-based review drawing on platform governance literature and recent research on AI-enhanced automation (no original empirical sample or quantitative study reported).

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... displacement of routine tasks (across entry-level and middle-management roles)

The observed negative OPM effect is consistent with short-term 'J-curve' transition costs (process redesign and capability buildup) during early AI adoption.

Interpretation of empirical patterns (short-term decline in OPM concurrent with no ROA change) offered by the authors as an explanatory mechanism; not presented as separately estimated or experimentally tested.

high negative The Dynamic Causal Effects of Corporate AI Adoption on Profi... operating profit margin dynamics / transition costs interpretation

AI adoption had a significantly negative impact on the operating profit margin (OPM).

Causal analysis of KOSDAQ-listed companies (2018–2025) with AI-adoption timing identified via multi-step, contextually validated text analysis of DART business reports; endogeneity addressed using two-way fixed effects (TWFE) and Propensity Score Matching (PSM).

high negative The Dynamic Causal Effects of Corporate AI Adoption on Profi... operating profit margin (OPM)

An alternative specification that makes different choices about the timing of the pervasiveness of AI yields less robust results, though it also suggests that AI is labor saving.

Reported sensitivity analysis / alternative empirical specification in the paper; authors state the alternative yields less robust results but still indicates labor-saving effects.

high negative Early Estimates of the Impact of AI Within BEA’s Industry Ec... labor use (labor-saving effect)

Our baseline model finds evidence that AI is input saving.

Outcome reported from the baseline empirical specification indicating reductions in inputs associated with AI (authors' baseline model results).

high negative Early Estimates of the Impact of AI Within BEA’s Industry Ec... use of inputs (e.g., labor/capital inputs)

The infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it.

Authoritative claim in paper framing the research gap; presented as observational/argumentative (no empirical audit reported).

high negative ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... availability of cross-user collaboration infrastructure and governance mechanism...

Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user.

Statement in paper's introduction/positioning; conceptual survey-style claim (no empirical study or systematic benchmark reported).

high negative ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... automation scope (single-user vs multi-user)

Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations.

Paper asserts that existing/standard benchmarks do not adequately isolate parsing and computation-orchestration abilities, motivating the new benchmark.

high negative Time Series Augmented Generation for Financial Applications benchmark adequacy for isolating parsing/computation orchestration

As multimodal AI achieves human-parity understanding of speech and gesture, [the keyboard's] necessity dissolves.

Theoretical claim supported by multidisciplinary review (history, neuroscience, technology, organizational studies); no quantified empirical test reported.

high negative The Instrumental Dissolution of Typing: Why AI Challenges th... necessity/usage of keyboard as default input

General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs.

Conceptual/argumentative claim stated in the paper's motivation; no empirical test reported in the abstract.

high negative Learning from AVA: Early Lessons from a Curated and Trustwor... misinformation risk / epistemic humility

There was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition (i.e., retention decreased more after using Copilot).

Comparison of retest (one-week) performance across conditions reported in results; authors report a nonsignificant reduction and larger decrement for the AI/Copilot condition (n=22).

high negative Fast and Forgettable: A Controlled Study of Novices' Perform... retest performance (learning retention) after one week

Current operational approaches typically involve scattered testing tools, resulting in partial coverage and errors that surface only after deployment.

Authors' characterization of industry practice and limitations (assertion in paper; no empirical sample size reported in abstract).

high negative Aether: Network Validation Using Agentic AI and Digital Twin test coverage and post-deployment error incidence

Network change validation remains a critical yet predominantly manual, time-consuming, and error-prone process in modern network operations.

Statement in paper framing the problem; based on authors' characterization of current operational practice (no empirical sample size reported in abstract).

high negative Aether: Network Validation Using Agentic AI and Digital Twin manual effort / error-proneness of network change validation

The paper identifies governance challenges such as accountability gaps, digital sovereignty risks, ethical pluralism, and strategic weaponization arising from embedding AI in diplomatic practice.

Conceptual and normative analysis section of the paper outlining risks and governance challenges; illustrated by examples and argumentation.

high negative Strategic Cognition and Artificial Diplomacy: Designing Huma... presence of governance risks (accountability gaps, digital sovereignty, ethical ...

Thin training coverage fosters anxiety about substitution and slows diffusion of AI tools.

Reported associations from surveys of mid-level managers and technical staff, interviews, and document analysis across cases; thematic coding identified links between limited training, worker anxiety, and slower diffusion. (Sample size not reported.)

high negative Overcoming Resistance to Change: Artificial Intelligence in ... worker anxiety and speed of diffusion/adoption

Upstream textile SMEs frequently exhibit constrained supply chain resilience owing to persistent information latency and structural dependence on downstream orders.

Background/contextual claim stated in paper (motivation for study); no specific quantitative test reported in abstract.

high negative Enhancing Supply Chain Resilience in Textile SMEs: A Human-C... supply chain resilience (constrained due to information latency and downstream o...

The pharmaceutical R&D process is persistently challenged by high financial costs, protracted timelines, and remarkably low success rates.

Background statement in the review synthesizing prior literature and field knowledge; no original empirical data or sample sizes reported in the provided text.

high negative Artificial intelligence in drug discovery from advanced mole... financial costs, timelines, and success rates of pharmaceutical R&D

« Prev 1 2 3 … 9 10 11 … 130 131 Next »