Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

China's legal environment may offer certain advantage in terms of access to training data.

Stated as an analytical conclusion in the chapter based on comparative legal/regulatory assessment of data regimes; no empirical sample or quantitative evidence reported in the provided excerpt.

high positive Navigating Turbulence: The Challenge of Inclusive Innovation... access to training data for AI development

This work demonstrates how energy considerations can be embedded directly into AI-assisted coding workflows, supporting developers as they engage with energy implications through actionable feedback.

Concluding claim based on the system implementation and evaluation described (benchmarks and controlled study).

high positive EcoAssist: Embedding Sustainability into AI-Assisted Fronten... feasibility of embedding energy considerations into AI-assisted coding workflows

EcoAssist reduced per-website energy by 13-16% on average.

Reported result from the benchmark evaluation of 500 websites (effect size reported as 13-16%).

high positive EcoAssist: Embedding Sustainability into AI-Assisted Fronten... per-website energy consumption

We introduce EcoAssist, an energy-aware assistant integrated into an IDE that analyzes AI-generated frontend code, estimates its energy footprint, and proposes targeted optimizations.

Description of the system introduced by the authors (implementation claim).

high positive EcoAssist: Embedding Sustainability into AI-Assisted Fronten... availability of an IDE-integrated, energy-aware assistant

AI assistance improves short-term performance on tasks (people do better while using the AI).

Randomized controlled trials (N = 1,222) showing better immediate task outcomes when participants used AI assistance.

high positive AI Assistance Reduces Persistence and Hurts Independent Perf... short-term task performance (immediate accuracy/quality while assisted by AI)

Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.

Paper presents the benchmark design and logging mechanisms intended to enable reproducible experiments of multi-agent market interactions.

high positive Market-Bench: Benchmarking Large Language Models on Economic... None

Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics.

Paper states that the benchmark captures full transaction trajectories and exposes economic/operational/semantic metrics for automatic evaluation.

high positive Market-Bench: Benchmarking Large Language Models on Economic... None

In the retail stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase.

Methodological description of the retail-stage tasks and the role-based attention mechanism used to present offers to buyers.

high positive Market-Bench: Benchmarking Large Language Models on Economic... None

In the procurement stage, LLMs bid for limited inventory in budget-constrained auctions.

Design specification of the benchmark describing procurement-stage mechanics (auction/bidding mechanism, budget constraints).

high positive Market-Bench: Benchmarking Large Language Models on Economic... None

We construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise.

Methodological description in the paper detailing the simulated multi-agent supply chain environment and the role of LLMs as retailer agents.

high positive Market-Bench: Benchmarking Large Language Models on Economic... None

We introduce Market-Bench, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition.

Paper describes the design and release of Market-Bench as a benchmark/testbed (methodological contribution).

high positive Market-Bench: Benchmarking Large Language Models on Economic... None

Analyses use fixed-effects regression and structural equation modeling (SEM) on panel data from OECD countries.

Methods statement in the paper indicating use of fixed-effects and SEM applied to OECD-country panel data.

high positive AI-Augmented Peer Review and Scientific Productivity: A Cros... methodological approach (fixed-effects regression and SEM)

This paper provides the first cross-country empirical validation of AI-augmented scientific evaluation systems.

Authors' stated novelty claim that prior work lacked cross-country empirical quantification and that their OECD panel study is the first such validation.

high positive AI-Augmented Peer Review and Scientific Productivity: A Cros... novelty / first empirical cross-country validation

A one standard deviation increase in AIRC is associated with an 18–25% increase in scientific productivity.

Reported point estimate/range from regression/SEM results linking a 1 SD change in the constructed AIRC to productivity outcomes in the OECD panel.

high positive AI-Augmented Peer Review and Scientific Productivity: A Cros... scientific productivity (percent change per 1 SD AIRC)

AI-assisted evaluation significantly enhances scientific productivity.

Fixed-effects regression and structural equation modeling (SEM) applied to panel data from OECD countries; reported association between AIRC and research output.

high positive AI-Augmented Peer Review and Scientific Productivity: A Cros... scientific productivity (research output)

We construct a novel AI Review Capability Index (AIRC).

Paper reports creation of a new composite index (AIRC) to measure national-level AI capability in peer review; constructed and applied to panel data from OECD countries.

high positive AI-Augmented Peer Review and Scientific Productivity: A Cros... AI Review Capability (AIRC) (index construction)

China's 'Global Community of Shared Future' white paper and Putin's 2024 Valdai address provide empirical evidence for an articulated alternative vision to the Western‑led global order.

Qualitative textual/readings of the cited official documents (the white paper and the Valdai address) used in the paper as empirical support; no quantitative content analysis or sample coding is reported.

high positive Theorising the Interregnum: existence of articulated alternative geopolitical vision in official documents

Technical workers' potential for progressive transformation lies not just in their strategic importance and specialized knowledge but in their ability to build solidarity across the broader ecosystem of AI labour while operating between otherwise incommensurable philosophical and infrastructural systems.

Normative/theoretical claim combining philosophical analysis (Chinese Marxism, Bauman) with empirical literature on hidden AI labour and infrastructure competition (Muldoon et al., 2024); offered as an interpretive synthesis rather than empirically validated causal finding.

high positive Theorising the Interregnum: capacity for progressive transformation via worker solidarity in AI labour ecosy...

Technical workers occupy a strategic position at the intersection of competing infrastructural systems and alternative visions of global order, making them potentially crucial actors in determining the outcome of the current interregnum.

Argumentative claim supported by secondary empirical literature cited in the paper (Muldoon, Graham, and Cant, 2024) on hidden labour supporting AI systems and on geopolitical competition over digital infrastructure; presented as qualitative/interpretive evidence rather than primary quantitative measurement.

high positive Theorising the Interregnum: technical workers' strategic influence over geopolitical/technical outcomes

The semi-core's challenge to Western hegemony creates unique conditions for systemic transformation.

The paper advances this as a theoretical argument synthesizing World‑Systems theory, Demirel (2024), Bauman's philosophical work, and interpretive readings of official Chinese and Russian documents; no quantitative causal test is reported.

high positive Theorising the Interregnum: potential for systemic transformation arising from semi‑core challenge

The emergence of a 'semi-core' is represented most prominently by China and Russia.

The paper cites Ege Demirel (2024) as the primary conceptual source and draws on textual evidence from China's 'Global Community of Shared Future' white paper and Putin's 2024 Valdai address; presented via World‑Systems theoretical framing and qualitative/discourse analysis.

high positive Theorising the Interregnum: emergence of a semi-core led by China and Russia

AI agents autonomously plan, invoke external tools, and execute multi-step action chains with reduced human involvement.

Definitional framing provided by the authors describing the technical/functional characteristics of 'AI agents' as used in the paper.

high positive AI Agents Under EU Law technical capability characteristics of AI agents (autonomous planning, tool inv...

The provider's foundational compliance task is an exhaustive inventory of the agent's external actions, data flows, connected systems, and affected persons.

Authors' recommendation/practical conclusion derived from the regulatory mapping (prescriptive guidance rather than empirical measurement).

high positive AI Agents Under EU Law recommended compliance practice (exhaustive inventory of actions, data flows, sy...

We propose a twelve-step compliance architecture and a regulatory trigger mapping connecting agent actions to applicable legislation.

Paper asserts it includes a proposed 12-step compliance architecture and a mapping between agent actions and regulatory triggers (explicit step count provided).

high positive AI Agents Under EU Law proposed compliance architecture (12 steps) and regulatory trigger mapping

We present a practical taxonomy of nine agent deployment categories mapping concrete actions to regulatory triggers.

Paper states it includes a taxonomy comprising nine deployment categories (explicit count provided).

high positive AI Agents Under EU Law taxonomy of agent deployment categories (count = 9)

This paper provides the first systematic regulatory mapping for AI agent providers integrating (a) draft harmonised standards under Standardisation Request M/613 to CEN/CENELEC JTC 21 as of January 2026, (b) the GPAI Code of Practice published in July 2025, (c) the CRA harmonised standards programme under Mandate M/606 accepted in April 2025, and (d) the Digital Omnibus proposals of November 2025.

Author claim about the paper's contribution and scope (novelty/first-of-its-kind mapping integrating specified standards and documents).

high positive AI Agents Under EU Law existence of an integrated, systematic regulatory mapping

AI agents - i.e. AI systems that autonomously plan, invoke external tools, and execute multi-step action chains with reduced human involvement - are being deployed at scale across enterprise functions ranging from customer service and recruitment to clinical decision support and critical infrastructure management.

Author assertion in the paper's introductory framing; no empirical sample size or quantified deployment statistics provided in the excerpt.

high positive AI Agents Under EU Law deployment/adoption of AI agents across enterprise functions

Ablation experiments and scalability analysis verify the effectiveness of each core module of HGA-MADDPG.

Ablation study and scalability analysis reported in the paper; experiments removing or altering core modules and reporting comparative performance.

high positive Research on Multi-Agent Collaborative Decision-Making Algori... contribution of core modules to algorithm performance (ablation results)

HGA-MADDPG maintains a cost reduction rate of 21.5% in a 120-node ultra-large-scale supply chain.

Scalability experiments reported in the paper on a 120-node simulated supply chain; reported cost reduction rate of 21.5% for HGA-MADDPG.

high positive Research on Multi-Agent Collaborative Decision-Making Algori... total cost (operational cost) in a 120-node supply chain

In the same extreme scenario of triple perturbation, HGA-MADDPG achieves a recovery time of 58 hours, outperforming existing methods.

Simulation experiments under triple perturbation reported in the paper; reported recovery time of 58 hours and stated superior performance relative to baselines.

high positive Research on Multi-Agent Collaborative Decision-Making Algori... recovery time (hours) after perturbation

In an extreme scenario of triple perturbation, HGA-MADDPG achieves a cost deviation rate of 29.6%, which is significantly better than existing methods.

Simulation experiments under an extreme scenario (triple perturbation) reported in the paper; comparison with existing methods and reported cost deviation rate of 29.6%.

high positive Research on Multi-Agent Collaborative Decision-Making Algori... cost deviation rate under perturbation

In the same baseline scenario, HGA-MADDPG controls the stockout rate at 3.2%.

Simulation experiments reported in the paper (baseline four-level supply chain using real data), reporting a stockout rate of 3.2% for HGA-MADDPG compared to baselines.

high positive Research on Multi-Agent Collaborative Decision-Making Algori... stockout rate

In the same baseline scenario, HGA-MADDPG achieves a service level improvement rate of 42.8% compared with eight baseline algorithms.

Simulation experiments reported in the paper (baseline four-level supply chain using SCDL and WSN data), compared to eight baselines; reported 42.8% service level improvement.

high positive Research on Multi-Agent Collaborative Decision-Making Algori... service level (supply/delivery performance)

In a baseline scenario (four-level supply chain, dynamic environment driven by real data from SCDL and WSN) and compared with eight baseline algorithms, HGA-MADDPG achieves a total cost reduction rate of 26.2%.

Simulation experiments reported in the paper: four-level supply chain baseline scenario driven by real data (SCDL and WSN), compared to eight baseline algorithms; reported aggregate result of 26.2% total cost reduction.

high positive Research on Multi-Agent Collaborative Decision-Making Algori... total cost (operational cost) of the supply chain

The paper constructs an adversarial disturbance and resilient training architecture that models three types of disturbances (demand mutation, node failure, transportation delay), adversarial agent injection, a dynamic environment replay buffer, and a two-stage training strategy.

Methodological description and implementation details of the training architecture and disturbance models in the paper.

high positive Research on Multi-Agent Collaborative Decision-Making Algori... presence and implementation of adversarial/resilient training components (method...

An adaptive fusion weight based on marginal returns is designed to dynamically balance local and global credit.

Methodological description (design and incorporation of adaptive fusion weight in algorithm).

high positive Research on Multi-Agent Collaborative Decision-Making Algori... adaptive weighting between local and global credit (method implementation)

The algorithm quantifies the contribution of individual actions to sub-chain objectives and system-level indicators through local and global credit networks.

Methodological description and algorithm design (local and global credit networks described in the paper).

high positive Research on Multi-Agent Collaborative Decision-Making Algori... quantification of action contributions to sub-chain and system-level objectives ...

HGA-MADDPG introduces a hierarchical graph attention mechanism to dynamically represent the state of the supply chain network topology.

Methodological description and algorithm design presented in the paper (development and implementation of the hierarchical graph attention mechanism).

high positive Research on Multi-Agent Collaborative Decision-Making Algori... dynamic representation of supply chain network topology (method implementation)

Rather than indiscriminate collection of context-relevant data, researchers and practitioners should adopt interactional practices to embed generative AI systems more appropriately into users' contexts of use.

Normative conclusion/provocation drawn from the paper's empirical findings and analysis of failure modes; presented as a recommendation (not an empirical effect; based on qualitative synthesis).

high positive Context Collapse: Barriers to Adoption for Generative AI in ... recommended design and deployment practices for contextual integration

Users deploy concrete strategies to address failures of generative AI systems to account for context.

Empirical observations from interviews describing user-devised workarounds and strategies; qualitative cases/examples (sample size not provided).

high positive Context Collapse: Barriers to Adoption for Generative AI in ... user practices and strategies for mitigating system-context misalignment

We hypothesize the emergent necessity of a 'Compliance Premium,' indicating wage resilience increasingly tied to risk-absorption capacity.

Hypothesis proposed by authors based on observed institutional/business risk differentials from HITL validation and OAI patterns; framed as a forward-looking interpretation rather than demonstrated empirical result.

high positive Bounded by Risk, Not Capability: Quantifying AI Occupational... wage resilience tied to compliance/risk-absorption capacity

Non-routine cognitive roles highly dependent on symbolic manipulation (e.g., Data Scientists) face unprecedented exposure, with OAI ≈ 0.70.

Reported OAI value for example occupation(s) (Data Scientists) derived from the algorithmic aggregation across DWAs; claim presented as a key empirical finding.

high positive Bounded by Risk, Not Capability: Quantifying AI Occupational... Relative Occupational Automation Index (OAI) for Data Scientists

We utilize a multi-agent LLM ensemble to score both technical feasibility and business risk for DWAs.

Method description: deployment of a multi-agent LLM ensemble to produce scores on technical feasibility and business risk per DWA. Specific ensemble composition and hyperparameters not provided in the excerpt.

high positive Bounded by Risk, Not Capability: Quantifying AI Occupational... LLM-derived technical feasibility and business risk scores

We introduce a Tech-Risk Dual-Factor Model that jointly scores technical feasibility and business risk to re-evaluate occupational exposure to LLMs.

Methodological contribution described in the paper (model specification). Implementation details described elsewhere in paper (see multi-agent scoring and aggregation), but claim itself is the introduction of the model.

high positive Bounded by Risk, Not Capability: Quantifying AI Occupational... joint technical feasibility and business risk scores

All code, infrastructure, and benchmark data are released to facilitate future research in realistic computer-use agents.

Statement of release in paper (availability claim).

high positive Gym-Anything: Turn any Software into an Agent Environment availability of code, infrastructure, and benchmark data

Applying the same auditing principle at test time — a separate VLM reviews completed trajectories and provides feedback — improves Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%.

Experimental result reported in paper: evaluation of Gemini-3-Flash with/without test-time VLM auditing on CUA-World-Long, reported scores 11.5% -> 14.0%.

high positive Gym-Anything: Turn any Software into an Agent Environment benchmark score (success rate) on CUA-World-Long

Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2× its size.

Modeling experiments reported in paper: distilled 2B VLM evaluated against larger models (2× size). Exact evaluation metrics and baseline model sizes not specified in excerpt.

high positive Gym-Anything: Turn any Software into an Agent Environment model performance on benchmark tasks (success metric unspecified in excerpt)

CUA-World-Long is a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks.

Benchmark description in paper reporting typical task lengths ("often requiring over 500 steps") and comparison to existing benchmarks.

high positive Gym-Anything: Turn any Software into an Agent Environment task horizon measured in number of steps

The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits.

Dataset release / creation claim specifying >10,000 tasks and train/test splits.

high positive Gym-Anything: Turn any Software into an Agent Environment number of long-horizon tasks and availability of realistic data and splits

Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage.

Dataset creation procedure and reported coverage claim (200 software applications), taxonomy derived from U.S. GDP data as stated.

high positive Gym-Anything: Turn any Software into an Agent Environment number of software applications covered and occupational coverage

« Prev 1 2 3 … 151 152 153 … 276 277 Next »