Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

At a 20x compression ratio, DPM improves reasoning coherence by +0.53 (Cohen's h=1.13, p=0.0034) compared to summarization-based memory (paired permutation, n=10).

Paired permutation test over 10 cases at a 20x compression ratio; reported effect +0.53 with Cohen's h=1.13 and p=0.0034.

high positive Stateless Decision Memory for Enterprise AI Agents reasoning coherence

At a 20x compression ratio, DPM improves factual precision by +0.52 (Cohen's h=1.17, p=0.0014) compared to summarization-based memory (paired permutation, n=10).

Paired permutation test over 10 cases at a 20x compression ratio; reported effect +0.52 with Cohen's h=1.17 and p=0.0014.

high positive Stateless Decision Memory for Enterprise AI Agents factual precision

On ten regulated decisioning cases at three memory budgets, DPM matches summarization-based memory at generous budgets and substantially outperforms it when the budget binds.

Empirical evaluation on 10 decisioning cases across three memory budgets; comparison between DPM and summarization-based memory as reported in the paper (n=10).

high positive Stateless Decision Memory for Enterprise AI Agents relative performance (match/outperform) of DPM vs summarization-based memory acr...

We propose Deterministic Projection Memory (DPM): an append-only event log plus one task-conditioned projection at decision time.

Method/architectural proposal described in the paper.

high positive Stateless Decision Memory for Enterprise AI Agents architecture design (DPM specification)

Presumptuousness in legal AI is systematic but addressable, and addressing it is a necessary step towards systems that reliably support, rather than supplant, human judgment wherever decisions must await sufficient evidence.

Synthesis conclusion in paper based on the benchmark experiments, comparisons across prompting methods, and SPEC results.

high positive Learning When Not to Decide: A Framework for Overcoming Fact... reliability of AI systems to support human judgment under insufficient evidence ...

SPEC achieves 89% overall accuracy, while appropriately deferring when evidence is insufficient.

Empirical evaluation of SPEC reported in paper: overall accuracy reported as 89% and behavior of proper deferral on insufficient-evidence cases.

high positive Learning When Not to Decide: A Framework for Overcoming Fact... overall accuracy and appropriate deferral on insufficient-evidence cases

We introduce SPEC (Structured Prompting for Evidence Checklists), a structured framework requiring explicit identification of missing information before any determination.

Methodological contribution described in paper: new prompting/framework (SPEC) that enforces explicit missing-information identification prior to decision.

high positive Learning When Not to Decide: A Framework for Overcoming Fact... framework implementation that forces evidence-checklist and missing-information ...

Through a collaboration with the Colorado Department of Labor and Employment, we secured access to official training materials and guidance to design a novel benchmark that systematically varies information completeness.

Methodological description in paper: collaboration with state agency and dataset/benchmark construction using official training materials and guidance.

high positive Learning When Not to Decide: A Framework for Overcoming Fact... creation of a benchmark varying information completeness

Long-term prospects of agentic AI include catalyzing accelerated innovation in physical design via autonomous algorithm discovery, continuous tool improvement, and closed-loop learning from large design corpora.

Forward-looking conclusion in the paper; framed as the authors' projection based on survey synthesis rather than as an empirically demonstrated outcome in the abstract.

high positive Invited: Agentic AI for Physical Design R&D: Status and Pros... autonomous algorithm discovery, continuous tool improvement, closed-loop learnin...

Interfaces between agentic systems and traditional EDA frameworks are a key area of focus and enable tighter integration of agent capabilities into existing design workflows.

Survey highlights interfaces between agents and EDA frameworks as a focus area; claim is descriptive of research direction rather than reporting empirical outcomes.

high positive Invited: Agentic AI for Physical Design R&D: Status and Pros... development and importance of interfaces between agents and EDA frameworks

Autonomous agents can explore heuristic spaces for placement, routing, and partitioning, enabling autonomous exploration of design heuristics.

Presented as an emphasized capability/area of research in the survey; the abstract asserts this possibility but does not report empirical benchmarks or sample sizes.

high positive Invited: Agentic AI for Physical Design R&D: Status and Pros... autonomous exploration of heuristic spaces (placement, routing, partitioning)

Tool-integrated agents can be used for algorithm evolution, debugging, and workflow automation in physical design R&D.

Paper emphasizes this as a primary area of application in the survey; rationale and examples are discussed but no quantitative trial sizes are given in the abstract.

high positive Invited: Agentic AI for Physical Design R&D: Status and Pros... use of agents for algorithm evolution, debugging, and workflow automation

Agentic AI systems can comprehend user specifications, modify code, run EDA tools, analyze results, perform multi-step reasoning, and iteratively refine design heuristics—unlike earlier ML uses that focused narrowly on prediction or optimization subroutines.

Descriptive claim in the paper contrasting agentic AI capabilities with earlier ML approaches; presented as an overview of functional capabilities rather than empirical measurement.

high positive Invited: Agentic AI for Physical Design R&D: Status and Pros... breadth of tasks agentic AI systems can perform (spec comprehension, code modifi...

Recent advances in large language models (LLMs) and tool-using autonomous agents present new opportunities for accelerating research and development in physical design.

Stated as a central thesis in the paper's abstract/survey; based on the authors' synthesis of recent advances and emerging applications (no empirical sample or quantified evaluation reported in the abstract).

high positive Invited: Agentic AI for Physical Design R&D: Status and Pros... acceleration of research and development in physical design

The framework is applied to Canada's 2025-2026 national AI Strategy consultation with n = 5,253 respondents across two independent policy topics.

Empirical application reported in the paper; dataset description gives sample size and two policy topics.

high positive Participatory provenance as representational auditing for AI... sample and context for empirical evaluation

This paper introduces 'participatory provenance': a measurement framework grounded in optimal transport theory, causal inference and semantic analysis that tracks how individual public submissions are transformed, filtered or lost through AI-mediated summarization.

Methodological contribution described in the paper (framework design combining optimal transport, causal inference, semantic analysis).

high positive Participatory provenance as representational auditing for AI... ability to track transformations/filtration/loss of individual submissions

AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.

Aggregate comparison from the preregistered experiment showing humans had nonzero endorsement and higher suppression rates while all tested LLMs showed 0% endorsements and lower suppression under pressure (human n=1,201; AI conversations n=3,360).

high positive Large Language Models Outperform Humans in Fraud Detection a... consistency of fraud warnings between advisors (LLMs vs. lay humans)

Human advisors endorsed fraudulent investments at baseline rates of 13-14%.

Human benchmark of 1,201 participants run in the preregistered experiment; reported baseline endorsement rates for fraudulent scenarios.

high positive Large Language Models Outperform Humans in Fraud Detection a... baseline endorsement rate of fraudulent investments by human advisors

Motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them.

Preregistered experiment across seven leading LLMs and twelve investment scenarios; 3,360 AI advisory conversations analyzed comparing motivated vs. baseline investor framings.

high positive Large Language Models Outperform Humans in Fraud Detection a... frequency of AI fraud warnings under motivated investor framing

Under these conditions (alignment of forces and AI-driven ideation cost reductions), PIM offers a framework for organising governed discovery in real time and provides the methodological foundation for later applied work.

The paper presents PIM as a proposed framework and positioning statement for future applied research and implementations (theoretical proposal; no applied trials reported).

high positive Probabilistic Innovation Methodology: A Scientific Methodolo... feasibility of using PIM to organise real-time governed discovery

Organised attacks on complex problems can generate an epistemic mode transition: a shift from predominantly Knightian uncertainty toward probabilistically characterisable innovation dynamics as relevant structures become more visible, decomposed, coordinated, and testable.

The paper states and formalises this methodological claim within PIM as a central proposition (theoretical argumentation; no empirical validation reported).

high positive Probabilistic Innovation Methodology: A Scientific Methodolo... degree of uncertainty characterization (Knightian vs probabilistic)

When problem-relevant causal, informational, and coordinative forces become sufficiently aligned, the epistemic character of search changes and open-ended uncertainty can be progressively transformed into structured probabilistic search.

The claim is presented as the central theoretical argument and formalised within the PIM conceptual framework (theoretical/model-based argumentation; no empirical sample).

high positive Probabilistic Innovation Methodology: A Scientific Methodolo... epistemic character of search (shift from Knightian uncertainty to probabilistic...

The same user study (n=32) reports improvements in subjective measures including fluency and user preference for RAPIDDS over non-adaptive systems.

User study (n=32) reporting subjective questionnaire/ratings (fluency, preference) comparing RAPIDDS vs non-adaptive baselines.

high positive Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teamin... subjective fluency and user preference

A user study (n=32) shows significant plan improvement compared to non-adaptive systems across objective metrics such as efficiency and proximity.

User study reported in paper with sample size n=32 comparing RAPIDDS to non-adaptive systems on objective metrics (efficiency, proximity); significance claimed.

high positive Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teamin... efficiency and proximity (objective plan metrics)

An ablation study in simulation and a physical robot scenario demonstrates the importance of dual (task + motion) adaptation.

Ablation experiments reported in paper (simulation and physical robot experiments comparing full RAPIDDS to ablated variants).

high positive Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teamin... plan performance when removing components (effect of dual adaptation)

RAPIDDS jointly adapts task schedules and steers diffusion models of robot motions to maximize efficiency and minimize proximity accounting for individualized models.

Algorithmic method described in paper combining schedule optimization with motion steering (method section).

high positive Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teamin... efficiency and proximity of joint plans

At the country level, digitalisation and workplace training provision steepen the exposure–adoption gradient.

Country-level heterogeneity analysis using the 2024 EWCS (35 countries) linking national measures of digitalisation and prevalence of workplace training to stronger occupational exposure–adoption relationships.

high positive Generative AI at Work: From Exposure to Adoption across 35 E... self-reported adoption of generative AI (interaction with exposure)

Individual skills, non-routine cognitive job content within occupations, and employee say in organisational decisions steepen the exposure–adoption gradient.

Interaction and stratified analyses from the 2024 EWCS showing stronger exposure–adoption associations among workers with higher individual skills, more non-routine cognitive job content (within occupations), and greater employee influence over organisational decisions; sample >36,600 workers.

high positive Generative AI at Work: From Exposure to Adoption across 35 E... self-reported adoption of generative AI (interaction with exposure)

Occupational exposure strongly predicts uptake.

Associational/regression analysis using the 2024 EWCS linking occupation-level measures of AI exposure to individual-level self-reported adoption; sample >36,600 workers across 35 countries.

high positive Generative AI at Work: From Exposure to Adoption across 35 E... self-reported adoption of generative AI

Adoption averages 12% but ranges from under 3% to 25% across countries.

Descriptive analysis of the 2024 European Working Conditions Survey (EWCS), sample of more than 36,600 workers in 35 countries; country-level tabulations of self-reported generative AI adoption.

high positive Generative AI at Work: From Exposure to Adoption across 35 E... self-reported adoption of generative AI

ClawNet enables multiple users to collaborate securely through their respective agents.

Capability claim about the instantiated system (authors assert that ClawNet enables secure multi-user collaboration; excerpt contains no empirical security evaluation or user study).

high positive ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... secure multi-user collaboration enabled by agent-mediated interactions

We instantiate this paradigm in ClawNet, an identity-governed agent collaboration framework that enforces identity binding and authorization verification through a central orchestrator.

Implementation claim: authors state they built ClawNet as an instantiation of their paradigm (paper describes framework/architecture; no experimental evaluation included in excerpt).

high positive ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... existence of an implemented framework (ClawNet) enforcing identity binding and a...

Action-level accountability logs every operation against its owner's identity and authorization, ensuring full auditability.

Design claim describing an accountability primitive (paper asserts logging and auditability as a property; no audit or verification evidence shown in excerpt).

high positive ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... auditability of agent actions (logging tied to owner identity/authorization)

Scoped authorization enforces per-identity access control and escalates boundary violations to the owner.

Design/specification claim describing the scoped authorization governance primitive in the proposed paradigm (no empirical or security evaluation provided in excerpt).

high positive ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... access control enforcement and escalation behavior

The paradigm rests on three governance primitives: (1) a layered identity architecture that separates a Manager Agent from multiple context-specific Identity Agents; the Manager Agent holds global knowledge but is architecturally isolated from external communication.

Architectural/design claim describing the proposed layered identity primitive (presentation of design; no empirical validation in excerpt).

high positive ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... identity architecture and information flow constraints

We propose a human-symbiotic agent paradigm in which each user owns a permanently bound agent system that collaborates on the owner's behalf, forming a network whose nodes are humans rather than agents.

Design proposal / conceptual architecture presented in the paper (no large-scale deployment or empirical evaluation described in excerpt).

high positive ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... structure of agent networks (human-centric vs agent-centric) and delegation mode...

The next frontier for AI agents lies not in stronger individual capability, but in the digitization of human collaborative relationships.

Normative/strategic claim advanced by the authors as the central thesis (conceptual argument, no empirical test reported).

high positive ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... focus of AI-agent development (individual capability vs collaboration digitizati...

Human productivity rests on the social and organizational relationships through which people coordinate, negotiate, and delegate.

Theoretical/argumentative claim presented as background motivation (conceptual reasoning, citation not provided in excerpt).

high positive ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... human productivity as mediated by social/organizational relationships

Time Series Augmented Generation (TSAG) enables LLM agents to delegate quantitative tasks to verifiable external tools.

Description of TSAG framework in paper stating delegation mechanism to external verifiable tools for quantitative computations.

high positive Time Series Augmented Generation for Financial Applications delegation capability to external tools

We publicly release the evaluation framework and empirical insights to foster standardized research on reliable financial AI.

Paper states that the framework, benchmark, and empirical results are released publicly by the authors.

high positive Time Series Augmented Generation for Financial Applications public release of resources

The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm.

Empirical results from the authors' experiments on the 100-question benchmark across multiple agents; paper states agents achieve 'near-perfect' tool-use accuracy and 'minimal' hallucination.

high positive Time Series Augmented Generation for Financial Applications tool-use accuracy; hallucination rate

We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools.

Paper reports applying the TSAG framework in an empirical study in which agents call external tools to perform quantitative computations; described as 'large-scale' and implemented by the authors.

high positive Time Series Augmented Generation for Financial Applications use of external/verifiable tools by LLM agents

We introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis.

Paper describes a new methodology and benchmark (Time Series Augmented Generation, TSAG) developed by the authors for evaluating LLM reasoning on financial time-series tasks.

high positive Time Series Augmented Generation for Financial Applications existence of a new evaluation methodology / benchmark

Effective evaluation-driven loop scaling is a central axis for advancing LLM-driven scientific discovery, and SimpleTES provides a simple yet practical framework for realizing these gains.

High-level claim supported by the aggregate experimental results and discussion in the paper.

high positive Evaluation-driven Scaling for Scientific Discovery impact of scaling evaluation-driven discovery loops on LLM-driven scientific dis...

When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover.

Experiments in which models were post-trained on successful SimpleTES trajectories and evaluated on both seen and unseen problems (paper claim of improved efficiency and generalization).

high positive Evaluation-driven Scaling for Scientific Discovery post-training efficiency on seen problems and generalization to unseen problems ...

SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning.

Methodological claim and supporting experiments where SimpleTES generates solution trajectories that are then used as supervision for learning.

high positive Evaluation-driven Scaling for Scientific Discovery availability and usefulness of trajectory-level histories for supervision

We discovered new Erdos minimum overlap constructions that surpass the best-known results.

Reported novel combinatorial constructions (Erdos minimum overlap) in the experiments that improve on prior best-known results.

high positive Evaluation-driven Scaling for Scientific Discovery quality of Erdos minimum overlap constructions (best-known benchmarks)

We designed quantum circuit routing policies that reduce gate overhead by 24.5%.

Experimental results reported for quantum circuit routing tasks showing a 24.5% reduction in gate overhead when using SimpleTES-designed policies.

high positive Evaluation-driven Scaling for Scientific Discovery quantum circuit gate overhead

We sped up the widely used LASSO algorithm by over 2x.

Benchmarking experiment reported in the paper comparing LASSO runtime/performance with and without SimpleTES (paper states >2x speedup).

high positive Evaluation-driven Scaling for Scientific Discovery LASSO algorithm runtime / speed

SimpleTES consistently outperforms both frontier-model baselines and sophisticated optimization pipelines.

Comparative experimental evaluation vs. frontier-model baselines and optimization pipelines across the reported problems (paper claim).

high positive Evaluation-driven Scaling for Scientific Discovery performance relative to baselines (solution quality / discovery success)

« Prev 1 2 3 … 70 71 72 … 129 130 Next »