Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

SCDPs are a useful framework for policy simulation for the digital economy, mechanism design for information systems, and digital twin modeling of cyberinfrastructure.

Paper posits these applications as prospective uses of the framework (argumentative/speculative; no empirical evaluation reported in abstract).

high positive The Design and Composition of Structural Causal Decision Pro... usefulness for policy simulation, mechanism design, and digital twin modeling

SCDPs are capable of modeling variable discounting, a tool used widely in social scientific modeling.

Paper states the capability as part of SCDP definition and examples (theoretical claim).

high positive The Design and Composition of Structural Causal Decision Pro... modeling of variable discounting

An SCDP can endogenously model the memory-formation process and is thus useful for modeling resource‑rational agents in dynamic settings.

Paper asserts SCDP can represent memory-formation endogenously and discusses application to resource-rational agents (theoretical modeling capability).

high positive The Design and Composition of Structural Causal Decision Pro... ability to model endogenous memory formation / resource-rational agents

SCDPs are strictly more expressive than POMDPs because they do not assume rational belief formation.

Comparative expressiveness claim stated in the paper; supported by theoretical argument or formal separation result (paper text states the claim explicitly).

high positive The Design and Composition of Structural Causal Decision Pro... expressiveness relative to POMDPs (ability to represent non-rational belief form...

SCDPs inherit the composition properties of SCDMs (i.e., SCDPs benefit from SCDM composability).

Logical consequence argued in the paper from SCDP being constructed from SCDMs; likely supported by formal argumentation in the text.

high positive The Design and Composition of Structural Causal Decision Pro... inheritance of composability by SCDPs

A Structural Causal Decision Process (SCDP) is defined as a recurring SCDM with a discount variable.

Formal definition introduced in the paper (theoretical definition).

high positive The Design and Composition of Structural Causal Decision Pro... definition of SCDP as recurring SCDM with discounting

SCDMs have a well-defined and computationally useful property of composability.

Paper states and demonstrates ("We show") composability property — presumably via formal proofs or constructive arguments in the text (theoretical proofs/exposition).

high positive The Design and Composition of Structural Causal Decision Pro... composability of causal decision models

SCDMs can have open root variables for which no probability distribution or structural equation is given.

Model definitions in the paper explicitly allow open root variables (theoretical description).

high positive The Design and Composition of Structural Causal Decision Pro... support for open root variables in model formalism

In SCDMs, agent decisions can be constrained by their causal antecedents (i.e., decisions can be constrained by their causal parents).

Model specification and definitions in the paper describing constraints on decisions as part of SCDM structure (theoretical construction).

high positive The Design and Composition of Structural Causal Decision Pro... decision constraints by causal antecedents

Structural Causal Decision Models (SCDMs) expand on Structural Causal Influence Models by explicitly representing the causal relationships between model variables and the payoffs of agent decisions.

Formal model development and comparison to existing SCIMs provided in the paper (theoretical definitions and arguments).

high positive The Design and Composition of Structural Causal Decision Pro... explicit representation of causal relationships between variables and payoffs

We present two new classes of causal models of decision-making agents: Structural Causal Decision Models (SCDMs) and Structural Causal Decision Processes (SCDPs).

Paper introduces formal definitions for two model classes and describes their properties in the text (theoretical exposition).

high positive The Design and Composition of Structural Causal Decision Pro... introduction of new model classes (SCDMs and SCDPs)

These findings provide insights for designing flexible yet reliable constraint-based workflows.

Synthesis and discussion of study results and technical evaluation in paper's conclusion.

high positive U-Define: Designing User Workflows for Hard and Soft Constra... design guidance for constraint-based workflows

User-defined constraint types improve user satisfaction.

Reported user study measures showing higher satisfaction for participants using U-Define compared to baselines (no sample size or numeric effects provided).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... user satisfaction (self-reported)

User-defined constraint types improve performance.

Reported results from user studies and/or technical evaluation indicating better task performance when users can set hard/soft constraint types (no numeric effect size or sample size in excerpt).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... performance (task success / quality of generated plans)

User-defined constraint types improve perceived usefulness.

Results from the reported user studies comparing U-Define (user-defined constraint types) to baselines; based on participant responses and measures of perceived usefulness (sample sizes/details not provided in excerpt).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... perceived usefulness (user-reported)

U-Define verifies hard constraints using formal model checking and verifies soft constraints using an LLM-as-judge evaluation.

Description of the complementary verification methods employed in the U-Define system (technical design/implementation).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... verification of constraint types (hard via model checking, soft via LLM evaluati...

We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility.

System implementation and description in paper (design and implementation of U-Define).

high positive U-Define: Designing User Workflows for Hard and Soft Constra... ability to specify constraints (natural-language input and categorization into h...

KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.

Theoretical claim about economic and cumulative effects of adopting KOs; no cost-benefit analysis, pilot results, or quantitative evidence reported in the paper.

high positive Reliable AI Needs to Externalize Implicit Knowledge: A Human... cost-effectiveness of verification and cumulative improvement in AI reliability

We propose Knowledge Objects (KOs) — structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse.

Proposed solution described in the paper; conceptual design and intended properties presented, without reported deployments, trials, or empirical evaluation.

high positive Reliable AI Needs to Externalize Implicit Knowledge: A Human... externalization and human verifiability of implicit knowledge via KOs

Evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, provides added value compared to focusing on benchmark performance only.

Argument/interpretation in the paper based on the study's multi-turn human-in-the-loop evaluation showing differences between objective performance gains and participant perceptions.

high positive Seeking Information with RAG-Assistants: Does Model Size Mat... evaluation methodology value (usability, satisfaction, accuracy)

Hybrid systems (human + RAG assistant) are beneficial in information-seeking scenarios.

Conclusion drawn from the experiment showing human-AI collaboration outperforms model-only baselines across model sizes in a realistic multi-turn information-seeking task with N=112 participants.

high positive Seeking Information with RAG-Assistants: Does Model Size Mat... task performance in information-seeking

The performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size.

Reported results from the experimental comparison across conditions and three model sizes (3B, 8B, 70B) with N=112 participants; paper states the performance gain is significant across sizes (no numeric effect sizes or p-values provided in the excerpt).

high positive Seeking Information with RAG-Assistants: Does Model Size Mat... task accuracy / performance

The framework addresses AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.

Paper lists and provides guidance on AI-specific methodological issues (model versioning, interaction dynamics, contamination/spillover, equity). This is a descriptive claim about topics the framework covers, not an empirical evaluation of solutions.

high positive Principles and Guidelines for Randomized Controlled Trials i... coverage of AI-specific methodological challenges in evaluation guidelines

The framework implements a graded transparency and repeatability framework.

Paper extends TOP-guideline-derived transparency principle into a graded scheme for transparency and repeatability; described as an operational feature of the proposed framework.

high positive Principles and Guidelines for Randomized Controlled Trials i... graded transparency and repeatability practices for AI RCTs

The framework integrates heterogeneity analysis and practical significance assessment.

Paper reports inclusion of guidance on analyzing heterogenous treatment effects and assessing practical significance; presented as part of guidelines rather than tested across datasets.

high positive Principles and Guidelines for Randomized Controlled Trials i... inclusion of heterogeneity and practical significance analysis in evaluation pra...

The framework formalizes causal inference through RCT methodology for AI contexts.

Paper states adoption of randomized controlled trial methods and causal inference framing for AI impact evaluation; described as methodological proposition rather than validated application.

high positive Principles and Guidelines for Randomized Controlled Trials i... use of RCTs to support causal inference in AI evaluations

Our framework extends prior work by centering evaluation on human performance rather than model output alone.

Paper claims a conceptual shift: focus on human performance metrics; supported by argumentative rationale and literature references rather than empirical demonstration.

high positive Principles and Guidelines for Randomized Controlled Trials i... focus of evaluation metrics (human performance vs. model output)

The principles and guidelines serve three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms.

Paper's stated intended uses/positioning of the framework; presented as roles in the discussion/positioning section rather than empirically validated roles.

high positive Principles and Guidelines for Randomized Controlled Trials i... utility of the framework in planning, evaluating, and standard-setting

We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases.

Paper reports a concrete output: 33 guidelines derived from the five principles, with each guideline presented as requirement + rationale + implementation instructions + evidence base (documented in paper content).

high positive Principles and Guidelines for Randomized Controlled Trials i... availability of operational guidelines for AI RCTs

The paper adopts the (Shadish et al., 2002) four-validity framework and extends it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025).

Explicit methodological choice described in the paper: adoption of Shadish et al. four-validity framework and addition of a transparency/repeatability principle based on TOP Guidelines; documented in the text as design decision.

high positive Principles and Guidelines for Randomized Controlled Trials i... methodological framework / validity criteria

The framework draws on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology.

Paper reports literature review and cross-disciplinary synthesis as the methodological foundation for the framework (references to those disciplines). No empirical cross-disciplinary experiment reported.

high positive Principles and Guidelines for Randomized Controlled Trials i... methodological comprehensiveness / interdisciplinary grounding

This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies).

Paper's stated contribution: development of a conceptual framework integrating RCT design principles for AI evaluation. Based on literature synthesis and methodological argumentation rather than empirical testing.

high positive Principles and Guidelines for Randomized Controlled Trials i... standardization of AI evaluation RCTs / evaluation methodology

The paper introduces a Specification Governance Model (SGM), grounded in Transaction Cost Economics, and provides a practical governance decision guide.

Conceptual/modeling contribution described in the paper: SGM grounded in TCE with an applied decision guide (theoretical plus prescriptive).

high positive The Productivity-Reliability Paradox: Specification-Driven G... governance decision-making for specification practices

The paper proposes the AI-Augmented Methodology Taxonomy (AAMT), classifying six methodologies under three AI integration tiers.

Conceptual contribution: taxonomy introduced and described in the paper (six methodologies, three tiers).

high positive The Productivity-Reliability Paradox: Specification-Driven G... existence and classification of methodologies (taxonomic contribution)

Telemetry across 10,000+ developers shows a 98% increase in pull requests.

Observational telemetry data aggregated across >10,000 developers reported in the paper; metric reported is percent increase in pull request count.

high positive The Productivity-Reliability Paradox: Specification-Driven G... number of pull requests (pull_request_count)

Controlled studies report 20-56% productivity gains on well-scoped tasks.

Aggregate of multiple controlled experimental studies cited in the paper (2022–2026); reported as observed productivity improvements on well-scoped tasks in those studies. Specific study-level sample sizes not reported in the claim text.

high positive The Productivity-Reliability Paradox: Specification-Driven G... developer productivity

Practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration can be articulated, and calibrated beliefs plus utility-aware policies can improve agentic AI orchestration (illustrated via concrete examples and design patterns)

Paper provides articulated properties, examples, and design patterns but no empirical validation; claims of improvement are illustrated conceptually.

high positive Position: agentic AI orchestration should be Bayes-consisten... improvement in agentic AI orchestration from calibrated beliefs and utility-awar...

Coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters

Central prescriptive claim of the position paper; supported by conceptual argumentation and illustrative examples rather than empirical tests.

high positive Position: agentic AI orchestration should be Bayes-consisten... coherence of decision-making in agentic systems as a function of orchestration-l...

Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions

Argumentative/theoretical claim in the position paper; illustrated with conceptual examples and design patterns rather than empirical evaluation.

high positive Position: agentic AI orchestration should be Bayes-consisten... decision quality of agentic control via belief maintenance and updating

Many high-value deployments rely on decisions under uncertainty (for example, which tool to call, which expert to consult, or how many resources to invest)

Stated as a motivating observation in the paper; no quantitative data or sample provided.

high positive Position: agentic AI orchestration should be Bayes-consisten... prevalence of decision-under-uncertainty requirements in high-value deployments

LLMs excel at predictive tasks and complex reasoning tasks

Asserted in the paper's opening motivation; no empirical evaluation or sample reported in the paper itself.

high positive Position: agentic AI orchestration should be Bayes-consisten... LLM performance on predictive and reasoning tasks

Qiushi Engine performed thousands of LLM-mediated reasoning, measurement and revision actions during its investigations (e.g., 3,242 LLM calls, 1,242 tool calls).

Operational logs and activity counts reported in the paper: 145.9 million tokens, 3,242 LLM calls, 1,242 tool calls, 163 research notes, 44 scripts.

high positive End-to-end autonomous scientific discovery on a real optical... scale of automated research activity (counts of LLM calls, tool calls, notes, sc...

Qiushi Engine combines nonlinear research phases, Meta-Trace memory and a dual-layer architecture to maintain adaptive and stable research trajectories across long-horizon investigations.

System architecture and methods section describing nonlinear research phases, Meta-Trace memory, and dual-layer architecture; demonstrated operation across long-horizon tasks in experiments (thousands of LLM and tool calls).

high positive End-to-end autonomous scientific discovery on a real optical... ability to maintain adaptive and stable research trajectories over long-horizon ...

The AI-discovered optical bilinear mechanism suggests a route towards high-speed, energy-efficient optical hardware for pairwise computation.

Interpretive claim based on the structural analogy between the discovered optical bilinear interaction and Transformer attention; conceptual argument provided in the paper rather than measured hardware speed or energy benchmarks.

high positive End-to-end autonomous scientific discovery on a real optical... potential for high-speed, energy-efficient optical hardware (conceptual implicat...

In an open-ended study (145.9 million tokens, 3,242 LLM calls, 1,242 tool calls, 163 research notes and 44 scripts), Qiushi Engine proposes and experimentally validates an optical bilinear interaction, a physical mechanism structurally analogous to a core operation in Transformer attention.

Open-ended experimental study reported in the paper with the listed activity metrics (145.9M tokens, 3,242 LLM calls, etc.); experimental investigation and measurements presented claiming validation of optical bilinear interaction and drawing structural analogy to Transformer attention's pairwise operation.

high positive End-to-end autonomous scientific discovery on a real optical... experimental validation of an optical bilinear interaction mechanism

Qiushi Engine autonomously reproduces a published transmission-matrix experiment on a non-original platform.

Experimental reproduction reported in the paper; description of executing the published transmission-matrix experiment using the Qiushi Engine on a different (non-original) optical platform and presenting measured results comparing to published experiment.

high positive End-to-end autonomous scientific discovery on a real optical... successful reproduction of a published transmission-matrix experiment (experimen...

Qiushi Discovery Engine is an LLM-based agentic system for end-to-end autonomous scientific discovery on a real optical platform.

Description and implementation of the Qiushi Engine combining LLM-based agentic control with an optical experimental platform; system design and end-to-end experiments reported in the paper (no randomized trial; system demonstration).

high positive End-to-end autonomous scientific discovery on a real optical... existence and operation of an end-to-end autonomous LLM-driven discovery system ...

The practical aim is to help strategic leaders and system designers recognize the configuration at work, notice when it shifts, and judge whether it fits the decision before them.

Stated aim/objective of the paper (normative guidance; conceptual).

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... leaders' capacity to detect configuration, detect shifts, and assess fitness of ...

The framework introduces 'co-adaptability'—the capacity of a configuration to improve as human and non-human participants adjust together—and situates it within 'heterogeneous teaming' where participants may vary by number, substrate, model architecture, capability, speed, memory, and form of participation.

Conceptual/theoretical introduction of new constructs (co-adaptability and heterogeneous teaming) in the paper; definitional rather than empirical.

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... capacity for joint improvement through adaptation between human and AI participa...

The five positions serve as landmarks that help leaders recognize configurations as they layer, drift, or change in a single decision.

Normative/conceptual claim supported by the framework; no empirical validation or sample provided in the excerpt.

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... leaders' ability to recognize shifting decision configurations

« Prev 1 2 3 … 64 65 66 … 129 130 Next »