Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

The framework specifies five mandatory control points for high-judgment use cases: source grounding and traceability, independent verification and tie-out, contradiction testing, escalation and approval, and audit-trail logging.

Results section listing five control points as mandatory design elements for high-judgment accounting use cases; conceptual recommendation from synthesis.

high positive Collaborative Intelligence in Accounting: A Human + AI Compl... governance_and_regulation

The paper develops the C³ Framework—Complementarity, Controls, and Competencies—which maps accounting tasks by task structure and judgment/materiality to recommend collaboration modes.

Results section: conceptual framework developed by the authors based on synthesized literature and guidance; no reported empirical validation in the abstract.

high positive Collaborative Intelligence in Accounting: A Human + AI Compl... task_allocation

AI accelerates drafting, summarization, and pattern detection in accounting while professionals remain accountable for judgment, materiality, and defensibility in financial reporting and analysis.

Statement in paper summarizing literature and practitioner guidance (2023–2025); conceptual synthesis rather than new empirical data.

high positive Collaborative Intelligence in Accounting: A Human + AI Compl... task_completion_time

AI tools can serve as valuable aids in task splitting, provided there is human oversight to filter out irrelevant tasks.

Paper's conclusion synthesizing experimental results and participant feedback, recommending human-in-the-loop oversight when using AI for task-splitting.

high positive Splitting User Stories Into Tasks with AI -- A Foe or an All... effectiveness of AI-assisted task-splitting under human oversight

Participants favored a hybrid approach, combining AI tools with conventional methods to maintain high accuracy in planning.

Participant preferences and qualitative feedback reported from the controlled experiment indicating preference for combining AI assistance with human methods; sample size not provided.

high positive Splitting User Stories Into Tasks with AI -- A Foe or an All... participant preference for planning approach / planning accuracy

AI-assisted approaches can help ensure no important tasks are overlooked during task-splitting.

Reported finding from the experiment indicating AI assistance reduced omissions in task lists (paper statement based on experiment and participant observations); sample size not stated.

high positive Splitting User Stories Into Tasks with AI -- A Foe or an All... task omission rate / completeness of task lists

AI-assisted approaches can generate more granular task lists than traditional methods.

Experimental comparison reported in the paper showing AI-generated task lists were more granular (based on task lists produced during the controlled experiment); sample size not provided in summary.

high positive Splitting User Stories Into Tasks with AI -- A Foe or an All... task list granularity

The Analysis Contract framework generalizes across domains of vibe inference through domain-specific instantiation.

Theoretical claim and conceptual generalization proposed in the paper; no cross-domain empirical tests or case studies reported.

high positive Vibe Econometrics and the Analysis Contract applicability/generalizability of the Analysis Contract across domains

The Analysis Contract, a proposed pre-commitment framework, can adapt the logic of pre-analysis plans and the Causal Roadmap to the AI-assisted setting by imposing three conditions before a causal claim is made: a method-data contract, a data audit, and a pre-commitment statement defining what would count as a disconfirming result.

Proposed methodological/framework contribution in the paper; described and motivated conceptually, without empirical validation or implementation evidence.

high positive Vibe Econometrics and the Analysis Contract governance of AI-assisted causal claims / credibility of causal claims under AI ...

Users maintain a moderate level of trust in AI even when their decisions diverge from those of AI.

Reported descriptive/analytic finding from the experiment with 59 pre-service teachers indicating measured trust remained at a moderate level in inconsistent decision conditions.

high positive Shaping Human-AI Collaboration in Education: Effects of AI-A... trust in AI under decision divergence

The proportion of consistent decisions significantly moderates the impact of AI-assisted decision-making paradigms on users' confidence levels.

Moderation analysis reported in the study (N=59); authors indicate that proportion of consistent human-AI decisions significantly moderates the effect of AI-assisted decision-making paradigm on confidence.

high positive Shaping Human-AI Collaboration in Education: Effects of AI-A... users' confidence (moderation effect)

Consistency between human and AI decisions significantly enhances task performance.

Within-subject consistency manipulation in the experimental sample of 59 pre-service teachers; authors report significant positive association between proportion of consistent decisions and measured task performance.

high positive Shaping Human-AI Collaboration in Education: Effects of AI-A... task performance

Consistency between human and AI decisions significantly enhances users' confidence.

Within-subject manipulation of human-AI consistency in the study (N=59); authors report a significant positive effect of consistency on users' confidence in the measured models.

high positive Shaping Human-AI Collaboration in Education: Effects of AI-A... users' confidence

Consistency between human and AI decisions significantly enhances users' trust in AI.

Within-subject manipulation of human-AI consistency in the experiment with 59 pre-service teachers; authors report a significant positive effect of consistency on trust measured and tested in their models.

high positive Shaping Human-AI Collaboration in Education: Effects of AI-A... trust in AI

When human-AI decision consistency is taken into account, AI-assisted decision-making paradigms influence task performance indirectly through a sequential psychological pathway involving users’ confidence and their trust in the AI.

Same experimental sample (N=59), structural equation modeling reported a significant indirect (mediated) pathway from AI-assisted paradigms → users' confidence → trust in AI → task performance; moderation by human-AI consistency was considered.

high positive Shaping Human-AI Collaboration in Education: Effects of AI-A... task performance (mediated effect)

Post-hoc SHAP attribution reveals that complaint recurrence and neighborhood-level statistics are stronger predictors of actionable violations than raw complaint volume.

Empirical claim based on post-hoc SHAP feature-attribution analysis applied to the paper's models; the excerpt reports a relative feature importance finding but provides no numeric effect sizes or sample counts.

high positive Scaling the Queue: Reinforcement Learning for Equitable Call... predictive importance for actionable violations (feature importance)

We formalize each domain as a Markov Decision Process (MDP) in which equitable classification coverage is a first-class reward objective.

Methodological specification in the paper asserting each operational domain was modeled as an MDP with equity-aware reward structure. No further empirical details in the excerpt.

high positive Scaling the Queue: Reinforcement Learning for Equitable Call... equitable classification coverage (as a modeled reward)

The proposed technique is designed to maximize throughput, minimize misclassification cost, and actively narrow historical equity gaps in service delivery.

Stated design objectives of the RL approach in the paper. No quantified outcomes or evaluation reported in the provided text.

high positive Scaling the Queue: Reinforcement Learning for Equitable Call... throughput; misclassification cost; historical equity gaps in service delivery

Rather than replacing human classifiers, our agents act as intelligent intake routers that learn to assign incoming complaints to action categories: escalate, batch, defer, inspect now.

Descriptive claim of agent behavior and intended design; asserts agents perform routing decisions into four action categories. No empirical performance numbers provided in the excerpt.

high positive Scaling the Queue: Reinforcement Learning for Equitable Call... complaint routing action assignment

We develop an equity-centered reinforcement learning (RL) framework that augments call classification capacity across six New York City Department of Buildings operational domains (boiler safety, crane and derrick oversight, heat and hot water, housing complaint triage, scaffold safety, and Natural Area District protection).

Methodological development described in the paper; claimed application domain spans six named DOB operational areas. No evaluation metrics or sample sizes provided in the excerpt.

high positive Scaling the Queue: Reinforcement Learning for Equitable Call... call classification capacity / intake routing capability

Design principle: effective AI assistance should clear a quality threshold suited to the target content, rather than simply be present.

Authors' proposed design principle based on empirical and qualitative results from their study.

high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... design guidance for AI assistance effectiveness

Qualitative findings suggest the required quality threshold for helpful AI drafts is content-dependent; as visual complexity increases, the quality needed from AI drafts increases.

Authors' qualitative analysis from the study (no numeric measures provided in the excerpt).

high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... relationship between visual complexity and required AI draft quality

There is a minimum quality threshold for AI drafts to be effective; simple presence of AI assistance is insufficient.

Synthesis of empirical results and comparisons between GenAD and baseline drafts reported by the authors (stated as an interpretation of the findings).

high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... effectiveness of AI assistance (dependent on draft quality)

Baseline drafts generated from simple, unguided prompts offered only modest benefits compared to authoring from scratch.

Empirical comparison reported in the within-subjects study contrasting GenAD drafts and baseline (unguided-prompt) drafts; no numeric effect sizes or sample sizes provided in the excerpt.

high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... benefit/effectiveness of baseline AI drafts (e.g., quality or efficiency gains)

GenAD drafts significantly reduced cognitive load.

Result reported from the within-subjects study (authors state a significant reduction in cognitive load when using GenAD drafts); specific measure, statistical values, and sample size not provided in the excerpt.

high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... cognitive load

GenAD drafts cut completion time by more than half.

Result reported from the within-subjects study comparing completion time when using GenAD drafts versus authoring from scratch; exact sample size and numeric reduction not provided in the excerpt.

high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... completion time

Recent work has shown that giving novice describers an AI-generated draft to start from helps produce higher-quality audio description (AD) and lowers the barrier to entry.

Statement refers to prior published work (no specific study, sample size, or citation provided in the excerpt).

high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... AD quality / barrier to entry for novice describers

Olava Extract reduced inference cost by 78% to 97% compared with the frontier models tested.

Reported cost comparison (inference cost) versus the five frontier models evaluated in the study; percentage reductions presented in the paper.

high positive A Few Good Clauses: Comparing LLMs vs Domain-Trained Small L... inference cost

Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842.

Reported evaluation results comparing Olava Extract to five frontier models on structured contract extraction; explicit macro and micro F1 scores presented in the paper.

high positive A Few Good Clauses: Comparing LLMs vs Domain-Trained Small L... F1 score (macro and micro)

Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models.

Reported intervention experiments where authors used ASR diagnostics to refine prompts and add deterministic routing guards, observing TSR improvements up to +93.8 percentage points.

high positive Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... Task Success Rate (TSR)

GPT-5.2 achieves perfect ASR.

Model-level evaluation reported in the paper indicating GPT-5.2 attained perfect ASR under the HMASP tests.

high positive Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... Agentic Success Rate (ASR)

We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision.

Methodological contribution described in the paper (definition of a new metric and its components).

high positive Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... trajectory fidelity / transition-level execution accuracy

LLM-based multi-agent systems are increasingly deployed for payment workflows.

Statement in the paper's introduction/abstract framing; no empirical deployment data or sample size provided.

high positive Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... deployment/adoption of LLM-based multi-agent systems for payment workflows

DePAI offers a path to scalable, resilient self-organization that integrates physical infrastructure, AI, and community ownership under transparent rules, on-chain incentives, and permissionless participation, aiming to preserve human autonomy.

Normative/conceptual claim and argument based on the proposed architecture and incentive design; presented without empirical evaluation.

high positive DAO-enabled decentralized physical AI: A new paradigm for hu... scalability and resilience of self-organization, integration of infrastructure/A...

These elements specify workflows that couple machine execution with human oversight, enabling enhanced self-organization of techno-socio-economic systems, which we call DePAI.

Theoretical workflow specification and argumentation in the paper; no reported experimental or observational validation.

high positive DAO-enabled decentralized physical AI: A new paradigm for hu... workflows coupling machine execution with human oversight and resulting self-org...

We connect DAO design with digital-democracy research on deliberation and voting, showing how each can advance the other.

Conceptual linkage and theoretical argumentation drawing on literature from DAO design and digital-democracy research; no empirical test or sample described.

high positive DAO-enabled decentralized physical AI: A new paradigm for hu... mutual advancement of DAO design and digital-democracy practices (deliberation a...

We synthesize foundations in blockchains, decentralized autonomous organizations (DAOs), and cryptoeconomics.

Literature synthesis and conceptual review within the paper; no empirical sample or experimental method reported.

high positive DAO-enabled decentralized physical AI: A new paradigm for hu... coverage/synthesis of foundational literature on blockchains, DAOs, and cryptoec...

We propose DAO-enabled decentralized physical AI (DePAI), a democratic architecture for coordinating humans and autonomous machines in the operation and governance of physical-digital systems.

Conceptual proposal and architectural synthesis presented in the paper (theory/design contribution). No empirical evaluation or sample reported.

high positive DAO-enabled decentralized physical AI: A new paradigm for hu... coordination of humans and autonomous machines in operation and governance of ph...

Human performance on the benchmark is 80.7%.

Human baseline reported in the paper (same evaluation/rubrics as agents).

high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... benchmark score (human performance)

We provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%.

Description of a reduced-size benchmark split (100 tasks) and reported cost reduction (~70%) in the paper.

high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... evaluation cost (and distributional fidelity of the subset)

Workspace-Bench includes files up to 20GB in size.

Dataset description in the paper specifying maximum file size.

high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... maximum file size in the benchmark

We construct Workspace-Bench with 5 worker profiles, 74 file types, 20,476 files (up to 20GB), 388 tasks, and 7,399 total rubrics, each task associated with its own file dependency graph.

Dataset construction described in the paper; counts and sizes reported by authors.

high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... benchmark size and heterogeneity (worker profiles, file types, file count, task ...

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively.

Conceptual definition and motivation presented by the authors in the paper (no empirical test reported).

high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... ability of AI agents to use file dependencies to complete tasks

A six-factor confirmatory factor analysis confirmed the measurement model used for perceived intimacy, perceived responsiveness, trust, purchase intention, hedonic motivation, and utilitarian motivation.

Reported six-factor CFA on the collected survey measures from the experimental sample (N = 439).

high positive Conditional trust pathways in live-streaming commerce: how c... measurement validity (factor structure)

The mediating effect of perceived intimacy on the anchor type → trust relationship was stronger (i.e., particularly operative) when participants' hedonic motivation was moderate to high.

Moderated mediation analyses including hedonic motivation as a moderator on the intimacy→trust pathway; experimental sample N = 439; reported conditional mediation (stronger at moderate-to-high hedonic motivation).

high positive Conditional trust pathways in live-streaming commerce: how c... trust (conditional mediation by perceived intimacy moderated by hedonic motivati...

Perceived intimacy (a relational cue) mediated the effect of anchor type on trust.

Moderated mediation analyses (with heteroscedasticity-consistent standard errors) based on experimental manipulation of anchor type in sample N = 439; CFA supported measurement of the mediator and outcome constructs; mediation path reported for perceived intimacy.

high positive Conditional trust pathways in live-streaming commerce: how c... trust (mediated by perceived intimacy)

Human anchors generated higher purchase intention overall than AI anchors.

Between-subjects randomized experiment (N = 439) with a purchase-intention measure administered after exposure to either a human or AI anchor; reported higher purchase intention in the human-anchor condition.

high positive Conditional trust pathways in live-streaming commerce: how c... purchase intention

Human anchors generated higher trust overall than AI anchors.

Between-subjects randomized experiment comparing human vs AI live-streaming anchors; participants (N = 439) watched one video and completed a trust measure; analysis reported higher trust for human-anchor condition.

high positive Conditional trust pathways in live-streaming commerce: how c... trust

Conceptually, AI is positioned not as an automated controller but as an intelligence-augmenting co-regulator that supports learners' capacity to coordinate effort, attention, and understanding together.

Authors' conceptual framing and interpretation of empirical results showing that adaptive and proactive AI feedback supports shared regulation processes during dyadic programming tasks.

high positive Cognitive Alignment Drives Attention: Modeling and Supportin... conceptual role of AI in co-regulation of learning

Proactive, forecast-based feedback using machine-learning predictions of future collaboration states (Study 3) further enhances performance and sustains shared regulation by anticipating breakdowns before they manifest.

Study 3 intervention using ML-based forecasts of future collaboration states to provide proactive support; reported improvements relative to reactive/single-channel conditions (per-study sample sizes and ML model metrics not provided in abstract).

high positive Cognitive Alignment Drives Attention: Modeling and Supportin... collaboration performance and sustained shared regulation (reduction/prevention ...

« Prev 1 2 3 … 62 63 64 … 129 130 Next »