Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
The framework specifies five mandatory control points for high-judgment use cases: source grounding and traceability, independent verification and tie-out, contradiction testing, escalation and approval, and audit-trail logging.
Results section listing five control points as mandatory design elements for high-judgment accounting use cases; conceptual recommendation from synthesis.
The paper develops the C³ Framework—Complementarity, Controls, and Competencies—which maps accounting tasks by task structure and judgment/materiality to recommend collaboration modes.
Results section: conceptual framework developed by the authors based on synthesized literature and guidance; no reported empirical validation in the abstract.
AI accelerates drafting, summarization, and pattern detection in accounting while professionals remain accountable for judgment, materiality, and defensibility in financial reporting and analysis.
Statement in paper summarizing literature and practitioner guidance (2023–2025); conceptual synthesis rather than new empirical data.
AI tools can serve as valuable aids in task splitting, provided there is human oversight to filter out irrelevant tasks.
Paper's conclusion synthesizing experimental results and participant feedback, recommending human-in-the-loop oversight when using AI for task-splitting.
Participants favored a hybrid approach, combining AI tools with conventional methods to maintain high accuracy in planning.
Participant preferences and qualitative feedback reported from the controlled experiment indicating preference for combining AI assistance with human methods; sample size not provided.
AI-assisted approaches can help ensure no important tasks are overlooked during task-splitting.
Reported finding from the experiment indicating AI assistance reduced omissions in task lists (paper statement based on experiment and participant observations); sample size not stated.
AI-assisted approaches can generate more granular task lists than traditional methods.
Experimental comparison reported in the paper showing AI-generated task lists were more granular (based on task lists produced during the controlled experiment); sample size not provided in summary.
The Analysis Contract framework generalizes across domains of vibe inference through domain-specific instantiation.
Theoretical claim and conceptual generalization proposed in the paper; no cross-domain empirical tests or case studies reported.
The Analysis Contract, a proposed pre-commitment framework, can adapt the logic of pre-analysis plans and the Causal Roadmap to the AI-assisted setting by imposing three conditions before a causal claim is made: a method-data contract, a data audit, and a pre-commitment statement defining what would count as a disconfirming result.
Proposed methodological/framework contribution in the paper; described and motivated conceptually, without empirical validation or implementation evidence.
Users maintain a moderate level of trust in AI even when their decisions diverge from those of AI.
Reported descriptive/analytic finding from the experiment with 59 pre-service teachers indicating measured trust remained at a moderate level in inconsistent decision conditions.
The proportion of consistent decisions significantly moderates the impact of AI-assisted decision-making paradigms on users' confidence levels.
Moderation analysis reported in the study (N=59); authors indicate that proportion of consistent human-AI decisions significantly moderates the effect of AI-assisted decision-making paradigm on confidence.
Consistency between human and AI decisions significantly enhances task performance.
Within-subject consistency manipulation in the experimental sample of 59 pre-service teachers; authors report significant positive association between proportion of consistent decisions and measured task performance.
Consistency between human and AI decisions significantly enhances users' confidence.
Within-subject manipulation of human-AI consistency in the study (N=59); authors report a significant positive effect of consistency on users' confidence in the measured models.
Consistency between human and AI decisions significantly enhances users' trust in AI.
Within-subject manipulation of human-AI consistency in the experiment with 59 pre-service teachers; authors report a significant positive effect of consistency on trust measured and tested in their models.
When human-AI decision consistency is taken into account, AI-assisted decision-making paradigms influence task performance indirectly through a sequential psychological pathway involving users’ confidence and their trust in the AI.
Same experimental sample (N=59), structural equation modeling reported a significant indirect (mediated) pathway from AI-assisted paradigms → users' confidence → trust in AI → task performance; moderation by human-AI consistency was considered.
Post-hoc SHAP attribution reveals that complaint recurrence and neighborhood-level statistics are stronger predictors of actionable violations than raw complaint volume.
Empirical claim based on post-hoc SHAP feature-attribution analysis applied to the paper's models; the excerpt reports a relative feature importance finding but provides no numeric effect sizes or sample counts.
We formalize each domain as a Markov Decision Process (MDP) in which equitable classification coverage is a first-class reward objective.
Methodological specification in the paper asserting each operational domain was modeled as an MDP with equity-aware reward structure. No further empirical details in the excerpt.
The proposed technique is designed to maximize throughput, minimize misclassification cost, and actively narrow historical equity gaps in service delivery.
Stated design objectives of the RL approach in the paper. No quantified outcomes or evaluation reported in the provided text.
Rather than replacing human classifiers, our agents act as intelligent intake routers that learn to assign incoming complaints to action categories: escalate, batch, defer, inspect now.
Descriptive claim of agent behavior and intended design; asserts agents perform routing decisions into four action categories. No empirical performance numbers provided in the excerpt.
We develop an equity-centered reinforcement learning (RL) framework that augments call classification capacity across six New York City Department of Buildings operational domains (boiler safety, crane and derrick oversight, heat and hot water, housing complaint triage, scaffold safety, and Natural Area District protection).
Methodological development described in the paper; claimed application domain spans six named DOB operational areas. No evaluation metrics or sample sizes provided in the excerpt.
Design principle: effective AI assistance should clear a quality threshold suited to the target content, rather than simply be present.
Authors' proposed design principle based on empirical and qualitative results from their study.
Qualitative findings suggest the required quality threshold for helpful AI drafts is content-dependent; as visual complexity increases, the quality needed from AI drafts increases.
Authors' qualitative analysis from the study (no numeric measures provided in the excerpt).
There is a minimum quality threshold for AI drafts to be effective; simple presence of AI assistance is insufficient.
Synthesis of empirical results and comparisons between GenAD and baseline drafts reported by the authors (stated as an interpretation of the findings).
Baseline drafts generated from simple, unguided prompts offered only modest benefits compared to authoring from scratch.
Empirical comparison reported in the within-subjects study contrasting GenAD drafts and baseline (unguided-prompt) drafts; no numeric effect sizes or sample sizes provided in the excerpt.
GenAD drafts significantly reduced cognitive load.
Result reported from the within-subjects study (authors state a significant reduction in cognitive load when using GenAD drafts); specific measure, statistical values, and sample size not provided in the excerpt.
GenAD drafts cut completion time by more than half.
Result reported from the within-subjects study comparing completion time when using GenAD drafts versus authoring from scratch; exact sample size and numeric reduction not provided in the excerpt.
Recent work has shown that giving novice describers an AI-generated draft to start from helps produce higher-quality audio description (AD) and lowers the barrier to entry.
Statement refers to prior published work (no specific study, sample size, or citation provided in the excerpt).
Olava Extract reduced inference cost by 78% to 97% compared with the frontier models tested.
Reported cost comparison (inference cost) versus the five frontier models evaluated in the study; percentage reductions presented in the paper.
Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842.
Reported evaluation results comparing Olava Extract to five frontier models on structured contract extraction; explicit macro and micro F1 scores presented in the paper.
Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models.
Reported intervention experiments where authors used ASR diagnostics to refine prompts and add deterministic routing guards, observing TSR improvements up to +93.8 percentage points.
GPT-5.2 achieves perfect ASR.
Model-level evaluation reported in the paper indicating GPT-5.2 attained perfect ASR under the HMASP tests.
We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision.
Methodological contribution described in the paper (definition of a new metric and its components).
LLM-based multi-agent systems are increasingly deployed for payment workflows.
Statement in the paper's introduction/abstract framing; no empirical deployment data or sample size provided.
DePAI offers a path to scalable, resilient self-organization that integrates physical infrastructure, AI, and community ownership under transparent rules, on-chain incentives, and permissionless participation, aiming to preserve human autonomy.
Normative/conceptual claim and argument based on the proposed architecture and incentive design; presented without empirical evaluation.
These elements specify workflows that couple machine execution with human oversight, enabling enhanced self-organization of techno-socio-economic systems, which we call DePAI.
Theoretical workflow specification and argumentation in the paper; no reported experimental or observational validation.
We connect DAO design with digital-democracy research on deliberation and voting, showing how each can advance the other.
Conceptual linkage and theoretical argumentation drawing on literature from DAO design and digital-democracy research; no empirical test or sample described.
We synthesize foundations in blockchains, decentralized autonomous organizations (DAOs), and cryptoeconomics.
Literature synthesis and conceptual review within the paper; no empirical sample or experimental method reported.
We propose DAO-enabled decentralized physical AI (DePAI), a democratic architecture for coordinating humans and autonomous machines in the operation and governance of physical-digital systems.
Conceptual proposal and architectural synthesis presented in the paper (theory/design contribution). No empirical evaluation or sample reported.
Human performance on the benchmark is 80.7%.
Human baseline reported in the paper (same evaluation/rubrics as agents).
We provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%.
Description of a reduced-size benchmark split (100 tasks) and reported cost reduction (~70%) in the paper.
Workspace-Bench includes files up to 20GB in size.
Dataset description in the paper specifying maximum file size.
We construct Workspace-Bench with 5 worker profiles, 74 file types, 20,476 files (up to 20GB), 388 tasks, and 7,399 total rubrics, each task associated with its own file dependency graph.
Dataset construction described in the paper; counts and sizes reported by authors.
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively.
Conceptual definition and motivation presented by the authors in the paper (no empirical test reported).
A six-factor confirmatory factor analysis confirmed the measurement model used for perceived intimacy, perceived responsiveness, trust, purchase intention, hedonic motivation, and utilitarian motivation.
Reported six-factor CFA on the collected survey measures from the experimental sample (N = 439).
The mediating effect of perceived intimacy on the anchor type → trust relationship was stronger (i.e., particularly operative) when participants' hedonic motivation was moderate to high.
Moderated mediation analyses including hedonic motivation as a moderator on the intimacy→trust pathway; experimental sample N = 439; reported conditional mediation (stronger at moderate-to-high hedonic motivation).
Perceived intimacy (a relational cue) mediated the effect of anchor type on trust.
Moderated mediation analyses (with heteroscedasticity-consistent standard errors) based on experimental manipulation of anchor type in sample N = 439; CFA supported measurement of the mediator and outcome constructs; mediation path reported for perceived intimacy.
Human anchors generated higher purchase intention overall than AI anchors.
Between-subjects randomized experiment (N = 439) with a purchase-intention measure administered after exposure to either a human or AI anchor; reported higher purchase intention in the human-anchor condition.
Human anchors generated higher trust overall than AI anchors.
Between-subjects randomized experiment comparing human vs AI live-streaming anchors; participants (N = 439) watched one video and completed a trust measure; analysis reported higher trust for human-anchor condition.
Conceptually, AI is positioned not as an automated controller but as an intelligence-augmenting co-regulator that supports learners' capacity to coordinate effort, attention, and understanding together.
Authors' conceptual framing and interpretation of empirical results showing that adaptive and proactive AI feedback supports shared regulation processes during dyadic programming tasks.
Proactive, forecast-based feedback using machine-learning predictions of future collaboration states (Study 3) further enhances performance and sustains shared regulation by anticipating breakdowns before they manifest.
Study 3 intervention using ML-based forecasts of future collaboration states to provide proactive support; reported improvements relative to reactive/single-channel conditions (per-study sample sizes and ML model metrics not provided in abstract).