The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6491 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Human Ai Collab Remove filter
The framework specifies five mandatory control points for high-judgment use cases: source grounding and traceability, independent verification and tie-out, contradiction testing, escalation and approval, and audit-trail logging.
Results section listing five control points as mandatory design elements for high-judgment accounting use cases; conceptual recommendation from synthesis.
high positive Collaborative Intelligence in Accounting: A Human + AI Compl... governance_and_regulation
The paper develops the C³ Framework—Complementarity, Controls, and Competencies—which maps accounting tasks by task structure and judgment/materiality to recommend collaboration modes.
Results section: conceptual framework developed by the authors based on synthesized literature and guidance; no reported empirical validation in the abstract.
AI accelerates drafting, summarization, and pattern detection in accounting while professionals remain accountable for judgment, materiality, and defensibility in financial reporting and analysis.
Statement in paper summarizing literature and practitioner guidance (2023–2025); conceptual synthesis rather than new empirical data.
AI tools can serve as valuable aids in task splitting, provided there is human oversight to filter out irrelevant tasks.
Paper's conclusion synthesizing experimental results and participant feedback, recommending human-in-the-loop oversight when using AI for task-splitting.
high positive Splitting User Stories Into Tasks with AI -- A Foe or an All... effectiveness of AI-assisted task-splitting under human oversight
Participants favored a hybrid approach, combining AI tools with conventional methods to maintain high accuracy in planning.
Participant preferences and qualitative feedback reported from the controlled experiment indicating preference for combining AI assistance with human methods; sample size not provided.
high positive Splitting User Stories Into Tasks with AI -- A Foe or an All... participant preference for planning approach / planning accuracy
AI-assisted approaches can help ensure no important tasks are overlooked during task-splitting.
Reported finding from the experiment indicating AI assistance reduced omissions in task lists (paper statement based on experiment and participant observations); sample size not stated.
high positive Splitting User Stories Into Tasks with AI -- A Foe or an All... task omission rate / completeness of task lists
AI-assisted approaches can generate more granular task lists than traditional methods.
Experimental comparison reported in the paper showing AI-generated task lists were more granular (based on task lists produced during the controlled experiment); sample size not provided in summary.
The Analysis Contract framework generalizes across domains of vibe inference through domain-specific instantiation.
Theoretical claim and conceptual generalization proposed in the paper; no cross-domain empirical tests or case studies reported.
high positive Vibe Econometrics and the Analysis Contract applicability/generalizability of the Analysis Contract across domains
The Analysis Contract, a proposed pre-commitment framework, can adapt the logic of pre-analysis plans and the Causal Roadmap to the AI-assisted setting by imposing three conditions before a causal claim is made: a method-data contract, a data audit, and a pre-commitment statement defining what would count as a disconfirming result.
Proposed methodological/framework contribution in the paper; described and motivated conceptually, without empirical validation or implementation evidence.
high positive Vibe Econometrics and the Analysis Contract governance of AI-assisted causal claims / credibility of causal claims under AI ...
Users maintain a moderate level of trust in AI even when their decisions diverge from those of AI.
Reported descriptive/analytic finding from the experiment with 59 pre-service teachers indicating measured trust remained at a moderate level in inconsistent decision conditions.
high positive Shaping Human-AI Collaboration in Education: Effects of AI-A... trust in AI under decision divergence
The proportion of consistent decisions significantly moderates the impact of AI-assisted decision-making paradigms on users' confidence levels.
Moderation analysis reported in the study (N=59); authors indicate that proportion of consistent human-AI decisions significantly moderates the effect of AI-assisted decision-making paradigm on confidence.
high positive Shaping Human-AI Collaboration in Education: Effects of AI-A... users' confidence (moderation effect)
Consistency between human and AI decisions significantly enhances task performance.
Within-subject consistency manipulation in the experimental sample of 59 pre-service teachers; authors report significant positive association between proportion of consistent decisions and measured task performance.
Consistency between human and AI decisions significantly enhances users' confidence.
Within-subject manipulation of human-AI consistency in the study (N=59); authors report a significant positive effect of consistency on users' confidence in the measured models.
Consistency between human and AI decisions significantly enhances users' trust in AI.
Within-subject manipulation of human-AI consistency in the experiment with 59 pre-service teachers; authors report a significant positive effect of consistency on trust measured and tested in their models.
When human-AI decision consistency is taken into account, AI-assisted decision-making paradigms influence task performance indirectly through a sequential psychological pathway involving users’ confidence and their trust in the AI.
Same experimental sample (N=59), structural equation modeling reported a significant indirect (mediated) pathway from AI-assisted paradigms → users' confidence → trust in AI → task performance; moderation by human-AI consistency was considered.
high positive Shaping Human-AI Collaboration in Education: Effects of AI-A... task performance (mediated effect)
Post-hoc SHAP attribution reveals that complaint recurrence and neighborhood-level statistics are stronger predictors of actionable violations than raw complaint volume.
Empirical claim based on post-hoc SHAP feature-attribution analysis applied to the paper's models; the excerpt reports a relative feature importance finding but provides no numeric effect sizes or sample counts.
high positive Scaling the Queue: Reinforcement Learning for Equitable Call... predictive importance for actionable violations (feature importance)
We formalize each domain as a Markov Decision Process (MDP) in which equitable classification coverage is a first-class reward objective.
Methodological specification in the paper asserting each operational domain was modeled as an MDP with equity-aware reward structure. No further empirical details in the excerpt.
high positive Scaling the Queue: Reinforcement Learning for Equitable Call... equitable classification coverage (as a modeled reward)
The proposed technique is designed to maximize throughput, minimize misclassification cost, and actively narrow historical equity gaps in service delivery.
Stated design objectives of the RL approach in the paper. No quantified outcomes or evaluation reported in the provided text.
high positive Scaling the Queue: Reinforcement Learning for Equitable Call... throughput; misclassification cost; historical equity gaps in service delivery
Rather than replacing human classifiers, our agents act as intelligent intake routers that learn to assign incoming complaints to action categories: escalate, batch, defer, inspect now.
Descriptive claim of agent behavior and intended design; asserts agents perform routing decisions into four action categories. No empirical performance numbers provided in the excerpt.
high positive Scaling the Queue: Reinforcement Learning for Equitable Call... complaint routing action assignment
We develop an equity-centered reinforcement learning (RL) framework that augments call classification capacity across six New York City Department of Buildings operational domains (boiler safety, crane and derrick oversight, heat and hot water, housing complaint triage, scaffold safety, and Natural Area District protection).
Methodological development described in the paper; claimed application domain spans six named DOB operational areas. No evaluation metrics or sample sizes provided in the excerpt.
high positive Scaling the Queue: Reinforcement Learning for Equitable Call... call classification capacity / intake routing capability
Design principle: effective AI assistance should clear a quality threshold suited to the target content, rather than simply be present.
Authors' proposed design principle based on empirical and qualitative results from their study.
high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... design guidance for AI assistance effectiveness
Qualitative findings suggest the required quality threshold for helpful AI drafts is content-dependent; as visual complexity increases, the quality needed from AI drafts increases.
Authors' qualitative analysis from the study (no numeric measures provided in the excerpt).
high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... relationship between visual complexity and required AI draft quality
There is a minimum quality threshold for AI drafts to be effective; simple presence of AI assistance is insufficient.
Synthesis of empirical results and comparisons between GenAD and baseline drafts reported by the authors (stated as an interpretation of the findings).
high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... effectiveness of AI assistance (dependent on draft quality)
Baseline drafts generated from simple, unguided prompts offered only modest benefits compared to authoring from scratch.
Empirical comparison reported in the within-subjects study contrasting GenAD drafts and baseline (unguided-prompt) drafts; no numeric effect sizes or sample sizes provided in the excerpt.
high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... benefit/effectiveness of baseline AI drafts (e.g., quality or efficiency gains)
GenAD drafts significantly reduced cognitive load.
Result reported from the within-subjects study (authors state a significant reduction in cognitive load when using GenAD drafts); specific measure, statistical values, and sample size not provided in the excerpt.
GenAD drafts cut completion time by more than half.
Result reported from the within-subjects study comparing completion time when using GenAD drafts versus authoring from scratch; exact sample size and numeric reduction not provided in the excerpt.
Recent work has shown that giving novice describers an AI-generated draft to start from helps produce higher-quality audio description (AD) and lowers the barrier to entry.
Statement refers to prior published work (no specific study, sample size, or citation provided in the excerpt).
high positive Making AI Drafts Count: A Quality Threshold in Audio Descrip... AD quality / barrier to entry for novice describers
Olava Extract reduced inference cost by 78% to 97% compared with the frontier models tested.
Reported cost comparison (inference cost) versus the five frontier models evaluated in the study; percentage reductions presented in the paper.
Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842.
Reported evaluation results comparing Olava Extract to five frontier models on structured contract extraction; explicit macro and micro F1 scores presented in the paper.
high positive A Few Good Clauses: Comparing LLMs vs Domain-Trained Small L... F1 score (macro and micro)
Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models.
Reported intervention experiments where authors used ASR diagnostics to refine prompts and add deterministic routing guards, observing TSR improvements up to +93.8 percentage points.
GPT-5.2 achieves perfect ASR.
Model-level evaluation reported in the paper indicating GPT-5.2 attained perfect ASR under the HMASP tests.
high positive Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... Agentic Success Rate (ASR)
We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision.
Methodological contribution described in the paper (definition of a new metric and its components).
high positive Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... trajectory fidelity / transition-level execution accuracy
LLM-based multi-agent systems are increasingly deployed for payment workflows.
Statement in the paper's introduction/abstract framing; no empirical deployment data or sample size provided.
high positive Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... deployment/adoption of LLM-based multi-agent systems for payment workflows
DePAI offers a path to scalable, resilient self-organization that integrates physical infrastructure, AI, and community ownership under transparent rules, on-chain incentives, and permissionless participation, aiming to preserve human autonomy.
Normative/conceptual claim and argument based on the proposed architecture and incentive design; presented without empirical evaluation.
high positive DAO-enabled decentralized physical AI: A new paradigm for hu... scalability and resilience of self-organization, integration of infrastructure/A...
These elements specify workflows that couple machine execution with human oversight, enabling enhanced self-organization of techno-socio-economic systems, which we call DePAI.
Theoretical workflow specification and argumentation in the paper; no reported experimental or observational validation.
high positive DAO-enabled decentralized physical AI: A new paradigm for hu... workflows coupling machine execution with human oversight and resulting self-org...
We connect DAO design with digital-democracy research on deliberation and voting, showing how each can advance the other.
Conceptual linkage and theoretical argumentation drawing on literature from DAO design and digital-democracy research; no empirical test or sample described.
high positive DAO-enabled decentralized physical AI: A new paradigm for hu... mutual advancement of DAO design and digital-democracy practices (deliberation a...
We synthesize foundations in blockchains, decentralized autonomous organizations (DAOs), and cryptoeconomics.
Literature synthesis and conceptual review within the paper; no empirical sample or experimental method reported.
high positive DAO-enabled decentralized physical AI: A new paradigm for hu... coverage/synthesis of foundational literature on blockchains, DAOs, and cryptoec...
We propose DAO-enabled decentralized physical AI (DePAI), a democratic architecture for coordinating humans and autonomous machines in the operation and governance of physical-digital systems.
Conceptual proposal and architectural synthesis presented in the paper (theory/design contribution). No empirical evaluation or sample reported.
high positive DAO-enabled decentralized physical AI: A new paradigm for hu... coordination of humans and autonomous machines in operation and governance of ph...
Human performance on the benchmark is 80.7%.
Human baseline reported in the paper (same evaluation/rubrics as agents).
high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... benchmark score (human performance)
We provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%.
Description of a reduced-size benchmark split (100 tasks) and reported cost reduction (~70%) in the paper.
high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... evaluation cost (and distributional fidelity of the subset)
Workspace-Bench includes files up to 20GB in size.
Dataset description in the paper specifying maximum file size.
high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... maximum file size in the benchmark
We construct Workspace-Bench with 5 worker profiles, 74 file types, 20,476 files (up to 20GB), 388 tasks, and 7,399 total rubrics, each task associated with its own file dependency graph.
Dataset construction described in the paper; counts and sizes reported by authors.
high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... benchmark size and heterogeneity (worker profiles, file types, file count, task ...
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively.
Conceptual definition and motivation presented by the authors in the paper (no empirical test reported).
high positive Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... ability of AI agents to use file dependencies to complete tasks
A six-factor confirmatory factor analysis confirmed the measurement model used for perceived intimacy, perceived responsiveness, trust, purchase intention, hedonic motivation, and utilitarian motivation.
Reported six-factor CFA on the collected survey measures from the experimental sample (N = 439).
high positive Conditional trust pathways in live-streaming commerce: how c... measurement validity (factor structure)
The mediating effect of perceived intimacy on the anchor type → trust relationship was stronger (i.e., particularly operative) when participants' hedonic motivation was moderate to high.
Moderated mediation analyses including hedonic motivation as a moderator on the intimacy→trust pathway; experimental sample N = 439; reported conditional mediation (stronger at moderate-to-high hedonic motivation).
high positive Conditional trust pathways in live-streaming commerce: how c... trust (conditional mediation by perceived intimacy moderated by hedonic motivati...
Perceived intimacy (a relational cue) mediated the effect of anchor type on trust.
Moderated mediation analyses (with heteroscedasticity-consistent standard errors) based on experimental manipulation of anchor type in sample N = 439; CFA supported measurement of the mediator and outcome constructs; mediation path reported for perceived intimacy.
high positive Conditional trust pathways in live-streaming commerce: how c... trust (mediated by perceived intimacy)
Human anchors generated higher purchase intention overall than AI anchors.
Between-subjects randomized experiment (N = 439) with a purchase-intention measure administered after exposure to either a human or AI anchor; reported higher purchase intention in the human-anchor condition.
Human anchors generated higher trust overall than AI anchors.
Between-subjects randomized experiment comparing human vs AI live-streaming anchors; participants (N = 439) watched one video and completed a trust measure; analysis reported higher trust for human-anchor condition.
Conceptually, AI is positioned not as an automated controller but as an intelligence-augmenting co-regulator that supports learners' capacity to coordinate effort, attention, and understanding together.
Authors' conceptual framing and interpretation of empirical results showing that adaptive and proactive AI feedback supports shared regulation processes during dyadic programming tasks.
high positive Cognitive Alignment Drives Attention: Modeling and Supportin... conceptual role of AI in co-regulation of learning
Proactive, forecast-based feedback using machine-learning predictions of future collaboration states (Study 3) further enhances performance and sustains shared regulation by anticipating breakdowns before they manifest.
Study 3 intervention using ML-based forecasts of future collaboration states to provide proactive support; reported improvements relative to reactive/single-channel conditions (per-study sample sizes and ML model metrics not provided in abstract).
high positive Cognitive Alignment Drives Attention: Modeling and Supportin... collaboration performance and sustained shared regulation (reduction/prevention ...