The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6491 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Human Ai Collab Remove filter
Positive Alignment is a distinct and necessary agenda within AI alignment research.
Normative argumentation in the paper advocating for a separate research agenda (no empirical validation presented).
high positive Positive Alignment: Artificial Intelligence for Human Flouri... need for a distinct research agenda in alignment
Positive Alignment is the development of AI systems that (i) actively support human and ecological flourishing in a pluralistic, polycentric, context-sensitive, and user-authored way while (ii) remaining safe and cooperative.
Paper's definitional proposal / conceptual framing (normative definition rather than empirical evidence).
high positive Positive Alignment: Artificial Intelligence for Human Flouri... definition and intended properties of 'Positive Alignment' systems
Policy frameworks are necessary to govern verifiable machine intelligence in modern socio-technical infrastructures.
Normative recommendation and policy discussion in the paper; no empirical policy evaluation or legislative case studies are presented in the supplied text.
high positive Optimizing Process Based Reward Models through Reinforcement... existence/need for governance and regulation
Process-based supervision has broader implications for algorithmic fairness and can reduce black-box opacity.
High-level discussion in the paper linking process-verifiability to fairness and reduced opacity; no empirical fairness audits or quantitative fairness metrics reported in the provided text.
high positive Optimizing Process Based Reward Models through Reinforcement... algorithmic fairness / model opacity
Integrating reinforcement learning with process-oriented feedback can foster a more transparent AI ecosystem where the path to a conclusion is as scrutinized as the conclusion itself.
Conceptual claim and proposed benefit in the paper; presented as an argument rather than supported by empirical transparency or interpretability studies in the supplied text.
high positive Optimizing Process Based Reward Models through Reinforcement... transparency / interpretability of model reasoning
Process-based supervision significantly improves the reliability of models in high-stakes domains such as law, medicine, and engineering.
Asserted by the authors as an advantage of PRMs for high-stakes applications; presented as argumentation rather than backed by reported empirical trials or case-study sample sizes in the provided text.
high positive Optimizing Process Based Reward Models through Reinforcement... model reliability in high-stakes domains
Optimizing PRMs through reinforcement learning enhances the verifiability and robustness of multi-step reasoning in large-scale model architectures.
Central argumentative claim of the paper (theoretical proposal and conceptual analysis); no experimental results or quantitative evaluation provided in the text supplied.
high positive Optimizing Process Based Reward Models through Reinforcement... verifiability and robustness of multi-step reasoning
Process-Based Reward Models (PRMs) assign value to each distinct stage of a reasoning chain, providing a more granular signal for training than outcome-only approaches.
Methodological description and conceptual argument in the paper; described as a design/approach rather than empirically validated with data.
high positive Optimizing Process Based Reward Models through Reinforcement... training signal granularity / training effectiveness
Methodologically, the study demonstrates how expert reasoning can be operationalized as a benchmark for evaluating AI systems in urban infrastructure contexts, addressing gaps in empirical assessment and governance tools.
Study design: creation of Delphi-derived rubric from 20 experts and its use as an evaluation benchmark for six LLMs; reported as a methodological contribution.
high positive Governance risks of AI reasoning in urban infrastructure thr... feasibility of operationalizing expert reasoning as evaluation benchmark
The Delphi process elicited and refined expert reasoning criteria, producing a rubric that emphasized public safety, regulatory compliance, contextual judgment, financial stewardship, and system reliability.
Method: Delphi process with 20 infrastructure professionals that generated and refined reasoning criteria; resulting rubric content reported in paper.
high positive Governance risks of AI reasoning in urban infrastructure thr... content/themes of the expert-derived rubric
Experiments show consistent advantages in viewer engagement.
Reported experimental comparison vs named baselines; claim of consistent advantage in viewer engagement without numeric effect size provided in the excerpt.
Experiments show consistent advantages in tactfulness.
Reported experimental comparison vs named baselines; claim of consistent advantage in tactfulness without numeric effect size provided in the excerpt.
Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 18% on factual correctness.
Reported experimental comparison vs named baselines; specific numeric improvement stated (18% gain on factual correctness). Evaluation dataset or sample size not provided in the excerpt.
Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness.
Reported experimental comparison vs named baselines; specific numeric improvement stated (23% gain on informativeness). Evaluation dataset or sample size not provided in the excerpt.
We fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection.
Methodological contribution: fine-tuning an LLM on the collected annotated data, described in the paper.
high positive VerbalValue: A Socially Intelligent Virtual Host for Sales-D... ability to produce empathetic, commercially oriented responses
We collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents.
Dataset creation reported in the methods: explicitly states 1,475 annotated live-commerce interactions.
high positive VerbalValue: A Socially Intelligent Virtual Host for Sales-D... size of annotated dataset
We construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise.
Methodological contribution described in the paper: construction of a domain knowledge base and curated sales terminology lexicon.
high positive VerbalValue: A Socially Intelligent Virtual Host for Sales-D... availability of domain knowledge and sales lexicon (artifact creation)
A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure.
Conceptual description in the paper's introduction; no empirical data or experimental method cited in the excerpt.
high positive VerbalValue: A Socially Intelligent Virtual Host for Sales-D... purchase intent / sales conversion
A causal ablation confirms that each of the four mechanical enforcement primitives is individually necessary.
Causal ablation experiments reported by authors in the synthetic banking domain: removing each primitive degrades performance/governance, implying individual necessity. Abstract does not report exact experimental counts or effect sizes.
high positive Mechanical Enforcement for LLM Governance:Evidence of Govern... impact of removing each mechanical primitive on governance/task metrics (necessi...
Mechanical enforcement raises task accuracy from MCC ~0.43 to 0.88.
Reported Matthews correlation coefficient (MCC) for task accuracy under text-only governance (≈0.43) versus mechanical enforcement (≈0.88) in the paper's synthetic experiments; sample size not provided in abstract.
high positive Mechanical Enforcement for LLM Governance:Evidence of Govern... task accuracy (Matthews correlation coefficient)
Mechanical enforcement more than doubles deferral information content.
Comparison of information-content measures for deferrals between text-only governance and mechanical enforcement in the synthetic banking domain experiments; exact numeric basis not given in abstract.
high positive Mechanical Enforcement for LLM Governance:Evidence of Govern... deferral information content (information-theoretic or content metric reported b...
Mechanical enforcement reduces the rate of deferrals that carry no decision-relevant information by 73%.
Head-to-head comparison between text-only governance and a mechanically enforced architecture (four primitives) in the paper's synthetic banking experiments; specific sample size not stated in abstract.
high positive Mechanical Enforcement for LLM Governance:Evidence of Govern... relative reduction in rate of non-informative deferrals
There is a positive spillover effect on AI-ineligible chats: treated workers adapted their multitasking workflow to devote greater attention to these chats.
Experiment-level observations comparing worker behavior on AI-ineligible chats between treatment and control; treated workers reallocated attention/effort (multitasking workflow changes) leading to improved attention on AI-ineligible chats.
high positive Agentic AI and Human-in-the-Loop Interventions: Field Experi... attention/effort devoted to AI-ineligible chats (spillover effect)
Early intervention is essential for sustaining high post-escalation intervention effort.
Temporal analysis of intervention timing within the randomized experiment showing an association between earlier human intervention after escalation and higher subsequent intervention effort.
high positive Agentic AI and Human-in-the-Loop Interventions: Field Experi... post-escalation intervention effort as a function of intervention timing
Human intervention preserves service quality in algorithm-triggered technical escalations (unresolved customer issues beyond the AI's capability).
Experimental subgroup analysis of escalations categorized as algorithm-triggered technical escalations; post-escalation human interventions were observed to maintain service quality in these cases.
high positive Agentic AI and Human-in-the-Loop Interventions: Field Experi... service quality after technical escalations
By reframing reskilling as a shared, supported, and bounded process, AI-driven change can foster long-term career resilience, professional identity renewal, and sustainable human–AI integration.
Conceptual conclusion/implication drawn by the authors from the proposed model and recommendations; no empirical validation included in the paper.
high positive AI-driven skill volatility and the emergence of re-skilling ... career resilience, professional identity renewal, quality of human–AI integratio...
The paper advances a set of sustainable, collective strategies—such as role-linked learning, protected learning time, skill prioritization, and phased AI adoption—to interrupt the reskilling loop and redistribute adaptive demands across organizations.
Prescriptive/theoretical recommendations proposed by the authors; no empirical evaluation or trial evidence presented.
high positive AI-driven skill volatility and the emergence of re-skilling ... effectiveness of organizational strategies in reducing reskilling burdens
The paper proposes a reconstructed labour law framework based on economic dependency rather than traditional employment classification, including recognition of dependent contractor status, platform liability for worker welfare, algorithmic transparency, social security obligations, and specialised grievance mechanisms.
Normative legal/policy proposal articulated by the author(s) based on theoretical argument and the comparative analysis of existing regulatory gaps; prescriptive recommendation rather than empirically tested intervention.
high positive Corporate Accountability in the Gig Economy: Re-examining La... recommended legal/regulatory reforms and institutional design
The appropriate design response to Metis tasks is centaur architectures in which humans lead and AI supports, rather than pursuing further automation.
Prescriptive recommendation based on the conceptual analysis and normative reasoning in the paper; not supported by empirical evaluation or quantified comparisons of architectures.
high positive Metis AI: The Overlooked Middle Zone Between AI-Native and W... recommended human-AI system design
These verified assertions improve users' performance on code-comprehension tasks in a user study with more than 400 participants.
User study reported in the paper: a study involving more than 400 participants measured performance on code-comprehension tasks with and without the verified assertions (sample size reported as >400 participants).
high positive Viverra: Text-to-Code with Guarantees users' performance on code-comprehension tasks
Evaluation on 18 diverse programming tasks suggests that Viverra can efficiently generate code with verified assertions.
Empirical evaluation reported in the paper: a test set of 18 programming tasks was used to evaluate Viverra's ability to generate code with verified assertions (sample size = 18 tasks).
high positive Viverra: Text-to-Code with Guarantees rate/ability to generate code with verified assertions
Viverra verifies those assertions in a compositional and best-effort manner via a portfolio of bounded model checkers.
Method description: the paper states that verification is done compositionally and in a best-effort way using a portfolio of bounded model checkers (implementation/algorithmic claim).
high positive Viverra: Text-to-Code with Guarantees verification of assertions using bounded model checkers
Given a natural-language task description, Viverra prompts an LLM to synthesize a C program together with candidate assertions expressing safety and correctness properties.
Method section description: the workflow described in the paper explicitly states LLM prompting to produce C programs and candidate assertions (methodological claim, illustrated with examples).
high positive Viverra: Text-to-Code with Guarantees generation of C program plus candidate assertions
Viverra automatically produces formally verified annotations alongside generated code to aid users' understanding of the generated program.
System description in the paper: Viverra is presented as a system that generates code together with formally verified annotations; implementation details and demonstration are described (no precise external benchmark cited here).
high positive Viverra: Text-to-Code with Guarantees availability of formally verified annotations alongside generated code
Participants cited inclusivity as their primary reason for preferring LLM facilitators.
Post-task survey responses where participants reported reasons for preferring LLM-facilitated discussion; inclusivity reported as the primary reason.
high positive Real-Time Group Dynamics with LLM Facilitation: Evidence fro... self-reported reasons for facilitator preference (inclusivity)
Participants consistently preferred facilitated discussion.
Survey responses collected after deliberation across both studies indicating participant preference for facilitated discussions over no facilitation.
high positive Real-Time Group Dynamics with LLM Facilitation: Evidence fro... participant preference for facilitated discussion (self-report)
The study offers actionable insights for leaders seeking to balance innovation, capability development and ethical governance in AI-enabled workplaces while sustaining human interpretive authority, accountability and responsibility over time.
Implications and recommendations derived from the study's qualitative findings (28 interviews) and interpretive synthesis.
high positive Reimagining work in the age of intelligent automation: a qua... guidance for leadership on balancing innovation and governance
AI reshapes contemporary work by augmenting, rather than substituting, human roles.
Qualitative semistructured interviews with 28 managers and professionals from 12 organizations across technology, finance and knowledge-intensive services in Europe and Asia; thematic and interpretive analysis supported by organizational document review.
high positive Reimagining work in the age of intelligent automation: a qua... nature of human roles (augmentation vs substitution)
The paper proposes a technical and regulatory pivot: bounding the evidentiary weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes (specifically linear probes, activation patching, and before/after-training comparisons).
Policy and technical recommendations presented in the paper (proposal, not empirical test).
high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation
We introduce the concept of 'fragile assurance' to describe cases where the evidential structure does not support the asserted safety claim.
Paper's conceptual contribution defining 'fragile assurance' and illustrating the notion with argumentation/examples.
We formalize the structural mismatch between required and achievable verification access as the 'audit gap' (the divergence between required and achievable verification access).
Paper introduces a formal definition and conceptual framing called the 'audit gap'.
high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation
AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability.
Paper's review of AI governance frameworks enacted between 2019 and early 2026 (policy/literature review as reported in the paper).
high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation
Task complexity positively moderates the relationships between GenAI usage patterns and knowledge integration capability.
Moderation analysis using three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; interaction terms between task complexity and GenAI usage patterns reported to have positive effects on knowledge integration capability.
high positive The impact of generative artificial intelligence (GenAI) usa... knowledge integration capability
Employees' knowledge integration capability plays a critical complementary mediating role in the relationships between GenAI usage patterns (exploitative and exploratory) and creativity.
Mediation analysis conducted on three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; knowledge integration capability measured and tested as mediator between GenAI usage patterns and creativity outcomes.
high positive The impact of generative artificial intelligence (GenAI) usa... creativity (incremental and radical) via mediator knowledge integration capabili...
Exploratory GenAI use is more strongly positively associated with radical creativity than incremental creativity.
Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploratory GenAI use with radical vs. incremental creativity (mediation and moderation models reported in paper).
high positive The impact of generative artificial intelligence (GenAI) usa... radical creativity (and compared to incremental creativity)
Exploitative GenAI use is more strongly positively associated with incremental creativity than radical creativity.
Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploitative GenAI use with incremental vs. radical creativity (mediation and moderation models reported in paper).
high positive The impact of generative artificial intelligence (GenAI) usa... incremental creativity (and compared to radical creativity)
Function signatures, constraints and style descriptions emerge as the most influential prompt dimensions affecting the readability of LLM-generated code.
Systematic examination of multiple prompt dimensions in the paper, reporting that function signatures, constraints, and style descriptions had the largest measured influence on readability scores.
high positive The Readability Spectrum: Patterns, Issues, and Prompt Effec... impact_of_prompt_dimensions_on_readability
We evaluate the readability of code generated by mainstream LLMs under 5,869 scenarios extracted from large code bases including World of Code (WoC) and LeetCode.
Empirical evaluation reported in paper using 5,869 scenarios drawn from WoC and LeetCode; LLM-generated code samples were produced and scored with the readability model.
high positive The Readability Spectrum: Patterns, Issues, and Prompt Effec... coverage of evaluation / dataset size for readability assessment
We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code.
Description in paper of a newly constructed readability model combining textual, structural, program, and visual features; model development is presented as a methodological contribution (no numeric effect size).
high positive The Readability Spectrum: Patterns, Issues, and Prompt Effec... code_readability (measured via the proposed readability model)
Visualizing spatial (localization) uncertainty in the annotation interface improves human-in-the-loop annotation (i.e., localization uncertainty is a lever to improve annotation quality/efficiency).
Synthesis/interpretation in the paper based on the controlled study results (120 participants) and box-level analysis showing improved label quality and reduced time when uncertainty cues were shown.
high positive From Model Uncertainty to Human Attention: Localization-Awar... human-in-the-loop annotation quality and efficiency