The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6491 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Human Ai Collab Remove filter
A box-level analysis confirms that the uncertainty cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes.
Box-level analysis reported in paper comparing annotator behavior across predicted boxes with differing localization uncertainty; analysis shows effort reallocation toward boxes labeled as high-uncertainty.
high positive From Model Uncertainty to Human Attention: Localization-Awar... annotator effort allocation across predicted boxes
In the same controlled study, participants who received uncertainty cues were faster overall (reduced annotation time).
Same controlled user study with 120 participants comparing interfaces with and without spatial-uncertainty visualizations; paper reports that participants with cues were faster overall.
In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality.
Controlled user study reported in the paper; 120 participants; comparison between annotators who received visualized spatial-uncertainty cues via a purpose-built interface and those who did not; paper reports label quality outcomes.
The model identifies simple measures/conditions that characterize when productivity paradoxes and skill polarization arise.
Theoretical derivations and analytical characterizations within the model yielding threshold conditions and measures parameterizing when paradoxical outcomes occur (model-based; no empirical validation).
high positive Human-AI Productivity Paradoxes: Modeling the Interplay of S... predictive conditions/thresholds for productivity paradoxes and skill polarizati...
Replicating the within-subject experiment with simulated users recovers aggregate model hierarchies (i.e., the same ranking of models at the population level).
A replication of the human within-subject experiment using simulated users; authors report that aggregate model ranking/hierarchy is preserved between simulators and humans.
high positive PRISM-X: Experiments on Personalised Fine-Tuning with Human ... agreement in aggregate model rankings between simulated-user evaluations and hum...
People reward sycophancy and relationship-seeking behaviours in short-term evaluations.
Participant judgments in the blinded multi-turn conversations (same 530-participant experiment) indicated higher short-term preference ratings for outputs exhibiting sycophancy/relationship-seeking.
high positive PRISM-X: Experiments on Personalised Fine-Tuning with Human ... participant short-term preference ratings for model outputs showing sycophancy/r...
Preference fine-tuning (P-DPO) significantly outperforms both a generic model and personalised prompting in blinded multi-turn conversations with human participants.
Within-subject blinded multi-turn conversation experiment with 530 human participants comparing P-DPO, a generic model, and personalised prompting; statistical comparison reported in paper (claimed 'significantly outperforms').
high positive PRISM-X: Experiments on Personalised Fine-Tuning with Human ... human preference / model ranking as judged by participants in blinded multi-turn...
A digital twin analytics platform validation shows that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.
Validation/demonstration reported in the paper using a digital twin analytics platform; platform demonstration claimed to eliminate tool-call hallucination and enable cross-domain configurability via configuration only.
high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... tool-call hallucination elimination and cross-domain configurability without app...
In the same controlled experiment, ontology-grounded parameters reduced domain-identifier hallucination to 0%.
Same controlled experiment (six industry configurations, 72 tool invocations with Qwen3-32B) reported in the paper; ontology-grounded parameter condition produced 0% hallucination.
high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... hallucination rate for domain identifiers (ontology-grounded condition)
The architecture is formalized as a three-operation interface contract — resolve, contextualize, annotate — with invariants enforced by an AIOps orchestration layer.
Design specification and formalization presented in the paper (architectural description).
high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... existence of a three-operation interface contract and invariant enforcement
Embedding manufacturing ontology directly into the AI tool layer as a typed relational configuration enforces semantic constraints at runtime and closes the semantic training gap.
Proposed system architecture described and argued in the paper; validated via demonstrations and experiments described later in the paper.
high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... enforcement of semantic constraints at runtime / closure of semantic gap
Sustainable progress requires collaborative integration of humans and machines, rather than replacement.
Normative conclusion/recommendation stated in the paper based on study findings (argument for augmented intelligence over replacement).
high positive Augmented Intelligence: Resolving the AI integration-obsoles... approach to AI-human integration
This research presents the innovative Marketing Intelligence Operations (MIO) Framework and a practical AI Adoption Readiness Scorecard, enabling leaders to manage the operational balance between transformative efficiency improvements and human capital vulnerability.
Paper states that it introduces a new framework and a practical scorecard as deliverables of the research (descriptive claim about the paper's contributions).
high positive Augmented Intelligence: Resolving the AI integration-obsoles... AI adoption readiness / operational management capability
AI-integrated Marketing Intelligence Operations (MIO) quantitatively improves campaign Return on Investment (ROI) by 47%.
Reported as an empirical result from the paper's mixed-methods study (the paper states use of audits, surveys, and NLP analysis to evaluate MIO outcomes).
high positive Augmented Intelligence: Resolving the AI integration-obsoles... campaign Return on Investment (ROI)
Deploying LegalCheck in the Municipality of Amsterdam demonstrated substantial efficiency gains, improved legal consistency, and positive user acceptance.
Summary claim based on the real-world deployment outcomes described in the paper (timing improvements, consistency/factual accuracy statements, and reported positive reception by professionals); specific quantitative metrics and sample sizes are not fully reported in the excerpt.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... efficiency (time), legal consistency, user acceptance
The system produced explainable outputs based on actual regulations and prior cases, providing citations/explainability that support legal reasoning.
Paper describes retrieval from curated legal knowledge bases and generation of outputs grounded in regulations and prior cases during the Amsterdam deployment; presented as a feature of the system and supported by expert review.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... explainability / traceability of generated legal reasoning to source regulations...
LegalCheck uses a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) with curated legal knowledge bases and controlled prompting to retrieve relevant laws and precedents and incorporate case-specific details into coherent drafts.
System architecture and methodology described in the paper (design/implementation claim).
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... n/a (system design / method description)
Legal professionals found that the system ensured a consistent application of legal standards without replacing human judgment.
Reported qualitative feedback from professionals in the Municipality of Amsterdam deployment and the system design that includes an expert-in-the-loop review; no formal measurement of 'replacement' was reported.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... consistency in application of legal standards and preservation of human oversigh...
Legal professionals found that the system reduced their workload.
Reported user feedback from legal professionals during the Municipality of Amsterdam deployment; qualitative statements that professionals experienced workload reduction (no numeric workload metrics or sample size reported).
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... perceived workload of legal professionals
The system's output captured the vast majority of required legal reasoning—often 80% to 100% of essential content.
Reported coverage statistic from the deployment/evaluation described in the paper (phrased as 'often 80% to 100% of essential content'); exact evaluation method, sample size, and measurement protocol are not provided in the excerpt.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... proportion of essential legal reasoning/content captured in generated drafts
LegalCheck maintained high legal consistency and factual accuracy when generating draft letters.
Evaluation during real-world deployment with expert-in-the-loop review and feedback from legal professionals in the Municipality of Amsterdam; claims of high consistency and factual accuracy are reported but no formal numeric accuracy metric or sample size is provided in the text.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... legal consistency and factual accuracy of generated letters
LegalCheck produced near-final advice letters in minutes rather than hours.
Reported results from a real-world deployment within the Municipality of Amsterdam; system logs / timing comparisons between human drafting time (hours) and LegalCheck-assisted drafting time (minutes) are described in the paper (no explicit numeric sample size reported).
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... time to produce advice/objection response letters
We outline a research program for the runtime systems that foundation-model software agents will require.
Paper claims to present a forward-looking research agenda or program (stated in abstract); this is a conceptual contribution rather than an empirical finding.
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... research directions needed for runtime systems for foundation-model software age...
Applied to a controlled validation task, the framework yields episode packages whose evidence structure varies systematically with harness level: lower levels produce only a final patch, while higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports.
Empirical application described in the abstract: framework applied to a controlled validation task showing systematic variation in episode-package evidence structure across harness levels. The abstract does not report sample size or statistical measures.
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... evidence structure of episode packages produced (types of artifacts: final patch...
We propose a trace-based evaluation protocol that converts each agent run into an auditable episode package.
Methodological proposal described in the abstract proposing a trace-based protocol and an auditable episode package format; no quantitative evaluation details provided in the abstract.
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... auditability of agent runs (availability of trace-based episode packages)
We operationalize the harness through a four-level ladder (H0–H3) that progressively exposes runtime support to the agent.
Design contribution described in the paper (abstract) introducing a four-level ladder (H0–H3) as an operationalization of the harness concept.
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... degree of runtime support exposed to an agent across harness levels
Foundation models have transformed automated code generation.
Statement in paper's abstract referring to broad impact of foundation models on automated code generation; likely supported by citations and literature overview within the paper (no sample size or quantitative study reported in the abstract).
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... ability of foundation models to generate code (automation of coding tasks)
Authorship preservation should be a design priority for AI tools deployed in identity-relevant, behavior-dependent tasks.
Authors' recommendation based on experimental results showing negative motivational and behavioral consequences of delegating authorship to LLMs despite improved objective goal quality.
high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... design recommendation (no empirical outcome measured)
Mediation analyses identified psychological ownership as the mechanism: it mediated the authorship effect on every downstream motivational and behavioral outcome, while objective goal quality did not.
Mediation analyses reported in the preregistered experiment (authors tested psychological ownership and objective goal quality as mediators of authorship effects on multiple downstream outcomes); preregistered N = 470.
high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... mediating effect of psychological ownership on authorship => motivational and be...
At two-week follow-up, 72.8% of self-authored participants had acted on two or more of their goals, compared to 46.6% in the LLM condition.
Behavioral follow-up measure collected two weeks after the intervention in the preregistered experiment; percentages reported in the paper/abstract. (Follow-up completion N not specified in the abstract.)
high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... proportion of participants who acted on two or more goals within two weeks (beha...
LLM-generated goals scored higher on SMART criteria (specificity, measurability, achievability, relevance, and time-boundedness).
Preregistered randomized experiment comparing self-authored vs LLM-authored goals derived from a personal reflection; reported effect size d = 2.26; total preregistered N = 470.
high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... SMART criteria score (objective goal quality)
As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior.
Intervention experiments applying PGLA to model decoding on IMAVB; reported consistent improvements in the models' tendency to reject misleading premises after logit adjustment guided by probes.
We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension.
Description of new benchmark introduced in paper: 500 clips, 2x2 design (vision vs audio × standard vs misleading premises); used to measure conflict detection independently of standard multimodal QA.
The Agent-First paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.
Conceptual argument and mapping presented in the paper asserting interoperability/orthogonality with transport-layer standards (e.g., MCP).
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... compatibility_with_transport_layer_standards
Agent-First APIs improve autonomous error recovery by 5.8x (compared to optimized CRUD baselines).
Reported comparative experiments on 50 real operational tasks measuring autonomous error recovery capability.
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... autonomous_error_recovery
Agent-First APIs reduce required human interventions by 72.7% (compared to optimized CRUD baselines).
Same set of comparative experiments on 50 real operational tasks reported in the paper.
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... required_human_interventions
Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%).
Empirical comparative experiments reported in the paper on 50 real operational tasks, comparing Agent-First APIs to optimized CRUD baselines.
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... end-to-end_task_success_rate
The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains.
Reported production implementation and deployment statistics (platform with 85 registered tools spanning 6 business domains).
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... deployment_of_paradigm_on_production_SaaS_platform
We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation.
Design and specification presented in the paper (proposed architecture and components).
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... proposed_API_paradigm_and_components
LLMs can help generate more correct and functional code compared to participant-generated solutions.
Comparative analysis of generated solutions reported in the paper (no sample-size for solutions explicitly stated in the abstract). The paper states LLM-assisted solutions were more correct/functional.
high positive "Like Taking the Path of Least Resistance": Exploring the Im... correctness and functionality of generated code
Qualitative analysis of participants' interactions and interviews revealed four different human-LLM collaboration modes supporting various problem-solving strategies.
Qualitative analysis of interaction logs and retrospective interviews from the study participants (N=20) reported in the paper; identification of four collaboration modes described.
high positive "Like Taking the Path of Least Resistance": Exploring the Im... types of collaboration modes
We conducted a within-subject study followed by retrospective interviews with programmers (N=20).
Stated methods in the paper: within-subject experimental design plus retrospective interviews; sample size explicitly given as N=20.
AwareLLM opens new avenues for Human-AI collaboration where technology adapts to users' needs rather than users adhering to technological constraints.
Authorial/conceptual claim based on the proposed framework and study results; presented as a broader implication rather than a direct empirical finding.
high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... human-AI collaboration potential
Participants described AwareLLM's personalized interventions as timely and relevant, helping them boost their confidence and deepen engagement with their work.
Qualitative user feedback reported in the study (participant descriptions); sample size 20. No coding details or counts provided in the abstract.
high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... confidence and engagement (subjective reports)
AwareLLM reduced mental demand for participants.
Reported results from the user study (comparison to a standard LLM assistant) with 20 participants; abstract reports reductions but gives no quantitative metrics.
AwareLLM led to reductions in cognitive fatigue.
Reported results from the user study comparing AwareLLM to a standard LLM assistant; sample size 20. No quantitative values provided in the abstract.
Compared to a standard LLM assistant, AwareLLM produced statistically significant improvements in task performance.
Results reported from the user study (comparison vs. standard LLM); sample size noted as 20 participants. No numerical effect size provided in the abstract.
AwareLLM dynamically adapts to users' psychophysiological states while analyzing temporal patterns and behavioral tendencies to provide personalized and timely interventions.
Design and claimed operational behavior of the proposed framework as described by authors.
high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... personalization/adaptivity of interventions
We introduce AwareLLM, a multimodal framework that integrates egocentric vision, pupillometry, eye-gaze tracking, posture detection, heart activity, and large language models to create a proactive and context-aware ecosystem.
System/methods description in paper (architecture/design claim).
high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... system capability to combine multimodal signals
Information workers' productivity is significantly influenced by their cognitive states and physiological responses.
Background statement in paper (literature-motivated claim); no study data provided within the abstract to support it.