The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (2954 claims)

Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 369 105 58 432 972
Governance & Regulation 365 171 113 54 713
Research Productivity 229 95 33 294 655
Organizational Efficiency 354 82 58 34 531
Technology Adoption Rate 277 115 63 27 486
Firm Productivity 273 33 68 10 389
AI Safety & Ethics 112 177 43 24 358
Output Quality 228 61 23 25 337
Market Structure 105 118 81 14 323
Decision Quality 154 68 33 17 275
Employment Level 68 32 74 8 184
Fiscal & Macroeconomic 74 52 32 21 183
Skill Acquisition 85 31 38 9 163
Firm Revenue 96 30 22 148
Innovation Output 100 11 20 11 143
Consumer Welfare 66 29 35 7 137
Regulatory Compliance 51 61 13 3 128
Inequality Measures 24 66 31 4 125
Task Allocation 64 6 28 6 104
Error Rate 42 47 6 95
Training Effectiveness 55 12 10 16 93
Worker Satisfaction 42 32 11 6 91
Task Completion Time 71 5 3 1 80
Wages & Compensation 38 13 19 4 74
Team Performance 41 8 15 7 72
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 17 15 9 5 46
Job Displacement 5 28 12 45
Social Protection 18 8 6 1 33
Developer Productivity 25 1 2 1 29
Worker Turnover 10 12 3 25
Creative Output 15 5 3 1 24
Skill Obsolescence 3 18 2 23
Labor Share of Income 7 4 9 20
Clear
Human Ai Collab Remove filter
Returns to AI are heterogeneous across firms; estimating treatment effects requires attention to selection, complementarities, and dynamic adoption pipelines.
Methodological argument referencing treatment-effect literature and observed firm heterogeneity; supported by conceptual examples rather than a single empirical treatment-effect estimate.
high neutral Modern Management in the Age of Artificial Intelligence: Str... heterogeneity in returns to AI adoption (firm-level productivity or performance ...
Methods combine targeted literature synthesis, comparative conceptual analysis, and framework building (with recent scholarly and institutional sources reviewed).
Explicit methodological statement in the paper describing the review and analytic approach; no primary-data methods used.
high null result Behavioral Factors as Determinants of Successful Scaling of ... methodological approach (literature synthesis and conceptual framework developme...
AI coding assistants are a high-visibility class of corporate AI and are given special attention as an illustrative case in the paper.
Paper specifically calls out AI coding assistants as a focal example in the conceptual analysis and discussion; based on literature review rather than original measurement.
high null result Behavioral Factors as Determinants of Successful Scaling of ... role of coding assistants as illustrative case for scaling and behavioral dynami...
The Article translates these insights into risk-sensitive guideposts for modernizing governance of AI-enabled tools and emerging modalities, from agentic systems to blockchain-deployed smart contracts.
Prescriptive/conceptual policy guidance presented in the Article (normative recommendations; governance framework).
high null result Rewired: Reconceptualizing Legal Services for the AI Age provision of governance guideposts for AI-enabled legal technologies
The Innovation Frontier traces LegalTech’s evolution from 2000s-vintage e-discovery to generative AI.
Historical/chronological analysis in the Article (literature review/history of LegalTech provided by authors).
high null result Rewired: Reconceptualizing Legal Services for the AI Age narrative/historical scope of LegalTech evolution covered in the Article
The Legal Services Value Chain disaggregates the lifecycle of a legal matter into five distinct nodes of activity.
Model description in the Article (conceptual architecture; decomposition of legal work).
high null result Rewired: Reconceptualizing Legal Services for the AI Age number and structure of nodes in the proposed value-chain model
The Article develops two core organizing models: the Legal Services Value Chain and the Innovation Frontier.
Explicit claim in the Article describing conceptual/model contributions (theoretical/model-building).
high null result Rewired: Reconceptualizing Legal Services for the AI Age presence of two organizing conceptual models in the Article
This Article provides a practical framework for navigating the shifting terrain of legal innovation and AI.
Statement of purpose in the Article (conceptual contribution; framework development). No empirical validation reported in the excerpt.
high null result Rewired: Reconceptualizing Legal Services for the AI Age existence of a practical framework for legal-AI governance and strategy
AI transparency alone did not significantly increase data-sharing.
Result reported from the randomized experiment (N=240) comparing actual data-sharing rates across human, white-box AI, and black-box AI conditions; authors state that transparency alone did not produce a significant increase in sharing.
high null result Understanding Data-Sharing with AI Systems: The Roles of Tra... actual data-sharing (behavioral sharing decisions)
The SRL did not generate designs with significantly better performance than RWL, even though it explored a different region of the design space.
Empirical comparison on the battery pack design task showing no significant performance improvement of SRL over RWL despite differing exploration; exact statistical tests, p-values, and sample sizes are not provided in the excerpt.
high null result Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regul... design performance (SRL vs RWL)
Three interlocking threads characterize AI for science: (1) AI as research instrument, (2) AI for research infrastructure, and (3) the reshaping of scholarly profiles and incentives by machine-readable metrics.
Conceptual framework presented in the paper; organization of topics rather than empirical measurement. The paper indicates these threads are followed through historical and contemporary examples.
high null result A Brief History of AI for Scientific Discovery: Open Researc... conceptual decomposition of AI-for-science developments
The history of artificial intelligence for scientific discovery is not a two year story about chatbots learning to write papers; it is a sixty year story beginning with DENDRAL (1965).
Historical narrative / literature review citing early systems such as DENDRAL (1965) and subsequent developments in scholarly infrastructure (arXiv, Google Scholar, ORCID). No empirical sample or statistical test reported.
high null result A Brief History of AI for Scientific Discovery: Open Researc... historical scope and timeline of AI for scientific discovery
Four control mechanisms emerged from the review: GPS tracking (panoptic surveillance), rating systems (emotional labour demands), dynamic pricing (income volatility), and automated sanctions (deactivation fear).
Thematic synthesis across the 48 reviewed studies identifying recurring algorithmic control mechanisms.
high null result Algorithmic Control and Psychological Risk in Digitally Mana... presence/identification of algorithmic control mechanisms
Thematic synthesis integrated Job Demand-Control Model, Conservation of Resources Theory, and Algorithmic Management Theory to develop an integrated multilevel theoretical framework.
Authors' stated method: thematic synthesis combining those three theoretical frameworks across the reviewed literature (48 studies).
high null result Algorithmic Control and Psychological Risk in Digitally Mana... theoretical integration
PRISMA-guided systematic integrative review of 48 peer-reviewed studies (2016-2025) sourced from 4,812 initial records (Scopus, Web of Science, PubMed).
Methods statement in the paper: PRISMA-guided systematic integrative review; search across Scopus, Web of Science, PubMed; initial yield 4,812 records; final included studies = 48.
high null result Algorithmic Control and Psychological Risk in Digitally Mana... number of studies and records screened/included
Both the positive (approach) and negative (avoidance) AI job crafting pathways failed to significantly affect life satisfaction, indicating domain specificity of AI-related psychological mechanisms.
Analysis of the same multi-source, multi-wave dataset of 287 employee–leader dyads; tests of effects on life satisfaction showed non-significant results for both pathways.
The frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately.
Analytic results reported in the study comparing model propensity (how often manipulative outputs are produced) with measures of success (induced belief/behavior changes), finding inconsistent or weak association.
high null result Evaluating Language Models for Harmful Manipulation association between model propensity (frequency of manipulative outputs) and man...
We ran a behavioral experiment (N = 200) in which participants predicted the AI's correctness across four AI calibration conditions: standard, overconfidence, underconfidence, and a counterintuitive "reverse confidence" mapping.
Reported experimental design and sample size in the paper (behavioral experiment with N = 200; four experimental conditions).
high null result Learning to Trust: How Humans Mentally Recalibrate AI Confid... experimental conditions / task setup (participants predicting AI correctness)
This yields a common scale (bits of usable information) for comparing a wide range of interventions, contexts, and models.
Theoretical implication of the authors' formalization combining Bayesian persuasion and V-usable information (paper argues for a common information scale measured in bits).
high null result Mecha-nudges for Machines bits of usable information as a comparability metric
To formalize mecha-nudges, we combine the Bayesian persuasion framework with V-usable information, a generalization of Shannon information that is observer-relative.
Methodological/theoretical development described in the paper (formal combination of two theoretical frameworks).
high null result Mecha-nudges for Machines formal representation of information available to observers/agents
We introduce mecha-nudges: changes to how choices are presented that systematically influence AI agents without degrading the decision environment for humans.
Conceptual/definitional contribution made in the paper (novel concept introduced by authors).
high null result Mecha-nudges for Machines influence on AI agents while preserving human decision environment
Nudges are subtle changes to the way choices are presented to human decision-makers (e.g., opt-in vs. opt-out by default) that shift behavior without restricting options or changing incentives.
Background/definition stated in the paper (conceptual; references to standard behavioral-economics definition of nudges).
high null result Mecha-nudges for Machines behavioral response to choice presentation
The visualization avoided redistributing value.
Reported result from the within-subjects experiment (N=32) stating that the visualization did not redistribute value between parties (i.e., it improved outcomes/efficiency without changing value split).
high null result From Overload to Convergence: Supporting Multi-Issue Human-A... distribution of value between negotiating parties (value split / surplus allocat...
Human-like presentations did not raise conformity pressure.
Reported experimental result: manipulaton of presentation style (human-like vs not) and measurement of conformity pressure; the abstract states that human-like presentation increased perceived usefulness/agency without increasing conformity pressure. No quantitative details provided in abstract.
Larger panels yielded no gains in accuracy relative to a single AI.
Reported experimental comparison manipulating panel size in the study (three tasks). The abstract states that larger panels did not produce accuracy gains versus a single AI. (No sample size or numerical effect reported in abstract.)
We evaluate our approach on spapi, a production in-vehicle API system at Volvo Group involving 192 endpoints, 420 properties, and 776 CAN signals across six functional domains.
Case study / evaluation dataset description (explicit counts provided in paper).
high null result LLM-Powered Workflow Optimization for Multidisciplinary Soft... evaluation dataset scale and scope (endpoints, properties, CAN signals, domains)
The analysis relies on partial least squares path modeling (PLS-PM) to test eight predictions linking technological perceptions, organizational factors, and adoption outcomes.
Author-stated analytical method: PLS-PM; eight predictions tested; uses the survey data described above.
high null result Artificial Intelligence Adoption in Talent Acquisition: Effe... analytical approach / hypothesis testing
The study uses cross-sectional survey data from 523 human resource professionals and hiring managers representing 184 organizations across multiple industries in the United States.
Author-stated sample description in the paper: cross-sectional survey; 523 HR professionals/hiring managers; 184 organizations; multiple industries; U.S.
high null result Artificial Intelligence Adoption in Talent Acquisition: Effe... sample composition / data source
Each task is evaluated under three agent configurations (no-skills, LLM-generated skills, and human-expert skills) and validated through real hardware execution.
Experimental design described in the paper specifying three agent configurations per task and hardware validation of task runs.
high null result Skilled AI Agents for Embedded and IoT Systems Development evaluation configuration and validation modality
IoT-SkillsBench spans three representative embedded platforms, 23 peripherals, and 42 tasks across three difficulty levels.
Benchmark composition statistics reported in the paper (counts of platforms, peripherals, tasks, and difficulty levels).
high null result Skilled AI Agents for Embedded and IoT Systems Development benchmark scope (platforms, peripherals, tasks, difficulty levels)
We introduce a skills-based agentic framework for HIL embedded development together with IoT-SkillsBench, a benchmark designed to systematically evaluate AI agents in real embedded programming environments.
Methodological contribution described in the paper (introduction of framework and benchmark; the paper reports design and implementation).
high null result Skilled AI Agents for Embedded and IoT Systems Development availability of a skills-based agentic framework and benchmark
The cooperative video game KeyWe, with a scripted agent, served as a valid testbed for studying human-agent teamwork and the effects of the training intervention.
Methodological choice: KeyWe was used as the experimental environment and the agent behavior was scripted for consistency; all behavioral and performance measures were collected within this setting.
high null result Teaming Up With an AI Agent: Training Humans to Develop Huma... experimental_testbed_description
Half of the participants received the teamwork training and half did not (between-subjects comparison).
Experimental design description: participants were split into trained and untrained groups (50/50).
high null result Teaming Up With an AI Agent: Training Humans to Develop Huma... experimental_assignment (trained vs. untrained)
The study observes five delivery configurations: a traditional baseline and four successive platform versions (V1–V4).
Study design described by the authors; outcomes measured across these five configurations for the three programs.
high null result Orchestrating Human-AI Software Delivery: A Retrospective Lo... delivery configuration variations (baseline, V1–V4)
The study covers three real software modernization programs: a COBOL banking migration (~30k LOC), a large accounting modernization (~400k LOC), and a .NET/Angular mortgage modernization (~30k LOC).
Study design / sample description provided by the authors in the paper's methods section.
high null result Orchestrating Human-AI Software Delivery: A Retrospective Lo... study programs and codebase sizes (lines of code)
Evidence on AI in software engineering still leans heavily toward individual task completion, while evidence on team-level delivery remains scarce.
Paper's literature-context statement (intro); asserted by the authors as motivation for the study (no primary data supporting this meta-claim provided within the study).
high null result Orchestrating Human-AI Software Delivery: A Retrospective Lo... distribution of prior evidence (individual task vs team-level delivery) in the l...
The model yields two limits on the speed of learning and adoption: a structural limit determined by prerequisite reachability and an epistemic limit determined by uncertainty about the target.
Theoretical result stated in the paper (model-derived identification of two distinct limiting factors on learning speed).
high null result A Mathematical Theory of Understanding speed of learning / adoption
Teaching is modeled as sequential communication with a latent target.
Modeling assumption explicitly stated in the paper (formalization of teaching in the theoretical framework).
high null result A Mathematical Theory of Understanding model specification (teaching process)
The paper models the learner as a mind: an abstract learning system characterized by a prerequisite structure over concepts.
Modeling assumption explicitly stated in the paper (definition of the 'mind' in the theoretical model).
high null result A Mathematical Theory of Understanding model specification (representation of learner)
The findings provide evidence against concerns that AI mediation undermines people's ability to distinguish truth from lies.
Synthesis of experimental results showing unchanged lie-detection accuracy despite declines in perceived trust/confidence.
high null result Through the Looking-Glass: AI-Mediated Video Communication R... ability to distinguish truth from lies (lie-detection accuracy)
Participants were no more inclined to suspect those using AI tools of lying.
Experimental comparisons assessing participants' propensity to suspect AI-mediated speakers of deception showed no increase in suspicion for users of AI tools.
high null result Through the Looking-Glass: AI-Mediated Video Communication R... inclination to suspect AI-mediated speakers of lying
Participants' actual judgment accuracy (ability to detect lies) remained unchanged across AI-mediated and non-AI-mediated videos.
Primary experimental result comparing lie-detection accuracy (truthful vs deceptive statements) across the three AI mediation conditions in the preregistered experiments (N = 2,000).
high null result Through the Looking-Glass: AI-Mediated Video Communication R... judgment accuracy (lie-detection accuracy)
We conducted two preregistered online experiments (N = 2,000).
Methods statement in the paper: two preregistered online experiments with a combined sample size of 2,000 participants.
high null result Through the Looking-Glass: AI-Mediated Video Communication R... study design / sample size (methodological claim)
This Article presents the results of an experiment in which a transcript of a hypothetical client interview involving potential disability discrimination, retaliation, and wrongful termination claims was submitted to each AI system, with prompts requesting identification and assessment of viable legal theories.
Methodological description of the experiment: one hypothetical client interview transcript fed to each of four AI engines with prompts to identify and assess legal theories.
high null result Robot Wingman: Using AI to Assess an Employment Termination experimental procedure (input and prompts)
The experiment compared three prompt conditions: (A) simple prompts, (B) raw PPS JSON, and (C) natural-language-rendered PPS.
Method description of the three prompt conditions used in the controlled experiment.
The study used three specific LLMs: DeepSeek-V3, Qwen-Max, and Kimi.
Method section listing the three models evaluated in the experiment.
We ran a controlled three-condition study across 60 tasks in three domains (business, technical, and travel), three large language models (DeepSeek-V3, Qwen-Max, and Kimi), and three prompt conditions, collecting 540 AI-generated outputs evaluated by an LLM judge.
Authors report an experimental study design: 60 tasks × 3 models × 3 prompt conditions = 540 outputs, with outputs evaluated by an LLM judge (methodological description in the paper).
high null result Evaluating 5W3H Structured Prompting for Intent Alignment in... experimental_data_collection (AI outputs evaluated by LLM judge)
Existing financial question answering benchmarks primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals.
Literature/background claim made in the paper motivating the new benchmark; authors contrast prior benchmarks' focus on balance sheet data with the lack of market/trading-signal evaluation.
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs scope of existing financial QA benchmarks (focus on balance sheet data vs. tradi...
Retrieval provides limited benefit for trading-signal reasoning.
Experimental comparison reported in the paper showing that retrieval-augmentation had little impact on performance for trading-signal-focused questions.
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs change in performance on trading-signal-focused questions with retrieval
To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment.
Methodological claim in the paper describing the QA and annotation pipeline; the paper reports using these components as part of their reliability framework.
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs benchmark annotation and validation procedure