The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (14055 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance).
Controlled experiment reported in the paper: 600 runs across five named industries (experimental setup reported in abstract).
high neutral Ontology-Constrained Neural Reasoning in Enterprise Agentic ... experimental performance of ontology-coupled vs ungrounded agents across industr...
The paper addresses three institutional audiences: enterprise finance and operations teams; government and regulatory bodies developing AI labor displacement frameworks; and financial markets requiring a machine labor index as a long-duration economic signal.
Stated intended audiences in the paper (descriptive statement).
high neutral HEWU: A Standardized Framework for Measuring Machine-Generat... intended institutional audiences
BCR is a minimalist, single-stage training paradigm that trains the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy.
Methodological description presented in the paper describing the training procedure and objective (single-stage, per-instance accuracy reward, N-problem batching in shared context).
high neutral Batched Contextual Reinforcement: A Task-Scaling Law for Eff... training paradigm characteristics (simplicity, stage count, reward structure)
The framework is calibrated with O*NET task data, a survey of 3,778 domain experts, and GPT-4o-derived task decompositions, and implemented in computer vision.
Calibration and empirical implementation using O*NET, a domain expert survey (n=3,778), and GPT-4o task decompositions; applied to computer vision tasks.
high neutral Economics of Human and AI Collaboration: When is Partial Aut... validity of calibration / empirical grounding of the framework
We introduce an entropy-based measure of task complexity that maps model accuracy into a labor substitution ratio, quantifying human labor displacement at each accuracy level.
New metric proposed in the paper (entropy-based task complexity) and mapping procedure from accuracy to substitution ratio; implemented in the framework.
high neutral Economics of Human and AI Collaboration: When is Partial Aut... labor substitution ratio (human labor displaced per unit accuracy)
Costinot and Werning (2023) develop a sufficient-statistic approach and find optimal technology taxes of 1–3.7% on robots.
Citation reported in the paper summarizing Costinot and Werning (2023)'s quantitative sufficient-statistic estimate.
high neutral NBER WORKING PAPER SERIES optimal robot tax rate
Guerreiro et al. (2022) characterize optimal Mirrleesian tax system with automation and find that robot taxes should be transitional—high when incumbent workers cannot retrain, converging to zero as new cohorts adjust skill investments.
Citation reported in the paper summarizing Guerreiro et al. (2022)'s theoretical result on transitional robot taxes.
high neutral NBER WORKING PAPER SERIES optimal robot tax path over time
If labor becomes economically redundant, the policy focus shifts from steering innovation to redesigning public finance and redistribution (e.g., new tax instruments, redistribution mechanisms).
Theoretical scenario analysis in the paper with references to related works (Korinek and Juelfs 2024; Korinek and Lockwood 2026).
high neutral NBER WORKING PAPER SERIES policy priority shift (steering -> public finance/redistribution)
Evaluation is carried out under three frozen context configurations (diff only: config_A; diff with file content: config_B; full context: config_C) enabling systematic ablation of context provision strategies.
Methodological description: three fixed context configurations defined and used for ablation experiments.
high neutral SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... effect of context-provision design on model performance
Traffic performance is evaluated using the Fundamental Diagram (FD) under varying driver heterogeneity, heterogeneous time-gap penetration levels, and different shares of RL-controlled vehicles.
Description of experimental/evaluation setup in the paper: macroscopic evaluation via Fundamental Diagram across varied scenario parameters. No numeric sample size provided in the claim text.
high neutral Macroscopic Characteristics of Mixed Traffic Flow with Deep ... traffic performance (via Fundamental Diagram) under varied heterogeneity and RL ...
CriQ is a sister app to Dream11, India's largest fantasy sports platform with over 250 million users.
Descriptive statement in the paper providing context about the application domain and user base.
We performed an extensive evaluation of 37 state-of-the-art Vision-Language Models on MultihopSpatial.
Empirical evaluation described in the paper listing the number of models evaluated (37).
high neutral MultihopSpatial: Multi-hop Compositional Spatial Reasoning B... benchmark coverage across models evaluated
We critically compare LLM-generated rulings against 10,000 real-world court judgments from China Judgments Online (CJOL).
Dataset statement: the paper compares model outputs to a corpus of 10,000 CJOL labor dispute judgments.
high neutral LLM Safety in Judicial AI: A Stress Test of Social Media Inf... agreement / deviation between LLM-generated rulings and CJOL judgments
We introduce a novel stress test that evaluates LLM-generated labor dispute outcomes by injecting social media sentiment as an external pressure.
Methodological description in the paper: a designed stress test where social media sentiment is used to perturb LLM outputs for labor dispute cases.
high neutral LLM Safety in Judicial AI: A Stress Test of Social Media Inf... sensitivity of LLM-generated labor dispute outcomes to injected social media sen...
The paper treats data as a new type of production factor and endogenizes it within the production function.
Theoretical/methodological: the paper constructs a macro-level theoretical model that explicitly includes data as an endogenous input in the production function (no empirical/sample data).
high neutral Study on the impact of big data sharing on individuals’ welf... inclusion of data as a production factor (model specification)
In the near term, the most plausible equilibrium is bounded autonomy, in which AI agents operate as supervised co-pilots, monitoring systems, and constrained execution modules embedded within human decision processes.
Theoretical argument and forward-looking assessment by the authors based on the proposed framework and plausibility considerations; not presented as the result of a causal empirical study in the excerpt.
high neutral AI Agents in Financial Markets: Architecture, Applications, ... expected equilibrium mode of AI agent autonomy in finance (bounded autonomy / su...
Economic evaluations of GLAI should account for end-to-end risk externalities (error propagation, institutional trust, rights impacts), not only short-term productivity gains.
Methodological recommendation grounded in conceptual synthesis of technical, behavioral, and legal risks; normative argument rather than empirical result.
high neutral Why Avoid Generative Legal AI Systems? Hallucination, Overre... comprehensiveness of economic evaluations (inclusion of externalities vs. narrow...
Generative Legal AI (GLAI) systems are built on token-prediction (LLM) architectures rather than formal legal-reasoning architectures.
Conceptual and technical analysis in the paper distinguishing GLAI from other legal-tech; literature synthesis on common LLM architectures. No original empirical dataset or sample size—qualitative/technical review.
high neutral Why Avoid Generative Legal AI Systems? Hallucination, Overre... underlying model architecture type (token-prediction vs. formal-reasoning)
The paper's formalism shows that prompt/system messages shape distributions over possible execution paths (indirect control) but do not evaluate actual partial paths at runtime.
Formal mapping in the paper that treats prompts as shaping prior over paths; conceptual argument and illustrative examples.
high neutral Runtime Governance for AI Agents: Policies on Paths degree of control over execution path (distributional shaping vs. path-specific ...
Through a thematic review of existing research, the authors identified recurring themes about incentive schemes: their components, how researchers manipulate them, and their impact on research outcomes.
Authors' stated method and findings: thematic review (the scope/number of reviewed papers not specified in excerpt).
high neutral Incentive-Tuning: Understanding and Designing Incentives for... themes in incentive design practices and reported impacts on empirical study out...
A critical aspect of conducting human–AI decision-making studies is the role of participants, often recruited through crowdsourcing platforms.
Claim based on the authors' thematic literature review noting participant sourcing practices (specific studies and counts not given in excerpt).
high neutral Incentive-Tuning: Understanding and Designing Incentives for... participant recruitment source (e.g., crowdsourcing) and its influence on study ...
Researchers conduct empirical studies investigating how humans use AI assistance for decision-making and how this collaboration impacts results.
Statement summarizing the research landscape; supported implicitly by the authors' thematic review of existing empirical studies (number of studies not specified in excerpt).
high neutral Incentive-Tuning: Understanding and Designing Incentives for... human behaviour and decision outcomes when assisted by AI (empirical study outco...
The study provides empirical evidence specific to a small open EU economy (Slovakia) on the relationship between AI adoption and labour productivity.
Use of harmonised Eurostat enterprise and productivity data for Slovakia and EU27 over 2021–2024, analysed with descriptive statistics, gap analysis, dynamics of change, correlation, and an illustrative regression model.
high neutral Artificial Intelligence Adoption and Labour Productivity in ... Empirical characterization of AI adoption and labour productivity relationship f...
Returns to AI are heterogeneous across firms; estimating treatment effects requires attention to selection, complementarities, and dynamic adoption pipelines.
Methodological argument referencing treatment-effect literature and observed firm heterogeneity; supported by conceptual examples rather than a single empirical treatment-effect estimate.
high neutral Modern Management in the Age of Artificial Intelligence: Str... heterogeneity in returns to AI adoption (firm-level productivity or performance ...
Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap
Claim of artifact availability hosted on GitHub (URL provided) as part of the paper's resources.
high null result Collaborative Human-Agent Protocol (CHAP) availability of specification and accompanying artifacts
Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability.
Factual claim referencing existing standards (MCP and A2A) and their scopes; no citations or supporting documentation included in the provided excerpt.
high null result Collaborative Human-Agent Protocol (CHAP) scope of existing protocol standards
Production deployments are no longer one human supervising one model; they are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries.
Stated as a general characterization of modern production deployments; no quantitative data or case counts provided in the excerpt.
high null result Collaborative Human-Agent Protocol (CHAP) structure of production deployments (multi-human, multi-agent)
The six middle macros form a low-contrast band between the poles; equivalence testing (TOST at d = 0.2) admits only 1 out of 15 macro-pair comparisons as equivalent.
Authors' analysis of pairwise macro comparisons using Two One-Sided Tests (TOST) for equivalence at Cohen's d = 0.2.
high null result Stable Geometry, Reversing Poles: The Bipolar Structure of A... pairwise equivalence among middle macros (TOST results)
We decomposed 1,961 O*NET Detailed Work Activities (DWAs) into 15,817 micro-actions using a multi-agent LLM pipeline with 31-expert human-in-the-loop (HITL) calibration.
Empirical method reported by the authors: automated multi-agent LLM pipeline plus 31-expert HITL calibration producing the stated counts (1,961 DWAs -> 15,817 micro-actions).
high null result Stable Geometry, Reversing Poles: The Bipolar Structure of A... task decomposition (DWAs to micro-actions)
Empirical research since Frey and Osborne (2017) has converged on a continuous-gradient representation in which each occupation is assigned a real-valued exposure score on [0,1] obtained by linear aggregation across capability dimensions.
Literature synthesis / statement in the paper referencing Frey and Osborne (2017) and subsequent empirical work using continuous exposure scores.
high null result Stable Geometry, Reversing Poles: The Bipolar Structure of A... use of continuous-gradient occupational exposure scores (OAI-style representatio...
Retrieval augmentation and scientist persona prompting yield only marginal gains.
Ablation/augmentation experiments comparing baseline LLM outputs to versions augmented with retrieval or scientist-persona prompting, showing only small improvements in judged quality.
high null result Contemporary AI lacks the imagination to diverge or negate i... change in judged quality due to retrieval augmentation or persona prompting
6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption.
Reported study participation and rating counts: 6,749 respondents providing 25,139 rating sets on specified dimensions.
high null result Contemporary AI lacks the imagination to diverge or negate i... number of respondents and rating sets
We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge follow-up ideas that large language models (LLMs) generated from the context and puzzles of their own papers.
Study recruitment described in paper: invitations sent to authors of 121,640 recent preprints across multiple fields (biology, medicine, chemistry, social sciences).
high null result Contemporary AI lacks the imagination to diverge or negate i... number of invited authors (study recruitment)
The findings provide empirical insights for managing employee wellbeing and refining human resource strategies during organizational digital transformation.
Authors' stated implications in the discussion, based on the reported empirical associations and moderation results from the survey of 411 employees.
high null result The impact of artificial intelligence application on employe... managerial implications for employee wellbeing and HR strategies
The study draws on the Conservation of Resources Theory and the Cognitive Appraisal Theory of Stress to explain how AI application influences employees' job insecurity via resource gain and resource threat mechanisms.
Theoretical framing stated in the introduction and discussion explaining the mechanisms (resource gain vs. resource threat) underlying the observed U-shaped association.
high null result The impact of artificial intelligence application on employe... theoretical explanation of mechanisms behind job insecurity
Data were collected via mixed online and offline questionnaires: 453 questionnaires were distributed (242 online, 211 offline); 449 were returned (242 online, 207 offline); following validity screening, 411 valid questionnaires were retained (219 online, 192 offline), yielding an effective response rate of 90.73%.
Reported survey administration and response counts provided in the methods section of the paper.
high null result The impact of artificial intelligence application on employe... survey response / valid sample size / response rate
Devil's Advocate (DA) is an AI assistant that critiques the human's initial ideas, whereas Dialectical Inquiry (DI) provides alternatives and synthesizes a resolution.
Conceptual/definitional claim in the paper describing the operationalization of DA and DI for the experiments.
high null result Shaping The Tool Or Shaping The Mind: An Investigation Of Du... operational definition of AI-supported conflict techniques
This research empirically compares DA and DI in AI contexts.
Paper reports experimental comparison between AI behaviors implementing Devil's Advocate (DA) and Dialectical Inquiry (DI) across the studies.
high null result Shaping The Tool Or Shaping The Mind: An Investigation Of Du... comparative effects of DA vs DI on SDM outcomes
Both studies examine benefit (information elaboration) and cost (cognitive load) pathways when AI supports SDM.
Paper explicitly frames both studies to measure information elaboration as a benefit pathway and cognitive load as a cost pathway; stated measurement plan in methods.
high null result Shaping The Tool Or Shaping The Mind: An Investigation Of Du... information elaboration and cognitive load
Study 2 tests mind-shaping interventions through user strategy training.
Study design described in the paper: a second experiment (Study 2) manipulating user strategy training (mind-shaping) to evaluate effects on SDM processes and outcomes.
high null result Shaping The Tool Or Shaping The Mind: An Investigation Of Du... effects of user strategy training on information elaboration and cognitive load
Study 1 tests tool-shaping interventions by comparing three AI bot prototype conditions (Information-only, DA, DI) against a control treatment.
Study design described in the paper: randomized/controlled experiment (Study 1) with four conditions (three AI prototype conditions plus control).
high null result Shaping The Tool Or Shaping The Mind: An Investigation Of Du... effects of AI prototype conditions on information elaboration and cognitive load
The 'do no harm' property is confirmed empirically.
Abstract states empirical confirmation in simulations and applications; specifics (e.g., datasets, sample sizes) not included in abstract.
high null result AI-Assisted Variance Reduction in Randomized Experiments empirical verification that adjusted estimator does not worsen performance when ...
Including AI predictions as covariates has a 'do no harm' property: the adjusted estimator reverts to the unadjusted difference in means when predictions are uninformative.
Stated theoretical property in the paper and described as empirically confirmed in simulations and applications (per abstract).
high null result AI-Assisted Variance Reduction in Randomized Experiments bias/consistency and non-worsening of estimator when predictions uninformative
The model frames near-complete AGI substitution not merely as an efficiency transition but as a boundary case for value production under a strict political-economy theory of value.
Interpretive conclusion drawn from the theoretical model and its limiting-case implications (conceptual/theoretical claim; no empirical sample).
high null result AGI and the Limits of Value Production characterization of economic transition
Under the paper's core value-theoretic assumption, AGI transfers value but does not itself create new value.
Explicit model assumption / value-theoretic premise stated in the paper (theoretical assumption, no empirical backing).
high null result AGI and the Limits of Value Production value_creation
The paper distinguishes technical substitutability (the feasible replacement ceiling implied by AGI capability) from actual adoption (the realized replacement share chosen under cost, profitability, and adoption frictions).
Conceptual/theoretical definition introduced in the political-economy model (no empirical sample; definitional argument within the paper).
high null result AGI and the Limits of Value Production adoption_rate
Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96).
Blind-panel scoring of generated reports from agents A and B; panel size and panel methodology not specified in abstract.
high null result AI Scientists Are Only as Good as Their Evidence: A Stratifi... raw blind-panel decision-quality score
The value of an in-band cooperative deny signal (Recuse Signal) is an empirical question: it was previously unmeasured and the paper measures whether compliant LLM agents honor such a signal.
Motivation and framing in the paper; they position their controlled experiment as the measurement addressing this previously unmeasured question.
high null result Will the Agent Recuse Itself? Measuring LLM-Agent Compliance... degree to which LLM agents honor an in-band cooperative deny signal
We searched seven databases (plus backward and forward citation searching) and synthesised 13 empirical studies published between 2018 and 2025.
Methods reported in abstract: PRISMA-ScR scoping review with a preregistered protocol; explicit count of included studies and publication date range.
high null result Artificial intelligence applications supporting women’s care... number of empirical studies identified and synthesized
Self-evaluated creative performance remained unchanged when using GenAI.
Same experiment with 82 participants; authors report no significant difference in self-evaluated creative performance between GenAI users and controls.
high null result When Ai Sparks Less: Generative Ai And The Decline Of Self-P... self-evaluated creative performance