The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (7156 claims)

Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 369 105 58 432 972
Governance & Regulation 365 171 113 54 713
Research Productivity 229 95 33 294 655
Organizational Efficiency 354 82 58 34 531
Technology Adoption Rate 277 115 63 27 486
Firm Productivity 273 33 68 10 389
AI Safety & Ethics 112 177 43 24 358
Output Quality 228 61 23 25 337
Market Structure 105 118 81 14 323
Decision Quality 154 68 33 17 275
Employment Level 68 32 74 8 184
Fiscal & Macroeconomic 74 52 32 21 183
Skill Acquisition 85 31 38 9 163
Firm Revenue 96 30 22 148
Innovation Output 100 11 20 11 143
Consumer Welfare 66 29 35 7 137
Regulatory Compliance 51 61 13 3 128
Inequality Measures 24 66 31 4 125
Task Allocation 64 6 28 6 104
Error Rate 42 47 6 95
Training Effectiveness 55 12 10 16 93
Worker Satisfaction 42 32 11 6 91
Task Completion Time 71 5 3 1 80
Wages & Compensation 38 13 19 4 74
Team Performance 41 8 15 7 72
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 17 15 9 5 46
Job Displacement 5 28 12 45
Social Protection 18 8 6 1 33
Developer Productivity 25 1 2 1 29
Worker Turnover 10 12 3 25
Creative Output 15 5 3 1 24
Skill Obsolescence 3 18 2 23
Labor Share of Income 7 4 9 20
AI appears to be a diffusing technology, not an emerging occupation.
Synthesis of empirical findings: presence of a shared vocabulary but lack of a coherent practitioner population in resume data, interpreted as diffusion of AI skills/vocabulary across existing roles.
high negative NLP Occupational Emergence Analysis: How Occupations Form an... status of AI as technology diffusion versus occupation formation
Across heterogeneous learners, a common broadcast curriculum can be slower than personalized instruction by a factor linear in the number of learner types.
Theoretical comparative result in the model (analysis of broadcast vs personalized curricula across heterogeneous learner types; abstract states factor linear in number of types).
high negative A Mathematical Theory of Understanding speed of instruction / time to learn under broadcast curriculum vs personalized ...
The findings provide evidence against cue-based accounts of lie detection more generally.
Authors' interpretation: because lie-detection accuracy did not decrease despite changes to visual cues (retouching, backgrounds, avatars), the results challenge theories that rely on superficial cues for lie detection.
high negative Through the Looking-Glass: AI-Mediated Video Communication R... validity of cue-based accounts of lie detection
Participants' confidence in their judgments declined in AI-mediated videos, particularly when some participants used avatars while others did not.
Experimental comparisons across conditions with varying levels of AI mediation; subgroup/condition contrast highlighting larger declines in mixed-avatar settings.
high negative Through the Looking-Glass: AI-Mediated Video Communication R... participants' confidence in their lie-detection judgments
Perceived trust in speakers declined in AI-mediated videos.
Experimental results from the two preregistered online experiments comparing perceived trust across varying levels of AI mediation (retouching, background replacement, avatars).
high negative Through the Looking-Glass: AI-Mediated Video Communication R... perceived trust in speakers
AI-based tools that mediate, enhance or generate parts of video communication may interfere with how people evaluate trustworthiness and credibility.
Motivating claim stated in the paper's introduction/abstract; not an empirical finding but a hypothesis motivating the experiments.
high negative Through the Looking-Glass: AI-Mediated Video Communication R... evaluation of trustworthiness and credibility (general)
AI adoption faces critical obstacles originating from digital illiteracy, poor Internet access, excessive application costs, and the rural-to-urban divide.
Survey findings and interview themes from the mixed-methods study (survey n=293; interviews n=12) identifying barriers to AI adoption.
Users still had concerns about how AI credit assessments and chatbots operate.
Qualitative interview data (n=12) and/or survey responses (n=293) reporting user concerns about AI credit scoring and chatbots.
high negative The Impact of Artificial Intelligence on Financial Inclusion... user concerns / trust regarding AI credit assessments and chatbots
Compositional spatial reasoning remains a formidable challenge for state-of-the-art VLMs (as revealed by our evaluation).
Empirical results from the evaluation of the 37 VLMs on the MultihopSpatial benchmark showing poor performance on multi-hop/compositional queries.
high negative MultihopSpatial: Multi-hop Compositional Spatial Reasoning B... performance on compositional/multi-hop spatial reasoning tasks
Existing benchmarks predominantly focus on elementary, single-hop relations and neglect multi-hop compositional spatial reasoning and precise visual grounding needed for real-world scenarios.
Literature/benchmark survey and motivation presented by the authors comparing characteristics of prior benchmarks vs. the proposed needs.
high negative MultihopSpatial: Multi-hop Compositional Spatial Reasoning B... scope/complexity of spatial reasoning tasks in existing benchmarks
Adoption barriers exist, particularly for small and medium-sized enterprises and firms in emerging economies, where capability and data constraints limit impact.
Findings reported from the systematic review and mixed-methods assessment (abstract references barriers observed across reviewed studies); number of studies reported in abstract is 104 for the systematic review.
high negative Artificial intelligence as a catalyst for the circular econo... adoption barriers / limitations to AI impact (capability and data constraints)
AI can initially exacerbate distributional injustice.
Dimension-level analysis indicating negative (or initially negative) effects of AI on the distributional component of the energy justice index.
high negative Artificial intelligence adoption for advancing energy justic... distributional justice component of energy justice index
There are few integrated frameworks (bridging ethics and technical controls) in the current AI governance landscape.
Result of the literature review and cluster analysis showing limited coverage of frameworks that integrate ethical principles with auditable technical controls.
high negative AI Governance Risk Tiering for Sustainable Digital Infrastru... prevalence of integrated governance frameworks
Findings reveal a fragmented landscape dominated by ethics/privacy-centric and compliance/risk-focused approaches.
Synthesis of the reviewed literature and results of PCA/k-means clustering indicate thematic dominance of ethics/privacy and compliance/risk orientations across frameworks.
high negative AI Governance Risk Tiering for Sustainable Digital Infrastru... dominant thematic focus of governance frameworks
Significant limitations emerged in case law citations, with most cited cases being non-existent or incorrectly referenced.
Authors' review of the case citations produced by the four AI engines for the single transcript, finding many citations were fabricated or misreferenced.
high negative Robot Wingman: Using AI to Assess an Employment Termination accuracy of case law citations (error rate / hallucination rate)
These findings uncover critical threats to judicial integrity and public trust and underscore the urgent need for robust safeguards against non-legal influences in AI legal systems.
Interpretation/conclusion drawn from the empirical results (observed deviations, sentiment amplification, and subgroup vulnerabilities).
high negative LLM Safety in Judicial AI: A Stress Test of Social Media Inf... potential impact on judicial integrity and public trust (qualitative/inferential...
These safety risks are compounded for emotionally charged topics.
Subgroup analyses where emotionally charged case topics showed larger deviations and stronger effects from injected sentiment.
high negative LLM Safety in Judicial AI: A Stress Test of Social Media Inf... change in deviation/amplification of model outputs for emotionally charged topic...
These safety risks are compounded (stronger) for low-skilled occupational categories.
Subgroup analyses reported in the paper showing larger model deviations and/or greater sentiment amplification effects for cases involving low-skilled occupations.
high negative LLM Safety in Judicial AI: A Stress Test of Social Media Inf... interaction effect: deviation/amplification magnitude by occupational skill leve...
The sentiment-induced divergences lead to unstable and often inflated compensation predictions by the models.
Analysis of model-predicted compensation amounts under sentiment perturbations showing increased variability and upward bias compared to CJOL amounts.
high negative LLM Safety in Judicial AI: A Stress Test of Social Media Inf... predicted compensation amounts (inflation and instability) from LLMs versus CJOL...
Public opinion (social media sentiment) substantially amplifies deviations between LLM outputs and real rulings.
Stress-test experiments in which injected social media sentiment increased the divergence of model outputs from CJOL judgments across the sample.
high negative LLM Safety in Judicial AI: A Stress Test of Social Media Inf... change in deviation between LLM outputs and CJOL rulings when social media senti...
Models exhibit inherent deviations from real rulings.
Empirical comparison of LLM outputs to CJOL judgments showing systematic differences (based on the paper's reported comparisons across the dataset).
high negative LLM Safety in Judicial AI: A Stress Test of Social Media Inf... magnitude and frequency of deviations between LLM outputs and actual court judgm...
GDP growth is initially negatively affected by the ageing population.
Estimated negative association reported in panel threshold regressions using provincial panel data (31 provinces, 2000–2022); ageing operationalized (primary specification) as an ageing measure (paper also tests old-age dependency ratio).
The article argues that the idea of a “Pax Silica” is fragile.
Conclusion drawn from the paper's theoretical framework and comparative analysis; presented as an assessment rather than empirical measurement.
high negative The Logistics of Hegemony: Semiconductor Chokepoints, Global... stability/fragility of a proposed techno-hegemonic order ('Pax Silica')
Contemporary struggles over semiconductor supply chains represent not a new hegemonic order but a logistical adaptation of Pax Americana.
Stated thesis supported by comparative/historical analysis and theoretical argumentation (comparative analysis of historical Pax orders and U.S. techno-security architecture); no quantitative sample size reported in abstract.
high negative The Logistics of Hegemony: Semiconductor Chokepoints, Global... characterization of geopolitical order governing semiconductor supply chains
Initial adaptation challenges to AI integration were identified among employees.
Participants in semi-structured interviews (n=12) reported initial difficulties adapting to AI tools; themes relating to early adaptation challenges were coded.
high negative AI-AUGMENTED WORKFORCE: THE IMPACT OF ARTIFICIAL INTELLIGENC... initial adaptation challenges to AI
Past machine learning applications to pricing have produced models that adapt slowly to real-time changes, depend heavily on historical data, and struggle to handle multi-agent scenarios.
Stated as literature/related-work critique in paper; no new empirical evidence or sample size provided in the excerpt.
high negative The Application of Adaptive Reinforcement Learning in Dynami... model adaptivity to real-time changes and capability in multi-agent scenarios
Traditional methods, such as rule-based algorithms and statistical scale forecasting, struggle to adapt to rapidly changing market conditions, competitive maneuvers, and evolving consumer strategies, leading to sub-optimal pricing and decreased profitability.
Paper asserts this as background/motivation; no detailed empirical study or sample size provided in the excerpt.
high negative The Application of Adaptive Reinforcement Learning in Dynami... adaptivity of pricing methods and resulting profitability (sub-optimal pricing, ...
In the short term, big data may inhibit welfare growth.
Theoretical comparative-static/dynamic analysis reported in the model showing that initial or short-run effects of increased data sharing can reduce welfare growth (no empirical/sample data).
high negative Study on the impact of big data sharing on individuals’ welf... short-term growth of individuals' welfare
There is a measurement asymmetry in standard LLM evaluation: unconstrained prompts can inflate constraint-adherence scores and mask the practical value of structured prompting.
Analysis of evaluation results from the controlled study showing that unconstrained (simple) prompts sometimes achieve high constraint-adherence scores, leading to misleading evaluation of structured prompts' benefits.
high negative Evaluating 5W3H Structured Prompting for Intent Alignment in... constraint_adherence_scores / evaluation_bias
Traditional paradigms, specifically the resource-based view and the dynamic capabilities framework, operate under closed-system, first-order cybernetic assumptions that fail to capture the dissipative nature of algorithmic agents.
Conceptual critique presented in the paper's theoretical argumentation (literature critique and re-framing); no empirical sample reported.
high negative Governing Human–AI Co-Evolution: Intelligentization Capabili... explanatory_power_of_management_theory (ability to account for AI-driven organiz...
AI usage predicts work disengagement behavior via emotional exhaustion elicited by AI-associated technostressors.
Four-stage longitudinal study (survey) of finance professionals (N=285); mediation analysis testing AI usage -> technostressors -> emotional exhaustion -> work disengagement, based on SOR framework.
high negative Autonomous enhancement or emotional depletion? The dual-path... work disengagement behavior (mediated by emotional exhaustion from technostresso...
These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.
Interpretation of experimental results in the paper: authors conclude that the observed limited gains (particularly on trading-signal/time-series aspects) indicate shortcomings in LLM numerical and time-series reasoning.
high negative FinTradeBench: A Financial Reasoning Benchmark for LLMs LLMs' numerical and time-series reasoning capability (qualitative conclusion fro...
There is a central design tension in human-AI systems: maximizing short-term hybrid capability does not necessarily preserve long-term human cognitive competence.
Conceptual/theoretical claim derived from the framework and discussion in the paper (argument and mathematical framing), no empirical sample or longitudinal data presented in the excerpt.
high negative Cognitive Amplification vs Cognitive Delegation in Human-AI ... long-term human cognitive competence
This result directly contradicts classical scaling laws which assume monotonic capability gains with model scale.
Comparative theoretical claim in the paper contrasting the Institutional Scaling Law with classical empirical/theoretical scaling laws in ML literature.
high negative Punctuated Equilibria in Artificial Intelligence: The Instit... relationship between model scale and deployment-relevant fitness/capability
The Institutional Scaling Law proves that institutional fitness is non-monotonic in model scale.
Formal mathematical derivation/proof presented in the paper (the 'Institutional Scaling Law').
high negative Punctuated Equilibria in Artificial Intelligence: The Instit... institutional fitness as a function of model scale
AI development proceeds not through smooth advancement but through extended periods of stasis interrupted by rapid phase transitions that reorganize the competitive landscape (punctuated equilibrium pattern).
Argument based on punctuated equilibrium theory from evolutionary biology and historical analysis presented in the paper identifying discrete transitions in AI history; the paper cites and classifies eras/events as evidence.
high negative Punctuated Equilibria in Artificial Intelligence: The Instit... pattern of AI development (stasis vs. phase transitions)
The interaction of artificial intelligence and environmental regulation produces a '1 + 1 < 2' crowding-out effect (their combined effect is less than the sum of individual effects).
Spatial Durbin model with interaction term between AI and environmental regulation as summarized in the abstract; reported as a crowding-out interaction.
high negative How artificial intelligence and environmental regulation inf... UCEE index (interaction effect of AI and environmental regulation)
Environmental regulation significantly inhibits local UCEE.
Spatial Durbin model results reported in the abstract indicating a significant negative local coefficient for environmental regulation.
high negative How artificial intelligence and environmental regulation inf... UCEE index (local/provincial effect of environmental regulation)
Artificial intelligence significantly inhibits local UCEE.
Spatial Durbin model results reported in the abstract indicating a significant negative local coefficient for artificial intelligence.
high negative How artificial intelligence and environmental regulation inf... UCEE index (local/provincial effect of AI)
Progress in agentic AI systems that generate and optimize GPU kernels is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution.
Author argument/observation in paper (conceptual claim about limitations of existing benchmarks); no empirical sample or experiment reported in the provided text.
high negative SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... benchmark_alignment_with_hardware_efficiency
Rather than broad job losses, evidence points to a reallocation at the entry level: AI automates tasks typically assigned to junior staff, shifting the nature of entry-level roles.
Synthesis of firm- and task-level empirical studies reported in the brief documenting automation of routine/junior tasks and changes in job-task composition; specific sample sizes vary by cited study and are not provided in the brief.
high negative AI, Productivity, and Labor Markets: A Review of the Empiric... automation of entry-level/junior tasks and changes to entry-level job content
Algorithmic credit systems are linked to higher levels of financial stress.
Study reports a positive association between algorithmic credit system use and reported financial stress from regression analysis on the 400-user cross-sectional dataset.
Confirmation bias poses a weakness in LLM-based code review, with implications on how AI-assisted development tools are deployed.
Synthesis of findings from Study 1 (framing-induced detection failures) and Study 2 (practical exploitability and partial mitigation via debiasing).
high negative Measuring and Exploiting Confirmation Bias in LLM-Assisted S... reliability/security of LLM-based code review
Adversarial framing succeeds in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success.
Study 2 experiments in real project configurations with iterative adversary refinement evaluated against Claude Code (autonomous agent); reported 88% success rate.
high negative Measuring and Exploiting Confirmation Bias in LLM-Assisted S... attack success rate (vulnerability reintroduction accepted/not detected)
Adversarial pull request framing (e.g., labeled as security improvements or urgent functionality fixes) succeeds in reintroducing known vulnerabilities in 35% of cases against GitHub Copilot under one-shot attacks.
Study 2 experiments simulating adversarial pull requests evaluated against GitHub Copilot (interactive assistant); reported success rate 35% for one-shot attacks.
high negative Measuring and Exploiting Confirmation Bias in LLM-Assisted S... attack success rate (vulnerability reintroduction accepted/not detected)
The framing effect is strongly asymmetric: false negatives increase sharply while false positive rates change little.
Comparison of false negative and false positive rates across framing conditions in Study 1 experiments (250 CVE pairs across models).
high negative Measuring and Exploiting Confirmation Bias in LLM-Assisted S... false negative rate and false positive rate
Framing a change as bug-free reduces vulnerability detection rates by 16-93%.
Result reported from Study 1 controlled experiments across models and framing conditions (250 CVE pairs).
high negative Measuring and Exploiting Confirmation Bias in LLM-Assisted S... vulnerability detection rate
AI-only baselines perform near or below the median of competition participants.
Comparison of AI-only baseline performance to the distribution of competition participant results reported in the paper (competition with 29 teams / 80 participants).
high negative AgentDS Technical Report: Benchmarking the Future of Human-A... relative performance rank of AI-only baselines vs participants
Our results show that current AI agents struggle with domain-specific reasoning.
Outcome of the competition reported in the paper comparing AI-only baselines to participant submissions across the AgentDS tasks (competition data from 29 teams / 80 participants); reported aggregate performance indicating AI weakness on domain-specific tasks.
high negative AgentDS Technical Report: Benchmarking the Future of Human-A... domain-specific reasoning performance
LLM-generated peer reviews place significantly less weight on clarity and significance of the research.
Comparative analysis between LLM-generated reviews and human reviews from the conference dataset; reported as a statistically significant difference but exact statistics and sample size not provided in the excerpt.
high negative How LLMs Distort Our Written Language importance/weight given to clarity and significance in peer review content