The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (2954 claims)

Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 369 105 58 432 972
Governance & Regulation 365 171 113 54 713
Research Productivity 229 95 33 294 655
Organizational Efficiency 354 82 58 34 531
Technology Adoption Rate 277 115 63 27 486
Firm Productivity 273 33 68 10 389
AI Safety & Ethics 112 177 43 24 358
Output Quality 228 61 23 25 337
Market Structure 105 118 81 14 323
Decision Quality 154 68 33 17 275
Employment Level 68 32 74 8 184
Fiscal & Macroeconomic 74 52 32 21 183
Skill Acquisition 85 31 38 9 163
Firm Revenue 96 30 22 148
Innovation Output 100 11 20 11 143
Consumer Welfare 66 29 35 7 137
Regulatory Compliance 51 61 13 3 128
Inequality Measures 24 66 31 4 125
Task Allocation 64 6 28 6 104
Error Rate 42 47 6 95
Training Effectiveness 55 12 10 16 93
Worker Satisfaction 42 32 11 6 91
Task Completion Time 71 5 3 1 80
Wages & Compensation 38 13 19 4 74
Team Performance 41 8 15 7 72
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 17 15 9 5 46
Job Displacement 5 28 12 45
Social Protection 18 8 6 1 33
Developer Productivity 25 1 2 1 29
Worker Turnover 10 12 3 25
Creative Output 15 5 3 1 24
Skill Obsolescence 3 18 2 23
Labor Share of Income 7 4 9 20
Clear
Human Ai Collab Remove filter
LLM design agents can fixate on existing paradigms and fail to explore alternatives when solving design challenges, potentially leading to suboptimal solutions (a pathology analogous to human designers).
Literature/background claim and authors' characterization of observed agent behavior; motivated the proposed metacognitive interventions. No numerical sample size reported.
high negative Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regul... tendency to fixate on existing paradigms / lack of exploration leading to subopt...
Algorithmic management functions as 'psychological governance' that erodes worker mental health through surveillance, opacity, and precarity.
Synthesis/conclusion from integrating findings across the reviewed literature (48 studies) and the trilevel theoretical framework.
high negative Algorithmic Control and Psychological Risk in Digitally Mana... worker mental health (general deterioration)
Fear of deactivation (automated sanctions) creates chronic precarity; 78% report chronic fear.
Reported prevalence in the paper's synthesis of studies that measured fear of deactivation / account suspension among platform workers.
high negative Algorithmic Control and Psychological Risk in Digitally Mana... self-reported chronic fear of deactivation
Task defragmentation (fragmenting tasks via platform algorithms) leads to a reduced sense of accomplishment among drivers.
Thematic finding/proposition from the trilevel framework based on qualitative and quantitative evidence synthesized across studies.
high negative Algorithmic Control and Psychological Risk in Digitally Mana... reduced sense of accomplishment
Rating pressure is associated with emotional exhaustion, with 41–67% reporting high burnout.
Reported prevalence range in the paper's synthesis of included studies measuring burnout/emotional exhaustion among workers exposed to rating systems.
high negative Algorithmic Control and Psychological Risk in Digitally Mana... emotional exhaustion / high burnout prevalence
Income volatility from dynamic pricing is associated with depressive symptoms (reported prevalence range 23–41%).
Reported prevalence range in the paper's synthesized findings (from included empirical studies reporting depressive symptom prevalence among affected workers).
high negative Algorithmic Control and Psychological Risk in Digitally Mana... prevalence of depressive symptoms
Algorithmic opacity is linked to procedural anxiety.
Thematic proposition from the trilevel framework reported in the paper synthesizing pathways from algorithmic control to psychological risk.
Real estate pro forma development remains one of the most time-intensive functions in property investment, typically requiring twenty to forty hours per multifamily project through manual research, Excel-based modeling, and iterative scenario analysis.
Statement in paper asserting typical industry practice; not tied to the paper's controlled test. No empirical sample size or survey data reported alongside this assertion.
Work autonomy weakens the positive effect of AI avoidance job crafting on work alienation (buffering moderation).
Moderation analysis in the same dataset (287 employee–leader dyads) showing a significant interaction between AI avoidance job crafting and work autonomy predicting lower work alienation when autonomy is higher.
The negative effect of AI avoidance job crafting on career-relevant outcomes (career satisfaction and performance) is mediated by increased work alienation.
Mediation analysis on the multi-wave, multi-source survey data (287 employee–leader dyads) showing a pathway from AI avoidance job crafting → work alienation → worse career outcomes.
high negative Approach or avoidance? A dual-pathway model of job crafting ... career satisfaction and performance (mediated by work alienation)
AI avoidance job crafting negatively predicts career satisfaction and performance.
Multi-source, multi-wave survey of 287 employee–leader dyads in China linking employee-reported AI avoidance job crafting to lower career satisfaction and lower performance.
high negative Approach or avoidance? A dual-pathway model of job crafting ... career satisfaction and performance
The competence shadow compounds multiplicatively to produce degradation far exceeding naive additive estimates.
Analytic/closed-form performance bounds derived in the paper showing multiplicative compounding (theoretical result; no empirical sample reported).
The competence shadow is a systematic narrowing of human reasoning induced by AI-generated safety analysis; it is defined as not what the AI presents, but what it prevents from being considered.
Conceptual definition and formalization within the paper (theoretical exposition; no empirical test reported).
Safety engineering resists benchmark-driven evaluation because safety competence is irreducibly multidimensional, constrained by context-dependent correctness, inherent incompleteness, and legitimate expert disagreement.
Conceptual/theoretical argument and formalization presented in the paper (no empirical sample reported).
In experimental settings, the model is able to induce belief and behaviour changes in study participants.
Controlled experimental interventions reported in the study where participant beliefs and behaviors were measured pre/post or between conditions; aggregate result: model induced changes.
high negative Evaluating Language Models for Harmful Manipulation participant beliefs and behaviour changes (manipulative efficacy)
The tested model can produce manipulative behaviours when prompted to do so.
Human-AI interaction tests in which the model was prompted to produce manipulative behaviours; empirical observations reported in study across participants and prompts.
high negative Evaluating Language Models for Harmful Manipulation frequency/occurrence of manipulative behaviours (model propensity to produce man...
Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity).
Authors' conceptual argument and motivation for introducing a new evaluation framework; contrasted standard calibration metrics (ECE, Brier) with Type-1 vs Type-2 capacities in the paper's introduction and methods.
high negative Do LLMs Know What They Know? Measuring Metacognitive Efficie... confounding of calibration metrics between Type-1 sensitivity (knowledge) and Ty...
Traditional expert-based assessment faces a critical scalability challenge in large systems (e.g., serving 36 million children across 250,000+ kindergartens in China), making continuous quality monitoring infeasible and relegating assessment to infrequent episodic audits.
Authors' contextual motivation citing scale figures (36 million children, 250,000+ kindergartens) and describing time/cost constraints of manual observation leading to infrequent audits.
high negative When AI Meets Early Childhood Education: Large Language Mode... feasibility/scalability of manual expert-based assessment
There is a significant boundary in the reverse confidence scenario: a substantial proportion of participants struggled to override initial inductive biases and thus had difficulty learning in that condition.
Behavioral experiment (N = 200) reporting that many participants failed or struggled in the reverse confidence mapping condition; proportion described in paper (exact proportion not given here).
high negative Learning to Trust: How Humans Mentally Recalibrate AI Confid... failure/struggle rate in reverse confidence condition (ability to learn mappings...
Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate).
Preliminary empirical evaluation reported by the authors; reported task failure rate ~60% (no sample size provided in abstract).
high negative CUA-Suite: Massive Human-annotated Video Demonstrations for ... task failure rate of foundation action models on professional desktop applicatio...
The largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video.
Quantitative statement about ScaleCUA reported in paper: 2,000,000 screenshots and <20 hours equivalence.
high negative CUA-Suite: Massive Human-annotated Video Demonstrations for ... size/coverage of existing open dataset (ScaleCUA)
Progress toward general-purpose CUAs is bottlenecked by the scarcity of continuous, high-quality human demonstration videos.
Asserted in paper as motivation; refers to the gap in available continuous video data for training CUAs.
high negative CUA-Suite: Massive Human-annotated Video Demonstrations for ... availability of continuous, high-quality human demonstration videos (data scarci...
Refining the state (as above) raises state-action blind mass from 0.0165 at \tau=50 to 0.1253 at \tau=1000.
Empirical measurement reported on the instantiated model over the BPI 2019 log showing state-action blind mass values at two threshold (tau) settings.
high negative The Stochastic Gap: A Markovian Framework for Pre-Deployment... state-action blind mass (measure of unsupported next-step decisions)
Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful.
Paper cites empirical literature (unspecified in excerpt) as the basis for this claim; no sample size or methods given here.
high negative From Accuracy to Readiness: Metrics and Benchmarks for Human... failures due to miscalibrated reliance (overreliance/underreliance)
Evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively.
Paper-level critique / literature observation asserted in text; no empirical method or sample reported in excerpt.
high negative From Accuracy to Readiness: Metrics and Benchmarks for Human... evaluation focus (accuracy vs. team readiness)
These harms increasingly translate into financial loss through litigation, enforcement penalties, brand erosion, and failed deployments.
Paper argues this linkage using conceptual reasoning and illustrative examples/case vignettes; cites regulatory and market incidents but does not provide systematic empirical estimates or a sample size.
AI systems can create material harms: discriminatory outcomes, privacy and security failures, opacity in decision logic, and regulatory noncompliance.
Paper lists these harms as core risks based on prior literature, regulatory developments, and conceptual risk analysis. Presented as well-documented categories rather than as new empirical findings; no sample size reported.
As artificial intelligence assumes cognitive labor, no existing quantitative framework predicts when human capability loss becomes catastrophic.
Introductory/background claim asserted by authors motivating the study (literature gap claim).
high negative The enrichment paradox: critical capability thresholds and i... absence of prior quantitative frameworks for catastrophic human capability loss
Broader AI scope lowers the critical threshold K* (i.e., more general AI reduces the K* value at which capability collapse occurs).
Model sensitivity analysis / simulations showing K* varies with assumed scope of AI (reported in model calibration discussion).
high negative The enrichment paradox: critical capability thresholds and i... change in critical threshold K* with AI scope
The model identifies a critical threshold K* approximately 0.85 (scope-dependent; broader AI scope lowers K*) beyond which capability collapses abruptly — the 'enrichment paradox.'
Model analysis and simulations calibrated across domains (paper reports computed threshold K* ≈ 0.85 and notes dependence on AI scope).
high negative The enrichment paradox: critical capability thresholds and i... critical delegation/capability threshold (K*) at which human capability collapse...
Fabrication risk is not an anomalous glitch but a foreseeable consequence of the technology's design, with direct implications for the evolving duty of technological competence.
Conclusion drawn from the paper's theoretical/physics-based analysis and the simulated scenario; stated in the abstract as the authors' interpretation and policy/legal implication.
high negative When AI output tips to bad but nobody notices: Legal implica... foreseeability of fabrication risk and implications for professional duty/compet...
The paper presents the physics-based analysis in a legal-industry setting by walking through a simulated brief-drafting scenario.
Methodological claim explicitly stated in the abstract: use of a simulated brief-drafting scenario to demonstrate the analysis.
high negative When AI output tips to bad but nobody notices: Legal implica... demonstration of fabrication risk in a simulated legal drafting task (output qua...
Although commonly dismissed as random 'hallucination', recent physics-based analysis of the Transformer's core mechanism reveals a deterministic component: the AI's internal state can cross a calculable threshold, causing its output to flip from reliable legal reasoning to authoritative-sounding fabrication.
Paper cites/relies on 'recent physics-based analysis' of Transformer mechanisms and states that it demonstrates a calculable threshold; the paper also purports to present this science in a legal setting (via simulation). No numeric experimental sample provided in the excerpt.
high negative When AI output tips to bad but nobody notices: Legal implica... transition from reliable reasoning to fabricated outputs (failure mode / interna...
Courts confront a novel threat to the integrity of the adversarial process due to fabricated authorities produced by generative AI.
Asserted in the abstract as a consequence of fabricated outputs; supported by the paper's conceptual argument and simulation reference rather than empirical court-case analysis.
high negative When AI output tips to bad but nobody notices: Legal implica... integrity of the adversarial process / decision quality in courts
Attorneys who unknowingly file such fabrications face professional sanctions, malpractice exposure, and reputational harm.
Stated as a legal/consequential claim in the abstract; no empirical evidence, case counts, or legal-statistics provided in the excerpt.
high negative When AI output tips to bad but nobody notices: Legal implica... professional sanctions, malpractice exposure, reputational harm
For law in particular, generative AI introduces a perilous failure mode in which the AI fabricates fictitious case law, statutes, and judicial holdings that appear entirely authentic.
Claimed in the paper; supported by the paper's analytic argument and a simulated brief-drafting scenario referenced in the abstract (no numeric sample provided).
high negative When AI output tips to bad but nobody notices: Legal implica... fabrication of legal authorities (authentic-appearing fake citations/holdings)
Measuring only technical model performance (such as predictive accuracy) is insufficient for assessing the strategic impact of AI in drug discovery.
Argued in the paper as a critique of current evaluation practices; presented as a conceptual point rather than supported by new empirical data in the excerpt.
high negative Strategic Key Performance Indicators for AI in Lead Optimiza... adequacy of technical model performance metrics for capturing strategic impact
Pressure remains high to increase the probability of success to improve the effectiveness of pharmaceutical R&D.
Asserted in the paper as motivational context for the work; framed as an industry pressure point rather than backed by a specific empirical sample or quantified survey in the excerpt.
high negative Strategic Key Performance Indicators for AI in Lead Optimiza... probability of success in pharmaceutical R&D
Increasing cost and failure rates in the pharmaceutical R&D process have not fundamentally improved over the last decade.
Stated as a contextual observation in the paper's opening paragraph; presented as a summary of industry trends (no specific dataset, sample size, or citation included in the excerpt).
high negative Strategic Key Performance Indicators for AI in Lead Optimiza... cost and failure rates in pharmaceutical R&D
Without support, performance stays stable up to three issues but declines as additional issues increase cognitive load.
Empirical study / human-AI negotiation case study in a property rental scenario that varied the number of negotiated issues; the paper reports observed performance across different numbers of issues (no sample size for this specific comparison stated in the abstract).
high negative From Overload to Convergence: Supporting Multi-Issue Human-A... negotiation performance (ability to find good agreements) under increasing numbe...
Reliance on automated content generation introduces risks of cognitive overreliance, algorithmic bias, and strategic misalignment.
The paper articulates these risks as conceptual/qualitative concerns in its discussion; no quantitative estimates or empirical tests of these specific risks are reported in the provided excerpt.
high negative The Strategic Impact of Generative Artificial Intelligence o... risks to decision-making including cognitive overreliance, algorithmic bias, str...
Wide disagreement among AIs created confusion and undermined appropriate reliance on advice.
Reported experimental finding from the paper: manipulating within-panel disagreement across tasks produced wide disagreement conditions that, according to the abstract, led to confusion and reduced appropriate reliance. No quantitative metrics reported in abstract.
high negative More Isn't Always Better: Balancing Decision Accuracy and Co... appropriate reliance on advice / decision-making
High within-panel consensus fostered overreliance on AI advice.
Experimental manipulation of within-panel consensus across the three tasks; the abstract reports that high consensus increased participants' reliance on AI (interpreted as overreliance). Specific measures and sample size not provided in abstract.
high negative More Isn't Always Better: Balancing Decision Accuracy and Co... reliance on AI advice (overreliance)
Developers and experts still lack a shared view, resulting in repeated coordination, clarification rounds, and error-prone handoffs.
Observational/qualitative claim in paper describing current MSD practice (no numeric sample reported).
high negative LLM-Powered Workflow Optimization for Multidisciplinary Soft... frequency of coordination rounds / error-prone handoffs
Even with AI coding assistants like GitHub Copilot, individual coding tasks are semi-automated, but the workflow connecting domain knowledge to implementation is not.
Qualitative observation/comparative statement in paper (no empirical sample reported).
high negative LLM-Powered Workflow Optimization for Multidisciplinary Soft... degree of automation of coding tasks vs. end-to-end workflow automation
Multidisciplinary Software Development (MSD) requires domain experts and developers to collaborate across incompatible formalisms and separate artifact sets.
Conceptual/argument in paper framing the problem (no empirical sample reported).
high negative LLM-Powered Workflow Optimization for Multidisciplinary Soft... collaboration/workflow efficiency between domain experts and developers
Only 12% of AI market value is used in physical activities.
Descriptive aggregate: authors categorize and report that 12% of estimated AI market value maps to physical activities.
high negative Where can AI be used? Insights from a deep ontology of work ... share of AI market value by activity type (physical)
Applying them to hardware-in-the-loop (HIL) embedded and Internet-of-Things (IoT) systems remains challenging due to the tight coupling between software logic and physical hardware behavior; code that compiles successfully may still fail when deployed on real devices because of timing constraints, peripheral initialization requirements, or hardware-specific behaviors.
Conceptual/engineering reasoning stated in the paper describing known HIL/IoT failure modes (no experimental quantification provided in this excerpt).
high negative Skilled AI Agents for Embedded and IoT Systems Development code failure / runtime correctness when deployed to hardware
Across heterogeneous learners, a common broadcast curriculum can be slower than personalized instruction by a factor linear in the number of learner types.
Theoretical comparative result in the model (analysis of broadcast vs personalized curricula across heterogeneous learner types; abstract states factor linear in number of types).
high negative A Mathematical Theory of Understanding speed of instruction / time to learn under broadcast curriculum vs personalized ...
The findings provide evidence against cue-based accounts of lie detection more generally.
Authors' interpretation: because lie-detection accuracy did not decrease despite changes to visual cues (retouching, backgrounds, avatars), the results challenge theories that rely on superficial cues for lie detection.
high negative Through the Looking-Glass: AI-Mediated Video Communication R... validity of cue-based accounts of lie detection