The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (2954 claims)

Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 369 105 58 432 972
Governance & Regulation 365 171 113 54 713
Research Productivity 229 95 33 294 655
Organizational Efficiency 354 82 58 34 531
Technology Adoption Rate 277 115 63 27 486
Firm Productivity 273 33 68 10 389
AI Safety & Ethics 112 177 43 24 358
Output Quality 228 61 23 25 337
Market Structure 105 118 81 14 323
Decision Quality 154 68 33 17 275
Employment Level 68 32 74 8 184
Fiscal & Macroeconomic 74 52 32 21 183
Skill Acquisition 85 31 38 9 163
Firm Revenue 96 30 22 148
Innovation Output 100 11 20 11 143
Consumer Welfare 66 29 35 7 137
Regulatory Compliance 51 61 13 3 128
Inequality Measures 24 66 31 4 125
Task Allocation 64 6 28 6 104
Error Rate 42 47 6 95
Training Effectiveness 55 12 10 16 93
Worker Satisfaction 42 32 11 6 91
Task Completion Time 71 5 3 1 80
Wages & Compensation 38 13 19 4 74
Team Performance 41 8 15 7 72
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 17 15 9 5 46
Job Displacement 5 28 12 45
Social Protection 18 8 6 1 33
Developer Productivity 25 1 2 1 29
Worker Turnover 10 12 3 25
Creative Output 15 5 3 1 24
Skill Obsolescence 3 18 2 23
Labor Share of Income 7 4 9 20
Clear
Human Ai Collab Remove filter
Helicoid dynamics is a specific failure regime: a system engages competently, drifts into error, accurately names what went wrong, then reproduces the same pattern at a higher level of sophistication, recognizing it is looping and continuing nonetheless.
Definition introduced in the paper and illustrated by the reported case series; the claim is conceptual/phenomenological rather than a statistical result.
high mixed AI Knows What's Wrong But Cannot Fix It: Helicoid Dynamics i... incidence and qualitative characterization of the helicoid pattern in LLM intera...
A minimal linear specification (linearized model) demonstrates how coupling strength, persistence, and dissipation determine local stability and oscillatory regimes through spectral conditions on the Jacobian.
Analytic linear model and local stability analysis in the paper: computation of Jacobian, derivation of spectral conditions (eigenvalue locations) that separate stable/oscillatory regimes; illustrative examples within the paper (no empirical data).
high mixed How Intelligence Emerges: A Minimal Theory of Dynamic Adapti... local stability/oscillatory behavior characterized by Jacobian eigenvalues (spec...
Distinct AI features (recommendation engines, chatbots, and comparison tools) influence consumer outcomes when modeled as latent constructs.
Methodological claim: the study modeled three AI features as latent constructs and analyzed their relationships with dependent variables using SEM (quantitative questionnaire data).
high mixed Role of artificial intelligence on consumer buying behavior:... influence on consumer trust, perceived decision-making support, and purchase int...
Both time constraints and LLM use significantly alter the characteristics of decision-makers' mental representations.
Results from the 2 × 2 experiment (N = 348) comparing representation-related measures across manipulated conditions; reported statistically significant differences associated with time constraints and with LLM use.
high mixed AI-Augmented Strategic Decision-Making Under Time Constraint... characteristics of mental representations (representation-related measures colle...
We develop a theoretical framework - the productivity funnel - that traces how technological potential narrows through successive stages, from access and digital infrastructure, through organizational absorption and human capital adaptation, to ultimate value capture.
Conceptual/theoretical development presented in the paper; no empirical sample needed (framework-building).
high mixed The complementarity trap: AI adoption and value capture n/a (theoretical framework describing stages leading to value capture)
Effects of curated Skills are highly heterogeneous across domains (e.g., +4.5 pp in Software Engineering vs. +51.9 pp in Healthcare).
Per-domain pass-rate deltas reported in the paper (SkillsBench per-domain analysis). The example domain deltas (+4.5 pp and +51.9 pp) are taken from the reported per-domain results.
high mixed SkillsBench: Benchmarking How Well Agent Skills Work Across ... task pass rate (per-domain average delta)
The study's qualitative and exploratory design limits generalizability; the proposed framework requires quantitative testing and broader samples (practicing architects, firms, cross-cultural contexts).
Explicit limitations stated by authors; study is based on semi-structured interviews with architecture students (N unspecified) and inductive thematic analysis.
high mixed Human–AI Collaboration in Architectural Design Education: To... generalizability / external validity of findings and framework
XChronos reframes transhumanist technology evaluation in experiential terms, creating both market opportunities and measurement/regulatory challenges for AI economics.
Synthesis and concluding argument in the paper summarizing proposed implications; conceptual reasoning without empirical tests.
high mixed XChronos and Conscious Transhumanism: A Philosophical Framew... shift in evaluation criteria toward experiential measures and resultant market/r...
Across 182 reviewed studies, LLM-generated synthetic participants have modest and inconsistent fidelity to human participants.
Systematic review and synthesis of 182 empirical and methodological studies comparing LLM-generated participants to human samples; studies were coded and analyzed for fidelity outcomes.
high mixed Synthetic Participants Generated by Large Language Models: A... fidelity of synthetic participants to human participants (behavioral/response si...
Participant targeting: 44% of programs targeted doctors and 44% targeted medical students (with possible overlap), and 56% targeted entry‑to‑practice career stages.
Participant audience and career-stage data extracted from the 27 included programs; proportions reported in the review.
high mixed Assessing the effectiveness of artificial intelligence educa... target audience (doctors, medical students) and career stage distribution (entry...
Most programs were delivered in academic settings: 56% of evaluated programs reported an academic setting.
Setting information extracted from the 27 included programs, with 56% reported as delivered in academic settings.
high mixed Assessing the effectiveness of artificial intelligence educa... program delivery setting (academic vs non-academic)
A plurality of programs were short in duration: 44% of programs were categorized as short courses.
Extraction of program length from the 27 included studies; 44% were classified as short courses per the review's categorization.
high mixed Assessing the effectiveness of artificial intelligence educa... program duration (short vs longer formats)
Most programs were introductory in content: 67% of included programs taught introductory AI concepts rather than advanced/technical AI skills.
Program content extraction across the 27 included studies yielded that 67% were classified as teaching introductory AI.
high mixed Assessing the effectiveness of artificial intelligence educa... program content focus (introductory vs advanced/technical AI skills)
The methodological landscape of the evidence base is heterogeneous, consisting of cross-sectional surveys, case studies, quasi-experimental designs, and a limited number of longitudinal analyses.
Study design information was extracted from the 145 included studies revealing a mix of designs and relatively few longitudinal or experimental studies.
high mixed Digital transformation and its relationship with work produc... study design types (cross-sectional, case study, quasi-experimental, longitudina...
Human factors (training, trust calibration, workflows) determine whether clinicians accept, override, or ignore GenAI suggestions.
Qualitative and quantitative human-AI interaction studies and pilot deployments discussed in the paper; specific sample sizes and effect sizes are not reported in the paper.
high mixed GenAI and clinical decision making in general practice override/acceptance rates; clinician-reported trust and cognitive load; adherenc...
Safety and net benefit of GenAI CDS hinge on deployment details: user interface, real-time feedback, uncertainty quantification, calibration, and how recommendations are presented (strong vs. suggestive).
Human factors and implementation studies referenced; early A/B tests and human-AI interaction research suggest interface and presentation affect acceptance and error rates; no large-scale standardized implementation trial data cited.
high mixed GenAI and clinical decision making in general practice acceptance/override rates; error rates; calibration metrics; clinician trust
Reimbursement models (fee-for-service vs. capitation) will influence whether cost savings from GenAI are realized or offset by increased service volume.
Economic incentive framework and prior health-economics literature cited; the paper does not provide direct empirical tests but references plausible incentive channels.
high mixed GenAI and clinical decision making in general practice total spending; per-patient cost; service volume under different payment models
RL and adaptive methods are good for real-time adaptation but can be myopic, require large amounts of interaction data, and struggle to incorporate long-term preference structure and ethical constraints.
Surveyed properties of reinforcement learning and adaptive methods in HRI/RS literature; no new empirical evaluation in this paper.
high mixed Reimagining Social Robots as Recommender Systems: Foundation... real-time adaptation effectiveness, sample efficiency (amount of interaction dat...
The community knowledge functions both as practical how-to guidance and as collective experimentation with platform rules and revenue mechanisms.
Observed dual nature in the 377-video corpus: instructional workflows alongside demonstrations/testing of platform-tailored monetization tactics and workarounds.
high mixed Monetizing Generative AI: YouTubers' Collective Knowledge on... co-occurrence of instructional content and platform-experimentation practices
Typical practices emphasized by creators include rapid mass production of content, productizing prompt engineering, repurposing existing material via synthesis/localization, and packaging AI outputs as sellable creative services or assets.
Recurring practices surfaced through qualitative coding of workflows, tools, and pipelines described in the 377 videos.
high mixed Monetizing Generative AI: YouTubers' Collective Knowledge on... presence and frequency of recommended production and productization practices
Across the 377 videos, creators converge on a set of repeatable use cases and platform‑tailored monetization tactics.
Thematic coding of 377 videos produced a catalog of recurring use cases and tactics; the paper reports convergence across that sample.
high mixed Monetizing Generative AI: YouTubers' Collective Knowledge on... frequency and recurrence of specific use cases and monetization tactics in the s...
YouTube creators have collectively constructed and circulated a practical knowledge repository about how to monetize GenAI-driven creative work.
Systematic qualitative content analysis (thematic coding) of 377 publicly available YouTube videos in which creators promote GenAI workflows and monetization strategies.
high mixed Monetizing Generative AI: YouTubers' Collective Knowledge on... presence and characteristics of a community knowledge repository (practical guid...
Choice of scaffold materially affects outcomes: an open-source scaffold outperformed vendor-provided scaffolds by up to approximately 5 percentage points.
Comparative experiments across three scaffolding approaches (vendor scaffolds and at least one open-source scaffold) showing up to ~5 percentage point differences in measured outcomes.
high mixed Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... performance_difference_across_scaffolds (detection/exploitation_rates_difference...
Adoption of NFD approaches in regulated domains will depend on standards for validation, auditability, and update procedures.
Implications and governance discussion emphasizing regulatory constraints (finance, healthcare) and the need for validation/audit standards; logical/ normative claim rather than empirical finding.
high mixed Nurture-First Agent Development: Building Domain-Expert AI A... adoption rate in regulated domains conditional on available validation/audit sta...
Limitations include generalizability beyond Chatbot Arena data, calibration of priors on novel tasks, audit costs/latency, user comprehension/cognitive load, and strategic manipulation.
Authors' stated limitations and open questions; these are candid acknowledgements rather than empirical findings.
high mixed Task-Aware Delegation Cues for LLM Agents generalizability, calibration, audit cost/latency, user comprehension, susceptib...
RAD requires estimating cost distributions and choosing a reference policy and quantile-weighting function; these choices determine the method's conservatism and sample efficiency.
Methodological and practical considerations discussed in the paper; noted dependency on estimation and design choices (no quantitative sample-efficiency results provided in the summary).
high mixed Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... method conservatism (relative safety level) and sample efficiency (amount of dat...
Explanations change workflows, shift responsibilities between humans and machines, and can reshape power dynamics—creating both opportunities (better oversight) and risks (over-reliance, gaming).
Qualitative and conceptual studies synthesized in the review, including socio-technical analyses and case studies reporting observed or theorized workflow and responsibility shifts; no meta-analytic causal estimate.
high mixed Explainable AI in High-Stakes Domains: Improving Trust, Tran... workflows, responsibility allocation, power dynamics, oversight quality
Explanations increase user trust principally when they are understandable, actionable, and aligned with users’ domain knowledge; opaque or overly technical explanations can fail to build trust or even decrease it.
Thematic synthesis of empirical and conceptual studies in the reviewed literature reporting conditional effects of explanation form and comprehensibility on trust; review notes heterogeneity in study designs and contexts.
high mixed Explainable AI in High-Stakes Domains: Improving Trust, Tran... user trust / changes in trust toward AI outputs
Explainability improves perceived legitimacy, user trust, and organizational accountability only when technical transparency is paired with human-centered explanation design and governance mechanisms.
Synthesis of studies from the reviewed literature showing conditional effects of algorithmic interpretability combined with explanation design and governance; derived via thematic coding across technical and social-science sources (no new primary experimental data reported).
high mixed Explainable AI in High-Stakes Domains: Improving Trust, Tran... perceived legitimacy, user trust, organizational accountability
Explainability is a necessary but not sufficient condition for trustworthy AI in high-stakes domains.
Systematic literature review (thematic coding and synthesis) of interdisciplinary scholarship (peer-reviewed research, technical reports, policy documents); the paper synthesizes conceptual and empirical studies rather than presenting new primary data. Emphasis on high-stakes domains (healthcare, finance, public sector).
high mixed Explainable AI in High-Stakes Domains: Improving Trust, Tran... overall trustworthiness of AI systems in high-stakes domains (multidimensional c...
Some patients value human contact for sensitive cases; automated interactions can feel impersonal.
Semi-structured interviews with patients/staff and open-ended survey responses documenting preferences for human interaction in sensitive/complex complaints.
high mixed The Role of Artificial Intelligence in Healthcare Complaint ... patient-reported preference for human contact and perceived interpersonal qualit...
Data‑driven policies can either amplify or mitigate inequalities depending on data representativeness, model design, and deployment governance.
Multiple empirical examples and theoretical analyses in the review highlighting cases of both harm (bias amplification) and mitigation, identified across the 103 items.
high mixed Models, applications, and limitations of the responsible ado... distributional equity outcomes (inequality amplification or mitigation)
Citizen acceptance, transparency, and perceived fairness strongly shape adoption trajectories and the political feasibility of AI tools in government.
Repeated empirical findings in the reviewed literature linking public trust, transparency measures, and fairness perceptions to successful or failed deployments (drawn from multiple case studies in the 103 items).
high mixed Models, applications, and limitations of the responsible ado... adoption trajectory/political feasibility of government AI tools (measured via d...
Adoption of AI and data-driven governance is highly uneven across jurisdictions and sectors, driven by institutional capacity, governance frameworks, and public trust.
Cross‑regional and cross‑sector comparisons in the review corpus (103 items) showing varying maturity levels and repeated identification of institutional capacity, governance arrangements, and trust factors as determinants.
high mixed Models, applications, and limitations of the responsible ado... adoption level/maturity of AI-driven governance systems
Productivity gains from generative AI depend on task mix, integration design, and the availability of complementary human skills.
Theoretical evaluation and synthesis of heterogeneous empirical findings; authors highlight variation across firms, sectors, and tasks.
high mixed The Use of ChatGPT in Business Productivity and Workflow Opt... productivity change conditional on task mix/integration/human skills (productivi...
Existing evidence is time-sensitive and heterogeneous: rapidly evolving models, heterogeneous study designs, and many short-term lab/microtask studies limit direct comparability and long-run inference.
Meta-observation from the review: documented methodological limitations across the literature (variation in models, tasks, metrics; prevalence of short-term studies).
high mixed ChatGPT as a Tool for Programming Assistance and Code Develo... generalizability and comparability of empirical findings (study heterogeneity)
Methodological caveats across the literature (heterogeneity of tasks/measures, publication bias, short-term studies) limit the generalizability of current findings.
Meta-level critique within the synthesis noting study heterogeneity, likely publication/short-term biases, and variable domain-specific performance dependent on user expertise and workflows.
high mixed ChatGPT as an Innovative Tool for Idea Generation and Proble... generalizability and external validity of LLM-assisted creativity findings
Standard productivity metrics are likely to undercount the value generated by AI-augmented ideation; quality-adjusted measures of creative output are required.
Measurement critique based on the mismatch between existing productivity statistics and the kinds of upstream idea-generation gains observed in empirical studies; supported by the review's methodological discussion.
high mixed ChatGPT as an Innovative Tool for Idea Generation and Proble... measured productivity vs. true quality-adjusted creative output
Realized value from AI methods (ML, predictive analytics, anomaly detection, XAI) is conditional: these technical methods deliver capabilities only when combined with strong data governance, standardized processes, and change management.
Thematic synthesis across the systematic review (2020–2025) showing repeated case-study and practitioner-report evidence that technical gains failed to scale without governance, process standardization, and organizational change efforts.
high mixed Integrating Artificial Intelligence and Enterprise Resource ... magnitude and durability of ERP-AI benefits (e.g., sustained accuracy gains, ado...
The hybrid estimator (GA+SQP) is computationally more intensive than single-stage MLE/local optimization, implying a trade-off between estimation reliability and runtime cost.
Reported runtime and computational cost comparisons in estimation experiments: the paper notes longer runtimes for GA+SQP versus standard optimizers while documenting improvements in objective values and convergence behavior.
high mixed k-QREM: Integrating Hierarchical Structures to Optimize Boun... computation time / runtime, convergence reliability
Results and implications are limited by the sample and context: evidence comes from law students on a single issue-spotting exam using one brief training intervention, so generalizability to experienced professionals, other tasks, or other models is untested.
Authors’ reported sample (164 law students) and explicit caution about generalizability in the study summary; the intervention and outcome are specific to one exam and one ~10-minute training.
high mixed Training for Technology: Adoption and Productive Use of Gene... Generalizability/applicability to other populations and tasks
Some mechanism-specific estimates are imprecise due to the sample size; confidence intervals for those estimates are wide.
Authors report wide confidence intervals for mechanism decomposition (principal stratification) results based on the randomized sample of 164 students.
high mixed Training for Technology: Adoption and Productive Use of Gene... Precision of mechanism estimates (confidence interval width for adoption vs prod...
Performance degradation persists even when context is provided via structured semantic layers including AST-extracted function context and import graph resolution.
Experiments comparing unstructured versus structured context provision; structured semantic layers (AST context, import graph resolution) were evaluated and models still degraded with more context.
high negative SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... model detection/performance when given structured semantic context
Models' performance degrades monotonically from diff-only (config_A) to diff+file content (config_B) to full context (config_C) across all 8 models.
Systematic ablation across three frozen context configurations (config_A, config_B, config_C) reported; all 8 evaluated models show monotonic performance decline as more context is provided.
high negative SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... model performance score across context-provision configurations
Eight frontier models detect only 15–31% of human-flagged issues on the diff-only configuration (config_A).
Empirical evaluation across 8 models on SWE-PRBench (350 PRs) under the diff-only configuration; reported detection rates of 15–31% relative to human-flagged issues.
high negative SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... detection rate of human-flagged issues
There is a growing gap between rapid experimentation with AI tools and limited organizational capability to institutionalize them in everyday workflows.
Argument supported by targeted literature synthesis and review of recent scholarly and institutional sources; no primary empirical sample reported in this paper.
high negative Behavioral Factors as Determinants of Successful Scaling of ... organizational capability to institutionalize AI initiatives (pilot-to-productio...
Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 55.0% accuracy on help prediction.
Experimental evaluation reported in the paper comparing eight multimodal models on the Help Prediction task with reported accuracy metric.
Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 44.6% accuracy on behavior state detection.
Experimental evaluation reported in the paper comparing eight multimodal models on the Behavior State Detection task with reported accuracy metric.
high negative GUIDE: A Benchmark for Understanding and Assisting Users in ... behavior state detection accuracy
Ikema is a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old.
Demographic/descriptive claim reported in the paper's background (likely citing prior surveys or census estimates); the abstract states the ~1,300 speakers figure and age distribution.
high negative Automatic Speech Recognition for Documenting Endangered Lang... number and age distribution of speakers
The financial planning and investment management profession is undergoing a radical transformation driven by Generative AI (GenAI) and Agentic AI, creating urgent workforce displacement challenges that require coordinated government policy intervention alongside educational reform.
Author assertion in the paper's introduction/abstract; framing argument based on the paper's synthesized analysis (no empirical sample, no reported statistical test).
high negative STRENGTHENING FINANCIAL WORKFORCE COMPETITIVENESS: A CURRICU... rate of workforce displacement in the financial planning and investment manageme...