The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (14055 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Generative search platforms are non-deterministic: the same query at different times can yield different answers and different cited domains.
Repeated-query experiments performed on three platforms (Perplexity Search, OpenAI SearchGPT, Google Gemini) across three consumer-product topics, using multi-day sampling (one collection per day over nine days) and high-frequency sampling (repeated queries at 10-minute intervals); observed variation in responses and cited domains across runs.
high negative Quantifying Uncertainty in AI Visibility: A Statistical Fram... response variability (changes in generated answers) and cited domains per query
Performance degrades when forecasted features are removed from the downstream regression model.
Ablation study results reported in the paper which compare full FutureBoosting against variants without TSFM-generated forecasted features using the same evaluation protocols.
high negative Regression Models Meet Foundation Models: A Hybrid-AI Approa... Increase in MAE (worse forecast error) after removing forecasted features
Despite LoRA being parameter-efficient, fine-tuning and iterative human-in-the-loop workflows still require compute resources and researcher time; governance/versioning of tuned models is necessary.
Caveat stated in the paper about remaining computational and governance costs; no quantitative resource usage reported in the summary.
high negative THETA: A Textual Hybrid Embedding-based Topic Analysis Frame... compute/resource requirements and governance burden
Embedding fine-tuning (DAFT) risks amplifying domain-specific biases present in the tuning corpus, so domain experts and robust evaluation protocols are necessary.
Paper caveat noting bias-amplification risk from fine-tuning embeddings; aligns with known risks in the literature but no empirical bias audit results provided in the summary.
high negative THETA: A Textual Hybrid Embedding-based Topic Analysis Frame... amplification of biases in tuned embeddings / need for bias mitigation
Mean emotional self-alignment between poster and responder is 32.7%, indicating systematic affective mismatch rather than congruence.
Pairwise comparison of emotion labels across post–response pairs in the dataset; computation of mean percentage where poster and immediate responder share the same emotion (32.7%).
high negative What Do AI Agents Talk About? Emergent Communication Structu... percentage of post–response pairs with identical emotion labels (emotional self-...
Conversational coherence declines rapidly with thread depth, indicating shallow, weakly connected multi-turn exchanges.
Lexical-semantic coherence metrics (e.g., embedding-based similarity) computed across comment threads of varying depth in the Moltbook dataset; observed rapid decrease in coherence scores as thread depth increases.
high negative What Do AI Agents Talk About? Emergent Communication Structu... coherence (similarity) metric as a function of thread depth
When pipelines have cross-cutting ties, prices oscillate, allocation quality drops, and management becomes difficult.
Empirical simulation results from the ablation study: configurations with non-hierarchical, cross-cutting graph structures produced larger price volatility, frequent oscillations in price updates, and lower allocation value/throughput compared to hierarchical graphs (measured across many runs and random seeds within the 1,620-run experimental set).
high negative Real-Time AI Service Economy: A Framework for Agentic Comput... price volatility and oscillation frequency; allocation quality (value/throughput...
On the 22 postdating (contamination-free) incidents, no agent achieved end-to-end exploitation success across all 110 agent–incident pairs evaluated.
Empirical evaluation of 110 agent–incident pairs reported in the study (end-to-end exploit attempts on the 22 incidents).
high negative Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... end_to_end_exploitation_success_rate (per_agent_per_incident)
The original EVMbench had a data contamination risk because it relied on audit-contest data published before every evaluated model's release, which could have been seen during model training.
Timing relationship between the audit-contest dataset used by EVMbench and the release dates of evaluated models (dataset predated model releases).
high negative Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... dataset_contamination_risk (potential_training_data_leakage)
The original EVMbench evaluation was narrow: it evaluated 14 agent configurations and most models were tested only with their vendor-provided scaffold.
Description of the original EVMbench experimental setup (number of agent configurations and scaffold usage) cited in this study.
high negative Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... evaluation_breadth (number_of_agent_configurations; scaffold_variety)
There is a risk that NFD will overfit to individual practices and lead to privacy/IP leakage if crystallization is not carefully governed.
Limitations and risk analysis in the paper; conceptual argument and case study discussion raising privacy/IP concerns. No empirical incidence rates provided.
high negative Nurture-First Agent Development: Building Domain-Expert AI A... degree of overfitting to individual practice; instances of privacy/IP leakage
NFD requires sustained practitioner engagement and incentive alignment to be effective.
Limitations and discussion sections of the paper explicitly state this requirement; logical inference from method (human-in-the-loop commercialization and continual crystallization).
high negative Nurture-First Agent Development: Building Domain-Expert AI A... practitioner engagement/time invested
Limitations of the study include reliance on self-reported perceptions (subject to response and survivorship bias), lack of experimental/causal identification, potential non-representative sample, and cross-sectional design limiting inference about long-term productivity effects.
Authors' stated limitations in the paper summary.
high negative Artificial Intelligence as a Catalyst for Innovation in Soft... validity threats (self-report bias, lack of causal design) as reported by author...
A mathematical analysis bounds or relates expected performance loss of the surrogate to measurable distribution mismatch between the training parameter distribution (samples) and the target parameter distribution.
Theoretical derivations presented in the paper that relate performance loss to distribution mismatch; the summary states the analysis provides a measurable diagnostic for when retraining or reweighting is needed.
high negative MCMC Informed Neural Emulators for Uncertainty Quantificatio... expected performance loss (e.g., increase in predictive loss) as a function of d...
Neural estimators are less interpretable than closed-form or equilibrium-based estimators, which matters for policy applications and audits.
Conceptual claim/caveat: reasoning about model interpretability and regulatory transparency; not an empirical measurement in the summary.
high negative ForwardFlow: Simulation only statistical inference using dee... interpretability / transparency (qualitative)
Estimator performance depends on the fidelity of the simulation model to real data; misspecified simulation-generating processes can yield misleading estimates.
Methodological caveat: conceptual argument and standard concern about simulation-based inference; no specific empirical counterexamples provided in the summary, but stated as an important limitation.
high negative ForwardFlow: Simulation only statistical inference using dee... external validity / susceptibility to model misspecification (qualitative claim ...
MSE-trained point-estimator networks do not directly provide calibrated interval estimates or valid standard errors; integrating conditional density estimators or bootstrap-calibration is needed for uncertainty quantification.
Methodological caveat: logical/statistical argument and recommendation based on the fact that training with MSE produces point estimates; no empirical demonstration in the summary, but the limitation follows from standard statistical principles.
high negative ForwardFlow: Simulation only statistical inference using dee... availability of calibrated uncertainty quantification (absence of calibrated int...
Basic/minimal BSBM architectures (without ancilla modes or generalized postprocessing) are not universal generative models.
Analytical proof/argument in the paper demonstrating non-universality of the minimal BSBM architecture; theoretical reasoning about expressive limitations of the plain model family (no empirical sample size).
high negative Universality of Classically Trainable, Quantum-Deployed Boso... generative universality / expressive power (failure of universality)
Current bottlenecks are disparate quantum and classical resources operating in isolation, causing manual job orchestration, inefficient scheduling, data-movement overheads, and slow iteration that limit productivity and algorithmic exploration.
Use-case-driven analysis and observations from early hybrid deployments and literature; systems design decomposition highlighting latency and data-staging requirements; no quantitative benchmark data.
high negative Reference Architecture of a Quantum-Centric Supercomputer developer/researcher productivity, iteration latency, scheduling and data-transf...
If deployment value is the time-average for one agent, optimizing the usual expected-value objective can lead to poor real-world outcomes.
Reasoning plus the paper's illustrative example demonstrating policies with high expected reward but poor or highly variable realized time-average outcomes; theoretical exposition, no empirical dataset.
high negative Ergodicity in reinforcement learning realized long-run (time-average) reward of deployed agent
Optimizing the expected cumulative reward (ensemble average across trajectories) can be misleading when reward-generating dynamics are non-ergodic because the ensemble expectation does not generally equal the time-average experienced by a single deployed agent.
Theoretical argumentation and a constructive illustrative example in the paper showing divergence between ensemble expectation and single-trajectory time-average; no empirical sample; analysis-based evidence.
high negative Ergodicity in reinforcement learning expected cumulative reward (ensemble expectation) vs. time-average realized rewa...
A small linear spatial disadvantage requires an exponentially larger population to obtain the same probability of early discovery (scaling relation).
Analytic scaling result derived from extreme-value analysis of first-passage times in the model, with confirmation by numerical simulations (stochastic realizations; number of runs not specified). The result is internal to the theoretical model.
high negative Macroscopic Dominance from Microscopic Extremes: Symmetry Br... population size required to match probability of early discovery (or probability...
Standard RLHF expected-cost constraints ignore distributional shape and can fail under heavy tails or rare catastrophic events.
Analytic/motivating argument presented in the paper contrasting expectation-based constraints with distributional behavior; illustrative examples and discussion of heavy-tailed/rara event failure modes (no sample-size or dataset details provided in the summary).
high negative Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... safety cost distribution properties (tail probability of high-cost/unsafe rollou...
Improving explainability can trade off with predictive performance, privacy, and robustness; these trade-offs must be managed rather than ignored.
Review aggregates technical literature and conceptual analyses documenting trade-offs reported by researchers (e.g., simpler interpretable models sometimes having lower predictive accuracy; disclosure risks to privacy; robustness concerns). No single causal estimate provided.
high negative Explainable AI in High-Stakes Domains: Improving Trust, Tran... predictive performance, privacy risk, model robustness
The evidence base presented is limited to a single SME pilot, so generalizability across sectors, firm sizes, and data regimes is untested and requires further research.
Explicit limitation noted in the paper and the fact that the pilot illustrated is a single case study (sample size = 1 SME pilot).
high negative ALGORITHM FOR IMPLEMENTING AI IN THE MANAGEMENT LOOP OF SMES... external validity / generalizability of results beyond the single pilot
Tasks that are routine, repetitive, or pattern‑based (e.g., boilerplate coding, refactoring, unit test generation, some accessibility fixes) will be increasingly automated by AI.
Task‑level decomposition and examples of current automation capabilities (code generation, test suggestion tools); conceptual projection rather than empirical measurement.
high negative How AI Will Transform the Daily Life of a Techie within 5 Ye... rate of automation for routine software development tasks (proportion of such ta...
Common barriers to effective RM implementation include siloed functions/weak coordination, limited resources or expertise, poor data quality/lack of metrics, and cultural resistance driven by short-term incentives.
Frequent identification of these barriers across the reviewed literature and practitioner sources synthesized via thematic analysis over the last ten years.
high negative The Role of Risk Management as an Organizational Management ... barriers to RM adoption/implementation; likelihood of successful RM
Hierarchy compresses: fewer organizational layers are needed for a given firm output as coordination costs fall.
Analytical proposition in the theoretical model and simulation results showing reduced number of layers under coordination compression.
high negative AI as Coordination-Compressing Capital: Task Reallocation, O... number of hierarchical layers per firm
Global median post-harvest losses are around 19.8% (FAO & Kaggle datasets).
Descriptive statistics cited from FAO and Kaggle datasets referenced in the paper for global context.
high negative AI in food inequality: Leveraging artificial intelligence to... post-harvest loss (percent, global median)
A one standard-deviation increase in AI adoption (2019–2025, 38 OECD countries) causally reduces employment in routine cognitive occupations by 2.3%.
Panel of 38 OECD countries, 2019–2025; AI Adoption Index (composite of enterprise AI investment, AI patent filings, workforce/firm AI-use surveys); instrumental-variable (IV) estimation to identify causal effect on occupational employment; country and year fixed effects and macro controls reported.
high negative Artificial Intelligence and Labor Market Transformation: Emp... Employment in routine cognitive occupations (percent change per 1 SD increase in...
Higher measured GDP need not imply higher aggregate welfare: the private costs of the arms race can outweigh the market gains from increased output.
Welfare comparisons performed in the model showing parameter regions where private equilibrium raises GDP but reduces aggregate welfare once investment costs are included.
high negative Janus-Faced Technological Progress and the Arms Race in the ... aggregate welfare (utility/net social surplus)
Because private incentives push agents toward tail outcomes, aggregate overinvestment occurs relative to the social optimum (the arms race is inefficient).
Welfare calculations and comparison of private vs social optima within the model; the paper shows private equilibrium investment exceeds the socially optimal investment given the externalities of the arms race.
high negative Janus-Faced Technological Progress and the Arms Race in the ... aggregate welfare (social welfare loss due to overinvestment)
Upfront costs for AI adoption are substantial: development, clinical validation, regulatory compliance, EHR integration, and ongoing monitoring.
Implementation and regulatory literature synthesized in the review documenting typical cost categories and reported expenditures for clinical AI projects.
high negative Will AI Replace Physicians in the Near Future? AI Adoption B... fixed and recurring implementation costs
Large language models (LLMs) suffer from hallucinations (fabricated facts), overconfidence, and unpredictable failure modes in open-ended tasks.
Technical papers and benchmarks on LLM factuality, calibration, and failure modes summarized in the review; empirical evaluations showing instances of fabricated outputs and calibration issues.
high negative Will AI Replace Physicians in the Near Future? AI Adoption B... factual accuracy of outputs; calibration (confidence vs accuracy); failure rate ...
Contemporary AI systems have no capacity for physical examination, sensorimotor procedures, or direct patient-contact diagnostics.
Technical limitations of CNNs and LLMs described in literature (lack of embodiment, no sensorimotor capabilities) and absence of credible empirical demonstrations of safe autonomous physical clinical procedures in reviewed studies.
high negative Will AI Replace Physicians in the Near Future? AI Adoption B... ability to perform physical exam / procedural tasks / direct patient-contact dia...
Current models exhibit poor out-of-distribution (OOD) generalization: performance degrades when inputs differ from training distributions.
Technical literature and robustness/domain-shift research reviewed in the paper documenting declines in model accuracy under domain shift and dataset changes.
high negative Will AI Replace Physicians in the Near Future? AI Adoption B... model accuracy/performance under domain shift / OOD inputs
High upfront costs and lack of tailored financing instruments are significant financial constraints on SME AI adoption.
Case studies, finance sector reports, and SME surveys cited in the review showing cost barriers and financing gaps; evidence descriptive rather than causal.
high negative Artificial Intelligence Adoption for Sustainable Development... upfront investment costs; access to tailored finance; adoption rates
Infrastructure deficits (unreliable power, inadequate broadband, limited local compute) materially constrain AI uptake by SMEs.
Policy reports and empirical studies in the literature documenting infrastructural limitations in LMIC contexts (including Botswana) that impede digital and AI deployment.
high negative Artificial Intelligence Adoption for Sustainable Development... infrastructure adequacy metrics (power reliability, broadband access); AI adopti...
Skills shortages (AI literacy, data science, digital management) are a primary constraint on SME AI adoption in developing economies.
Consistent findings across surveys, interviews, and case studies in the reviewed literature highlighting skill gaps as a common barrier; authors note multiple empirical sources pointing to this constraint.
high negative Artificial Intelligence Adoption for Sustainable Development... availability of AI-relevant skills; reported skills constraints limiting adoptio...
Heterogeneity in study designs and contexts within the literature limits direct comparability and generalizability of findings.
Limitation noted in the paper based on the authors' assessment of diversity across the 103 reviewed studies (varying methods, contexts, metrics).
high negative Models, applications, and limitations of the responsible ado... comparability/generalizability of evidence across studies
Institutional inertia, fragmented governance structures, limited technical capacity, and weak data stewardship impede scale‑up of AI systems in the public sector.
Thematic synthesis of barriers reported across empirical studies and institutional reports within the systematic review (103 items).
high negative Models, applications, and limitations of the responsible ado... ability to scale AI systems / scale‑up rate
Low‑ and middle‑income contexts face persistent gaps—infrastructure, data ecosystems, and talent retention—that slow AI adoption in public governance.
Consistent findings across multiple studies in the 103‑item corpus reporting infrastructure deficits, weak data ecosystems, and brain drain/retention issues in LMIC settings.
high negative Models, applications, and limitations of the responsible ado... rate/extent of AI adoption in public governance in low- and middle‑income contex...
On-Premise RAG requires internal technical capabilities (MLOps, infrastructure engineers) to maintain and update the system.
Organizational evaluation and implementation discussion noting operational responsibilities and skill requirements for on-prem deployment.
high negative An Empirical Study on the Feasibility Analysis of On-Premise... need for technical staff / internal capabilities (MLOps, infra)
On-Premise RAG incurs higher latency compared with cloud RAG.
Technology evaluations included measured system latency comparisons between architectures; exact latency values and statistical details not provided in summary.
high negative An Empirical Study on the Feasibility Analysis of On-Premise... system latency (response time)
On-Premise RAG requires upfront capital expenditure (hardware) and ongoing maintenance (operations, model updates, staff).
Organizational evaluations / cost accounting and implementation discussion indicating hardware, operations, and personnel requirements for on-prem deployment; specific cost figures not provided in summary.
high negative An Empirical Study on the Feasibility Analysis of On-Premise... upfront capital expenditure and ongoing maintenance costs and staffing needs
The January 2026 DoD AI Strategy memorandum establishes a Barrier Removal Board that provides expanded authority to waive established governance controls.
Primary source analysis: close reading of the Department of Defense January 2026 AI Strategy memorandum and related policy text (policy language describing the Barrier Removal Board and its waiver authorities). No sample size required; based on document text.
high negative FEATURE COMMENT: Governance as a "Blocker": How the Pentagon... existence and authority of the Barrier Removal Board (waiver authority over gove...
Risks include bias and discrimination, opacity in decision-making, privacy and cybersecurity threats, liability gaps, and uneven distribution of benefits that can exacerbate inequality.
Compilation from academic and policy literature, regulatory gap analyses, and examples of problematic AI use cases identified in the report's sectoral review.
high negative AI Governance and Data Privacy: Comparative Analysis of U.S.... bias/discrimination incidents, decision-making opacity, privacy/cybersecurity in...
AI creates significant ethical, legal and distributional risks.
Review of policy documents, academic and policy literature, and documented examples of AI deployment across multiple sectors highlighting harms (bias, privacy breaches, liability gaps, unequal benefits).
high negative AI Governance and Data Privacy: Comparative Analysis of U.S.... ethical risks, legal gaps, and distributional outcomes (inequality)
Except for the EU, jurisdictions surveyed generally lack AI-specific energy-disclosure requirements.
Comparative analysis across eleven jurisdictions identifying presence/absence of AI-specific energy disclosure rules; EU singled out as having such requirements.
high negative The Global Landscape of Environmental AI Regulation: From th... existence of AI-specific energy disclosure rules (binary presence/absence by jur...
Regulatory regimes in the surveyed jurisdictions focus on training emissions more than on inference-phase energy consumption.
Regulatory mapping and lifecycle-phase analysis showing which phases (training vs inference) are covered by existing rules in the eleven jurisdictions.
high negative The Global Landscape of Environmental AI Regulation: From th... regulated lifecycle phase (training coverage vs inference coverage)