Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Generative search platforms are non-deterministic: the same query at different times can yield different answers and different cited domains.
Repeated-query experiments performed on three platforms (Perplexity Search, OpenAI SearchGPT, Google Gemini) across three consumer-product topics, using multi-day sampling (one collection per day over nine days) and high-frequency sampling (repeated queries at 10-minute intervals); observed variation in responses and cited domains across runs.
Performance degrades when forecasted features are removed from the downstream regression model.
Ablation study results reported in the paper which compare full FutureBoosting against variants without TSFM-generated forecasted features using the same evaluation protocols.
Despite LoRA being parameter-efficient, fine-tuning and iterative human-in-the-loop workflows still require compute resources and researcher time; governance/versioning of tuned models is necessary.
Caveat stated in the paper about remaining computational and governance costs; no quantitative resource usage reported in the summary.
Embedding fine-tuning (DAFT) risks amplifying domain-specific biases present in the tuning corpus, so domain experts and robust evaluation protocols are necessary.
Paper caveat noting bias-amplification risk from fine-tuning embeddings; aligns with known risks in the literature but no empirical bias audit results provided in the summary.
Mean emotional self-alignment between poster and responder is 32.7%, indicating systematic affective mismatch rather than congruence.
Pairwise comparison of emotion labels across post–response pairs in the dataset; computation of mean percentage where poster and immediate responder share the same emotion (32.7%).
Conversational coherence declines rapidly with thread depth, indicating shallow, weakly connected multi-turn exchanges.
Lexical-semantic coherence metrics (e.g., embedding-based similarity) computed across comment threads of varying depth in the Moltbook dataset; observed rapid decrease in coherence scores as thread depth increases.
When pipelines have cross-cutting ties, prices oscillate, allocation quality drops, and management becomes difficult.
Empirical simulation results from the ablation study: configurations with non-hierarchical, cross-cutting graph structures produced larger price volatility, frequent oscillations in price updates, and lower allocation value/throughput compared to hierarchical graphs (measured across many runs and random seeds within the 1,620-run experimental set).
On the 22 postdating (contamination-free) incidents, no agent achieved end-to-end exploitation success across all 110 agent–incident pairs evaluated.
Empirical evaluation of 110 agent–incident pairs reported in the study (end-to-end exploit attempts on the 22 incidents).
The original EVMbench had a data contamination risk because it relied on audit-contest data published before every evaluated model's release, which could have been seen during model training.
Timing relationship between the audit-contest dataset used by EVMbench and the release dates of evaluated models (dataset predated model releases).
The original EVMbench evaluation was narrow: it evaluated 14 agent configurations and most models were tested only with their vendor-provided scaffold.
Description of the original EVMbench experimental setup (number of agent configurations and scaffold usage) cited in this study.
There is a risk that NFD will overfit to individual practices and lead to privacy/IP leakage if crystallization is not carefully governed.
Limitations and risk analysis in the paper; conceptual argument and case study discussion raising privacy/IP concerns. No empirical incidence rates provided.
NFD requires sustained practitioner engagement and incentive alignment to be effective.
Limitations and discussion sections of the paper explicitly state this requirement; logical inference from method (human-in-the-loop commercialization and continual crystallization).
Limitations of the study include reliance on self-reported perceptions (subject to response and survivorship bias), lack of experimental/causal identification, potential non-representative sample, and cross-sectional design limiting inference about long-term productivity effects.
Authors' stated limitations in the paper summary.
A mathematical analysis bounds or relates expected performance loss of the surrogate to measurable distribution mismatch between the training parameter distribution (samples) and the target parameter distribution.
Theoretical derivations presented in the paper that relate performance loss to distribution mismatch; the summary states the analysis provides a measurable diagnostic for when retraining or reweighting is needed.
Neural estimators are less interpretable than closed-form or equilibrium-based estimators, which matters for policy applications and audits.
Conceptual claim/caveat: reasoning about model interpretability and regulatory transparency; not an empirical measurement in the summary.
Estimator performance depends on the fidelity of the simulation model to real data; misspecified simulation-generating processes can yield misleading estimates.
Methodological caveat: conceptual argument and standard concern about simulation-based inference; no specific empirical counterexamples provided in the summary, but stated as an important limitation.
MSE-trained point-estimator networks do not directly provide calibrated interval estimates or valid standard errors; integrating conditional density estimators or bootstrap-calibration is needed for uncertainty quantification.
Methodological caveat: logical/statistical argument and recommendation based on the fact that training with MSE produces point estimates; no empirical demonstration in the summary, but the limitation follows from standard statistical principles.
Basic/minimal BSBM architectures (without ancilla modes or generalized postprocessing) are not universal generative models.
Analytical proof/argument in the paper demonstrating non-universality of the minimal BSBM architecture; theoretical reasoning about expressive limitations of the plain model family (no empirical sample size).
Current bottlenecks are disparate quantum and classical resources operating in isolation, causing manual job orchestration, inefficient scheduling, data-movement overheads, and slow iteration that limit productivity and algorithmic exploration.
Use-case-driven analysis and observations from early hybrid deployments and literature; systems design decomposition highlighting latency and data-staging requirements; no quantitative benchmark data.
If deployment value is the time-average for one agent, optimizing the usual expected-value objective can lead to poor real-world outcomes.
Reasoning plus the paper's illustrative example demonstrating policies with high expected reward but poor or highly variable realized time-average outcomes; theoretical exposition, no empirical dataset.
Optimizing the expected cumulative reward (ensemble average across trajectories) can be misleading when reward-generating dynamics are non-ergodic because the ensemble expectation does not generally equal the time-average experienced by a single deployed agent.
Theoretical argumentation and a constructive illustrative example in the paper showing divergence between ensemble expectation and single-trajectory time-average; no empirical sample; analysis-based evidence.
A small linear spatial disadvantage requires an exponentially larger population to obtain the same probability of early discovery (scaling relation).
Analytic scaling result derived from extreme-value analysis of first-passage times in the model, with confirmation by numerical simulations (stochastic realizations; number of runs not specified). The result is internal to the theoretical model.
Standard RLHF expected-cost constraints ignore distributional shape and can fail under heavy tails or rare catastrophic events.
Analytic/motivating argument presented in the paper contrasting expectation-based constraints with distributional behavior; illustrative examples and discussion of heavy-tailed/rara event failure modes (no sample-size or dataset details provided in the summary).
Improving explainability can trade off with predictive performance, privacy, and robustness; these trade-offs must be managed rather than ignored.
Review aggregates technical literature and conceptual analyses documenting trade-offs reported by researchers (e.g., simpler interpretable models sometimes having lower predictive accuracy; disclosure risks to privacy; robustness concerns). No single causal estimate provided.
The evidence base presented is limited to a single SME pilot, so generalizability across sectors, firm sizes, and data regimes is untested and requires further research.
Explicit limitation noted in the paper and the fact that the pilot illustrated is a single case study (sample size = 1 SME pilot).
Tasks that are routine, repetitive, or pattern‑based (e.g., boilerplate coding, refactoring, unit test generation, some accessibility fixes) will be increasingly automated by AI.
Task‑level decomposition and examples of current automation capabilities (code generation, test suggestion tools); conceptual projection rather than empirical measurement.
Common barriers to effective RM implementation include siloed functions/weak coordination, limited resources or expertise, poor data quality/lack of metrics, and cultural resistance driven by short-term incentives.
Frequent identification of these barriers across the reviewed literature and practitioner sources synthesized via thematic analysis over the last ten years.
Hierarchy compresses: fewer organizational layers are needed for a given firm output as coordination costs fall.
Analytical proposition in the theoretical model and simulation results showing reduced number of layers under coordination compression.
Global median post-harvest losses are around 19.8% (FAO & Kaggle datasets).
Descriptive statistics cited from FAO and Kaggle datasets referenced in the paper for global context.
A one standard-deviation increase in AI adoption (2019–2025, 38 OECD countries) causally reduces employment in routine cognitive occupations by 2.3%.
Panel of 38 OECD countries, 2019–2025; AI Adoption Index (composite of enterprise AI investment, AI patent filings, workforce/firm AI-use surveys); instrumental-variable (IV) estimation to identify causal effect on occupational employment; country and year fixed effects and macro controls reported.
Higher measured GDP need not imply higher aggregate welfare: the private costs of the arms race can outweigh the market gains from increased output.
Welfare comparisons performed in the model showing parameter regions where private equilibrium raises GDP but reduces aggregate welfare once investment costs are included.
Because private incentives push agents toward tail outcomes, aggregate overinvestment occurs relative to the social optimum (the arms race is inefficient).
Welfare calculations and comparison of private vs social optima within the model; the paper shows private equilibrium investment exceeds the socially optimal investment given the externalities of the arms race.
Upfront costs for AI adoption are substantial: development, clinical validation, regulatory compliance, EHR integration, and ongoing monitoring.
Implementation and regulatory literature synthesized in the review documenting typical cost categories and reported expenditures for clinical AI projects.
Large language models (LLMs) suffer from hallucinations (fabricated facts), overconfidence, and unpredictable failure modes in open-ended tasks.
Technical papers and benchmarks on LLM factuality, calibration, and failure modes summarized in the review; empirical evaluations showing instances of fabricated outputs and calibration issues.
Contemporary AI systems have no capacity for physical examination, sensorimotor procedures, or direct patient-contact diagnostics.
Technical limitations of CNNs and LLMs described in literature (lack of embodiment, no sensorimotor capabilities) and absence of credible empirical demonstrations of safe autonomous physical clinical procedures in reviewed studies.
Current models exhibit poor out-of-distribution (OOD) generalization: performance degrades when inputs differ from training distributions.
Technical literature and robustness/domain-shift research reviewed in the paper documenting declines in model accuracy under domain shift and dataset changes.
High upfront costs and lack of tailored financing instruments are significant financial constraints on SME AI adoption.
Case studies, finance sector reports, and SME surveys cited in the review showing cost barriers and financing gaps; evidence descriptive rather than causal.
Infrastructure deficits (unreliable power, inadequate broadband, limited local compute) materially constrain AI uptake by SMEs.
Policy reports and empirical studies in the literature documenting infrastructural limitations in LMIC contexts (including Botswana) that impede digital and AI deployment.
Skills shortages (AI literacy, data science, digital management) are a primary constraint on SME AI adoption in developing economies.
Consistent findings across surveys, interviews, and case studies in the reviewed literature highlighting skill gaps as a common barrier; authors note multiple empirical sources pointing to this constraint.
Heterogeneity in study designs and contexts within the literature limits direct comparability and generalizability of findings.
Limitation noted in the paper based on the authors' assessment of diversity across the 103 reviewed studies (varying methods, contexts, metrics).
Institutional inertia, fragmented governance structures, limited technical capacity, and weak data stewardship impede scale‑up of AI systems in the public sector.
Thematic synthesis of barriers reported across empirical studies and institutional reports within the systematic review (103 items).
Low‑ and middle‑income contexts face persistent gaps—infrastructure, data ecosystems, and talent retention—that slow AI adoption in public governance.
Consistent findings across multiple studies in the 103‑item corpus reporting infrastructure deficits, weak data ecosystems, and brain drain/retention issues in LMIC settings.
On-Premise RAG requires internal technical capabilities (MLOps, infrastructure engineers) to maintain and update the system.
Organizational evaluation and implementation discussion noting operational responsibilities and skill requirements for on-prem deployment.
On-Premise RAG incurs higher latency compared with cloud RAG.
Technology evaluations included measured system latency comparisons between architectures; exact latency values and statistical details not provided in summary.
On-Premise RAG requires upfront capital expenditure (hardware) and ongoing maintenance (operations, model updates, staff).
Organizational evaluations / cost accounting and implementation discussion indicating hardware, operations, and personnel requirements for on-prem deployment; specific cost figures not provided in summary.
The January 2026 DoD AI Strategy memorandum establishes a Barrier Removal Board that provides expanded authority to waive established governance controls.
Primary source analysis: close reading of the Department of Defense January 2026 AI Strategy memorandum and related policy text (policy language describing the Barrier Removal Board and its waiver authorities). No sample size required; based on document text.
Risks include bias and discrimination, opacity in decision-making, privacy and cybersecurity threats, liability gaps, and uneven distribution of benefits that can exacerbate inequality.
Compilation from academic and policy literature, regulatory gap analyses, and examples of problematic AI use cases identified in the report's sectoral review.
AI creates significant ethical, legal and distributional risks.
Review of policy documents, academic and policy literature, and documented examples of AI deployment across multiple sectors highlighting harms (bias, privacy breaches, liability gaps, unequal benefits).
Except for the EU, jurisdictions surveyed generally lack AI-specific energy-disclosure requirements.
Comparative analysis across eleven jurisdictions identifying presence/absence of AI-specific energy disclosure rules; EU singled out as having such requirements.
Regulatory regimes in the surveyed jurisdictions focus on training emissions more than on inference-phase energy consumption.
Regulatory mapping and lifecycle-phase analysis showing which phases (training vs inference) are covered by existing rules in the eleven jurisdictions.