Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Misleading believability: LLM outputs may look plausible but be incorrect or unrepresentative, risking overconfidence in synthetic data.
Reported instances in the literature and organized failure taxonomy describing plausible-looking but inaccurate synthetic responses.
Distortions: LLM outputs can exhibit systematic biases relative to target human distributions.
Empirical findings across reviewed studies showing output distributions from LLMs that deviate from human sample distributions; aggregated in the distortions failure category.
Cognitive misalignments: LLMs differ from humans in reasoning, goals, and bounded rationality, which can alter behavior in economic and strategic tasks.
Multiple studies in the review reported systematic differences in reasoning and goal-directed behavior when comparing LLM outputs to human participants; coded under the cognitive misalignment category.
Major failure modes limiting synthetic participants as direct substitutes for humans are: cognitive misalignments, distortions, misleading believability, and overfitting/contamination.
Standardized taxonomy developed by coding the 182 studies into generalizable indicators and organizing failure types into four categories.
The information-theoretic uncertainty measure provides a mechanism-level explanation for why deception value falls as transparency increases (residual uncertainty explains utility changes).
Analytical linkage in the model connecting the entropy-like residual uncertainty metric to equilibrium utility changes; theoretical argument and derivation in the paper.
The value of deception degrades (falls) as the true system state becomes more observable; this degradation is quantifiable via the price-of-transparency metric.
Analytical definition of price of transparency as marginal change and supporting theoretical results; computational experiments that sweep observability/transparency levels (simulated experiments, parameter sweeps; number of scenarios not specified).
The paper derives closed-form bounds and break-even conditions that delineate when deception is ineffective due to cost or detectability.
Theoretical proofs and closed-form inequalities presented in the analytical section (derivations of bounds and break-even conditions).
No evaluated program reported Kirkpatrick‑Barr level‑4 outcomes (organizational change, patient outcomes, or sustained metacognitive mastery).
Reviewers mapped reported outcomes from all 27 included programs and found none that demonstrated organizational-level impacts or patient‑level outcomes (level 4).
Because the design is cross-sectional and sampling purposive/geographically constrained, causal inference and generalizability are limited.
Authors' stated limitations in the summary: cross-sectional design and purposive, geographically constrained sample (Karnataka, India).
Workplace stress is associated with lower employee retention.
PLS-SEM analysis on a cross-sectional survey of N = 350 pharmaceutical workers in Karnataka, India (purposive sampling). Reported direct path: Stress → Retention, β = 0.321, p < 0.001. (Note: the paper interprets this as stress reducing retention; sign/coding conventions of the variables are not detailed in the summary.)
Regulatory uncertainty around blockchain/DeFi for corporate finance and cross-border data rules is a material risk to adoption.
Paper notes regulatory uncertainty as a risk; no jurisdictional legal analysis or compliance case studies provided in the summary.
Cybersecurity and data-privacy concerns arise from cloud provider centralization versus blockchain transparency.
Paper highlights this trade-off in its challenges section; discussion-based evidence rather than quantified security assessment in the summary.
Integration complexity with legacy ERPs and heterogeneous vendor ecosystems is a significant implementation challenge.
Paper lists this as a challenge/limitation based on pilot experience and analysis. No quantified measure of integration effort is provided in the summary.
EPC projects feature milestone-based payments, complex stakeholder flows, and large working-capital needs that strain traditional on-premise ERPs.
Problem context statement presented in the paper; consistent with commonly reported characteristics of EPC projects. The summary does not cite empirical industry-wide data.
If deployed without mitigation, GenAI CDS risks widening disparities by performing worse on underrepresented groups or being unequally distributed across resource-rich versus resource-poor settings.
Fairness literature, subgroup performance concerns, and distributional risk analysis cited in the paper; direct empirical demonstrations of widened disparities due to GenAI CDS are limited in the literature per the paper.
Limited public datasets and vendor lock-in constrain independent reproducible evaluations and audits of current generative models in healthcare.
Observation and policy analysis in the paper noting scarcity of public clinical datasets for state-of-the-art models and proprietary constraints; no dataset counts provided.
GenAI CDS creates data privacy and security risks because of high-value medical data and use of external cloud services.
Known cybersecurity risks and documented incidents in health IT; the paper cites the general risk context rather than specific breach sample counts tied to GenAI deployments.
GenAI CDS can amplify bias and inequities if training data underrepresent groups or reflect historical disparities.
Fairness and robustness audit literature and subgroup performance analyses referenced in the paper; specific empirical demonstrations for contemporary GenAI CDS are limited and sample sizes not given.
GenAI CDS systems hallucinate and can produce incorrect but plausible recommendations, which can cause patient harm if trusted unchecked.
Documented failure modes of generative models and examples from controlled evaluations; the paper references known hallucination behavior from model audits and case reports, though it does not quantify incidence rates or provide large-scale observational harm data.
Reproducibility and deployment gaps are widespread: missing code, inconsistent benchmarks, and insufficient productionization focus (monitoring, model updates, rollback).
Surveyed literature often lacks released code and consistent benchmarks; thematic analysis highlights absence of operational deployment practices.
Common ML pipeline pitfalls include overfitting, poor cross-validation practices, lack of real-time/online evaluation, and inadequate feature engineering.
Critical assessment of experimental practices in the surveyed literature identifying methodological shortcomings that can inflate reported performance.
There is a lack of large, labeled, realistic IoT datasets; class imbalance, concept drift, dataset bias, and synthetic datasets that poorly reflect real traffic are common problems.
Review of datasets (N-BaIoT, Bot-IoT, TON_IoT, UNSW-NB15, KDD variants, custom/synthetic datasets) and critical assessment of their limitations across studies.
Resource constraints (limited CPU, memory, energy, and network bandwidth on devices and edge nodes) significantly limit feasible ML model complexity and deployment choices.
Multiple surveyed studies report hardware constraints and evaluate runtime/memory/latency; survey synthesizes these resource limitations as a recurring challenge.
Despite high reported detection accuracies in academic work, there is a shortage of production-grade, deployable ML-IDS for IoT.
Critical review of surveyed papers showing many report lab metrics but few report deployment case studies, production rollouts, or provide deployment artifacts (code, runtime/energy measurements).
Limitations of the review include restricted sample size, Scopus-only coverage, emergent-literature timeframe, and heterogeneity in study designs and measures, which constrain generalizability.
Authors' limitations subsection explicitly listing these constraints from their SLR process.
There has been insufficient attention in the literature to ethics, fairness, and consumer welfare in algorithmic pricing.
Persistent gap identified in the SLR—few or no included studies focused on ethics/fairness/welfare issues according to authors' coding.
Existing empirical studies on digital VBP exhibit methodological limitations, including small/limited samples, short time windows, and inconsistent measures.
Authors' methodological critique from the SLR based on assessment of study designs and measures reported in the 30 articles.
Automated compliance and credentialing systems raise governance issues (auditability, appeals mechanisms) and risk incorrect automated deregistration if not properly governed.
Governance and algorithmic-risk discussion in the paper; logical argumentation rather than case-based evidence.
The paper models career progression as a continuous function and treats certification gaps as discontinuities that impede labour-market mobility.
Mathematical/conceptual modeling described in the methods (career-progression-as-continuous-function approach); this is a modeling choice reported in the paper rather than an empirical finding.
Industrial robotization (IR) is a robust negative predictor of provincial IWE after controlling for fixed effects and covariates.
Multiple regression specifications using province and year fixed effects and control variables; the negative IR–IWE coefficient remains statistically significant across alternative model specifications (robustness checks reported in the paper).
Adoption of industrial robots substantially reduces industrial wastewater emissions (IWE) across Chinese provinces (2013–2022).
Panel data covering 30 Chinese provinces for 2013–2022 (≈300 province-year observations); fixed-effects regressions with province and year fixed effects and covariates; estimated negative coefficient on provincial IR intensity.
There is limited long-term impact evidence and few system-level assessments of AI in developing-country agriculture.
Authors' methodological caveat based on the temporal scope and types of studies available in the >60-study review.
The evidence base is skewed toward pilots and high‑performer contexts; there is a lack of long‑panel, multi‑project longitudinal studies to validate typical returns and scalability.
Authors' assessment of evidence types in the 160 studies: mix of conceptual papers, case studies, pilots, and only limited larger empirical evaluations.
Substantial compute and resource requirements for training and inference concentrate capabilities among well‑resourced labs and firms.
Paper discusses large compute budgets for training/inference and states that performance scales with data, model size, and compute; it infers concentration of capabilities but provides no empirical market concentration measures.
Structure predictors depend on training data and exhibit biases; experimental validation remains necessary.
Paper notes dependence on training data biases and the need for experimental validation; references data sources (PDB, UniRef, metagenomic catalogs) but does not quantify bias magnitudes.
Current limitations include inaccurate prediction of multi‑chain complexes, flexible or rare conformational states, and limited prediction of dynamic ensembles.
Paper explicitly enumerates these limitations in the 'Ongoing limitations' section; no quantitative failure rates are given.
Traditional computational methods struggle without homologous templates or with complex folding/dynamics.
Paper discusses limitations of traditional computational methods, emphasizing dependence on homologous templates and difficulty with complex folding/dynamics; specific method comparisons or sample sizes are not provided.
Opacity, bias, and errors in AI systems demand auditing, standards, and governance (algorithmic accountability) to ensure trustworthy assessment.
Synthesis of literature on algorithmic bias and accountability plus policy analysis recommending audits and standards; supported by country cases that discuss governance concerns.
Student data used by AI vendors raises risks around consent, reuse, commercial exploitation, and other data-privacy concerns.
Policy analysis and literature on data governance, privacy law debates; examples from national policy documents in the comparative cases. No original data on breaches or misuse presented.
Empirical evaluation of integrated defenses, quantitative cost/benefit analyses, and standardized threat models for VR are research gaps that remain unaddressed in the literature window surveyed (2023–2025).
Authors' stated limitations from their comparative literature review of 31 studies noting an absence of primary empirical validation and quantitative economic analyses in the reviewed corpus.
Immersive VR systems collect continuous multimodal signals (motion tracking, gaze, voice, biometrics) that enable novel inference, spoofing, and manipulation attacks beyond traditional IT threats.
Synthesis of threat descriptions across the 31 reviewed peer‑reviewed studies (2023–2025) documenting sensor modalities and attack vectors; qualitative comparative evaluation of attack surfaces.
The Omnibus overlaps substantively with the DSA and other digital policies, creating potential jurisdictional and interpretive ambiguities about which rules apply to platforms and AI-enabled services.
Comparative mapping and legal/regulatory review identifying overlapping provisions; qualitative analysis of proposed texts (no quantitative sample).
Pakistan prioritizes economic and digital governance objectives, with comparatively weak governance of military AI.
Review of Pakistan’s economic and digital governance plans, export‑control materials, and secondary literature on Pakistan’s civil–military relations.
Large-scale machine learning enables invisible inferences about users from seemingly innocuous data.
Conceptual claim presented in the workshop and supported by referenced technical literature on inference capabilities of ML models (discussion in position papers); workshop itself did not present a new empirical experiment.
Inequities in climate-AI systems appear across three development phases—Inputs, Process, and Outputs—creating multiple failure points where Global North advantages propagate into final products.
Conceptual framework developed from cross-disciplinary synthesis, literature review, and illustrative examples (Inputs → Process → Outputs mapping).
Foundation-model development and high-performance computing (HPC) capacity are overwhelmingly located in the Global North.
Descriptive mapping of global HPC infrastructure and foundation-model authorship described in the paper (infrastructure mapping and authorship analysis). No single quantitative sample size reported; evidence based on spatial mapping and documented locations of compute centers and model-development institutions.
Ambiguity about the probability of data leaks (a 10–50% range) reduces user adoption of AI personalization relative to a neutral privacy presentation.
Between-subjects online experiment, 2 (information environment: Risk vs Ambiguity) × 3 (privacy-treatment conditions), N = 610 participants randomized across arms. Leak-probability ambiguity presented as a 10–50% range; adoption (choice of personalized vs standard basket) was measured and privacy-threatening conditions under ambiguity produced a statistically significant reduction in adoption compared to neutral.
Rank stability analysis across the whole citation distribution shows instability not only at the tail but across frequently cited domains; rankings shift substantially across samples.
Distribution-wide rank-stability methods applied to repeated-sample citation data from the three platforms and three topics, comparing domain ranks across samples and quantifying rank-change frequency and magnitude.
Bootstrap-based confidence intervals show wide uncertainty: many domain-level differences that look meaningful in single-run snapshots fall within measurement noise.
Bootstrap resampling applied to repeated-sample data (collected across nine days and high-frequency sampling) to compute confidence intervals for citation shares and prevalence; many pairwise or between-domain differences were not statistically separable once CIs were considered.
Single-run point estimates of citation share or prevalence are misleading; visibility metrics should be treated as estimators with uncertainty and reported with confidence intervals.
Comparison of single-run snapshots to distributions obtained from repeated sampling (daily and 10-minute interval regimes) and bootstrap resampling showing wide sample-to-sample variation and wide CI widths for domain-level shares and prevalence metrics.