Evidence (2954 claims)
Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 369 | 105 | 58 | 432 | 972 |
| Governance & Regulation | 365 | 171 | 113 | 54 | 713 |
| Research Productivity | 229 | 95 | 33 | 294 | 655 |
| Organizational Efficiency | 354 | 82 | 58 | 34 | 531 |
| Technology Adoption Rate | 277 | 115 | 63 | 27 | 486 |
| Firm Productivity | 273 | 33 | 68 | 10 | 389 |
| AI Safety & Ethics | 112 | 177 | 43 | 24 | 358 |
| Output Quality | 228 | 61 | 23 | 25 | 337 |
| Market Structure | 105 | 118 | 81 | 14 | 323 |
| Decision Quality | 154 | 68 | 33 | 17 | 275 |
| Employment Level | 68 | 32 | 74 | 8 | 184 |
| Fiscal & Macroeconomic | 74 | 52 | 32 | 21 | 183 |
| Skill Acquisition | 85 | 31 | 38 | 9 | 163 |
| Firm Revenue | 96 | 30 | 22 | — | 148 |
| Innovation Output | 100 | 11 | 20 | 11 | 143 |
| Consumer Welfare | 66 | 29 | 35 | 7 | 137 |
| Regulatory Compliance | 51 | 61 | 13 | 3 | 128 |
| Inequality Measures | 24 | 66 | 31 | 4 | 125 |
| Task Allocation | 64 | 6 | 28 | 6 | 104 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 10 | 16 | 93 |
| Worker Satisfaction | 42 | 32 | 11 | 6 | 91 |
| Task Completion Time | 71 | 5 | 3 | 1 | 80 |
| Wages & Compensation | 38 | 13 | 19 | 4 | 74 |
| Team Performance | 41 | 8 | 15 | 7 | 72 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 17 | 15 | 9 | 5 | 46 |
| Job Displacement | 5 | 28 | 12 | — | 45 |
| Social Protection | 18 | 8 | 6 | 1 | 33 |
| Developer Productivity | 25 | 1 | 2 | 1 | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 7 | 4 | 9 | — | 20 |
Human Ai Collab
Remove filter
Automation bias (human tendency to defer to automated outputs) compounds the risk that GLAI errors become embedded in legal processes.
Behavioral literature review on automation bias and trust in AI systems; applied to legal-context vignettes. No primary empirical test within the paper.
Cooperation with the AI plateaus and never reaches the near-complete cooperation levels observed in human–human interactions.
Time-series/trajectory analysis of cooperation rates in the lab human–AI experiment (n = 126) compared to the human–human benchmark (n = 108); reported convergence/end-state cooperation levels show AI condition asymptotes below the human–human condition.
Adoption requires hardware (VR headsets, capable GPUs) and integration effort, implying upfront capital expenditure for labs/observatories.
Paper explicitly notes hardware requirements (VR headsets, capable GPUs) and integration effort as part of adoption considerations; common-sense assessment of required capital.
Current models heavily rely on large static datasets and batch training and exhibit poor lifelong/continual learning.
Synthesis of common practices in contemporary ML (supervised pretraining and offline training paradigms); no new experiments provided.
When identical replies are labeled as coming from AI rather than from a human, recipients report feeling less heard and less validated (an attribution effect).
Controlled attribution labeling experiment within the study: identical replies presented with different source labels (AI vs. human) and recipient-rated perceptions of being heard/validated measured.
The empirical validation is performed only on synthetic text-preference data rather than real-world user populations, so field deployment effects and richer preference models remain to be tested.
Experiments section states synthetic dataset for text preferences and notes absence of field experiments on real user populations.
The theoretical results (algorithms and sample-complexity bounds) assume truthful, exogenous preferences and simple sampling access; strategic behavior or costly reporting could change the information requirements.
Modeling assumptions explicitly stated in the paper (sampling access to truthful preferences) and discussion in the implications/limitations section noting the need to consider strategic behavior and reporting costs.
Matching information-theoretic lower bounds are proved, establishing that no algorithm can guarantee finding an (approximate) proportional veto-core element with fewer queries than the stated bounds (i.e., the sample complexity is optimal).
Lower-bound proofs in the theoretical section of the paper showing impossibility results that match the upper-bound rates.
Agent performance degrades markedly as environment complexity, stochasticity, and non-stationarity increase, revealing core limitations of current LLM-based agents for long-horizon, multi-factor decision problems.
Experimental results across progressively harder RetailBench environments showing performance falloff for multiple LLMs under increased task complexity and non-stationarity.
There are limited randomized controlled trials or longitudinal evaluations; few studies measure patient-relevant outcomes or economic impacts.
Literature synthesis noting scarcity of RCTs and long-term observational studies, and absence of widespread patient-outcome and cost-effectiveness evaluations in existing publications.
Many published studies focus on standalone algorithm accuracy rather than clinician–AI joint performance in routine workflows.
Review of the literature categorizing study designs (preponderance of algorithm development/validation studies, fewer reader-in-the-loop, simulation, or deployment studies).
Advanced technologies' complexity and lack of explainability create risks for audit reliability and professional judgement.
Findings from literature synthesis and professional/regulatory perspectives included in the review; presented as an identified risk/challenge rather than quantified effect.
Audit 5.0 introduces key challenges: data quality and integration issues, complexity and explainability of advanced technologies, regulatory and ethical uncertainty, and skills shortages combined with cultural resistance.
Systematic literature review and synthesis of professional standards and regulatory perspectives; assertions based on reviewed literature rather than a single empirical dataset.
At the question level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best.
Question-level analysis from the randomized experiment comparing cases where chatbot suggestions were incorrect versus control; paper reports a ~66% reduction in accuracy on easy questions when chatbot suggestions were incorrect (exact denominators and statistics not provided in the excerpt).
When incentive signals depend non-trivially on persistent environmental memory, the resulting dynamics generically cannot be reduced to a static global objective defined solely over the agent state space (i.e., no global potential function over agents exists in the generic case).
A genericity theorem/argument in the paper (mathematical demonstration showing that for nontrivial dependence on environmental memory the closed-loop vector field is, for a generic set of parameterizations, not gradient of any scalar function on agent space).
AI notably reduces customer stability in sports enterprises (SE).
Empirical estimation using the DML model on the same panel dataset of 45 Chinese listed SEs (2012–2023); authors report a statistically significant negative effect of AI on customer stability.
The sample is limited to Chinese A-share-listed design enterprises (2014–2023), which may limit generalizability to small and medium-sized enterprises (SMEs) or firms in other countries/regions.
Study sample description: A-share-listed design-oriented enterprises in China between 2014 and 2023; authors explicitly note this as a limitation.
Using TFP as a proxy for project efficiency aggregates effects at the firm level and therefore lacks micro-level insight into specific project workflows or design iteration processes.
Methodological limitation acknowledged in the paper: TFP is used as a firm-level proxy and the dataset does not include micro-level project workflow or iteration logs.
There exists a systemic governance vacuum around GenAI, including gaps in privacy, accountability, and intellectual property protections.
Authors' synthesis of governance-related gaps reported across the 28 secondary studies and research agendas in the review.
Societal and ethical risks—such as bias, misuse, and skill erosion—constrain GenAI adoption.
Themes synthesized from the reviewed literature (28 papers) reporting societal and ethical concerns associated with GenAI deployment.
Technical unreliability—manifesting as hallucinations and performance drift—is a major constraint on GenAI adoption.
Recurring identification of technical reliability issues (hallucinations, performance drift) in the 28 reviewed papers and authors' aggregation of technical risks.
Adoption of GenAI is constrained by multiple interrelated challenges.
Cross-paper synthesis from the systematic review of 28 studies identifying recurring barriers and constraints reported in the literature.
Human judgment is constrained by bounded rationality, cognitive biases, and information-processing limitations.
Cited as established findings from prior research across decision sciences and related fields (extensive literature evidence referenced; no new empirical data in this paper's abstract).
Key implementation challenges include data quality and integration, model interpretability, cybersecurity and privacy, regulatory/compliance uncertainty, skills gaps among accounting professionals, and implementation costs.
Identified by the paper through literature review and practitioner reports; these are presented as recurring barriers rather than quantified with a specific sample.
Many studies on serious-game DSTs are small-scale or experimental, and long-term impact data on behavioral change and emissions outcomes are sparse, limiting generalizability.
Review of the literature summarized in the chapter showing predominance of case studies, prototypes, and short-term evaluations rather than longitudinal or large-sample studies.
Ensuring scientific validity of game models, scaling co-design processes, measuring real-world behavioral change, and aligning incentives (policy/subsidies, markets) are remaining challenges to using serious games for DST uptake.
Chapter discussion of limitations and gaps identified in the reviewed literature; absence or sparsity of long-term validation studies and large-scale co-design implementations documented in existing research.
Current uptake of DSTs for net zero remains limited because of issues of trust, usability, lack of evidence linking actions to farm profitability, and poor integration into farmer workflows.
Literature synthesis, qualitative interviews and surveys, case studies documenting low adoption and barriers; multiple practice reports and studies cited in the chapter. Many studies report limited or uneven adoption across contexts.
Using LLM participants without rigorous validation can bias external validity and causal inference in economic research.
Review documents cognitive misalignments and distortions that can bias estimated behaviors, preferences, or treatment effects; authors highlight this as a risk.
Overfitting/contamination: LLMs can reproduce pre-training or fine-tuning data (stochastic parroting) and leak training-set content into outputs.
Multiple reviewed studies documenting examples of content reproduction and data leakage; categorized as overfitting/contamination in the review.
Misleading believability: LLM outputs may look plausible but be incorrect or unrepresentative, risking overconfidence in synthetic data.
Reported instances in the literature and organized failure taxonomy describing plausible-looking but inaccurate synthetic responses.
Distortions: LLM outputs can exhibit systematic biases relative to target human distributions.
Empirical findings across reviewed studies showing output distributions from LLMs that deviate from human sample distributions; aggregated in the distortions failure category.
Cognitive misalignments: LLMs differ from humans in reasoning, goals, and bounded rationality, which can alter behavior in economic and strategic tasks.
Multiple studies in the review reported systematic differences in reasoning and goal-directed behavior when comparing LLM outputs to human participants; coded under the cognitive misalignment category.
Major failure modes limiting synthetic participants as direct substitutes for humans are: cognitive misalignments, distortions, misleading believability, and overfitting/contamination.
Standardized taxonomy developed by coding the 182 studies into generalizable indicators and organizing failure types into four categories.
No evaluated program reported Kirkpatrick‑Barr level‑4 outcomes (organizational change, patient outcomes, or sustained metacognitive mastery).
Reviewers mapped reported outcomes from all 27 included programs and found none that demonstrated organizational-level impacts or patient‑level outcomes (level 4).
Because the design is cross-sectional and sampling purposive/geographically constrained, causal inference and generalizability are limited.
Authors' stated limitations in the summary: cross-sectional design and purposive, geographically constrained sample (Karnataka, India).
Workplace stress is associated with lower employee retention.
PLS-SEM analysis on a cross-sectional survey of N = 350 pharmaceutical workers in Karnataka, India (purposive sampling). Reported direct path: Stress → Retention, β = 0.321, p < 0.001. (Note: the paper interprets this as stress reducing retention; sign/coding conventions of the variables are not detailed in the summary.)
If deployed without mitigation, GenAI CDS risks widening disparities by performing worse on underrepresented groups or being unequally distributed across resource-rich versus resource-poor settings.
Fairness literature, subgroup performance concerns, and distributional risk analysis cited in the paper; direct empirical demonstrations of widened disparities due to GenAI CDS are limited in the literature per the paper.
Limited public datasets and vendor lock-in constrain independent reproducible evaluations and audits of current generative models in healthcare.
Observation and policy analysis in the paper noting scarcity of public clinical datasets for state-of-the-art models and proprietary constraints; no dataset counts provided.
GenAI CDS creates data privacy and security risks because of high-value medical data and use of external cloud services.
Known cybersecurity risks and documented incidents in health IT; the paper cites the general risk context rather than specific breach sample counts tied to GenAI deployments.
GenAI CDS can amplify bias and inequities if training data underrepresent groups or reflect historical disparities.
Fairness and robustness audit literature and subgroup performance analyses referenced in the paper; specific empirical demonstrations for contemporary GenAI CDS are limited and sample sizes not given.
GenAI CDS systems hallucinate and can produce incorrect but plausible recommendations, which can cause patient harm if trusted unchecked.
Documented failure modes of generative models and examples from controlled evaluations; the paper references known hallucination behavior from model audits and case reports, though it does not quantify incidence rates or provide large-scale observational harm data.
Inequities in climate-AI systems appear across three development phases—Inputs, Process, and Outputs—creating multiple failure points where Global North advantages propagate into final products.
Conceptual framework developed from cross-disciplinary synthesis, literature review, and illustrative examples (Inputs → Process → Outputs mapping).
Foundation-model development and high-performance computing (HPC) capacity are overwhelmingly located in the Global North.
Descriptive mapping of global HPC infrastructure and foundation-model authorship described in the paper (infrastructure mapping and authorship analysis). No single quantitative sample size reported; evidence based on spatial mapping and documented locations of compute centers and model-development institutions.
On the 22 postdating (contamination-free) incidents, no agent achieved end-to-end exploitation success across all 110 agent–incident pairs evaluated.
Empirical evaluation of 110 agent–incident pairs reported in the study (end-to-end exploit attempts on the 22 incidents).
The original EVMbench had a data contamination risk because it relied on audit-contest data published before every evaluated model's release, which could have been seen during model training.
Timing relationship between the audit-contest dataset used by EVMbench and the release dates of evaluated models (dataset predated model releases).
The original EVMbench evaluation was narrow: it evaluated 14 agent configurations and most models were tested only with their vendor-provided scaffold.
Description of the original EVMbench experimental setup (number of agent configurations and scaffold usage) cited in this study.
There is a risk that NFD will overfit to individual practices and lead to privacy/IP leakage if crystallization is not carefully governed.
Limitations and risk analysis in the paper; conceptual argument and case study discussion raising privacy/IP concerns. No empirical incidence rates provided.
NFD requires sustained practitioner engagement and incentive alignment to be effective.
Limitations and discussion sections of the paper explicitly state this requirement; logical inference from method (human-in-the-loop commercialization and continual crystallization).
Limitations of the study include reliance on self-reported perceptions (subject to response and survivorship bias), lack of experimental/causal identification, potential non-representative sample, and cross-sectional design limiting inference about long-term productivity effects.
Authors' stated limitations in the paper summary.
Standard RLHF expected-cost constraints ignore distributional shape and can fail under heavy tails or rare catastrophic events.
Analytic/motivating argument presented in the paper contrasting expectation-based constraints with distributional behavior; illustrative examples and discussion of heavy-tailed/rara event failure modes (no sample-size or dataset details provided in the summary).