Evidence (3062 claims)
Adoption
5227 claims
Productivity
4503 claims
Governance
4100 claims
Human-AI Collaboration
3062 claims
Labor Markets
2480 claims
Innovation
2320 claims
Org Design
2305 claims
Skills & Training
1920 claims
Inequality
1311 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 373 | 105 | 59 | 439 | 984 |
| Governance & Regulation | 366 | 172 | 115 | 55 | 718 |
| Research Productivity | 237 | 95 | 34 | 294 | 664 |
| Organizational Efficiency | 364 | 82 | 62 | 34 | 545 |
| Technology Adoption Rate | 293 | 118 | 66 | 30 | 511 |
| Firm Productivity | 274 | 33 | 68 | 10 | 390 |
| AI Safety & Ethics | 117 | 178 | 44 | 24 | 365 |
| Output Quality | 231 | 61 | 23 | 25 | 340 |
| Market Structure | 107 | 123 | 85 | 14 | 334 |
| Decision Quality | 158 | 68 | 33 | 17 | 279 |
| Fiscal & Macroeconomic | 75 | 52 | 32 | 21 | 187 |
| Employment Level | 70 | 32 | 74 | 8 | 186 |
| Skill Acquisition | 88 | 31 | 38 | 9 | 166 |
| Firm Revenue | 96 | 34 | 22 | — | 152 |
| Innovation Output | 105 | 12 | 21 | 11 | 150 |
| Consumer Welfare | 68 | 29 | 35 | 7 | 139 |
| Regulatory Compliance | 52 | 61 | 13 | 3 | 129 |
| Inequality Measures | 24 | 68 | 31 | 4 | 127 |
| Task Allocation | 71 | 10 | 29 | 6 | 116 |
| Worker Satisfaction | 46 | 38 | 12 | 9 | 105 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 11 | 16 | 94 |
| Task Completion Time | 76 | 5 | 4 | 2 | 87 |
| Wages & Compensation | 46 | 13 | 19 | 5 | 83 |
| Team Performance | 44 | 9 | 15 | 7 | 76 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 18 | 16 | 9 | 5 | 48 |
| Job Displacement | 5 | 29 | 12 | — | 46 |
| Social Protection | 19 | 8 | 6 | 1 | 34 |
| Developer Productivity | 27 | 2 | 3 | 1 | 33 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 8 | 4 | 9 | — | 21 |
Human Ai Collab
Remove filter
Large language models (LLMs) risk reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market.
Framed as an assertion supported by prior literature and used as motivation for the study; partially evaluated empirically in this paper via the GPT-5 experiment.
The inability of models to reliably self-author useful Skills implies that models typically cannot produce the procedural knowledge they would benefit from consuming.
Interpretation based on the empirical finding that self-generated Skills provided no average benefit; inferred conclusion about model-authored procedural content quality. The paper's claim is supported by the comparative experimental results but the inference about broader capabilities is derived from those results rather than a direct separate measurement.
In some tasks, curated Skills worsened performance: 16 of 84 tasks showed negative deltas.
Per-task delta analysis reported in the paper: authors report 16 tasks with negative deltas where curated Skills reduced pass rate. (Note: the paper elsewhere reports 86 tasks in the benchmark; the negative-task count is reported as 16 of 84 in the paper's per-task summary.)
Access to digital learning and credential portability could unevenly benefit those with connectivity or prior skills, creating distributional effects and digital divides that should be measured.
Conceptual risk analysis and distributional reasoning based on digital access differentials; no empirical subgroup analysis reported.
Corridor governance is fragmented, with uneven implementation capacity across sending and receiving actors.
Governance gap analysis and desk review of corridor institutional arrangements; qualitative identification of capacity and accountability shortfalls.
Current mandatory pre-departure training is typically delivered late, generically, and with weak assessment, limiting its capacity to change recruitment choices or support migrants after arrival.
Structured desk review of policy and program materials and corridor process mapping identifying timing, actors, and touchpoints; qualitative/administrative evidence rather than quantitative outcome measurement.
Platforms optimized for engagement can produce externalities that distort lived temporality (loss of presence and meaning) beyond standard attention‑capture harms.
Argument synthesizing platform literature and phenomenological concerns; no new empirical analysis of platform effects provided.
Contemporary transhumanist and neurotechnology developments (BCIs, neural digital twins, human–AI collaboration) have advanced technologically but lack a robust conceptual core focused on lived experience and temporality.
Survey and synthesis of existing literatures reported in the paper (conceptual review); no systematic empirical content analysis or coded sample size provided.
LLM-generated participants are particularly risky in strategic and game-theoretic settings because they may misrepresent incentives, dynamic strategic thinking, and bounded rationality.
Review highlights examples and theoretical concerns from multiple studies indicating misrepresentation of strategic behavior; grouped under risks for strategic settings.
The absence of level‑4 evidence (organizational/patient outcomes) limits the ability of health systems and payers to conduct cost‑benefit or return‑on‑investment analyses for upskilling investments in AI.
No included study reported level‑4 outcomes; the paper reasons that without organizational/patient outcome data, economic evaluation is hampered.
Because most programs were short, introductory, and assessed only short‑term learner outcomes, they likely produce modest increases in individual AI literacy but are insufficient to build advanced clinical AI competencies that would change clinical task allocation or productivity.
Synthesis combining program characteristics (short duration, introductory content, academic delivery) and outcome mapping to only Kirkpatrick levels 1–3 in the 27 studies; interpretation drawn in the paper.
Workplace stress is associated with reduced job performance.
PLS-SEM analysis on the same N = 350 sample. Reported direct path: Stress → Performance, β = 0.158, p < 0.001. (Note: the study interprets this as stress reducing performance; sign/coding conventions are not detailed in the summary.)
High upfront and maintenance costs create scale advantages for larger institutions or centralized providers, potentially concentrating market power among well-resourced curriculum developers.
Economic inference from cost structure described in paper; no market concentration empirical data provided.
Disadvantages and risks include significant resource investment, complexity aligning multiple standards, and a high demand for continuous updates and audits.
Paper's risks section (author assertion); no quantified cost or burden data.
Implementing this program requires substantial resources and ongoing governance.
Author assertions in disadvantages/risks section; no cost accounting or empirical costing data provided.
Proprietary models trained on large clinical datasets can create high entry barriers, concentrating market power among a few platform firms and increasing prices for hospitals.
Market-structure and platform economics analysis in the paper; empirical evidence of concentration in GenAI healthcare is limited and no firm-level market-share data are provided.
Liability and accountability gaps exist for AI-suggested errors: it is unclear whether vendors, hospitals, or clinicians are responsible for harms resulting from GenAI CDS recommendations.
Policy and legal analysis discussed in the paper; this is a structural/legal observation rather than an empirical finding and no case-law sample size is provided.
Current simulation practice is insufficiently integrated with enabling technologies (digital twins, data analytics, AI/ML) and with relevant government policy constraints.
Synthesis of literature and gap analysis in the paper; assertions are conceptual and not empirically tested within the paper.
Current simulation practice has limited strategic orientation, often focusing more on tactical and operational questions than on firm strategy.
Literature review and analysis in the paper highlighting the emphasis in existing studies on tactical/operational problems.
Current simulation practice lacks contextualization to firm‑ and industry‑specific realities.
Findings from the paper's literature review and critique sections; no new empirical measurement provided.
Current manufacturing and supply‑chain simulation practices are insufficiently contextualized, strategically focused, or integrated with modern technologies and policy considerations.
Literature review and critique of existing simulation practice presented in the paper; no original empirical data or case studies.
Personalization raises distributional concerns and risks of manipulation or biased treatment; regulators may need to set transparency, fairness, and data-use standards.
Policy analysis and normative recommendation based on known risks of personalization systems; not empirically demonstrated in robotic deployments here.
LLM-based personalization generates context-aware responses but often fails to model long-term preferences and fine-grained user/item relations needed for consistent, proactive personalization.
Conceptual critique based on surveyed limitations of LLM-based approaches; no new experimental data reported.
ANN analysis ranks need-for-human-interaction barriers as the most important predictor of GAICS adoption outcome.
ANN feature-importance analysis reported in the paper that ranks predictors for adoption outcome and finds the human-interaction barrier as the top predictor; paper abstract does not include details on ANN implementation or sample characteristics.
Students raised concerns about ChatGPT producing factual errors, the risk of overreliance that could reduce independent thinking, and functional constraints of free ChatGPT versions.
Qualitative analysis of open-ended student survey responses; concerns consistently reported across responses in the sample of 254 students.
Biased or unrepresentative AI outputs produce negative externalities, including maladaptation and inefficient investments in vulnerable regions.
Conceptual analysis and illustrative cases linking misleading model outputs to maladaptive decisions; the paper notes risks rather than providing quantified incidence or cost estimates.
Returns to scale in compute and data favor incumbents; without intervention this dynamic can entrench inequality in the global climate-information market.
Economic theory of returns to scale combined with observed compute concentration; no empirical elasticity or returns-to-scale estimates provided.
Concentration of compute and model development creates market power for Northern institutions and companies, likely leading to unequal pricing, control over standards, and capture of high-value climate services.
Descriptive mapping of concentration plus economic analysis of market structure and returns to scale; illustrative rather than quantitatively proven across markets.
Rapid AI adoption without a shift from model-centric to data- and equity-centric development risks producing systematically worse performance and misleading recommendations for the most climate-vulnerable, data-sparse regions.
Synthesis of domain-specific case studies (weather/climate, impact models, LLMs) and conceptual causal tracing demonstrating how infrastructure asymmetry can degrade outputs in vulnerable regions; evidence illustrative rather than causal-estimate based.
Large language models (LLMs) that rely on dominant, textualized climate knowledge tend to foreground Northern epistemologies and marginalize local or indigenous knowledge, reinforcing biases in climate narratives and recommendations.
Case studies and analysis of training-corpus composition and output examples illustrating the dominance of Northern textual sources and examples of sidelining local knowledge; no large-scale audit results provided.
In climate impact modelling, sparse and unrepresentative exposure and vulnerability data combined with inadequate validation generate high uncertainty and risk of misleading interventions and maladaptation in vulnerable locales.
Targeted case studies and literature synthesis showing gaps in exposure/vulnerability datasets and validation failures; argument is illustrated rather than quantified across all systems.
In weather and climate modelling, historically and spatially biased observational data produce systematic performance gaps in under-observed tropical and low-income regions, reducing forecast fidelity where adaptive capacity is lowest.
Comparative, domain-specific case studies and literature review documenting observational data sparsity and illustrative empirical performance gaps; no single cross-system statistical estimate provided.
The geographic concentration of compute and model development creates path dependence: model design, training datasets, and validation reflect Northern priorities and contexts.
Conceptual analysis supported by cross-disciplinary synthesis and illustrative case studies showing dataset selection, validation practices, and model design choices aligned with Northern contexts rather than global representativeness.
At the organizational scale, AI adoption is constrained and shaped by compliance requirements, formal policies, and prevailing norms.
Participants' accounts in workshops (n=15) noting compliance and policy considerations; thematic analysis classified these as organizational-level constraints.
Creators who systematize high-throughput AI workflows or control distribution channels may capture outsized returns, potentially increasing winner-take-most dynamics on platforms.
Theoretical implication extrapolated from observed high-throughput practices and monetization strategies in the 377 videos; not directly measured or quantified in the dataset.
Widespread unverifiable income claims and promotional framing create noisy signals about viable earnings, complicating entrants’ investment decisions and labor market expectations.
Analytical inference based on the documented prevalence of unverifiable earnings claims in the 377 videos and theory about market signaling; not quantitatively tested in the paper.
GenAI lowers the time and skill cost of producing many types of creative outputs, which can increase content supply and exert downward pressure on wages for routine creative tasks.
Inference drawn as an implication from observed practices (e.g., mass production workflows) in the 377 videos and existing literature; not directly measured in this study.
Creators and the community knowledge base document shifting norms around authorship and attribution: GenAI blurs who is considered the creator and complicates labor recognition and rights.
Coding captured explicit discussion and contested norms about authorship, attribution, and creator identity across the 377 videos.
Some creators recommend or describe synthetic engagement practices (e.g., automated posting, synthetic comments/engagement) as tactics to inflate visibility.
Thematic coding noted advice or descriptions of engagement-inflating tactics across videos in the 377-video corpus.
Creators surface and often employ practices that raise content misappropriation concerns (use of copyrighted or third-party material in synthetic outputs).
Instances and discussions captured in the 377-video sample where creators show or recommend synthesizing, transforming, or repurposing third‑party content.
Many videos advertise earnings or income claims that are unverifiable within the content, producing noisy market signals.
Qualitative observations from coding the 377 videos noting frequent asserted earnings without reproducible evidence or transparent accounting.
These methodological adaptations reduce but do not eliminate validity threats; they often increase complexity and cost while leaving unresolved issues of generalizability and time-dependence.
Practitioner accounts (n=16) describing limits/tradeoffs of adaptations; authors' synthesis concluding residual threats remain despite adaptations.
External validity is limited: results from a given trial may not generalize across model versions, populations, tasks, or to temporally distant deployments.
Interview-derived themes (16 practitioners) and authors' analytic mapping to external validity concerns; supported by examples of model/version dependence discussed in interviews.
Construct validity is threatened because commonly used outcome measures can misrepresent the constructs of interest when AI changes task structure or human strategies.
Practitioners' reports in semi-structured interviews (n=16) and authors' synthesis illustrating cases where metrics no longer capture intended constructs after AI introduction.
Common internal validity threats in uplift studies of frontier AI include violations of treatment fidelity and SUTVA (e.g., contamination, time-varying treatments).
The paper's validity-consequences section, based on thematic analysis of 16 interviews and mapping practitioner-reported problems to internal validity constructs.
Porous real-world settings cause spillovers and contamination across experimental arms, violating SUTVA and threatening internal validity.
Multiple practitioners (n=16) reported examples of spillovers and contamination during deployment-like studies; thematic analysis mapped these to SUTVA/treatment-fidelity concerns.
Shifting baselines (changes in tools, protocols, or knowledge during and across studies) complicate defining an appropriate control or status quo.
Interview data (16 practitioners) and thematic analysis identifying shifting baselines as a recurring challenge reported by participants.
Rapidly evolving models (nonstationarity) make any single trial a moving target, undermining the temporal stability of measured uplift.
Practitioner reports from semi-structured interviews (n=16) describing model updates and performance changes during/after trials; thematic coding indicating nonstationarity as a common concern.
Properties of frontier AI — rapid model evolution, shifting baselines, heterogeneous and changing users, and porous real-world settings — regularly strain internal, construct, and external validity of human uplift studies.
Recurring themes identified via qualitative analysis of 16 practitioner interviews; mapped to internal/construct/external validity dimensions in the paper's results.
Instability of agent rankings across configurations makes procurement and deployment decisions based on narrow benchmarks risky; firms should evaluate agents under their own scaffolds, datasets, and workflows before committing.
Empirical finding of ranking instability across models, scaffolds, and datasets; methodological recommendation derived from that instability.