Evidence (3062 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	373	105	59	439	984
Governance & Regulation	366	172	115	55	718
Research Productivity	237	95	34	294	664
Organizational Efficiency	364	82	62	34	545
Technology Adoption Rate	293	118	66	30	511
Firm Productivity	274	33	68	10	390
AI Safety & Ethics	117	178	44	24	365
Output Quality	231	61	23	25	340
Market Structure	107	123	85	14	334
Decision Quality	158	68	33	17	279
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	88	31	38	9	166
Firm Revenue	96	34	22	—	152
Innovation Output	105	12	21	11	150
Consumer Welfare	68	29	35	7	139
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	71	10	29	6	116
Worker Satisfaction	46	38	12	9	105
Error Rate	42	47	6	—	95
Training Effectiveness	55	12	11	16	94
Task Completion Time	76	5	4	2	87
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	16	9	5	48
Job Displacement	5	29	12	—	46
Social Protection	19	8	6	1	34
Developer Productivity	27	2	3	1	33
Worker Turnover	10	12	—	3	25
Creative Output	15	5	3	1	24
Skill Obsolescence	3	18	2	—	23
Labor Share of Income	8	4	9	—	21

Human Ai Collab Remove filter

Large language models (LLMs) risk reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market.

Framed as an assertion supported by prior literature and used as motivation for the study; partially evaluated empirically in this paper via the GPT-5 experiment.

medium negative Gender Bias in Generative AI-assisted Recruitment Processes presence and amplification of gender stereotypes/bias in LLM outputs

The inability of models to reliably self-author useful Skills implies that models typically cannot produce the procedural knowledge they would benefit from consuming.

Interpretation based on the empirical finding that self-generated Skills provided no average benefit; inferred conclusion about model-authored procedural content quality. The paper's claim is supported by the comparative experimental results but the inference about broader capabilities is derived from those results rather than a direct separate measurement.

medium negative SkillsBench: Benchmarking How Well Agent Skills Work Across ... quality/usefulness of model-authored Skills as measured by downstream task pass ...

In some tasks, curated Skills worsened performance: 16 of 84 tasks showed negative deltas.

Per-task delta analysis reported in the paper: authors report 16 tasks with negative deltas where curated Skills reduced pass rate. (Note: the paper elsewhere reports 86 tasks in the benchmark; the negative-task count is reported as 16 of 84 in the paper's per-task summary.)

medium negative SkillsBench: Benchmarking How Well Agent Skills Work Across ... task pass rate (per-task delta)

Access to digital learning and credential portability could unevenly benefit those with connectivity or prior skills, creating distributional effects and digital divides that should be measured.

Conceptual risk analysis and distributional reasoning based on digital access differentials; no empirical subgroup analysis reported.

medium negative Training as corridor governance: TVET alignment, skills reco... differential program benefits across connectivity/skill/gender subgroups; measur...

Corridor governance is fragmented, with uneven implementation capacity across sending and receiving actors.

Governance gap analysis and desk review of corridor institutional arrangements; qualitative identification of capacity and accountability shortfalls.

medium negative Training as corridor governance: TVET alignment, skills reco... implementation capacity and inter-actor coordination in corridor governance

Current mandatory pre-departure training is typically delivered late, generically, and with weak assessment, limiting its capacity to change recruitment choices or support migrants after arrival.

Structured desk review of policy and program materials and corridor process mapping identifying timing, actors, and touchpoints; qualitative/administrative evidence rather than quantitative outcome measurement.

medium negative Training as corridor governance: TVET alignment, skills reco... timing and quality of training delivery; ability to affect recruitment choices a...

Platforms optimized for engagement can produce externalities that distort lived temporality (loss of presence and meaning) beyond standard attention‑capture harms.

Argument synthesizing platform literature and phenomenological concerns; no new empirical analysis of platform effects provided.

medium negative XChronos and Conscious Transhumanism: A Philosophical Framew... welfare externalities expressed as reductions in presence and perceived meaning ...

Contemporary transhumanist and neurotechnology developments (BCIs, neural digital twins, human–AI collaboration) have advanced technologically but lack a robust conceptual core focused on lived experience and temporality.

Survey and synthesis of existing literatures reported in the paper (conceptual review); no systematic empirical content analysis or coded sample size provided.

medium negative XChronos and Conscious Transhumanism: A Philosophical Framew... extent to which existing transhumanist/neurotech work centers lived temporality ...

LLM-generated participants are particularly risky in strategic and game-theoretic settings because they may misrepresent incentives, dynamic strategic thinking, and bounded rationality.

Review highlights examples and theoretical concerns from multiple studies indicating misrepresentation of strategic behavior; grouped under risks for strategic settings.

medium negative Synthetic Participants Generated by Large Language Models: A... accuracy of strategic decisions, equilibrium behavior, and incentive-respecting ...

The absence of level‑4 evidence (organizational/patient outcomes) limits the ability of health systems and payers to conduct cost‑benefit or return‑on‑investment analyses for upskilling investments in AI.

No included study reported level‑4 outcomes; the paper reasons that without organizational/patient outcome data, economic evaluation is hampered.

medium negative Assessing the effectiveness of artificial intelligence educa... availability of evidence linking training to organizational/patient outcomes for...

Because most programs were short, introductory, and assessed only short‑term learner outcomes, they likely produce modest increases in individual AI literacy but are insufficient to build advanced clinical AI competencies that would change clinical task allocation or productivity.

Synthesis combining program characteristics (short duration, introductory content, academic delivery) and outcome mapping to only Kirkpatrick levels 1–3 in the 27 studies; interpretation drawn in the paper.

medium negative Assessing the effectiveness of artificial intelligence educa... individual AI literacy gains and capacity to generate advanced clinical AI compe...

Workplace stress is associated with reduced job performance.

PLS-SEM analysis on the same N = 350 sample. Reported direct path: Stress → Performance, β = 0.158, p < 0.001. (Note: the study interprets this as stress reducing performance; sign/coding conventions are not detailed in the summary.)

medium negative AI-driven stress management and performance optimization: A ... job performance

High upfront and maintenance costs create scale advantages for larger institutions or centralized providers, potentially concentrating market power among well-resourced curriculum developers.

Economic inference from cost structure described in paper; no market concentration empirical data provided.

medium negative Curriculum engineering: organisation, orientation, and manag... costs (upfront and maintenance), market concentration metrics among curriculum p...

Disadvantages and risks include significant resource investment, complexity aligning multiple standards, and a high demand for continuous updates and audits.

Paper's risks section (author assertion); no quantified cost or burden data.

medium negative Curriculum engineering: organisation, orientation, and manag... implementation cost, complexity of standards alignment, frequency and cost of up...

Implementing this program requires substantial resources and ongoing governance.

Author assertions in disadvantages/risks section; no cost accounting or empirical costing data provided.

medium negative Curriculum engineering: organisation, orientation, and manag... resource requirements and governance burden (cost/time/staffing)

Proprietary models trained on large clinical datasets can create high entry barriers, concentrating market power among a few platform firms and increasing prices for hospitals.

Market-structure and platform economics analysis in the paper; empirical evidence of concentration in GenAI healthcare is limited and no firm-level market-share data are provided.

medium negative GenAI and clinical decision making in general practice market concentration metrics (HHI); vendor pricing; hospital switching costs

Liability and accountability gaps exist for AI-suggested errors: it is unclear whether vendors, hospitals, or clinicians are responsible for harms resulting from GenAI CDS recommendations.

Policy and legal analysis discussed in the paper; this is a structural/legal observation rather than an empirical finding and no case-law sample size is provided.

medium negative GenAI and clinical decision making in general practice existence of legal/ liability/ accountability clarity; number of resolved liabil...

Current simulation practice is insufficiently integrated with enabling technologies (digital twins, data analytics, AI/ML) and with relevant government policy constraints.

Synthesis of literature and gap analysis in the paper; assertions are conceptual and not empirically tested within the paper.

medium negative A Review of Manufacturing Operations Research Integration in... level of integration between simulation models and enabling technologies/policy ...

Current simulation practice has limited strategic orientation, often focusing more on tactical and operational questions than on firm strategy.

Literature review and analysis in the paper highlighting the emphasis in existing studies on tactical/operational problems.

medium negative A Review of Manufacturing Operations Research Integration in... strategic relevance of simulation research and models

Current simulation practice lacks contextualization to firm‑ and industry‑specific realities.

Findings from the paper's literature review and critique sections; no new empirical measurement provided.

medium negative A Review of Manufacturing Operations Research Integration in... degree of firm/industry contextualization in simulation models

Current manufacturing and supply‑chain simulation practices are insufficiently contextualized, strategically focused, or integrated with modern technologies and policy considerations.

Literature review and critique of existing simulation practice presented in the paper; no original empirical data or case studies.

medium negative A Review of Manufacturing Operations Research Integration in... simulation relevance (contextualization, strategic alignment, technology and pol...

Personalization raises distributional concerns and risks of manipulation or biased treatment; regulators may need to set transparency, fairness, and data-use standards.

Policy analysis and normative recommendation based on known risks of personalization systems; not empirically demonstrated in robotic deployments here.

medium negative Reimagining Social Robots as Recommender Systems: Foundation... incidence of biased treatment, transparency compliance, regulatory adoption rate...

LLM-based personalization generates context-aware responses but often fails to model long-term preferences and fine-grained user/item relations needed for consistent, proactive personalization.

Conceptual critique based on surveyed limitations of LLM-based approaches; no new experimental data reported.

medium negative Reimagining Social Robots as Recommender Systems: Foundation... consistency of personalization over time, representation of long-term user prefe...

ANN analysis ranks need-for-human-interaction barriers as the most important predictor of GAICS adoption outcome.

ANN feature-importance analysis reported in the paper that ranks predictors for adoption outcome and finds the human-interaction barrier as the top predictor; paper abstract does not include details on ANN implementation or sample characteristics.

medium negative Reimagining Stakeholder Engagement Through Generative AI: A ... GAICS adoption (likelihood/decision to adopt)

Students raised concerns about ChatGPT producing factual errors, the risk of overreliance that could reduce independent thinking, and functional constraints of free ChatGPT versions.

Qualitative analysis of open-ended student survey responses; concerns consistently reported across responses in the sample of 254 students.

medium negative Expanding the lens: multi-institutional evidence on student ... student-reported concerns and perceived risks

Biased or unrepresentative AI outputs produce negative externalities, including maladaptation and inefficient investments in vulnerable regions.

Conceptual analysis and illustrative cases linking misleading model outputs to maladaptive decisions; the paper notes risks rather than providing quantified incidence or cost estimates.

medium negative The Rise of AI in Weather and Climate Information and its Im... Incidence of maladaptation and associated economic inefficiencies attributable t...

Returns to scale in compute and data favor incumbents; without intervention this dynamic can entrench inequality in the global climate-information market.

Economic theory of returns to scale combined with observed compute concentration; no empirical elasticity or returns-to-scale estimates provided.

medium negative The Rise of AI in Weather and Climate Information and its Im... Degree to which compute/data scale advantages increase incumbents' market share ...

Concentration of compute and model development creates market power for Northern institutions and companies, likely leading to unequal pricing, control over standards, and capture of high-value climate services.

Descriptive mapping of concentration plus economic analysis of market structure and returns to scale; illustrative rather than quantitatively proven across markets.

medium negative The Rise of AI in Weather and Climate Information and its Im... Market power indicators (pricing, standard-setting control, market share in clim...

Rapid AI adoption without a shift from model-centric to data- and equity-centric development risks producing systematically worse performance and misleading recommendations for the most climate-vulnerable, data-sparse regions.

Synthesis of domain-specific case studies (weather/climate, impact models, LLMs) and conceptual causal tracing demonstrating how infrastructure asymmetry can degrade outputs in vulnerable regions; evidence illustrative rather than causal-estimate based.

medium negative The Rise of AI in Weather and Climate Information and its Im... Model performance and recommendation quality in climate-vulnerable, data-sparse ...

Large language models (LLMs) that rely on dominant, textualized climate knowledge tend to foreground Northern epistemologies and marginalize local or indigenous knowledge, reinforcing biases in climate narratives and recommendations.

Case studies and analysis of training-corpus composition and output examples illustrating the dominance of Northern textual sources and examples of sidelining local knowledge; no large-scale audit results provided.

medium negative The Rise of AI in Weather and Climate Information and its Im... Representation of local/indigenous knowledge in LLM outputs and bias in generate...

In climate impact modelling, sparse and unrepresentative exposure and vulnerability data combined with inadequate validation generate high uncertainty and risk of misleading interventions and maladaptation in vulnerable locales.

Targeted case studies and literature synthesis showing gaps in exposure/vulnerability datasets and validation failures; argument is illustrated rather than quantified across all systems.

medium negative The Rise of AI in Weather and Climate Information and its Im... Uncertainty in impact estimates and likelihood of misleading policy/intervention...

In weather and climate modelling, historically and spatially biased observational data produce systematic performance gaps in under-observed tropical and low-income regions, reducing forecast fidelity where adaptive capacity is lowest.

Comparative, domain-specific case studies and literature review documenting observational data sparsity and illustrative empirical performance gaps; no single cross-system statistical estimate provided.

medium negative The Rise of AI in Weather and Climate Information and its Im... Forecast fidelity/accuracy in under-observed tropical and low-income regions (mo...

The geographic concentration of compute and model development creates path dependence: model design, training datasets, and validation reflect Northern priorities and contexts.

Conceptual analysis supported by cross-disciplinary synthesis and illustrative case studies showing dataset selection, validation practices, and model design choices aligned with Northern contexts rather than global representativeness.

medium negative The Rise of AI in Weather and Climate Information and its Im... Degree of alignment between model design/validation choices and Northern (vs. lo...

At the organizational scale, AI adoption is constrained and shaped by compliance requirements, formal policies, and prevailing norms.

Participants' accounts in workshops (n=15) noting compliance and policy considerations; thematic analysis classified these as organizational-level constraints.

medium negative The Values of Value in AI Adoption: Rethinking Efficiency in... organizational-level constraints on adoption (compliance, policy, norms) and res...

Creators who systematize high-throughput AI workflows or control distribution channels may capture outsized returns, potentially increasing winner-take-most dynamics on platforms.

Theoretical implication extrapolated from observed high-throughput practices and monetization strategies in the 377 videos; not directly measured or quantified in the dataset.

medium negative Monetizing Generative AI: YouTubers' Collective Knowledge on... earnings concentration / market concentration effects (suggested, not measured)

Widespread unverifiable income claims and promotional framing create noisy signals about viable earnings, complicating entrants’ investment decisions and labor market expectations.

Analytical inference based on the documented prevalence of unverifiable earnings claims in the 377 videos and theory about market signaling; not quantitatively tested in the paper.

medium negative Monetizing Generative AI: YouTubers' Collective Knowledge on... information quality / market signaling affecting entrant decisions (hypothesized...

GenAI lowers the time and skill cost of producing many types of creative outputs, which can increase content supply and exert downward pressure on wages for routine creative tasks.

Inference drawn as an implication from observed practices (e.g., mass production workflows) in the 377 videos and existing literature; not directly measured in this study.

medium negative Monetizing Generative AI: YouTubers' Collective Knowledge on... potential change in labor costs, content supply, and wage pressure (not empirica...

Creators and the community knowledge base document shifting norms around authorship and attribution: GenAI blurs who is considered the creator and complicates labor recognition and rights.

Coding captured explicit discussion and contested norms about authorship, attribution, and creator identity across the 377 videos.

medium negative Monetizing Generative AI: YouTubers' Collective Knowledge on... frequency and content of discussions about authorship and attribution

Some creators recommend or describe synthetic engagement practices (e.g., automated posting, synthetic comments/engagement) as tactics to inflate visibility.

Thematic coding noted advice or descriptions of engagement-inflating tactics across videos in the 377-video corpus.

medium negative Monetizing Generative AI: YouTubers' Collective Knowledge on... presence of recommendations for synthetic or automated engagement tactics

Creators surface and often employ practices that raise content misappropriation concerns (use of copyrighted or third-party material in synthetic outputs).

Instances and discussions captured in the 377-video sample where creators show or recommend synthesizing, transforming, or repurposing third‑party content.

medium negative Monetizing Generative AI: YouTubers' Collective Knowledge on... occurrence of recommendations or demonstrations involving third-party/copyrighte...

Many videos advertise earnings or income claims that are unverifiable within the content, producing noisy market signals.

Qualitative observations from coding the 377 videos noting frequent asserted earnings without reproducible evidence or transparent accounting.

medium negative Monetizing Generative AI: YouTubers' Collective Knowledge on... presence of unverifiable income/earnings claims in videos

These methodological adaptations reduce but do not eliminate validity threats; they often increase complexity and cost while leaving unresolved issues of generalizability and time-dependence.

Practitioner accounts (n=16) describing limits/tradeoffs of adaptations; authors' synthesis concluding residual threats remain despite adaptations.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... effectiveness and tradeoffs of mitigation strategies for validity threats

External validity is limited: results from a given trial may not generalize across model versions, populations, tasks, or to temporally distant deployments.

Interview-derived themes (16 practitioners) and authors' analytic mapping to external validity concerns; supported by examples of model/version dependence discussed in interviews.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... generalizability/external validity of trial results across versions, populations...

Construct validity is threatened because commonly used outcome measures can misrepresent the constructs of interest when AI changes task structure or human strategies.

Practitioners' reports in semi-structured interviews (n=16) and authors' synthesis illustrating cases where metrics no longer capture intended constructs after AI introduction.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... construct validity of outcome measures (accuracy of metrics in capturing intende...

Common internal validity threats in uplift studies of frontier AI include violations of treatment fidelity and SUTVA (e.g., contamination, time-varying treatments).

The paper's validity-consequences section, based on thematic analysis of 16 interviews and mapping practitioner-reported problems to internal validity constructs.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... treatment fidelity and SUTVA adherence in RCTs measuring uplift

Porous real-world settings cause spillovers and contamination across experimental arms, violating SUTVA and threatening internal validity.

Multiple practitioners (n=16) reported examples of spillovers and contamination during deployment-like studies; thematic analysis mapped these to SUTVA/treatment-fidelity concerns.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... internal validity (SUTVA, treatment contamination) of uplift trials

Shifting baselines (changes in tools, protocols, or knowledge during and across studies) complicate defining an appropriate control or status quo.

Interview data (16 practitioners) and thematic analysis identifying shifting baselines as a recurring challenge reported by participants.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... construct validity of the control/status-quo definition in uplift studies

Rapidly evolving models (nonstationarity) make any single trial a moving target, undermining the temporal stability of measured uplift.

Practitioner reports from semi-structured interviews (n=16) describing model updates and performance changes during/after trials; thematic coding indicating nonstationarity as a common concern.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... temporal stability/generalizability of measured uplift across model versions

Properties of frontier AI — rapid model evolution, shifting baselines, heterogeneous and changing users, and porous real-world settings — regularly strain internal, construct, and external validity of human uplift studies.

Recurring themes identified via qualitative analysis of 16 practitioner interviews; mapped to internal/construct/external validity dimensions in the paper's results.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... internal, construct, and external validity of human uplift RCTs

Instability of agent rankings across configurations makes procurement and deployment decisions based on narrow benchmarks risky; firms should evaluate agents under their own scaffolds, datasets, and workflows before committing.

Empirical finding of ranking instability across models, scaffolds, and datasets; methodological recommendation derived from that instability.

medium negative Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... robustness_of_benchmark_based_procurement (risk_of_misleading_benchmarks)

« Prev 1 2 3 … 34 35 36 … 61 62 Next »