Evidence (6869 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Governance Remove filter

Overfitting/contamination: LLMs can reproduce pre-training or fine-tuning data (stochastic parroting) and leak training-set content into outputs.

Multiple reviewed studies documenting examples of content reproduction and data leakage; categorized as overfitting/contamination in the review.

high negative Synthetic Participants Generated by Large Language Models: A... occurrence of memorized or training-set-specific content in generated outputs

Misleading believability: LLM outputs may look plausible but be incorrect or unrepresentative, risking overconfidence in synthetic data.

Reported instances in the literature and organized failure taxonomy describing plausible-looking but inaccurate synthetic responses.

high negative Synthetic Participants Generated by Large Language Models: A... rate of plausible-but-incorrect or unrepresentative outputs (perceived plausibil...

Distortions: LLM outputs can exhibit systematic biases relative to target human distributions.

Empirical findings across reviewed studies showing output distributions from LLMs that deviate from human sample distributions; aggregated in the distortions failure category.

high negative Synthetic Participants Generated by Large Language Models: A... distributional deviations between LLM-generated responses and human responses (b...

Cognitive misalignments: LLMs differ from humans in reasoning, goals, and bounded rationality, which can alter behavior in economic and strategic tasks.

Multiple studies in the review reported systematic differences in reasoning and goal-directed behavior when comparing LLM outputs to human participants; coded under the cognitive misalignment category.

high negative Synthetic Participants Generated by Large Language Models: A... alignment of reasoning processes and goal-directed responses between LLMs and hu...

Major failure modes limiting synthetic participants as direct substitutes for humans are: cognitive misalignments, distortions, misleading believability, and overfitting/contamination.

Standardized taxonomy developed by coding the 182 studies into generalizable indicators and organizing failure types into four categories.

high negative Synthetic Participants Generated by Large Language Models: A... types and frequency of fidelity failures (categorical classification of failure ...

The information-theoretic uncertainty measure provides a mechanism-level explanation for why deception value falls as transparency increases (residual uncertainty explains utility changes).

Analytical linkage in the model connecting the entropy-like residual uncertainty metric to equilibrium utility changes; theoretical argument and derivation in the paper.

high negative Evaluating Synthetic Cyber Deception Strategies Under Uncert... relationship between residual attacker uncertainty (entropy-like) and change in ...

The value of deception degrades (falls) as the true system state becomes more observable; this degradation is quantifiable via the price-of-transparency metric.

Analytical definition of price of transparency as marginal change and supporting theoretical results; computational experiments that sweep observability/transparency levels (simulated experiments, parameter sweeps; number of scenarios not specified).

high negative Evaluating Synthetic Cyber Deception Strategies Under Uncert... value of deception as a function of observability; price of transparency (margin...

The paper derives closed-form bounds and break-even conditions that delineate when deception is ineffective due to cost or detectability.

Theoretical proofs and closed-form inequalities presented in the analytical section (derivations of bounds and break-even conditions).

high negative Evaluating Synthetic Cyber Deception Strategies Under Uncert... value of deception (conditions where value ≤ 0 or falls below cost thresholds)

If deployed without mitigation, GenAI CDS risks widening disparities by performing worse on underrepresented groups or being unequally distributed across resource-rich versus resource-poor settings.

Fairness literature, subgroup performance concerns, and distributional risk analysis cited in the paper; direct empirical demonstrations of widened disparities due to GenAI CDS are limited in the literature per the paper.

high negative GenAI and clinical decision making in general practice differences in performance/outcomes across demographic and socioeconomic groups;...

Limited public datasets and vendor lock-in constrain independent reproducible evaluations and audits of current generative models in healthcare.

Observation and policy analysis in the paper noting scarcity of public clinical datasets for state-of-the-art models and proprietary constraints; no dataset counts provided.

high negative GenAI and clinical decision making in general practice availability of public datasets; reproducibility of model evaluations; number of...

GenAI CDS creates data privacy and security risks because of high-value medical data and use of external cloud services.

Known cybersecurity risks and documented incidents in health IT; the paper cites the general risk context rather than specific breach sample counts tied to GenAI deployments.

high negative GenAI and clinical decision making in general practice data breaches; unauthorized access incidents; compliance violations

GenAI CDS can amplify bias and inequities if training data underrepresent groups or reflect historical disparities.

Fairness and robustness audit literature and subgroup performance analyses referenced in the paper; specific empirical demonstrations for contemporary GenAI CDS are limited and sample sizes not given.

high negative GenAI and clinical decision making in general practice performance disparities across demographic subgroups; differential error rates; ...

GenAI CDS systems hallucinate and can produce incorrect but plausible recommendations, which can cause patient harm if trusted unchecked.

Documented failure modes of generative models and examples from controlled evaluations; the paper references known hallucination behavior from model audits and case reports, though it does not quantify incidence rates or provide large-scale observational harm data.

high negative GenAI and clinical decision making in general practice adverse events; erroneous recommendations; clinician reliance/misuse leading to ...

Reproducibility and deployment gaps are widespread: missing code, inconsistent benchmarks, and insufficient productionization focus (monitoring, model updates, rollback).

Surveyed literature often lacks released code and consistent benchmarks; thematic analysis highlights absence of operational deployment practices.

high negative International Journal on Cybernetics & Informatics reproducibility indicators (code availability, benchmark consistency) and deploy...

Common ML pipeline pitfalls include overfitting, poor cross-validation practices, lack of real-time/online evaluation, and inadequate feature engineering.

Critical assessment of experimental practices in the surveyed literature identifying methodological shortcomings that can inflate reported performance.

high negative International Journal on Cybernetics & Informatics validity/reliability of reported model performance

There is a lack of large, labeled, realistic IoT datasets; class imbalance, concept drift, dataset bias, and synthetic datasets that poorly reflect real traffic are common problems.

Review of datasets (N-BaIoT, Bot-IoT, TON_IoT, UNSW-NB15, KDD variants, custom/synthetic datasets) and critical assessment of their limitations across studies.

high negative International Journal on Cybernetics & Informatics dataset quality and representativeness; labeling availability

Resource constraints (limited CPU, memory, energy, and network bandwidth on devices and edge nodes) significantly limit feasible ML model complexity and deployment choices.

Multiple surveyed studies report hardware constraints and evaluate runtime/memory/latency; survey synthesizes these resource limitations as a recurring challenge.

high negative International Journal on Cybernetics & Informatics resource usage (CPU, memory, energy) and feasible model complexity

Despite high reported detection accuracies in academic work, there is a shortage of production-grade, deployable ML-IDS for IoT.

Critical review of surveyed papers showing many report lab metrics but few report deployment case studies, production rollouts, or provide deployment artifacts (code, runtime/energy measurements).

high negative International Journal on Cybernetics & Informatics deployment readiness/production adoption

Limitations of the review include restricted sample size, Scopus-only coverage, emergent-literature timeframe, and heterogeneity in study designs and measures, which constrain generalizability.

Authors' limitations subsection explicitly listing these constraints from their SLR process.

high negative Pricing Strategy in Digital Marketing: A Systematic Review o... Generalisability and completeness of the review's conclusions

There has been insufficient attention in the literature to ethics, fairness, and consumer welfare in algorithmic pricing.

Persistent gap identified in the SLR—few or no included studies focused on ethics/fairness/welfare issues according to authors' coding.

high negative Pricing Strategy in Digital Marketing: A Systematic Review o... Coverage of ethics/fairness/consumer welfare topics in digital pricing literatur...

Existing empirical studies on digital VBP exhibit methodological limitations, including small/limited samples, short time windows, and inconsistent measures.

Authors' methodological critique from the SLR based on assessment of study designs and measures reported in the 30 articles.

high negative Pricing Strategy in Digital Marketing: A Systematic Review o... Methodological rigor and validity of existing digital VBP studies

Automated compliance and credentialing systems raise governance issues (auditability, appeals mechanisms) and risk incorrect automated deregistration if not properly governed.

Governance and algorithmic-risk discussion in the paper; logical argumentation rather than case-based evidence.

high negative <i>Electrotechnical education, institutional complianc... rate of incorrect automated decisions, existence and effectiveness of appeal pro...

The paper models career progression as a continuous function and treats certification gaps as discontinuities that impede labour-market mobility.

Mathematical/conceptual modeling described in the methods (career-progression-as-continuous-function approach); this is a modeling choice reported in the paper rather than an empirical finding.

high negative <i>Electrotechnical education, institutional complianc... labour-market mobility / continuity of career progression (in the conceptual mod...

There is limited long-term impact evidence and few system-level assessments of AI in developing-country agriculture.

Authors' methodological caveat based on the temporal scope and types of studies available in the >60-study review.

high negative A systematic review of the economic impact of artificial int... presence/absence of long-term impact evaluations and system-level assessments

The evidence base is skewed toward pilots and high‑performer contexts; there is a lack of long‑panel, multi‑project longitudinal studies to validate typical returns and scalability.

Authors' assessment of evidence types in the 160 studies: mix of conceptual papers, case studies, pilots, and only limited larger empirical evaluations.

high negative Digital Twins Across the Asset Lifecycle: Technical, Organis... representativeness and longitudinal robustness of evidence

Substantial compute and resource requirements for training and inference concentrate capabilities among well‑resourced labs and firms.

Paper discusses large compute budgets for training/inference and states that performance scales with data, model size, and compute; it infers concentration of capabilities but provides no empirical market concentration measures.

high negative Protein structure prediction powered by artificial intellige... distribution of computational capability/resources across organizations and resu...

Structure predictors depend on training data and exhibit biases; experimental validation remains necessary.

Paper notes dependence on training data biases and the need for experimental validation; references data sources (PDB, UniRef, metagenomic catalogs) but does not quantify bias magnitudes.

high negative Protein structure prediction powered by artificial intellige... bias in model predictions attributable to training data coverage/quality; requir...

Current limitations include inaccurate prediction of multi‑chain complexes, flexible or rare conformational states, and limited prediction of dynamic ensembles.

Paper explicitly enumerates these limitations in the 'Ongoing limitations' section; no quantitative failure rates are given.

high negative Protein structure prediction powered by artificial intellige... accuracy for multi‑chain complexes, flexible/rare conformations, and ensemble/dy...

Traditional computational methods struggle without homologous templates or with complex folding/dynamics.

Paper discusses limitations of traditional computational methods, emphasizing dependence on homologous templates and difficulty with complex folding/dynamics; specific method comparisons or sample sizes are not provided.

high negative Protein structure prediction powered by artificial intellige... accuracy/success of traditional computational structure prediction in low‑homolo...

Opacity, bias, and errors in AI systems demand auditing, standards, and governance (algorithmic accountability) to ensure trustworthy assessment.

Synthesis of literature on algorithmic bias and accountability plus policy analysis recommending audits and standards; supported by country cases that discuss governance concerns.

high negative The Future of Assessment: Rethinking Evaluation in an AI-Ass... algorithmic fairness, transparency, and reliability

Student data used by AI vendors raises risks around consent, reuse, commercial exploitation, and other data-privacy concerns.

Policy analysis and literature on data governance, privacy law debates; examples from national policy documents in the comparative cases. No original data on breaches or misuse presented.

high negative The Future of Assessment: Rethinking Evaluation in an AI-Ass... privacy risks and governance of student data

Empirical evaluation of integrated defenses, quantitative cost/benefit analyses, and standardized threat models for VR are research gaps that remain unaddressed in the literature window surveyed (2023–2025).

Authors' stated limitations from their comparative literature review of 31 studies noting an absence of primary empirical validation and quantitative economic analyses in the reviewed corpus.

high negative Securing Virtual Reality: Threat Models, Vulnerabilities, an... presence/absence of empirical validation, cost‑benefit studies, and standard thr...

Immersive VR systems collect continuous multimodal signals (motion tracking, gaze, voice, biometrics) that enable novel inference, spoofing, and manipulation attacks beyond traditional IT threats.

Synthesis of threat descriptions across the 31 reviewed peer‑reviewed studies (2023–2025) documenting sensor modalities and attack vectors; qualitative comparative evaluation of attack surfaces.

high negative Securing Virtual Reality: Threat Models, Vulnerabilities, an... existence and extent of expanded attack surface due to multimodal signal collect...

The Omnibus overlaps substantively with the DSA and other digital policies, creating potential jurisdictional and interpretive ambiguities about which rules apply to platforms and AI-enabled services.

Comparative mapping and legal/regulatory review identifying overlapping provisions; qualitative analysis of proposed texts (no quantitative sample).

high negative The Digital Omnibus and the Future of EU Regulation: Implica... jurisdictional/interpretive clarity of applicable rules for platforms and AI ser...

Pakistan prioritizes economic and digital governance objectives, with comparatively weak governance of military AI.

Review of Pakistan’s economic and digital governance plans, export‑control materials, and secondary literature on Pakistan’s civil–military relations.

high negative <b>Regulating AI in National Security: A Comparative S... strength and formality of military AI governance

Large-scale machine learning enables invisible inferences about users from seemingly innocuous data.

Conceptual claim presented in the workshop and supported by referenced technical literature on inference capabilities of ML models (discussion in position papers); workshop itself did not present a new empirical experiment.

high negative Moving Beyond Clicks: Rethinking Consent and User Control in... privacy risk from inferred attributes (inference accuracy / presence of invisibl...

Inequities in climate-AI systems appear across three development phases—Inputs, Process, and Outputs—creating multiple failure points where Global North advantages propagate into final products.

Conceptual framework developed from cross-disciplinary synthesis, literature review, and illustrative examples (Inputs → Process → Outputs mapping).

high negative The Rise of AI in Weather and Climate Information and its Im... Presence of inequities at each phase of the AI development lifecycle (data avail...

Foundation-model development and high-performance computing (HPC) capacity are overwhelmingly located in the Global North.

Descriptive mapping of global HPC infrastructure and foundation-model authorship described in the paper (infrastructure mapping and authorship analysis). No single quantitative sample size reported; evidence based on spatial mapping and documented locations of compute centers and model-development institutions.

high negative The Rise of AI in Weather and Climate Information and its Im... Geographic distribution of HPC capacity and foundation-model development (locati...

Ambiguity about the probability of data leaks (a 10–50% range) reduces user adoption of AI personalization relative to a neutral privacy presentation.

Between-subjects online experiment, 2 (information environment: Risk vs Ambiguity) × 3 (privacy-treatment conditions), N = 610 participants randomized across arms. Leak-probability ambiguity presented as a 10–50% range; adoption (choice of personalized vs standard basket) was measured and privacy-threatening conditions under ambiguity produced a statistically significant reduction in adoption compared to neutral.

high negative The Data-Dollars Tradeoff: Privacy Harms vs. Economic Risk i... Adoption choice: proportion choosing AI-personalized basket versus standard bask...

Rank stability analysis across the whole citation distribution shows instability not only at the tail but across frequently cited domains; rankings shift substantially across samples.

Distribution-wide rank-stability methods applied to repeated-sample citation data from the three platforms and three topics, comparing domain ranks across samples and quantifying rank-change frequency and magnitude.

high negative Quantifying Uncertainty in AI Visibility: A Statistical Fram... rank stability of domains by citation frequency across repeated samples

Bootstrap-based confidence intervals show wide uncertainty: many domain-level differences that look meaningful in single-run snapshots fall within measurement noise.

Bootstrap resampling applied to repeated-sample data (collected across nine days and high-frequency sampling) to compute confidence intervals for citation shares and prevalence; many pairwise or between-domain differences were not statistically separable once CIs were considered.

high negative Quantifying Uncertainty in AI Visibility: A Statistical Fram... width of bootstrap confidence intervals for domain citation shares / prevalence ...

Single-run point estimates of citation share or prevalence are misleading; visibility metrics should be treated as estimators with uncertainty and reported with confidence intervals.

Comparison of single-run snapshots to distributions obtained from repeated sampling (daily and 10-minute interval regimes) and bootstrap resampling showing wide sample-to-sample variation and wide CI widths for domain-level shares and prevalence metrics.

high negative Quantifying Uncertainty in AI Visibility: A Statistical Fram... bias/precision of single-run estimates of domain citation share and prevalence

Generative search platforms are non-deterministic: the same query at different times can yield different answers and different cited domains.

Repeated-query experiments performed on three platforms (Perplexity Search, OpenAI SearchGPT, Google Gemini) across three consumer-product topics, using multi-day sampling (one collection per day over nine days) and high-frequency sampling (repeated queries at 10-minute intervals); observed variation in responses and cited domains across runs.

high negative Quantifying Uncertainty in AI Visibility: A Statistical Fram... response variability (changes in generated answers) and cited domains per query

Despite LoRA being parameter-efficient, fine-tuning and iterative human-in-the-loop workflows still require compute resources and researcher time; governance/versioning of tuned models is necessary.

Caveat stated in the paper about remaining computational and governance costs; no quantitative resource usage reported in the summary.

high negative THETA: A Textual Hybrid Embedding-based Topic Analysis Frame... compute/resource requirements and governance burden

Embedding fine-tuning (DAFT) risks amplifying domain-specific biases present in the tuning corpus, so domain experts and robust evaluation protocols are necessary.

Paper caveat noting bias-amplification risk from fine-tuning embeddings; aligns with known risks in the literature but no empirical bias audit results provided in the summary.

high negative THETA: A Textual Hybrid Embedding-based Topic Analysis Frame... amplification of biases in tuned embeddings / need for bias mitigation

Mean emotional self-alignment between poster and responder is 32.7%, indicating systematic affective mismatch rather than congruence.

Pairwise comparison of emotion labels across post–response pairs in the dataset; computation of mean percentage where poster and immediate responder share the same emotion (32.7%).

high negative What Do AI Agents Talk About? Emergent Communication Structu... percentage of post–response pairs with identical emotion labels (emotional self-...

Conversational coherence declines rapidly with thread depth, indicating shallow, weakly connected multi-turn exchanges.

Lexical-semantic coherence metrics (e.g., embedding-based similarity) computed across comment threads of varying depth in the Moltbook dataset; observed rapid decrease in coherence scores as thread depth increases.

high negative What Do AI Agents Talk About? Emergent Communication Structu... coherence (similarity) metric as a function of thread depth

When pipelines have cross-cutting ties, prices oscillate, allocation quality drops, and management becomes difficult.

Empirical simulation results from the ablation study: configurations with non-hierarchical, cross-cutting graph structures produced larger price volatility, frequent oscillations in price updates, and lower allocation value/throughput compared to hierarchical graphs (measured across many runs and random seeds within the 1,620-run experimental set).

high negative Real-Time AI Service Economy: A Framework for Agentic Comput... price volatility and oscillation frequency; allocation quality (value/throughput...

If deployment value is the time-average for one agent, optimizing the usual expected-value objective can lead to poor real-world outcomes.

Reasoning plus the paper's illustrative example demonstrating policies with high expected reward but poor or highly variable realized time-average outcomes; theoretical exposition, no empirical dataset.

high negative Ergodicity in reinforcement learning realized long-run (time-average) reward of deployed agent

Optimizing the expected cumulative reward (ensemble average across trajectories) can be misleading when reward-generating dynamics are non-ergodic because the ensemble expectation does not generally equal the time-average experienced by a single deployed agent.

Theoretical argumentation and a constructive illustrative example in the paper showing divergence between ensemble expectation and single-trajectory time-average; no empirical sample; analysis-based evidence.

high negative Ergodicity in reinforcement learning expected cumulative reward (ensemble expectation) vs. time-average realized rewa...

« Prev 1 2 3 … 27 28 29 … 137 138 Next »