Evidence (3062 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	373	105	59	439	984
Governance & Regulation	366	172	115	55	718
Research Productivity	237	95	34	294	664
Organizational Efficiency	364	82	62	34	545
Technology Adoption Rate	293	118	66	30	511
Firm Productivity	274	33	68	10	390
AI Safety & Ethics	117	178	44	24	365
Output Quality	231	61	23	25	340
Market Structure	107	123	85	14	334
Decision Quality	158	68	33	17	279
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	88	31	38	9	166
Firm Revenue	96	34	22	—	152
Innovation Output	105	12	21	11	150
Consumer Welfare	68	29	35	7	139
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	71	10	29	6	116
Worker Satisfaction	46	38	12	9	105
Error Rate	42	47	6	—	95
Training Effectiveness	55	12	11	16	94
Task Completion Time	76	5	4	2	87
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	16	9	5	48
Job Displacement	5	29	12	—	46
Social Protection	19	8	6	1	34
Developer Productivity	27	2	3	1	33
Worker Turnover	10	12	—	3	25
Creative Output	15	5	3	1	24
Skill Obsolescence	3	18	2	—	23
Labor Share of Income	8	4	9	—	21

Human Ai Collab Remove filter

The paper proposes a 'manufacturing operation tree'—an organizationally structured framework—to guide development of more realistic, validated, and industry‑relevant simulation models.

Conceptual/modeling output in the paper (diagram and explanation of the manufacturing operation tree); theoretical development rather than empirical testing.

high positive A Review of Manufacturing Operations Research Integration in... guidance for simulation model design, potential for improved model realism and v...

Econometric and causal-inference tools (difference-in-differences, instrumental variables, randomized encouragement designs) are needed to estimate long-term effects of personalized robot interventions.

Recommended methodological agenda for AI economists in the paper; no applied causal studies presented.

high positive Reimagining Social Robots as Recommender Systems: Foundation... causal estimates of long-term intervention effects (treatment effect sizes, iden...

Research and deployment will require new datasets: longitudinal multimodal interaction logs, user preference surveys, simulated user populations, and ethically annotated datasets for fairness and safety evaluation.

Data & Methods recommendations based on identified empirical needs; no dataset release or analysis in this paper.

high positive Reimagining Social Robots as Recommender Systems: Foundation... availability and quality of recommended datasets (longitudinality, multimodality...

Measuring welfare impact of personalized robots requires going beyond engagement to include non-market outcomes such as well-being, autonomy, and mental health.

Methodological recommendation in the implications and evaluation sections; no empirical measures provided.

high positive Reimagining Social Robots as Recommender Systems: Foundation... welfare metrics (well-being scores, autonomy measures, mental health assessments...

A/B testing and longitudinal field studies are necessary for real-world validation of robot personalization, and metrics should include welfare-oriented outcomes (well-being, trust) in addition to engagement.

Recommended evaluation strategy drawing from HRI and RS experimental standards; no field trials reported in this work.

high positive Reimagining Social Robots as Recommender Systems: Foundation... welfare metrics (well-being, trust), engagement metrics, long-term behavioral ch...

Prior to live trials, offline RS evaluation metrics (precision/recall, NDCG), counterfactual/off-policy estimators, and simulated users should be used to validate personalization policies.

Methodological recommendation based on RS evaluation practices; no empirical comparison with live trials in robots presented.

high positive Reimagining Social Robots as Recommender Systems: Foundation... reliability of offline evaluation (correlation with online performance), risk re...

Contextual bandits and counterfactual/off-policy learning can enable safe exploration and off-policy evaluation when adapting robot interactions from logged data.

Methodological synthesis referencing contextual bandit and counterfactual learning techniques from RS and causal inference; no robotic implementation experiments reported.

high positive Reimagining Social Robots as Recommender Systems: Foundation... safe exploration trade-offs (regret), off-policy evaluation accuracy (e.g., IPS/...

Sequence-aware recommenders (RNNs, Transformers, Markov/session-based models) are suitable for modeling session dynamics and short-term preference shifts in robot interactions.

Survey of sequence/temporal RS models and their typical use cases; conceptual recommendation only.

high positive Reimagining Social Robots as Recommender Systems: Foundation... session-level prediction accuracy, short-term preference prediction performance

RS tooling covers long-term user profiles, short-term/session signals, context-awareness, multi-objective ranking, and evaluation methods suited for personalization at scale.

Review of recommender-systems methods and tooling in the literature; conceptual synthesis without empirical new data.

high positive Reimagining Social Robots as Recommender Systems: Foundation... capability to model multi-timescale preferences and to perform scalable personal...

Recommender systems are specialized in representing, predicting, and ranking user preferences across time and contexts (e.g., collaborative filtering, content-based models, sequential/session models).

Established RS literature surveyed and cited as the basis for the claim; conceptual argument, no new experiments.

high positive Reimagining Social Robots as Recommender Systems: Foundation... preference prediction/ranking accuracy across temporal and contextual settings

Creators explicitly name advertising, direct sales, affiliate marketing, and revenue-sharing models as common monetization channels for GenAI-enabled content.

Explicit references to these monetization channels appeared repeatedly across the 377 videos and were extracted during thematic coding.

high positive Monetizing Generative AI: YouTubers' Collective Knowledge on... types of monetization channels mentioned in videos

Practitioners adopt methodological adaptations — including adaptive/longitudinal designs, versioning/documentation, stratification/moderation analyses, robustness checks, mixed methods, deployment-stage monitoring, and pre-analysis plans — to mitigate validity threats.

Reported mitigation strategies aggregated from the 16 semi-structured interviews and described in the paper's 'Practitioner solutions' section.

high positive RCTs & Human Uplift Studies: Methodological Challenges and P... use and types of methodological adaptations employed by practitioners

Agents detected up to 65% of vulnerabilities in some experimental settings.

Reported detection rate maxima from the study's experiments on certain model/scaffold/task combinations.

high positive Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... vulnerability_detection_rate (peak_value_reported = ~65%)

The authors constructed a contamination-free dataset of 22 real-world smart-contract security incidents that postdate every evaluated model's release.

Curation procedure described in the methods: 22 incidents selected to occur after all model release dates to prevent leakage.

high positive Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... contamination_free_dataset_size (22 incidents)

This study expanded the evaluation matrix to 26 agent configurations spanning four model families and three scaffolding approaches.

Methods reported in this study specifying 26 agent configurations, four model families, and three scaffolds.

high positive Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... evaluation_matrix_size (agent_configurations; model_families; scaffolds)

EVMbench (OpenAI, Paradigm, OtterSec) reported agents detecting up to 45.6% of vulnerabilities and achieving exploitation on 72.2% of a curated subset.

Reported metrics from the original EVMbench paper/benchmark (as summarized in this study).

high positive Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... vulnerability_detection_rate; exploitation_success_rate (on curated subset)

Under NFD, agents are initialized with minimal scaffolding and grown through structured conversational interaction with domain practitioners, with the Knowledge Crystallization Cycle consolidating tacit dialogue into structured, reusable knowledge assets.

Architectural specification and operational formalism in the paper; supported by a detailed case study (iterative co-development with financial analysts, logged interaction transcripts and produced artifacts). Sample size for the case study is not specified.

high positive Nurture-First Agent Development: Building Domain-Expert AI A... amount and structure of crystallized knowledge/assets produced from interactions

Label changes across rounds concentrate on statements judged as ambiguous; statement ambiguity drives most label changes.

Participants provided labeling rationale and self-reported uncertainty for each of the 30 statements per round; analyses showed higher change rates for statements with higher self-reported uncertainty/ambiguous wording.

high positive Exploring Indicators of Developers' Sentiment Perceptions in... frequency of label changes per statement and its association with self-reported ...

The learned adaptive policy outperformed a fixed-wrench baseline by an average of 10.9% across five material setups.

Empirical evaluation: comparison between learned adaptive policy and a fixed-wrench policy on five different material setups; the paper reports an average improvement of ~10.9% (the exact performance metric formulation and per-setup statistics are not provided in the summary).

high positive Learning Adaptive Force Control for Contact-Rich Sample Scra... aggregate task performance (reported as average percent improvement over baselin...

Integrating AI (notably ML and NLP) meaningfully automates routine software engineering tasks across requirements management, code generation, testing, and maintenance.

Systematic literature review of prior AI-for-SE work combined with an empirical survey of software engineering professionals reporting usage and examples of tool-supported automation; sample size for the survey not specified in the summary.

high positive Artificial Intelligence as a Catalyst for Innovation in Soft... degree of task automation (e.g., frequency or share of routine tasks automated)

Coordination-Risk Cues—task-conditioned priors on disagreement/tie rates—capture coordination difficulty across tasks.

Method description: disagreement/tie rates computed per cluster from pairwise preference comparisons to form priors indicating coordination risk. Data source: Chatbot Arena pairwise comparisons; tie/disagreement rate computation described but numeric values not provided here.

high positive Task-Aware Delegation Cues for LLM Agents tie/disagreement rate per task cluster (coordination difficulty prior)

Capability Profiles—task-conditioned win-rate maps—can be computed per cluster to summarize agent strengths.

Method description: win-rate maps derived by computing agent win rates conditional on task clusters from the Chatbot Arena pairwise comparisons. Implementation reported in paper; no numeric summary of win-rate differences provided here.

high positive Task-Aware Delegation Cues for LLM Agents agent win-rate per task cluster

Semantic clustering on Chatbot Arena pairwise comparisons induces an interpretable task taxonomy (taxonomy induction).

Methodological claim: authors applied semantic clustering to tasks/queries from Chatbot Arena pairwise preference data to produce clusters described as interpretable. Data source: Chatbot Arena pairwise comparisons; specific clustering algorithm and hyperparameters not specified here.

high positive Task-Aware Delegation Cues for LLM Agents interpretable task clusters (taxonomy)

A speculative WikiRAT instantiation on Wikipedia illustrates RATs' design and potential uses.

The paper presents WikiRAT as a speculative prototype/illustration; no large-scale deployment or user study of WikiRAT is reported.

high positive Chasing RATs: Tracing Reading for and as Creative Activity existence of a prototype illustration (WikiRAT)

RATs record sequences of interaction: traversal (what is read and in what order), association (links and connections the reader forms), and reflection (annotations, notes, time spent), producing inspectable, shareable trajectories.

Design specification within the paper and description of data types RATs would collect (ordered page/navigation logs, hyperlinks followed, time-on-page, annotations, saved excerpts, tags, notes). This is a definitional claim about the proposed system rather than empirical measurement.

high positive Chasing RATs: Tracing Reading for and as Creative Activity captured interaction traces (traversal, association, reflection) as data

Weighted-FSD provides a tunable knob to encode risk aversion/preferences by selecting quantile-weighting functions.

Theoretical correspondence between quantile weights and risk measures (SRMs) described in the paper; conceptual demonstration that different weightings produce different risk profiles.

high positive Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... risk profile as measured by SRMs or weighted quantile-based metrics

Introducing quantile-weighted FSD (weighted-FSD) provably controls broad classes of Spectral Risk Measures (SRMs): improving weighted-FSD implies guaranteed improvements in the associated SRM.

Formal theoretical result/proof presented in the paper linking weighted quantile dominance to monotonic improvement in corresponding SRMs.

high positive Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... Spectral Risk Measures (SRMs) computed from cost distributions

RAD operationalizes FSD by comparing the learned policy’s empirical rollout cost distribution to a reference policy’s distribution using Optimal Transport (OT) with entropic regularization and Sinkhorn iterations.

Methodological description in the paper: entropically regularized OT objective and Sinkhorn iterations used to compare empirical distributions and produce a differentiable loss.

high positive Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... computable alignment loss (OT-based distance), differentiability of training obj...

First-Order Stochastic Dominance (FSD) constraints compare whole cost distributions and directly constrain tails, offering stronger guarantees against high-cost (unsafe) outcomes than expected-cost constraints.

Theoretical property of FSD described in the paper; formal argument that FSD constrains the full distribution (CDF) rather than only its mean.

high positive Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... cost distribution (CDF/tails), probability mass in high-cost region

Policy recommendations include subsidizing complementary investments (data governance, training) rather than technology-only incentives; encouraging standards and interoperability; and funding evaluation studies to measure distributional effects and long-run productivity impacts.

Authors' policy section proposing these interventions based on case findings and broader policy implications.

high positive Optimizing integrated supply planning in logistics: Bridging... adoption of ISP, reduction in switching costs, quality of evaluation evidence, d...

The authors propose a conceptual optimisation framework emphasizing three pillars: digital integration (tech stack & data), collaboration (processes & governance), and continuous improvement (metrics, feedback loops).

Paper presents a conceptual framework derived from cross-case findings; theoretical/conceptual contribution rather than empirical estimation.

high positive Optimizing integrated supply planning in logistics: Bridging... framework components (no direct empirical outcome; intended to improve ISP imple...

Explanations must be tailored to stakeholders (clinicians, regulators, customers) and integrated into decision processes to be useful (human-centered design principle).

Thematic coding of design and HCI literature within the review; draws on empirical studies and design guidance recommending stakeholder-specific explanation formats and integration into decision workflows.

high positive Explainable AI in High-Stakes Domains: Improving Trust, Tran... usefulness / effectiveness of explanations for different stakeholder groups

The forecasting model was deployed with a human-in-the-loop mechanism that triggers on critical forecast deviations.

Pilot description in the paper documenting integration of H-in-the-loop rules for critical deviations during pilot deployment (single-case deployment evidence).

high positive ALGORITHM FOR IMPLEMENTING AI IN THE MANAGEMENT LOOP OF SMES... presence and functioning of human-in-the-loop triggers for forecast deviations

The framework explicitly targets SME-specific risks (data scarcity, limited skills/budgets, and change resistance) and proposes mitigations such as staged pilots, human-in-the-loop designs, and clear governance.

Design rationale and operational recommendations within the paper addressing SME constraints (conceptual; no large-N testing).

high positive ALGORITHM FOR IMPLEMENTING AI IN THE MANAGEMENT LOOP OF SMES... presence of SME-specific mitigation measures in the framework (staged pilots, H-...

An MLOps layer is included to provide continuous integration/deployment, monitoring, retraining, and governance for sustainable model maintenance.

Framework/component specification in the paper describing an MLOps layer and its responsibilities (conceptual design).

high positive ALGORITHM FOR IMPLEMENTING AI IN THE MANAGEMENT LOOP OF SMES... presence of MLOps capabilities (CI/CD, monitoring, retraining, governance) in th...

The approach operationalizes AI adoption into seven sequential stages, each with specified deliverables, assigned roles, and gate/exit criteria.

Framework description in the paper enumerating seven sequential stages and documenting deliverables, role allocation, and gate criteria (conceptual / design artifact).

high positive ALGORITHM FOR IMPLEMENTING AI IN THE MANAGEMENT LOOP OF SMES... number and specification of stages (operationalization of adoption process)

The paper proposes a practice-oriented, end-to-end algorithm for integrating AI into SME managerial decision loops grounded in CRISP-DM and extended with AI Canvas, an organizational digital-readiness assessment, and an MLOps layer.

Conceptual/framework development presented in the paper; synthesis of CRISP-DM, AI Canvas, a digital-readiness assessment, and an MLOps layer (no empirical sample required).

high positive ALGORITHM FOR IMPLEMENTING AI IN THE MANAGEMENT LOOP OF SMES... existence and content of the proposed AI adoption algorithm/framework (design el...

Standards and governance frameworks (for model auditability, security, and alignment) will become economic infrastructure influencing adoption costs and market trust.

Conceptual argument linking governance to adoption and trust, drawing on normative risk analysis; no empirical governance impact studies included.

high positive How AI Will Transform the Daily Life of a Techie within 5 Ye... existence and adoption of standards/governance frameworks and their effect on AI...

Increasing AI autonomy magnifies ethical, safety, and value‑alignment concerns; robust human oversight and institutional governance are required.

Normative and risk analysis based on projected increases in system autonomy and illustrative failure modes; no formal safety audits included.

high positive How AI Will Transform the Daily Life of a Techie within 5 Ye... need/extent of human oversight and governance mechanisms (existence and strength...

Actionable takeaway: organizations should measure inter-model similarity and response diversity as part of ROI and procurement analyses and factor in governance and role-redesign costs when estimating net returns to LLM deployment.

Explicit recommendation in the paper grounded in empirical analyses of output similarity and diversity metrics; presented as operational guidance rather than tested via field ROI studies.

high positive The Artificial Hivemind: Rethinking Work Design and Leadersh... inclusion of diversity metrics and governance cost estimates in ROI/procurement ...

The paper provides practical diagnostic tools and metrics (e.g., inter-model similarity, response entropy) for detecting and tracking AI homogenization in workflows.

Methodological section describing diagnostic framework and example metrics used in the empirical analyses (semantic similarity measures, entropy, distinct-n), intended for operational use.

high positive The Artificial Hivemind: Rethinking Work Design and Leadersh... operational diagnostic metrics (inter-model similarity, entropy, distinct-n)

Organizational responses to homogenization include leadership communication strategies, work redesign (contrarian roles, ensemble workflows, mandated diversity checks), and governance frameworks (auditing, procurement policies avoiding monoculture).

Prescriptive recommendations in the paper synthesizing empirical results with organizational-design principles; proposed interventions are not evaluated empirically in the paper but are presented as actionable responses.

high positive The Artificial Hivemind: Rethinking Work Design and Leadersh... proposed organizational interventions to preserve cognitive and stylistic divers...

The analysis dataset comprises approximately 26,000 real-world user queries paired with outputs from over 70 distinct language models spanning different providers, architectures, and scales.

Explicit data description in the paper: ≈26,000 queries and outputs from 70+ models (paper lists model sets and sampling procedures in methods section).

high positive The Artificial Hivemind: Rethinking Work Design and Leadersh... dataset size and model count

The paper proposes a research agenda prioritizing interoperable, ethical‑by‑design platforms; metrics to measure social equity impacts; and adaptation of global standards to local institutional capacities.

Explicit list of three prioritized research directions provided in the paper, derived from the systematic synthesis of the 103 items.

high positive Models, applications, and limitations of the responsible ado... research priorities and agenda items

High‑income examples (e.g., Estonia, Singapore) demonstrate mature integration of digital/AI systems in e‑government, urban mobility, and e‑health.

Empirical case examples drawn from the reviewed literature and institutional reports cited in the review; specific country examples (Estonia, Singapore) repeatedly referenced as mature adopters.

high positive Models, applications, and limitations of the responsible ado... integration maturity of AI/digital systems in e‑government, urban mobility, and ...

Recommended research priorities for economists include measuring how adoption changes task mixes and wages, quantifying verification/remediation costs, estimating productivity gains net of security/IP costs, and studying market dynamics from centralized model providers.

Author recommendations based on identified gaps in the empirical literature synthesized by the paper.

high positive ChatGPT as a Tool for Programming Assistance and Code Develo... generation of targeted empirical studies addressing task mix, wage impacts, veri...

Cognitive interlocks include concrete mechanisms such as policy-enforced gates, automated verification thresholds, role-based checks, and mandatory rebuttal workflows to force verification before outputs are trusted or deployed.

Design details and enumerated mechanisms within the Overton Framework as presented in the paper; no implementation case studies reported.

high positive Overton Framework v1.0: Cognitive Interlocks for Integrity i... existence and configuration of interlock mechanisms; number of outputs blocked u...

The Overton Framework is an architectural remedy that embeds 'cognitive interlocks' into development environments to enforce verification boundaries and restore system integrity.

Prescriptive architectural proposal described in the paper (design specification and principles); presented conceptually without empirical validation.

high positive Overton Framework v1.0: Cognitive Interlocks for Integrity i... presence/implementation of cognitive interlocks in dev environments; intended re...

High‑frequency sensor and satellite data, processed with AI, improve precision in measuring yields, input use, and environmental externalities, enhancing the quality of economic impact evaluations and policy targeting.

Methodological and validation studies using high‑resolution satellite imagery and field sensors that show improved measurement accuracy versus traditional survey methods; referenced empirical demonstrations in the literature.

high positive MODERN APPROACHES TO SUSTAINABLE AGRICULTURAL TRANSFORMATION measurement precision for yields, input use, emissions/environmental externaliti...

The paper proposes specific metrics and empirical follow-ups (e.g., generation-to-verification throughput ratios, defect accumulation rates, time-to-acceptance for machine-generated artifacts, incident rates attributable to unverified AI outputs) to validate the model.

Explicit recommendations and measurement proposals listed in the paper; no empirical implementation provided.

high positive Overton Framework v1.0: Cognitive Interlocks for Integrity i... proposed measurement constructs (generation:verification ratio, defect accumulat...

« Prev 1 2 3 … 24 25 26 … 61 62 Next »