Evidence (3062 claims)
Adoption
5227 claims
Productivity
4503 claims
Governance
4100 claims
Human-AI Collaboration
3062 claims
Labor Markets
2480 claims
Innovation
2320 claims
Org Design
2305 claims
Skills & Training
1920 claims
Inequality
1311 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 373 | 105 | 59 | 439 | 984 |
| Governance & Regulation | 366 | 172 | 115 | 55 | 718 |
| Research Productivity | 237 | 95 | 34 | 294 | 664 |
| Organizational Efficiency | 364 | 82 | 62 | 34 | 545 |
| Technology Adoption Rate | 293 | 118 | 66 | 30 | 511 |
| Firm Productivity | 274 | 33 | 68 | 10 | 390 |
| AI Safety & Ethics | 117 | 178 | 44 | 24 | 365 |
| Output Quality | 231 | 61 | 23 | 25 | 340 |
| Market Structure | 107 | 123 | 85 | 14 | 334 |
| Decision Quality | 158 | 68 | 33 | 17 | 279 |
| Fiscal & Macroeconomic | 75 | 52 | 32 | 21 | 187 |
| Employment Level | 70 | 32 | 74 | 8 | 186 |
| Skill Acquisition | 88 | 31 | 38 | 9 | 166 |
| Firm Revenue | 96 | 34 | 22 | — | 152 |
| Innovation Output | 105 | 12 | 21 | 11 | 150 |
| Consumer Welfare | 68 | 29 | 35 | 7 | 139 |
| Regulatory Compliance | 52 | 61 | 13 | 3 | 129 |
| Inequality Measures | 24 | 68 | 31 | 4 | 127 |
| Task Allocation | 71 | 10 | 29 | 6 | 116 |
| Worker Satisfaction | 46 | 38 | 12 | 9 | 105 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 11 | 16 | 94 |
| Task Completion Time | 76 | 5 | 4 | 2 | 87 |
| Wages & Compensation | 46 | 13 | 19 | 5 | 83 |
| Team Performance | 44 | 9 | 15 | 7 | 76 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 18 | 16 | 9 | 5 | 48 |
| Job Displacement | 5 | 29 | 12 | — | 46 |
| Social Protection | 19 | 8 | 6 | 1 | 34 |
| Developer Productivity | 27 | 2 | 3 | 1 | 33 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 8 | 4 | 9 | — | 21 |
Human Ai Collab
Remove filter
The paper proposes a 'manufacturing operation tree'—an organizationally structured framework—to guide development of more realistic, validated, and industry‑relevant simulation models.
Conceptual/modeling output in the paper (diagram and explanation of the manufacturing operation tree); theoretical development rather than empirical testing.
Econometric and causal-inference tools (difference-in-differences, instrumental variables, randomized encouragement designs) are needed to estimate long-term effects of personalized robot interventions.
Recommended methodological agenda for AI economists in the paper; no applied causal studies presented.
Research and deployment will require new datasets: longitudinal multimodal interaction logs, user preference surveys, simulated user populations, and ethically annotated datasets for fairness and safety evaluation.
Data & Methods recommendations based on identified empirical needs; no dataset release or analysis in this paper.
Measuring welfare impact of personalized robots requires going beyond engagement to include non-market outcomes such as well-being, autonomy, and mental health.
Methodological recommendation in the implications and evaluation sections; no empirical measures provided.
A/B testing and longitudinal field studies are necessary for real-world validation of robot personalization, and metrics should include welfare-oriented outcomes (well-being, trust) in addition to engagement.
Recommended evaluation strategy drawing from HRI and RS experimental standards; no field trials reported in this work.
Prior to live trials, offline RS evaluation metrics (precision/recall, NDCG), counterfactual/off-policy estimators, and simulated users should be used to validate personalization policies.
Methodological recommendation based on RS evaluation practices; no empirical comparison with live trials in robots presented.
Contextual bandits and counterfactual/off-policy learning can enable safe exploration and off-policy evaluation when adapting robot interactions from logged data.
Methodological synthesis referencing contextual bandit and counterfactual learning techniques from RS and causal inference; no robotic implementation experiments reported.
Sequence-aware recommenders (RNNs, Transformers, Markov/session-based models) are suitable for modeling session dynamics and short-term preference shifts in robot interactions.
Survey of sequence/temporal RS models and their typical use cases; conceptual recommendation only.
RS tooling covers long-term user profiles, short-term/session signals, context-awareness, multi-objective ranking, and evaluation methods suited for personalization at scale.
Review of recommender-systems methods and tooling in the literature; conceptual synthesis without empirical new data.
Recommender systems are specialized in representing, predicting, and ranking user preferences across time and contexts (e.g., collaborative filtering, content-based models, sequential/session models).
Established RS literature surveyed and cited as the basis for the claim; conceptual argument, no new experiments.
Creators explicitly name advertising, direct sales, affiliate marketing, and revenue-sharing models as common monetization channels for GenAI-enabled content.
Explicit references to these monetization channels appeared repeatedly across the 377 videos and were extracted during thematic coding.
Practitioners adopt methodological adaptations — including adaptive/longitudinal designs, versioning/documentation, stratification/moderation analyses, robustness checks, mixed methods, deployment-stage monitoring, and pre-analysis plans — to mitigate validity threats.
Reported mitigation strategies aggregated from the 16 semi-structured interviews and described in the paper's 'Practitioner solutions' section.
Agents detected up to 65% of vulnerabilities in some experimental settings.
Reported detection rate maxima from the study's experiments on certain model/scaffold/task combinations.
The authors constructed a contamination-free dataset of 22 real-world smart-contract security incidents that postdate every evaluated model's release.
Curation procedure described in the methods: 22 incidents selected to occur after all model release dates to prevent leakage.
This study expanded the evaluation matrix to 26 agent configurations spanning four model families and three scaffolding approaches.
Methods reported in this study specifying 26 agent configurations, four model families, and three scaffolds.
EVMbench (OpenAI, Paradigm, OtterSec) reported agents detecting up to 45.6% of vulnerabilities and achieving exploitation on 72.2% of a curated subset.
Reported metrics from the original EVMbench paper/benchmark (as summarized in this study).
Under NFD, agents are initialized with minimal scaffolding and grown through structured conversational interaction with domain practitioners, with the Knowledge Crystallization Cycle consolidating tacit dialogue into structured, reusable knowledge assets.
Architectural specification and operational formalism in the paper; supported by a detailed case study (iterative co-development with financial analysts, logged interaction transcripts and produced artifacts). Sample size for the case study is not specified.
Label changes across rounds concentrate on statements judged as ambiguous; statement ambiguity drives most label changes.
Participants provided labeling rationale and self-reported uncertainty for each of the 30 statements per round; analyses showed higher change rates for statements with higher self-reported uncertainty/ambiguous wording.
The learned adaptive policy outperformed a fixed-wrench baseline by an average of 10.9% across five material setups.
Empirical evaluation: comparison between learned adaptive policy and a fixed-wrench policy on five different material setups; the paper reports an average improvement of ~10.9% (the exact performance metric formulation and per-setup statistics are not provided in the summary).
Integrating AI (notably ML and NLP) meaningfully automates routine software engineering tasks across requirements management, code generation, testing, and maintenance.
Systematic literature review of prior AI-for-SE work combined with an empirical survey of software engineering professionals reporting usage and examples of tool-supported automation; sample size for the survey not specified in the summary.
Coordination-Risk Cues—task-conditioned priors on disagreement/tie rates—capture coordination difficulty across tasks.
Method description: disagreement/tie rates computed per cluster from pairwise preference comparisons to form priors indicating coordination risk. Data source: Chatbot Arena pairwise comparisons; tie/disagreement rate computation described but numeric values not provided here.
Capability Profiles—task-conditioned win-rate maps—can be computed per cluster to summarize agent strengths.
Method description: win-rate maps derived by computing agent win rates conditional on task clusters from the Chatbot Arena pairwise comparisons. Implementation reported in paper; no numeric summary of win-rate differences provided here.
Semantic clustering on Chatbot Arena pairwise comparisons induces an interpretable task taxonomy (taxonomy induction).
Methodological claim: authors applied semantic clustering to tasks/queries from Chatbot Arena pairwise preference data to produce clusters described as interpretable. Data source: Chatbot Arena pairwise comparisons; specific clustering algorithm and hyperparameters not specified here.
A speculative WikiRAT instantiation on Wikipedia illustrates RATs' design and potential uses.
The paper presents WikiRAT as a speculative prototype/illustration; no large-scale deployment or user study of WikiRAT is reported.
RATs record sequences of interaction: traversal (what is read and in what order), association (links and connections the reader forms), and reflection (annotations, notes, time spent), producing inspectable, shareable trajectories.
Design specification within the paper and description of data types RATs would collect (ordered page/navigation logs, hyperlinks followed, time-on-page, annotations, saved excerpts, tags, notes). This is a definitional claim about the proposed system rather than empirical measurement.
Weighted-FSD provides a tunable knob to encode risk aversion/preferences by selecting quantile-weighting functions.
Theoretical correspondence between quantile weights and risk measures (SRMs) described in the paper; conceptual demonstration that different weightings produce different risk profiles.
Introducing quantile-weighted FSD (weighted-FSD) provably controls broad classes of Spectral Risk Measures (SRMs): improving weighted-FSD implies guaranteed improvements in the associated SRM.
Formal theoretical result/proof presented in the paper linking weighted quantile dominance to monotonic improvement in corresponding SRMs.
RAD operationalizes FSD by comparing the learned policy’s empirical rollout cost distribution to a reference policy’s distribution using Optimal Transport (OT) with entropic regularization and Sinkhorn iterations.
Methodological description in the paper: entropically regularized OT objective and Sinkhorn iterations used to compare empirical distributions and produce a differentiable loss.
First-Order Stochastic Dominance (FSD) constraints compare whole cost distributions and directly constrain tails, offering stronger guarantees against high-cost (unsafe) outcomes than expected-cost constraints.
Theoretical property of FSD described in the paper; formal argument that FSD constrains the full distribution (CDF) rather than only its mean.
Policy recommendations include subsidizing complementary investments (data governance, training) rather than technology-only incentives; encouraging standards and interoperability; and funding evaluation studies to measure distributional effects and long-run productivity impacts.
Authors' policy section proposing these interventions based on case findings and broader policy implications.
The authors propose a conceptual optimisation framework emphasizing three pillars: digital integration (tech stack & data), collaboration (processes & governance), and continuous improvement (metrics, feedback loops).
Paper presents a conceptual framework derived from cross-case findings; theoretical/conceptual contribution rather than empirical estimation.
Explanations must be tailored to stakeholders (clinicians, regulators, customers) and integrated into decision processes to be useful (human-centered design principle).
Thematic coding of design and HCI literature within the review; draws on empirical studies and design guidance recommending stakeholder-specific explanation formats and integration into decision workflows.
The forecasting model was deployed with a human-in-the-loop mechanism that triggers on critical forecast deviations.
Pilot description in the paper documenting integration of H-in-the-loop rules for critical deviations during pilot deployment (single-case deployment evidence).
The framework explicitly targets SME-specific risks (data scarcity, limited skills/budgets, and change resistance) and proposes mitigations such as staged pilots, human-in-the-loop designs, and clear governance.
Design rationale and operational recommendations within the paper addressing SME constraints (conceptual; no large-N testing).
An MLOps layer is included to provide continuous integration/deployment, monitoring, retraining, and governance for sustainable model maintenance.
Framework/component specification in the paper describing an MLOps layer and its responsibilities (conceptual design).
The approach operationalizes AI adoption into seven sequential stages, each with specified deliverables, assigned roles, and gate/exit criteria.
Framework description in the paper enumerating seven sequential stages and documenting deliverables, role allocation, and gate criteria (conceptual / design artifact).
The paper proposes a practice-oriented, end-to-end algorithm for integrating AI into SME managerial decision loops grounded in CRISP-DM and extended with AI Canvas, an organizational digital-readiness assessment, and an MLOps layer.
Conceptual/framework development presented in the paper; synthesis of CRISP-DM, AI Canvas, a digital-readiness assessment, and an MLOps layer (no empirical sample required).
Standards and governance frameworks (for model auditability, security, and alignment) will become economic infrastructure influencing adoption costs and market trust.
Conceptual argument linking governance to adoption and trust, drawing on normative risk analysis; no empirical governance impact studies included.
Increasing AI autonomy magnifies ethical, safety, and value‑alignment concerns; robust human oversight and institutional governance are required.
Normative and risk analysis based on projected increases in system autonomy and illustrative failure modes; no formal safety audits included.
Actionable takeaway: organizations should measure inter-model similarity and response diversity as part of ROI and procurement analyses and factor in governance and role-redesign costs when estimating net returns to LLM deployment.
Explicit recommendation in the paper grounded in empirical analyses of output similarity and diversity metrics; presented as operational guidance rather than tested via field ROI studies.
The paper provides practical diagnostic tools and metrics (e.g., inter-model similarity, response entropy) for detecting and tracking AI homogenization in workflows.
Methodological section describing diagnostic framework and example metrics used in the empirical analyses (semantic similarity measures, entropy, distinct-n), intended for operational use.
Organizational responses to homogenization include leadership communication strategies, work redesign (contrarian roles, ensemble workflows, mandated diversity checks), and governance frameworks (auditing, procurement policies avoiding monoculture).
Prescriptive recommendations in the paper synthesizing empirical results with organizational-design principles; proposed interventions are not evaluated empirically in the paper but are presented as actionable responses.
The analysis dataset comprises approximately 26,000 real-world user queries paired with outputs from over 70 distinct language models spanning different providers, architectures, and scales.
Explicit data description in the paper: ≈26,000 queries and outputs from 70+ models (paper lists model sets and sampling procedures in methods section).
The paper proposes a research agenda prioritizing interoperable, ethical‑by‑design platforms; metrics to measure social equity impacts; and adaptation of global standards to local institutional capacities.
Explicit list of three prioritized research directions provided in the paper, derived from the systematic synthesis of the 103 items.
High‑income examples (e.g., Estonia, Singapore) demonstrate mature integration of digital/AI systems in e‑government, urban mobility, and e‑health.
Empirical case examples drawn from the reviewed literature and institutional reports cited in the review; specific country examples (Estonia, Singapore) repeatedly referenced as mature adopters.
Recommended research priorities for economists include measuring how adoption changes task mixes and wages, quantifying verification/remediation costs, estimating productivity gains net of security/IP costs, and studying market dynamics from centralized model providers.
Author recommendations based on identified gaps in the empirical literature synthesized by the paper.
Cognitive interlocks include concrete mechanisms such as policy-enforced gates, automated verification thresholds, role-based checks, and mandatory rebuttal workflows to force verification before outputs are trusted or deployed.
Design details and enumerated mechanisms within the Overton Framework as presented in the paper; no implementation case studies reported.
The Overton Framework is an architectural remedy that embeds 'cognitive interlocks' into development environments to enforce verification boundaries and restore system integrity.
Prescriptive architectural proposal described in the paper (design specification and principles); presented conceptually without empirical validation.
High‑frequency sensor and satellite data, processed with AI, improve precision in measuring yields, input use, and environmental externalities, enhancing the quality of economic impact evaluations and policy targeting.
Methodological and validation studies using high‑resolution satellite imagery and field sensors that show improved measurement accuracy versus traditional survey methods; referenced empirical demonstrations in the literature.
The paper proposes specific metrics and empirical follow-ups (e.g., generation-to-verification throughput ratios, defect accumulation rates, time-to-acceptance for machine-generated artifacts, incident rates attributable to unverified AI outputs) to validate the model.
Explicit recommendations and measurement proposals listed in the paper; no empirical implementation provided.