Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review.
Empirical result reported from the paper's benchmark and qualitative review of agent outputs (specific metrics, number of agents/tasks, and quantitative scores not provided in the excerpt).
We develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards.
Methodological contribution stated in paper; described taxonomy elements (Accuracy, Formula, Format) as part of the evaluation design.
We provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis.
Claim of contribution in the paper; refers to the authors' own evaluation study (details like number of tasks/agents not provided in the excerpt).
Frontier AI labs have developed agents that can construct entire spreadsheets from scratch.
Asserted in paper as background/context; no specific models, numbers, or experimental details provided in the excerpt.
LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions.
Framing statement in paper; no empirical data or sample size reported to support the trend claim within the excerpt.
Adoption under higher communicative standards and institutional norms can mitigate suboptimal collective equilibria by imposing social commitments on individual users.
Theoretical argument and model-based analysis proposing communicative and institutional interventions as mitigating mechanisms (conceptual and formal reasoning).
Individually stable strategies can be scaled to collective equilibria using three extrapolation principles: (a) non-communicative aggregation, (b) local social signaling, and (c) institutional norms setting.
Theoretical extrapolation/principled modeling presented in the paper (conceptual and formal extension from individual to collective level).
Canonical decision-theoretic strategies that account for adaptive user trajectories can be mapped so that agents transition between strategies based on interaction feedback to reach stable equilibria.
Analytical results from the decision-theoretic modeling in the paper showing adaptive trajectories and stable equilibria (theoretical model derivation).
The paper develops a decision- and game-theoretic approach to the human-AI delegation-verification dilemma.
Methodological contribution: construction of decision- and game-theoretic models described in the paper (modeling/theoretical development).
Emerging models of human-AI interaction predominantly advance the complementarity thesis variously dubbed human-AI collaboration and human-AI hybrid intelligence.
Literature characterization / conceptual review reported in the paper (no empirical sample or quantitative analysis cited).
These effects are linked to improvements in green innovation quality.
Authors report that the observed negative associations between AIO and carbon emission intensity are connected to measures of green innovation quality (suggesting a mediating mechanism) in their empirical analyses.
A six-phase, stepwise implementation framework (ABC-XYZ segmentation, forecast model selection, safety stock calibration, replenishment policy assignment, simulation-based parameter tuning, KPI governance) enables enterprises to achieve 9–16% reductions in inventory costs within existing WMS and ERP architectures.
Practical implications presented in the paper proposing a six-phase implementation framework and asserting expected inventory cost reductions of 9–16% when deployed within existing WMS/ERP.
Learning-based control methods deliver up to 16% cost reductions under complex network conditions but require substantial data and governance infrastructure.
Findings from included studies (narrative and/or quantitative results) reporting maximum observed reductions 'up to 16%' and qualitative synthesis noting data/governance requirements.
The cost reduction from multi-echelon coordination increases significantly with network complexity and lead-time variability.
Pre-specified moderator analyses reported in the paper showing effect size growth with network complexity and lead-time variability.
Multi-echelon coordination yields a pooled mean cost reduction of 11.4% (95% CI: 6.9–15.9%).
Random-effects meta-analysis pooling percentage cost-reduction effect sizes (reported pooled mean and 95% CI).
The advantage of distributional safety stock methods is largest for high-variability SKU segments.
Pre-specified subgroup and moderator analyses reported in the paper indicating greater pooled effects in high-variability SKU segments.
Distributional safety stock methods outperform classical normal approximations by a pooled mean of 9.3% (95% CI: 5.8–12.7%) at equivalent service levels.
Random-effects meta-analysis pooling percentage cost-reduction effect sizes (reported pooled mean and 95% CI).
Politika önerisi: Yapay zekâ teknolojileri alanında faaliyet gösteren firmalara uygulanan vergi indirim oranları artırılabilir.
Araştırma bulgularının (Ar-Ge vergi teşviklerinin AI patent sayısıyla pozitif ilişkisi) politika çıkarımı; doğrudan ampirik test değil öneri.
Politika önerisi: Devlet, Ar-Ge harcamalarında verimliliği artırmak için performans ve proje bazlı destekler verebilir.
Yazarların çalışmanın bulgularından hareketle önerdiği uygulamalı politika tedbiri; ampirik olarak test edilmemiş öneri.
Politika önerisi: Teknolojik ilerlemeyi ve yeniliği önemseyen devletler, özel sektörün Ar-Ge yatırımlarını sübvansiyonlar ve düşük faizli krediler gibi araçlarla teşvik etmelidir.
Araştırmanın regresyon bulgularına dayanarak yapılan politika önerisi; doğrudan ampirik test değil, uygulama önerisi (çalışmanın sonuçlarından türetilmiş).
Yukarıdaki bulgular, özel sektör Ar-Ge harcamalarının ve Ar-Ge’deki vergi teşviklerinin verimli kullanıldığını göstermektedir.
Araştırmanın pozitif ilişkiler üzerine elde ettiği regresyon sonuçlarından çıkarılan yorum/yorumlayıcı çıkarım (G8 + Türkiye, 2010-2020, random effects regresyon).
Ar-Ge'de uygulanan vergi teşvikleri arttıkça yapay zekâ patent sayıları artmaktadır (pozitif ilişki).
Aynı panel veri seti ve rassal etkiler regresyonu (G8 + Türkiye, 2010-2020); vergi teşvikleri değişkeninin AI patent sayısı üzerindeki katsayısı pozitif bulunmuştur.
Özel sektörün Ar-Ge harcamaları ile yapay zekâ (AI) patent sayıları arasında pozitif bir ilişki vardır.
Panel veri analizi: G8 ülkeleri + Türkiye, yıllar 2010-2020; rassal etkiler (random effects) regresyon modeli; ülke-yıl düzeyinde veri (9 ülke × 11 yıl = 99 gözlem). Sonuç olarak özel sektör Ar-Ge harcamaları değişkeninin AI patent sayıları ile istatistiksel olarak pozitif ilişki gösterdiği raporlanmıştır.
Given the mixed outcomes (some improvements, some new lint/security issues), stronger tool-in-the-loop quality and security gating is motivated for AI-driven development workflows.
Interpretation/recommendation based on observed mix of improvements and introduced issues from the empirical results (PyQu, Pylint, Bandit analyses) and high merge rates.
73.5% of the analyzed PRs are merged (developer acceptance is high).
Empirical measurement of PR outcomes (merged vs. not merged) in the AIDev dataset of Python refactoring PRs.
Usability is the quality attribute that improves most frequently, improving in 36.5% of the studied changes.
PyQu-based before-and-after analysis of quality attributes on Python refactoring PRs from the AIDev dataset; reported frequency for the 'usability' attribute.
Agentic commits improve a quality attribute in 22.5% of the studied changes.
Empirical analysis of Python refactoring pull requests from the AIDev dataset using PyQu (an ML-based Python quality assessment tool) to compare quality attributes before and after each change.
The proposed taxonomy advances understanding and provides a structured framework for studying emerging human–algorithmic supervisory arrangements in organizations.
Authors' asserted contribution based on literature synthesis and their taxonomy derived from analysis of 14 real-world settings; intended to guide future research.
We demonstrate the taxonomy’s applicability through three ACoS examples.
Authors state they applied the taxonomy to three examples (case applications) to show applicability; abstract reports N=3 examples.
We identify two meta-dimensions, control collaboration and control enactment, and six dimensions that enable researchers to categorize and compare ACoS across organizations.
Taxonomy derived from the authors' analysis (14 real-world settings) and literature synthesis; specific dimensions enumerated in paper (as summarized in abstract).
Building on prior literature and an analysis of 14 real-world ACoS settings, we propose a taxonomy that conceptualizes the phenomenon.
Method stated in abstract: literature review plus qualitative/empirical analysis of 14 real-world ACoS settings; taxonomy presented as an output.
Organizations increasingly weave algorithmic systems into control processes.
Statement supported by prior literature review and the paper's motivating statements (no specific empirical trend data reported in abstract).
AI is a knowledge-intensive field that is particularly shaped by the flow of knowledge from scientific research to technological development.
Framing/background claim in the introduction describing the nature of AI and its dependence on science-to-technology knowledge flow.
The analysis covers AI-related patents filed from 2002 to 2021.
Paper states the temporal scope of the patent dataset analyzed (2002–2021).
Abstracts from patents and their cited scientific publications were extracted and BERTopic modelling was applied; topic labels were generated using generative AI.
Method description: data extraction of patent abstracts and cited scientific publication abstracts, application of BERTopic for topic modeling, and use of generative AI to create topic labels.
AI patents are classified into four categories using centrality measures derived from a CPC co-occurrence network.
Method section describing construction of a CPC (Cooperative Patent Classification) co-occurrence network and use of centrality measures to partition patents into four categories.
This study proposes a semantic science-technology exploration framework specifically designed for the AI domain, consisting of two stages: technology classification and semantic topic exploration.
Paper description of the proposed framework and its two-stage design (methodological contribution).
Current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution.
Qualitative and/or quantitative evaluation results in paper indicating strengths in spatial grounding, multimodal alignment, and coordinated action execution.
We develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding.
Methodological contribution described in paper: parser implementation that converts recordings and logs into structured GUI action trajectories.
The tasks involve dense multimodal interfaces and tightly coupled interaction sequences.
Task descriptions and dataset characteristics in paper stating tasks are complex, long-horizon, multimodal, and tightly coupled.
We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows.
Dataset construction reported in paper: curated expert demonstrations spanning 7 applications and 186 tasks (numbers provided in text).
We introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments.
Paper describes the creation of the Cutverse benchmark as a central contribution (design and implementation described in methods).
GUI agents have made significant progress in web navigation and basic operating system tasks.
Background claim stated in paper referencing prior work on GUI agents applied to web navigation and OS tasks (no specific experiments in this paper to support it).
We develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure.
Methodological contribution described in the paper: creation of a taxonomy to harmonize labels and claimed measurement targets across benchmarks (details and mapping provided in paper/tool).
We introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data.
Empirical contribution: the paper publishes the dataset and tool (links provided). Counts reported in the paper metadata (231 benchmarks, 139 model releases, 11 builders).
The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), enabling sustained operation beyond full-context limits.
Reported stress test / capability demonstration in paper: profile size stated as 14,000+ facts and 125k tokens stored and managed by the system.
The Dual Process system maintains 70-85% accuracy with 1-2 second latency while using 62% fewer tokens (45,434 vs 120,000+ limit) compared to full-context approaches.
Reported empirical results from the large-scale evaluation (1,440 queries / 15,000 messages) comparing Dual Process to full-context models; exact accuracy, latency, and token-count figures provided in the paper.
The Dual Process Memory Architecture decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message).
System design description and measured consolidation growth rate reported in the paper; empirical observation of growth rate stated.
Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.
Empirical experiments reported in the paper (on unspecified real-world and synthetic tabular datasets) comparing SPN to PFN-style tabular foundation models and classical tabular methods; the abstract claims consistent improvements but does not report sample sizes, dataset names, or quantitative effect sizes.
SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution.
Description of SPN's mechanism in the paper (methodological detail). Presented as the approach used to approximate strategic post-manipulation inputs and align predictions; no quantitative details or sample sizes in the abstract.