Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review.

Empirical result reported from the paper's benchmark and qualitative review of agent outputs (specific metrics, number of agents/tasks, and quantitative scores not provided in the excerpt).

high positive WorkstreamBench: Evaluating LLM Agents on End-to-End Spreads... output professionalism/quality

We develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards.

Methodological contribution stated in paper; described taxonomy elements (Accuracy, Formula, Format) as part of the evaluation design.

high positive WorkstreamBench: Evaluating LLM Agents on End-to-End Spreads... evaluation criteria/taxonomy

We provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis.

Claim of contribution in the paper; refers to the authors' own evaluation study (details like number of tasks/agents not provided in the excerpt).

high positive WorkstreamBench: Evaluating LLM Agents on End-to-End Spreads... existence of evaluation on end-to-end spreadsheet tasks

Frontier AI labs have developed agents that can construct entire spreadsheets from scratch.

Asserted in paper as background/context; no specific models, numbers, or experimental details provided in the excerpt.

high positive WorkstreamBench: Evaluating LLM Agents on End-to-End Spreads... agent capability to construct spreadsheets

LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions.

Framing statement in paper; no empirical data or sample size reported to support the trend claim within the excerpt.

high positive WorkstreamBench: Evaluating LLM Agents on End-to-End Spreads... expectations of agent capabilities (trend)

Adoption under higher communicative standards and institutional norms can mitigate suboptimal collective equilibria by imposing social commitments on individual users.

Theoretical argument and model-based analysis proposing communicative and institutional interventions as mitigating mechanisms (conceptual and formal reasoning).

high positive The Human-AI Delegation Dilemma: Individual Strategies, Coll... reduction of suboptimal collective equilibria / improvement in collective outcom...

Individually stable strategies can be scaled to collective equilibria using three extrapolation principles: (a) non-communicative aggregation, (b) local social signaling, and (c) institutional norms setting.

Theoretical extrapolation/principled modeling presented in the paper (conceptual and formal extension from individual to collective level).

high positive The Human-AI Delegation Dilemma: Individual Strategies, Coll... mechanisms for aggregation from individual strategies to collective equilibria

Canonical decision-theoretic strategies that account for adaptive user trajectories can be mapped so that agents transition between strategies based on interaction feedback to reach stable equilibria.

Analytical results from the decision-theoretic modeling in the paper showing adaptive trajectories and stable equilibria (theoretical model derivation).

high positive The Human-AI Delegation Dilemma: Individual Strategies, Coll... stability of agent strategies / attainment of equilibria

The paper develops a decision- and game-theoretic approach to the human-AI delegation-verification dilemma.

Methodological contribution: construction of decision- and game-theoretic models described in the paper (modeling/theoretical development).

high positive The Human-AI Delegation Dilemma: Individual Strategies, Coll... availability of a formal modeling framework for the delegation-verification dile...

Emerging models of human-AI interaction predominantly advance the complementarity thesis variously dubbed human-AI collaboration and human-AI hybrid intelligence.

Literature characterization / conceptual review reported in the paper (no empirical sample or quantitative analysis cited).

high positive The Human-AI Delegation Dilemma: Individual Strategies, Coll... prevalent theoretical framing in human-AI interaction literature (complementarit...

These effects are linked to improvements in green innovation quality.

Authors report that the observed negative associations between AIO and carbon emission intensity are connected to measures of green innovation quality (suggesting a mediating mechanism) in their empirical analyses.

high positive Artificial intelligence orientation and decarbonization spil... green innovation quality

A six-phase, stepwise implementation framework (ABC-XYZ segmentation, forecast model selection, safety stock calibration, replenishment policy assignment, simulation-based parameter tuning, KPI governance) enables enterprises to achieve 9–16% reductions in inventory costs within existing WMS and ERP architectures.

Practical implications presented in the paper proposing a six-phase implementation framework and asserting expected inventory cost reductions of 9–16% when deployed within existing WMS/ERP.

high positive Equitable railway corridor investment under demand uncertain... expected inventory cost reduction achievable by implementing the proposed framew...

Learning-based control methods deliver up to 16% cost reductions under complex network conditions but require substantial data and governance infrastructure.

Findings from included studies (narrative and/or quantitative results) reporting maximum observed reductions 'up to 16%' and qualitative synthesis noting data/governance requirements.

high positive Equitable railway corridor investment under demand uncertain... inventory cost reduction achieved by learning-based control methods; infrastruct...

The cost reduction from multi-echelon coordination increases significantly with network complexity and lead-time variability.

Pre-specified moderator analyses reported in the paper showing effect size growth with network complexity and lead-time variability.

high positive Equitable railway corridor investment under demand uncertain... magnitude of multi-echelon coordination cost reduction as a function of network ...

Multi-echelon coordination yields a pooled mean cost reduction of 11.4% (95% CI: 6.9–15.9%).

Random-effects meta-analysis pooling percentage cost-reduction effect sizes (reported pooled mean and 95% CI).

high positive Equitable railway corridor investment under demand uncertain... inventory cost reduction from multi-echelon coordination

The advantage of distributional safety stock methods is largest for high-variability SKU segments.

Pre-specified subgroup and moderator analyses reported in the paper indicating greater pooled effects in high-variability SKU segments.

high positive Equitable railway corridor investment under demand uncertain... relative cost-reduction advantage of distributional safety-stock vs normal appro...

Distributional safety stock methods outperform classical normal approximations by a pooled mean of 9.3% (95% CI: 5.8–12.7%) at equivalent service levels.

Random-effects meta-analysis pooling percentage cost-reduction effect sizes (reported pooled mean and 95% CI).

high positive Equitable railway corridor investment under demand uncertain... inventory cost reduction at equivalent service levels

Politika önerisi: Yapay zekâ teknolojileri alanında faaliyet gösteren firmalara uygulanan vergi indirim oranları artırılabilir.

Araştırma bulgularının (Ar-Ge vergi teşviklerinin AI patent sayısıyla pozitif ilişkisi) politika çıkarımı; doğrudan ampirik test değil öneri.

high positive AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... vergi indirimlerinin artırılması (öneri) ve dolaylı olarak AI patent üretimi

Politika önerisi: Devlet, Ar-Ge harcamalarında verimliliği artırmak için performans ve proje bazlı destekler verebilir.

Yazarların çalışmanın bulgularından hareketle önerdiği uygulamalı politika tedbiri; ampirik olarak test edilmemiş öneri.

high positive AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... Ar-Ge verimliliği (öneri/yorum)

Politika önerisi: Teknolojik ilerlemeyi ve yeniliği önemseyen devletler, özel sektörün Ar-Ge yatırımlarını sübvansiyonlar ve düşük faizli krediler gibi araçlarla teşvik etmelidir.

Araştırmanın regresyon bulgularına dayanarak yapılan politika önerisi; doğrudan ampirik test değil, uygulama önerisi (çalışmanın sonuçlarından türetilmiş).

high positive AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... özel sektör Ar-Ge yatırım teşviki (öneri) ve dolaylı olarak AI patent üretimi

Yukarıdaki bulgular, özel sektör Ar-Ge harcamalarının ve Ar-Ge’deki vergi teşviklerinin verimli kullanıldığını göstermektedir.

Araştırmanın pozitif ilişkiler üzerine elde ettiği regresyon sonuçlarından çıkarılan yorum/yorumlayıcı çıkarım (G8 + Türkiye, 2010-2020, random effects regresyon).

high positive AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... etkinlik/verimlilik (yorumsal çıkarım, doğrudan ölçülmemiş)

Ar-Ge'de uygulanan vergi teşvikleri arttıkça yapay zekâ patent sayıları artmaktadır (pozitif ilişki).

Aynı panel veri seti ve rassal etkiler regresyonu (G8 + Türkiye, 2010-2020); vergi teşvikleri değişkeninin AI patent sayısı üzerindeki katsayısı pozitif bulunmuştur.

high positive AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... AI patent sayıları (yapay zekâ patent sayısı)

Özel sektörün Ar-Ge harcamaları ile yapay zekâ (AI) patent sayıları arasında pozitif bir ilişki vardır.

Panel veri analizi: G8 ülkeleri + Türkiye, yıllar 2010-2020; rassal etkiler (random effects) regresyon modeli; ülke-yıl düzeyinde veri (9 ülke × 11 yıl = 99 gözlem). Sonuç olarak özel sektör Ar-Ge harcamaları değişkeninin AI patent sayıları ile istatistiksel olarak pozitif ilişki gösterdiği raporlanmıştır.

high positive AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... AI patent sayıları (yapay zekâ patent sayısı)

Given the mixed outcomes (some improvements, some new lint/security issues), stronger tool-in-the-loop quality and security gating is motivated for AI-driven development workflows.

Interpretation/recommendation based on observed mix of improvements and introduced issues from the empirical results (PyQu, Pylint, Bandit analyses) and high merge rates.

high positive Quality and Security Signals in AI-Generated Python Refactor... policy/process recommendation (quality/security gating)

73.5% of the analyzed PRs are merged (developer acceptance is high).

Empirical measurement of PR outcomes (merged vs. not merged) in the AIDev dataset of Python refactoring PRs.

high positive Quality and Security Signals in AI-Generated Python Refactor... PR merge rate (acceptance)

Usability is the quality attribute that improves most frequently, improving in 36.5% of the studied changes.

PyQu-based before-and-after analysis of quality attributes on Python refactoring PRs from the AIDev dataset; reported frequency for the 'usability' attribute.

high positive Quality and Security Signals in AI-Generated Python Refactor... usability (one of PyQu's quality attributes)

Agentic commits improve a quality attribute in 22.5% of the studied changes.

Empirical analysis of Python refactoring pull requests from the AIDev dataset using PyQu (an ML-based Python quality assessment tool) to compare quality attributes before and after each change.

high positive Quality and Security Signals in AI-Generated Python Refactor... improvement in any measured code quality attribute (per change)

The proposed taxonomy advances understanding and provides a structured framework for studying emerging human–algorithmic supervisory arrangements in organizations.

Authors' asserted contribution based on literature synthesis and their taxonomy derived from analysis of 14 real-world settings; intended to guide future research.

high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation

We demonstrate the taxonomy’s applicability through three ACoS examples.

Authors state they applied the taxonomy to three examples (case applications) to show applicability; abstract reports N=3 examples.

high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation

We identify two meta-dimensions, control collaboration and control enactment, and six dimensions that enable researchers to categorize and compare ACoS across organizations.

Taxonomy derived from the authors' analysis (14 real-world settings) and literature synthesis; specific dimensions enumerated in paper (as summarized in abstract).

high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation

Building on prior literature and an analysis of 14 real-world ACoS settings, we propose a taxonomy that conceptualizes the phenomenon.

Method stated in abstract: literature review plus qualitative/empirical analysis of 14 real-world ACoS settings; taxonomy presented as an output.

high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation

Organizations increasingly weave algorithmic systems into control processes.

Statement supported by prior literature review and the paper's motivating statements (no specific empirical trend data reported in abstract).

high positive A Taxonomy Of Algorithmic Co-Supervision adoption_rate

AI is a knowledge-intensive field that is particularly shaped by the flow of knowledge from scientific research to technological development.

Framing/background claim in the introduction describing the nature of AI and its dependence on science-to-technology knowledge flow.

high positive Knowledge flows from science to AI technology: Identifying c... role of scientific knowledge flow in AI development

The analysis covers AI-related patents filed from 2002 to 2021.

Paper states the temporal scope of the patent dataset analyzed (2002–2021).

high positive Knowledge flows from science to AI technology: Identifying c... temporal coverage of analyzed patents

Abstracts from patents and their cited scientific publications were extracted and BERTopic modelling was applied; topic labels were generated using generative AI.

Method description: data extraction of patent abstracts and cited scientific publication abstracts, application of BERTopic for topic modeling, and use of generative AI to create topic labels.

high positive Knowledge flows from science to AI technology: Identifying c... semantic topics derived from patent and cited-publication abstracts

AI patents are classified into four categories using centrality measures derived from a CPC co-occurrence network.

Method section describing construction of a CPC (Cooperative Patent Classification) co-occurrence network and use of centrality measures to partition patents into four categories.

high positive Knowledge flows from science to AI technology: Identifying c... patent classification into four categories

This study proposes a semantic science-technology exploration framework specifically designed for the AI domain, consisting of two stages: technology classification and semantic topic exploration.

Paper description of the proposed framework and its two-stage design (methodological contribution).

high positive Knowledge flows from science to AI technology: Identifying c... existence and design of a two-stage semantic science-technology exploration fram...

Current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution.

Qualitative and/or quantitative evaluation results in paper indicating strengths in spatial grounding, multimodal alignment, and coordinated action execution.

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... spatial grounding, multimodal alignment, coordinated action execution

We develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding.

Methodological contribution described in paper: parser implementation that converts recordings and logs into structured GUI action trajectories.

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... ability to produce structured, grounded GUI action trajectories from recordings/...

The tasks involve dense multimodal interfaces and tightly coupled interaction sequences.

Task descriptions and dataset characteristics in paper stating tasks are complex, long-horizon, multimodal, and tightly coupled.

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... interface complexity and interaction coupling in tasks

We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows.

Dataset construction reported in paper: curated expert demonstrations spanning 7 applications and 186 tasks (numbers provided in text).

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... size and scope of demonstration dataset (number of applications and tasks)

We introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments.

Paper describes the creation of the Cutverse benchmark as a central contribution (design and implementation described in methods).

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... existence and design of a benchmark for GUI agents in media post-production

GUI agents have made significant progress in web navigation and basic operating system tasks.

Background claim stated in paper referencing prior work on GUI agents applied to web navigation and OS tasks (no specific experiments in this paper to support it).

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... capability progress on web navigation and OS tasks

We develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure.

Methodological contribution described in the paper: creation of a taxonomy to harmonize labels and claimed measurement targets across benchmarks (details and mapping provided in paper/tool).

high positive Unsteady Metrics and Benchmarking Cultures of AI Model Build... harmonization/taxonomy of benchmark labels

We introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data.

Empirical contribution: the paper publishes the dataset and tool (links provided). Counts reported in the paper metadata (231 benchmarks, 139 model releases, 11 builders).

high positive Unsteady Metrics and Benchmarking Cultures of AI Model Build... size and coverage of the released dataset

The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), enabling sustained operation beyond full-context limits.

Reported stress test / capability demonstration in paper: profile size stated as 14,000+ facts and 125k tokens stored and managed by the system.

high positive Episodic-Semantic Memory Architecture for Long-Horizon Scien... number of scientific facts and token footprint the system can manage (profile ca...

The Dual Process system maintains 70-85% accuracy with 1-2 second latency while using 62% fewer tokens (45,434 vs 120,000+ limit) compared to full-context approaches.

Reported empirical results from the large-scale evaluation (1,440 queries / 15,000 messages) comparing Dual Process to full-context models; exact accuracy, latency, and token-count figures provided in the paper.

high positive Episodic-Semantic Memory Architecture for Long-Horizon Scien... accuracy; latency (seconds); token usage

The Dual Process Memory Architecture decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message).

System design description and measured consolidation growth rate reported in the paper; empirical observation of growth rate stated.

high positive Episodic-Semantic Memory Architecture for Long-Horizon Scien... episodic window size; long-term memory growth rate (tokens/message)

Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.

Empirical experiments reported in the paper (on unspecified real-world and synthetic tabular datasets) comparing SPN to PFN-style tabular foundation models and classical tabular methods; the abstract claims consistent improvements but does not report sample sizes, dataset names, or quantitative effect sizes.

high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... robustness and predictive performance under strategic manipulation

SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution.

Description of SPN's mechanism in the paper (methodological detail). Presented as the approach used to approximate strategic post-manipulation inputs and align predictions; no quantitative details or sample sizes in the abstract.

high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... alignment of PFN predictions with induced strategic distribution

« Prev 1 2 3 … 111 112 113 … 276 277 Next »