Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
We implement an Adversarial Multi-Agent Quality Control (QC) loop in which evaluator agents iteratively critique generated frames and prompt generators to refine outputs until a deterministic consensus is reached.
Method description of a multi-agent adversarial QC loop used in the pipeline; no experimental protocol, number of agents, or sample sizes provided in this sentence.
Our architecture integrates a retrieval-based 'Brand DNA' extraction module to parameterize generation according to established corporate identity guidelines.
Methodological description in paper indicating a retrieval-based module for extracting Brand DNA used to condition generation; no evaluation metrics or sample sizes provided in this statement.
We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production.
Paper describes the proposed system architecture (Genflow) as a methodological contribution; description of modules and pipeline provided but no external validation details in this sentence.
Recent advancements in generative video models demonstrate high visual fidelity.
Asserted in paper as a background observation about recent generative video models; no specific dataset, benchmark, or sample size reported.
Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) were performed to provide practical insights for production deployment.
Authors state they performed benchmark comparisons of multiple LLM backends (listed in abstract); specifics of metrics and sample sizes not given in abstract.
A comprehensive sustainability analysis shows that the hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing.
Authors report a sustainability analysis comparing hybrid AI+HITL approach to traditional manual processing (details not provided in abstract).
Prompt Fine Tuning with Feedback Inheritance (PFTFI) is a novel approach introduced in this work.
Authors explicitly introduce PFTFI as part of their approach (stated in abstract).
The system integrates five specialized agents—Classificator, Splitter, Parser, Extraction, and Validator—together with a Human-in-the-Loop mechanism and a Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach.
Authors' architectural description in the abstract specifying the five agents, HITL mechanism, and PFTFI approach.
MADP combines deep learning-based classification and parsing with large language model extraction while maintaining accuracy through selective human validation.
System description in paper asserting integration of DL classification/parsing, LLM extraction, and selective human validation; supported by system evaluations reported elsewhere in abstract.
Ablation evaluation on a stratified 100-document subset demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy.
Ablation evaluation on a stratified subset of 100 documents (5 documents per each of 20 supplier/document-type categories) reported by authors.
Only 3% of documents required non-AI fallback in the production deployment.
Same production deployment on 955 documents (stated in abstract).
Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate.
Reported production deployment on 955 real-world documents processed through January 2026 (stated in abstract).
Operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%.
Operational analysis reported by authors on a production use-case scenario involving 100,000 invoices per year (stated in abstract).
Trace-Prior RL adds bounded adaptation under capacity asymmetry.
Experiments contrasting Trace-Prior RL versus behavior cloning and reward-only approaches in settings with capacity asymmetry, showing Trace-Prior RL permits limited/adaptive deviation while preserving trace alignment.
Pure behavior cloning is nearly enough for symmetric imitation.
Empirical results in symmetric imitation settings (presumably in the two-hotel or bidding benchmarks) showing behavior cloning achieves close imitation without additional RL.
Trace-prior or corrected-history policies better preserve price or bid distributions.
Comparative experiments and ablations across the two-hotel benchmark and hidden-budget bidding task showing trace-prior and corrected-history policies retain price/bid distribution characteristics better than reward-only variants.
Revealing hidden state reduces label uncertainty.
Experiments (hidden-state ablations) in the compact hidden-budget bidding task and/or two-hotel benchmark where providing hidden state information to the learner reduced uncertainty in inferred labels.
A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem.
Reported evaluation: year-long pilot conducted across three clinical sites, total workflow runs = 8,728, reported completion rate = 97.08%; prototype lacked verified-core subsystem.
Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions.
Design description in the paper explaining swimlane use to delineate trust boundaries between system components and humans/external systems/AI.
At runtime a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit.
System architecture description in the paper describing runtime engine features (append-only log, enforcement, replay/retry, audit support).
At compile time GraphFlow restricts diagrams to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library.
Design/specification claim in the paper describing compile-time restrictions and proof-checked admission model (implementation/design detail).
GraphFlow treats workflow diagrams as the executable specification — a single artifact defining data scope, execution semantics, and monitoring — to address the gap between durable execution and semantic correctness.
System design description in the paper explaining GraphFlow's design philosophy and intended role of diagrams as executable specifications.
Existing workflow platforms provide durable execution and observability.
Author statement in background/motivation describing properties of existing workflow platforms.
Context engineering (programmatic state abstraction and clean task decomposition) is generally more cost-effective than deeper per-agent deliberation.
Cost-effectiveness measured as returns per token spent (RPTS) across configurations that vary context representation and deliberation; results from the 3,475-episode controlled study indicate context changes yielded larger returns per token than adding deliberation tools.
Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations.
Controlled empirical study in the CybORG CAGE-2 POMDP environment comparing context representations (raw observations vs. deterministic state-tracking layer with compressed history) across five model families, six models, and twelve configurations with token-level cost accounting (3,475 episodes).
Companies that train workers outperform those that simply cut them.
Claim presented as one of the five lessons, based on historical analogy and emerging workplace evidence (chapter asserts firms that invest in training do better).
For AI datacenter design, the relevant planning objective is not installed megawatts, but deployable capacity over time.
Conclusion/recommendation drawn from the paper's modeling results and analysis (argument that installed MW is a poor planning metric compared to time-varying deployable capacity).
The framework combines projection models for GPU, compute, and storage deployments with operational factors grounded in production data from Microsoft Azure.
Method claim: framework integrates projection models and operational data from Microsoft Azure (production data grounding); stated in the paper's methods summary.
We develop a framework for evaluating datacenter power delivery designs using throughput, power, and cost metrics over realistic arrival, oversubscription, and decommissioning sequences.
Methodological claim describing the paper's core contribution: a simulation/evaluation framework combining throughput, power, and cost metrics with arrival/oversubscription/decommissioning sequences; based on the authors' implementation (details and data referenced in the paper).
Designs must remain efficient over long datacenter lifetimes and multiple hardware generations.
Normative/design recommendation motivated by long asset lifetimes and evolving hardware density; stated as a requirement in the paper.
Demand for AI accelerators is rapidly increasing rack power density, with projections approaching 1MW per deployment by 2027.
Projection models for GPU deployments described in the paper (projection models combined with industry deployment assumptions); specific provenance referenced in the abstract but no sample size reported.
Ülkelerin yapay zekâ kaynaklı yapısal dönüşüme uyum sağlayabilmesi için koordineli ve uzun vadeli politika çerçevelerine ihtiyaç vardır; ticaret politikası, sanayi politikası ve dijital düzenlemeler bütünleşik bir strateji dahilinde ele alınmalıdır.
Çalışmanın sonuç ve politika önerisi bölümü; normatif tavsiye ve koordinasyon gereksinimi üzerine argüman; ampirik kanıt veya uygulama örnekleri verilmiyor.
Gelişmekte olan ülkeler için dijital altyapıya erken yatırım yapmak yeni rekabet gücü pencereleri açabilir.
Kavramsal argüman; politika yönelimi ve stratejik öneri; ampirik test veya nicel kanıt sunulmamıştır.
Otomasyon ve akıllı üretim sistemlerinin yaygınlaşmasıyla ucuz işgücüne dayalı karşılaştırmalı üstünlüklerin aşınması ve üretimin gelişmiş ekonomilere veya müttefik ülkelere geri dönüşünü (reshoring ve friendshoring) ifade eden eğilimlerin ivme kazanması beklenmektedir.
Kavramsal analiz ve beklenen teknoloji→tüketim/üretim mekanizmalarına ilişkin mantıksal çıkarımlar; çalışmada ampirik test veya nicel veri sunulmamıştır.
Higher sectoral digitalization potential strongly increased remote work: DiD estimate 40.74 percentage points (p < 0.001); remote work rose from 17.6% to 82.1% in highly digitalized sectors versus 1.3% to 6.6% in less digitalized sectors.
Difference-in-differences (DiD) analysis using the COVID-19 shock as quasi-natural experiment on quarterly panel data for 27 EU Member States (2018–2024), N = 36,685; reported DiD estimate = 40.74 percentage points, p < 0.001; descriptive pre/post shares reported for both groups.
Higher sectoral digitalization potential has a statistically significant positive effect on wages (hourly wages).
Difference-in-differences (DiD) analysis using the COVID-19 shock as quasi-natural experiment on the same quarterly panel (27 EU Member States, 2018–2024), N = 36,685; reported DiD coefficient = 0.52 €/hour, p < 0.001; authors state this corresponds to ≈4.6% increase in the wage gap between highly and less digitalized activities.
The study's findings offer actionable insights for managers and policymakers to leverage AI for sustainable organizational growth while safeguarding employee well-being.
Authors' concluding statement based on survey findings and analytical results.
Successful human–AI collaboration requires a human-centric approach that balances technological advancement with workforce development, ethical governance, and organizational support.
Study conclusion/recommendation based on survey findings (perceptions of opportunities and challenges) and analytical results (correlation/regression).
Human–AI collaboration reduces employees' routine workload.
Respondent perceptions collected via the structured questionnaire and analyzed with descriptive statistics and regression in SPSS.
AI-based systems support better decision-making by providing data-driven insights, allowing employees to focus on higher-level cognitive and strategic activities.
Survey responses (structured questionnaire) analyzed with SPSS (correlation and regression analyses) reporting perceived support for decision-making.
Human–AI collaboration significantly enhances workplace efficiency and productivity by reducing routine workload and improving accuracy and speed in task execution.
Primary data from employees in AI-enabled organizations collected via a structured questionnaire (5-point Likert); analyzed with SPSS using descriptive statistics and regression analysis.
Simulations calibrated to a real multifamily rental market confirm that supra-competitive outcomes arise robustly beyond the theoretical assumptions, including under finite horizons, heterogeneous products, and nonlinear logit demand.
Simulation experiments calibrated to a real multifamily rental market; simulations test finite-horizon settings, product heterogeneity, and nonlinear logit demand formulations.
Under symmetric exploration, prices can reach monopoly levels.
Theoretical result derived in the ODE analysis showing convergence to monopoly-level prices in symmetric exploration scenarios.
Supra-competitive prices arise when firms explore within similar price ranges on the same side of the Nash price.
Analytical characterization from the fluid-limit ordinary differential equation (ODE) analysis of the explore-then-exploit pipeline with misspecified monopoly-style estimation.
Simple algorithmic pricing systems can systematically produce collusive-like (supra-competitive) prices in multi-firm markets.
Theoretical model of multi-firm pricing with an explore-then-exploit pipeline and misspecified monopoly-style demand estimation; fluid-limit ODE analysis characterizing convergence; supporting simulations calibrated to a real multifamily rental market.
Continuous, simulation-driven prompt optimization is both tractable and necessary for reliable enterprise conversational AI at scale.
Concluding claim in abstract: 'Our results suggest that continuous, simulation-driven prompt optimization is both tractable and necessary...'.
PRISM is designed to run on a scheduled basis (daily), treating LLM behavioral drift as a first-class reliability concern.
Design statement in abstract describing scheduled daily runs to monitor behavioral drift.
PRISM diagnoses root causes of failures and surgically repairs the prompt, iterating until all tests pass.
Methodological description in abstract stating diagnosis and iterative repair loop until tests pass.
PRISM simulates full multi-turn conversations against a platform-faithful LLM environment and evaluates pass/fail using an LLM-as-judge.
Method/architecture claim in abstract describing simulation of multi-turn conversations and LLM-based judging.
PRISM automatically generates test cases from plain-language agent requirements.
Methodological description in abstract stating PRISM takes plain-language requirements and automatically generates test cases.