Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
IDS jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies.
Description of the IDS method and architecture presented in the paper (system design and algorithmic loop).
This paper presents the first effective approach to addressing the gap between LLM coding agents and mechanized formal verification for distributed systems (Inductive Deductive Synthesis, IDS).
Statement of novelty supported by the empirical claim that IDS succeeds on all 7 benchmark specs while prior SOTA agents did not; methodological description of IDS as a joint, incremental synthesis and learning system.
IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.
Empirical benchmarking of IDS-produced implementations against published verified systems, with performance (runtime) comparisons reporting up to a 3x speedup.
IDS is 17% cheaper than SOTA agents.
Cost comparison reported in the paper between IDS and the evaluated SOTA coding agents across the same 7 specs, yielding a 17% cost reduction for IDS.
IDS is roughly 200x faster than expert effort.
Comparison in the paper between IDS runtime (hours) and the typical expert effort (described as months to years) required for mechanized formal verification of similar distributed-system specifications; reported multiplicative speedup (~200x).
IDS costs $106 per spec on average.
Reported monetary cost computed for IDS runs averaged across the 7 specs in the evaluation.
IDS achieves 7/7 (succeeds on all 7 specs) in about 6.8 hours per spec on average.
Empirical evaluation of IDS on the same suite of 7 distributed key-value-store specifications, with runtime (wall-clock) measured and averaged over the 7 specs.
The paper presents a comprehensive empirical study of key design choices — including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies — to identify configurations that are most effective in production settings.
Authors report an empirical study covering multiple design axes; details of experiments, datasets, and sample sizes are not included in the excerpt.
HARNESS-LM (HLM) is a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models: (1) train a high-performance reference ('teacher') retriever by fine-tuning a billion-parameter-scale SLM; (2) align query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; (3) apply a final contrastive refinement stage to optimize the student for retrieval performance.
Methodological description of the HLM training recipe and model sizes provided in the paper; supported by subsequent empirical evaluations reported in the paper.
Online A/B testing on Bing Ads shows a +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model.
Live online A/B testing on Bing Ads comparing HLM deployment to the production ensemble using the 190M parameter model; exact experiment details not provided in excerpt.
Online A/B testing on Bing Ads shows a +0.6% Impression uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model.
Live online A/B testing on Bing Ads comparing HLM deployment to the production ensemble using the 190M parameter model; exact experiment details not provided in excerpt.
Online A/B testing on Bing Ads shows a +1% Revenue uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model.
Live online A/B testing on Bing Ads comparing HLM deployment to the production ensemble using the 190M parameter model; exact experiment duration, traffic allocation, and statistical significance not provided in excerpt.
HLM delivers up to 20x higher throughput on NVIDIA A100 GPUs.
Throughput benchmarking on NVIDIA A100 GPUs comparing HLM student encoder to baseline/reference encoders; exact workload and measurement details not provided in excerpt.
HLM delivers up to 27x lower online query-encoder latency on NVIDIA A100 GPUs.
Measured inference latency on NVIDIA A100 GPUs comparing HLM student encoder to baseline/reference encoders; exact measurement procedure and number of runs not provided in excerpt.
On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings.
Empirical evaluation on a real-world Bing Ads retrieval benchmark comparing HLM student retriever to a high-performance reference (teacher) retriever; exact benchmark dataset and number of test queries not reported in excerpt.
Accountability assets are complementary assets that make AI-supported outputs legitimate, auditable, reviewable, and assignable to a responsible party.
Conceptual definition and development in the paper; supported by illustrative domain examples but no empirical validation.
Agentic AI orchestrators reduce the interface and assembly costs of composing information systems capabilities across organizational boundaries, seemingly accelerating modularization and organizational disaggregation.
Conceptual/theoretical argument in the paper; theory development and illustrative examples across domains (document processing, legal services, audit, clinical decision support, procurement). No empirical sample or quantitative test reported.
The paper's contribution is to clarify the trade-offs that infrastructure decisions often obscure, distinguish deliberate triad governance from default allocation by market power or regulatory inertia, and propose a Deliberate Triad Choice Framework for policymakers considering AI infrastructure decisions of significant scale.
Stated contributions in the abstract: conceptual clarification, normative distinction between deliberate governance and default allocation, and proposal of a policy framework (Deliberate Triad Choice Framework).
This article develops the AI Infrastructure Triad as a conceptual framework for analyzing three competing priorities in regional AI infrastructure governance: Progress, Sustainability, and Equity.
Theoretical/conceptual development presented in the paper; synthesis of prior work on economic, physical, and moral limits of AI development.
Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.
Argument supported by observed cases in the experiments where models with similar overall ranks differed on capability axes and jaggedness, implying additional diagnostic value.
Newer frontier-tier models score higher on average.
Aggregate results from the head-to-head tournament comparing nine models across sampled games (>36k matches).
We introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games.
Methodological contribution described in paper (jaggedness metric).
We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness).
Methodological description in paper introducing the capability-profile decomposition.
The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination.
Method claim about generator capability described in the paper.
We introduce GENSTRAT, which uses procedurally generated strategic environments to address the limitations of fixed benchmarks.
Methodological contribution described in paper: design and implementation of GENSTRAT.
Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings.
Introductory statement in the paper situating motivation; no empirical data reported in the abstract to quantify the increase.
FastKernels is released (code available) as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements.
Statement of code release with GitHub URL provided in abstract.
The FastKernels kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures.
Coverage analysis comparing FastKernels' kernels to HuggingFace Transformers architectures (numbers given in abstract: 409/425).
FastKernels is built around a minimal set of 46 representative architectures spanning 8 categories.
Design of the benchmark as described in the paper; explicit counts provided in abstract.
This paper provides new evidence on AI adoption from a non-US context by leveraging German firm-level data (ifo Business Survey).
Use of a large German business survey (ifo Business Survey) and analysis of AI adoption patterns among German firms.
AI is expected to have positive long-term productivity impacts for different sectors of the German economy.
Assessment of potential productivity impacts using firm-level survey responses about expected long-term benefits of AI (forward-looking/expectation-based analysis).
The increase in AI usage from 2023 to 2024 was particularly pronounced in manufacturing and services sectors.
Sectoral breakdown of ifo Business Survey firm-level data showing higher increases in reported AI usage for manufacturing and services compared with other sectors.
There was a significant increase in AI usage among German firms from 2023 to 2024.
Firm-level responses from the ifo Business Survey comparing reported AI usage in 2023 versus 2024 (cross-sectional/descriptive trend analysis).
We propose efforts that individuals and leaders can take to support their colleagues through AI transformation while preserving healthy company cultures that support diverse thinking, collaboration, and informal interactions.
Authors' prescriptive recommendations derived from interview insights; recommendations are not empirically validated in the study.
We propose steps that AI companies can take to make the invisible work more visible.
Authors' normative recommendations based on synthesis of the qualitative interview findings; not empirically tested within the paper.
Some of these changes are positive, such as smoother collaboration between peers.
Interviewee accounts from the 24-participant qualitative study reporting perceived improvements in peer collaboration due to AI tools.
To support sustainable human–AI collaboration, the authors emphasize adopting a human-centered approach that prioritizes transparency, explainability, and user autonomy.
Authors' policy/research/practice recommendation grounded in the review synthesis of the interdisciplinary literature.
Well-designed AI systems have the potential to increase cognitive efficiency and job satisfaction.
Synthesis of findings across reviewed studies indicating positive associations between human-centered AI design and outcomes like cognitive efficiency and job satisfaction.
The successful integration of AI-driven EPM systems relies on the synergy between AI technologies and human judgment, allowing healthcare organizations to cultivate a more dynamic, innovative and responsive workforce.
Normative/concluding statement in the scoping review based on synthesis of included studies (n=29).
AI-driven EPM systems mark a significant advance in accessing real-time performance data and provide considerable progression when utilized within appropriate guidelines.
Conclusion drawn in the paper from the scoping review of 29 empirical studies; phrased as an overall assessment.
Predictive analytics help manage high rates of burnout.
Reported in the scoping review as a benefit across included studies (n=29).
Predictive analytics optimize operations.
Stated as an operational benefit in the scoping review (29 studies).
Predictive analytics assist in assessing labor shortages.
Reported use-case in the scoping review synthesizing empirical studies (n=29).
Predictive analytics are vital in orchestrating healthcare organizations’ strategic and operational activities.
Claim derived from the scoping review's conclusions across included studies (n=29).
AI-powered EPM produces significant time savings for managers.
Reported as a benefit in the scoping review synthesis (29 studies); no numerical magnitude given in the excerpt.
AI-powered EPM helps identify potential leaders.
Summarized outcome across empirical studies in the scoping review (n=29).
AI-powered EPM heightens employee engagement.
Reported as an aggregated finding in the scoping review of 29 empirical studies.
AI-powered EPM increases the frequency of feedback to employees.
Stated as a benefit in the scoping review synthesis across included studies (n=29).
AI-powered EPM platforms result in considerable improvements in efficiency, including increased frequent feedback, heightened employee engagement, identification of potential leaders and significant time savings for managers.
Synthesis claim from the scoping review of 29 empirical studies; no quantitative effects reported in the excerpt.
The delivery of high-quality healthcare depends essentially on the effective functioning of personnel, who are the vital resource for maintaining reputation, fostering a culture of continuous improvement, and ensuring the overall effective operation of the healthcare sector.
Conceptual assertion in the paper supported by literature synthesis in the scoping review (29 studies).