Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

IDS jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies.

Description of the IDS method and architecture presented in the paper (system design and algorithmic loop).

high positive Inductive Deductive Synthesis: Enabling AI to Generate Forma... other

This paper presents the first effective approach to addressing the gap between LLM coding agents and mechanized formal verification for distributed systems (Inductive Deductive Synthesis, IDS).

Statement of novelty supported by the empirical claim that IDS succeeds on all 7 benchmark specs while prior SOTA agents did not; methodological description of IDS as a joint, incremental synthesis and learning system.

high positive Inductive Deductive Synthesis: Enabling AI to Generate Forma... research_productivity

IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

Empirical benchmarking of IDS-produced implementations against published verified systems, with performance (runtime) comparisons reporting up to a 3x speedup.

high positive Inductive Deductive Synthesis: Enabling AI to Generate Forma... organizational_efficiency

IDS is 17% cheaper than SOTA agents.

Cost comparison reported in the paper between IDS and the evaluated SOTA coding agents across the same 7 specs, yielding a 17% cost reduction for IDS.

high positive Inductive Deductive Synthesis: Enabling AI to Generate Forma... organizational_efficiency

IDS is roughly 200x faster than expert effort.

Comparison in the paper between IDS runtime (hours) and the typical expert effort (described as months to years) required for mechanized formal verification of similar distributed-system specifications; reported multiplicative speedup (~200x).

high positive Inductive Deductive Synthesis: Enabling AI to Generate Forma... task_completion_time

IDS costs $106 per spec on average.

Reported monetary cost computed for IDS runs averaged across the 7 specs in the evaluation.

high positive Inductive Deductive Synthesis: Enabling AI to Generate Forma... organizational_efficiency

IDS achieves 7/7 (succeeds on all 7 specs) in about 6.8 hours per spec on average.

Empirical evaluation of IDS on the same suite of 7 distributed key-value-store specifications, with runtime (wall-clock) measured and averaged over the 7 specs.

high positive Inductive Deductive Synthesis: Enabling AI to Generate Forma... task_completion_time

The paper presents a comprehensive empirical study of key design choices — including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies — to identify configurations that are most effective in production settings.

Authors report an empirical study covering multiple design axes; details of experiments, datasets, and sample sizes are not included in the excerpt.

high positive HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... design configuration effectiveness for production deployment

HARNESS-LM (HLM) is a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models: (1) train a high-performance reference ('teacher') retriever by fine-tuning a billion-parameter-scale SLM; (2) align query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; (3) apply a final contrastive refinement stage to optimize the student for retrieval performance.

Methodological description of the HLM training recipe and model sizes provided in the paper; supported by subsequent empirical evaluations reported in the paper.

high positive HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... model compression / knowledge transfer into compact retriever

Online A/B testing on Bing Ads shows a +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model.

Live online A/B testing on Bing Ads comparing HLM deployment to the production ensemble using the 190M parameter model; exact experiment details not provided in excerpt.

high positive HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... Clicks

Online A/B testing on Bing Ads shows a +0.6% Impression uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model.

Live online A/B testing on Bing Ads comparing HLM deployment to the production ensemble using the 190M parameter model; exact experiment details not provided in excerpt.

high positive HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... Impressions

Online A/B testing on Bing Ads shows a +1% Revenue uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model.

Live online A/B testing on Bing Ads comparing HLM deployment to the production ensemble using the 190M parameter model; exact experiment duration, traffic allocation, and statistical significance not provided in excerpt.

high positive HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... Revenue

HLM delivers up to 20x higher throughput on NVIDIA A100 GPUs.

Throughput benchmarking on NVIDIA A100 GPUs comparing HLM student encoder to baseline/reference encoders; exact workload and measurement details not provided in excerpt.

high positive HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... inference throughput

HLM delivers up to 27x lower online query-encoder latency on NVIDIA A100 GPUs.

Measured inference latency on NVIDIA A100 GPUs comparing HLM student encoder to baseline/reference encoders; exact measurement procedure and number of runs not provided in excerpt.

high positive HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... online query-encoder latency

On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings.

Empirical evaluation on a real-world Bing Ads retrieval benchmark comparing HLM student retriever to a high-performance reference (teacher) retriever; exact benchmark dataset and number of test queries not reported in excerpt.

high positive HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... retriever precision

Accountability assets are complementary assets that make AI-supported outputs legitimate, auditable, reviewable, and assignable to a responsible party.

Conceptual definition and development in the paper; supported by illustrative domain examples but no empirical validation.

high positive Redrawing the AI Map: A Theory of Accountability Boundaries ... legitimacy/auditability/assignability of AI outputs (regulatory/compliance readi...

Agentic AI orchestrators reduce the interface and assembly costs of composing information systems capabilities across organizational boundaries, seemingly accelerating modularization and organizational disaggregation.

Conceptual/theoretical argument in the paper; theory development and illustrative examples across domains (document processing, legal services, audit, clinical decision support, procurement). No empirical sample or quantitative test reported.

high positive Redrawing the AI Map: A Theory of Accountability Boundaries ... organizational disaggregation / modularization

The paper's contribution is to clarify the trade-offs that infrastructure decisions often obscure, distinguish deliberate triad governance from default allocation by market power or regulatory inertia, and propose a Deliberate Triad Choice Framework for policymakers considering AI infrastructure decisions of significant scale.

Stated contributions in the abstract: conceptual clarification, normative distinction between deliberate governance and default allocation, and proposal of a policy framework (Deliberate Triad Choice Framework).

high positive The AI Infrastructure Triad in Regional Governance: How Regi... availability and design of a policy framework (Deliberate Triad Choice Framework...

This article develops the AI Infrastructure Triad as a conceptual framework for analyzing three competing priorities in regional AI infrastructure governance: Progress, Sustainability, and Equity.

Theoretical/conceptual development presented in the paper; synthesis of prior work on economic, physical, and moral limits of AI development.

high positive The AI Infrastructure Triad in Regional Governance: How Regi... conceptual clarity of governance priorities (Progress, Sustainability, Equity)

Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

Argument supported by observed cases in the experiments where models with similar overall ranks differed on capability axes and jaggedness, implying additional diagnostic value.

high positive GENSTRAT: Toward a Science of Strategic Reasoning in Large L... diagnostic usefulness for deployment decisions

Newer frontier-tier models score higher on average.

Aggregate results from the head-to-head tournament comparing nine models across sampled games (>36k matches).

high positive GENSTRAT: Toward a Science of Strategic Reasoning in Large L... average model score / overall strength

We introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games.

Methodological contribution described in paper (jaggedness metric).

high positive GENSTRAT: Toward a Science of Strategic Reasoning in Large L... within-distribution smoothness / local volatility (jaggedness)

We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness).

Methodological description in paper introducing the capability-profile decomposition.

high positive GENSTRAT: Toward a Science of Strategic Reasoning in Large L... decomposed capability profile across six axes

The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination.

Method claim about generator capability described in the paper.

high positive GENSTRAT: Toward a Science of Strategic Reasoning in Large L... freshness/resistance-to-contamination of benchmarks

We introduce GENSTRAT, which uses procedurally generated strategic environments to address the limitations of fixed benchmarks.

Methodological contribution described in paper: design and implementation of GENSTRAT.

high positive GENSTRAT: Toward a Science of Strategic Reasoning in Large L... availability of procedurally generated strategic environments for evaluation

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings.

Introductory statement in the paper situating motivation; no empirical data reported in the abstract to quantify the increase.

high positive GENSTRAT: Toward a Science of Strategic Reasoning in Large L... deployment of LLMs as economic agents

FastKernels is released (code available) as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements.

Statement of code release with GitHub URL provided in abstract.

high positive FastKernels: Benchmarking GPU Kernel Generation in Productio... availability of code / reproducibility (release of benchmark and framework)

The FastKernels kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures.

Coverage analysis comparing FastKernels' kernels to HuggingFace Transformers architectures (numbers given in abstract: 409/425).

high positive FastKernels: Benchmarking GPU Kernel Generation in Productio... architecture coverage (proportion subsumed)

FastKernels is built around a minimal set of 46 representative architectures spanning 8 categories.

Design of the benchmark as described in the paper; explicit counts provided in abstract.

high positive FastKernels: Benchmarking GPU Kernel Generation in Productio... benchmark coverage (number of representative architectures and categories)

This paper provides new evidence on AI adoption from a non-US context by leveraging German firm-level data (ifo Business Survey).

Use of a large German business survey (ifo Business Survey) and analysis of AI adoption patterns among German firms.

high positive AI adoption among German firms Empirical evidence on AI adoption in Germany (contribution to literature)

AI is expected to have positive long-term productivity impacts for different sectors of the German economy.

Assessment of potential productivity impacts using firm-level survey responses about expected long-term benefits of AI (forward-looking/expectation-based analysis).

high positive AI adoption among German firms Expected firm-level productivity / anticipated long-term productivity benefits

The increase in AI usage from 2023 to 2024 was particularly pronounced in manufacturing and services sectors.

Sectoral breakdown of ifo Business Survey firm-level data showing higher increases in reported AI usage for manufacturing and services compared with other sectors.

high positive AI adoption among German firms AI usage / AI adoption rate by sector

There was a significant increase in AI usage among German firms from 2023 to 2024.

Firm-level responses from the ifo Business Survey comparing reported AI usage in 2023 versus 2024 (cross-sectional/descriptive trend analysis).

high positive AI adoption among German firms AI usage / AI adoption rate (reported by firms)

We propose efforts that individuals and leaders can take to support their colleagues through AI transformation while preserving healthy company cultures that support diverse thinking, collaboration, and informal interactions.

Authors' prescriptive recommendations derived from interview insights; recommendations are not empirically validated in the study.

high positive Beyond the Org Chart: AI and the Transformation of Invisible... leadership and individual practices to preserve culture during AI adoption

We propose steps that AI companies can take to make the invisible work more visible.

Authors' normative recommendations based on synthesis of the qualitative interview findings; not empirically tested within the paper.

high positive Beyond the Org Chart: AI and the Transformation of Invisible... organizational practices to surface invisible work

Some of these changes are positive, such as smoother collaboration between peers.

Interviewee accounts from the 24-participant qualitative study reporting perceived improvements in peer collaboration due to AI tools.

high positive Beyond the Org Chart: AI and the Transformation of Invisible... peer collaboration / team coordination

To support sustainable human–AI collaboration, the authors emphasize adopting a human-centered approach that prioritizes transparency, explainability, and user autonomy.

Authors' policy/research/practice recommendation grounded in the review synthesis of the interdisciplinary literature.

high positive Yapay Zeka Sistemleri ve İnsan İşbirliğinin Psikolojik, Sosy... adoption of human-centered design practices (transparency, explainability, user ...

Well-designed AI systems have the potential to increase cognitive efficiency and job satisfaction.

Synthesis of findings across reviewed studies indicating positive associations between human-centered AI design and outcomes like cognitive efficiency and job satisfaction.

high positive Yapay Zeka Sistemleri ve İnsan İşbirliğinin Psikolojik, Sosy... cognitive efficiency (and job satisfaction, secondary)

The successful integration of AI-driven EPM systems relies on the synergy between AI technologies and human judgment, allowing healthcare organizations to cultivate a more dynamic, innovative and responsive workforce.

Normative/concluding statement in the scoping review based on synthesis of included studies (n=29).

high positive The influence of AI-Driven Employee Performance Management (... integration success conditional on human-AI synergy; workforce dynamism and resp...

AI-driven EPM systems mark a significant advance in accessing real-time performance data and provide considerable progression when utilized within appropriate guidelines.

Conclusion drawn in the paper from the scoping review of 29 empirical studies; phrased as an overall assessment.

high positive The influence of AI-Driven Employee Performance Management (... availability/access to real-time performance data and improvement in HR processe...

Predictive analytics help manage high rates of burnout.

Reported in the scoping review as a benefit across included studies (n=29).

high positive The influence of AI-Driven Employee Performance Management (... burnout management / mitigation

Predictive analytics optimize operations.

Stated as an operational benefit in the scoping review (29 studies).

high positive The influence of AI-Driven Employee Performance Management (... operational optimization (scheduling, resource allocation, workflows)

Predictive analytics assist in assessing labor shortages.

Reported use-case in the scoping review synthesizing empirical studies (n=29).

high positive The influence of AI-Driven Employee Performance Management (... ability to assess/predict labor shortages

Predictive analytics are vital in orchestrating healthcare organizations’ strategic and operational activities.

Claim derived from the scoping review's conclusions across included studies (n=29).

high positive The influence of AI-Driven Employee Performance Management (... usefulness of predictive analytics for strategic/operational decision-making

AI-powered EPM produces significant time savings for managers.

Reported as a benefit in the scoping review synthesis (29 studies); no numerical magnitude given in the excerpt.

high positive The influence of AI-Driven Employee Performance Management (... manager time spent on EPM tasks / administrative burden

AI-powered EPM helps identify potential leaders.

Summarized outcome across empirical studies in the scoping review (n=29).

high positive The influence of AI-Driven Employee Performance Management (... identification of leadership potential / talent spotting

AI-powered EPM heightens employee engagement.

Reported as an aggregated finding in the scoping review of 29 empirical studies.

high positive The influence of AI-Driven Employee Performance Management (... employee engagement

AI-powered EPM increases the frequency of feedback to employees.

Stated as a benefit in the scoping review synthesis across included studies (n=29).

high positive The influence of AI-Driven Employee Performance Management (... feedback frequency

AI-powered EPM platforms result in considerable improvements in efficiency, including increased frequent feedback, heightened employee engagement, identification of potential leaders and significant time savings for managers.

Synthesis claim from the scoping review of 29 empirical studies; no quantitative effects reported in the excerpt.

high positive The influence of AI-Driven Employee Performance Management (... efficiency gains and specific HR outcomes (feedback frequency, engagement, leade...

The delivery of high-quality healthcare depends essentially on the effective functioning of personnel, who are the vital resource for maintaining reputation, fostering a culture of continuous improvement, and ensuring the overall effective operation of the healthcare sector.

Conceptual assertion in the paper supported by literature synthesis in the scoping review (29 studies).

high positive The influence of AI-Driven Employee Performance Management (... relationship between personnel functioning and healthcare quality/delivery

« Prev 1 2 3 … 107 108 109 … 276 277 Next »