Evidence (4793 claims)
Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 402 | 112 | 67 | 480 | 1076 |
| Governance & Regulation | 402 | 192 | 122 | 62 | 790 |
| Research Productivity | 249 | 98 | 34 | 311 | 697 |
| Organizational Efficiency | 395 | 95 | 70 | 40 | 603 |
| Technology Adoption Rate | 321 | 126 | 73 | 39 | 564 |
| Firm Productivity | 306 | 39 | 70 | 12 | 432 |
| Output Quality | 256 | 66 | 25 | 28 | 375 |
| AI Safety & Ethics | 116 | 177 | 44 | 24 | 363 |
| Market Structure | 107 | 128 | 85 | 14 | 339 |
| Decision Quality | 177 | 76 | 38 | 20 | 315 |
| Fiscal & Macroeconomic | 89 | 58 | 33 | 22 | 209 |
| Employment Level | 77 | 34 | 80 | 9 | 202 |
| Skill Acquisition | 92 | 33 | 40 | 9 | 174 |
| Innovation Output | 120 | 12 | 23 | 12 | 168 |
| Firm Revenue | 98 | 34 | 22 | — | 154 |
| Consumer Welfare | 73 | 31 | 37 | 7 | 148 |
| Task Allocation | 84 | 16 | 33 | 7 | 140 |
| Inequality Measures | 25 | 77 | 32 | 5 | 139 |
| Regulatory Compliance | 54 | 63 | 13 | 3 | 133 |
| Error Rate | 44 | 51 | 6 | — | 101 |
| Task Completion Time | 88 | 5 | 4 | 3 | 100 |
| Training Effectiveness | 58 | 12 | 12 | 16 | 99 |
| Worker Satisfaction | 47 | 32 | 11 | 7 | 97 |
| Wages & Compensation | 53 | 15 | 20 | 5 | 93 |
| Team Performance | 47 | 12 | 15 | 7 | 82 |
| Automation Exposure | 24 | 22 | 9 | 6 | 62 |
| Job Displacement | 6 | 38 | 13 | — | 57 |
| Hiring & Recruitment | 41 | 4 | 6 | 3 | 54 |
| Developer Productivity | 34 | 4 | 3 | 1 | 42 |
| Social Protection | 22 | 10 | 6 | 2 | 40 |
| Creative Output | 16 | 7 | 5 | 1 | 29 |
| Labor Share of Income | 12 | 5 | 9 | — | 26 |
| Skill Obsolescence | 3 | 20 | 2 | — | 25 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
Productivity
Remove filter
To enhance simulation stability, we implement a mean-field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high-dimensional decision spaces.
Method description: implementation of a mean-field mechanism within the simulator; paper asserts this design stabilizes sampling in high-dimensional decision spaces (method + reported simulation behavior).
We introduce a preference learning paradigm in which LLMs are economically aligned via post-training on extensive, heterogeneous transaction records across diverse product categories.
Method description: post-training LLMs on heterogeneous transaction records across product categories to align preferences (methodological / training procedure described).
This paper introduces a Multi-Agent Large Language Model-based Economic Sandbox (MALLES) as a unified simulation framework applicable to cross-domain and cross-category scenarios.
Paper description: design and implementation of MALLES, presented as a unified framework leveraging large-scale LLM generalization for cross-domain/cross-category simulation (methodological contribution).
Human-AI systems should be designed under a cognitive sustainability constraint so that gains in hybrid performance do not come at the cost of degradation in human expertise.
Normative recommendation in the paper based on the conceptual/mathematical framework and the identified trade-off; presented as an argument rather than empirically validated policy outcome in the excerpt.
Together, these quantities provide a low-dimensional metric space for evaluating whether human-AI systems achieve genuine synergistic performance and whether such performance is cognitively sustainable for the human component over time.
Claim about the utility of the defined metrics, supported within the paper by the conceptual/mathematical framework and the proposed metric definitions (theoretical demonstration rather than reported empirical validation in the excerpt).
The paper defines a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR).
Explicit listing of newly proposed operational metrics in the paper; this is a descriptive claim about the paper's content (theoretical definitions), no sample size or empirical estimation provided in the excerpt.
The paper introduces a conceptual and mathematical framework to distinguish cognitive amplification (AI improves hybrid human-AI performance while preserving human expertise) from cognitive delegation (reasoning is progressively outsourced to AI).
Explicit contribution claim in the paper (description of a conceptual and mathematical framework); evidence consists of the model and formal definitions presented in the paper (no external empirical validation reported in the excerpt).
Artificial intelligence generates positive spatial spillovers for UCEE (positive effects on neighboring regions).
Spatial Durbin model reported in the abstract indicating positive spillover coefficients for artificial intelligence.
The Global Malmquist–Luenberger (GML) index and its efficiency change (EC) and technological change (TC) components stay above 1, indicating sustained efficiency gains dominated by technological progress.
GML index and decomposition results reported in the abstract based on the panel data and GML computation.
Nationally, the average UCEE index rises from about 0.3 to above 0.7 over the sample period.
Computed UCEE index results from the Super-SBM model applied to the panel of 30 provinces (2013–2022) as reported in the abstract.
SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.
Conceptual/positioning claim made by the authors about the intended shift in benchmarking perspective enabled by SOL-ExecBench.
To support robust evaluation of agentic optimizers, we provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis-based checks against common reward-hacking strategies.
Method/tool claim in paper describing the provided evaluation harness and its engineered controls (list of features included).
We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes.
Paper defines the SOL Score metric and states its interpretive meaning (fraction of gap closed between baseline and hardware SOL bound).
SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization.
Methodological claim: introduction of SOLAR pipeline to compute analytic hardware-grounded SOL bounds and use of those bounds as benchmark targets, as described in the paper.
The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities.
Paper description of benchmark coverage (workload direction and data types; inclusion of kernels tied to Blackwell hardware features).
We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs.
Paper reports construction of the benchmark with counts: 235 CUDA kernel problems and 124 source models; descriptive dataset claim in the manuscript.
Given these findings, policymakers should favor 'strategic forbearance'—apply existing laws rather than create new regulations that could stifle innovation and diffusion of AI.
Authors' normative policy recommendation based on their interpretation of the reviewed empirical literature (risk–benefit assessment); this is a prescriptive conclusion rather than an empirical finding, so no sample size applies.
Generative AI lowers entry costs for startups, facilitating new firm entry and product development.
Cited empirical and descriptive evidence in the literature review indicating reduced development costs and faster product prototyping enabled by AI tools; the brief does not provide a pooled sample size or a single quantitative estimate.
Generative AI significantly boosts productivity in specific tasks like coding, writing, and customer service—often by 15% to 50%.
Synthesis/review of empirical literature through 2025 (multiple empirical studies of task-level impacts, including field and lab studies and observational analyses); the brief reports aggregate reported effect ranges but does not list a single pooled sample size.
The AgentDS benchmark datasets are open-sourced and available at https://huggingface.co/datasets/lainmn/AgentDS.
Paper includes link to the open-source datasets and the AgentDS website.
The strongest solutions arise from human-AI collaboration.
Analysis of competition results showing top-performing submissions employed human-AI collaborative approaches rather than AI-only baselines (results from 29 teams / 80 participants).
We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.
Paper describes the creation of the AgentDS benchmark and an associated competition as the study's primary methodological contribution.
Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow.
Statement in the paper referencing recent developments in LLMs and AI agents; presented as motivation rather than validated empirically within the paper.
Data science plays a critical role in transforming complex data into actionable insights across numerous domains.
Background statement in the paper (no empirical test or dataset provided to support this claim).
End-to-end verified pipelines can produce provably correct code from informal specifications.
The paper surveys early research demonstrating pipelines that go from informal specifications to formally verified code; the provided text does not include experimental sample sizes or benchmarks.
AI-generated postconditions catch real-world bugs missed by prior methods.
Surveyed early research asserted by the paper indicating empirical instances where AI-generated postconditions found bugs that other methods missed; no numeric details provided in the excerpt.
Interactive test-driven formalization improves program correctness.
Paper surveys early research that reportedly demonstrates this effect (described as 'interactive test-driven formalization that improves program correctness'); the excerpt does not include specific study details or sample sizes.
The central bottleneck is validating specifications: since there is no oracle for specification correctness other than the user, we need semi-automated metrics that can assess specification quality with or without code, through lightweight user interaction and proxy artifacts such as tests.
Analytical claim and research agenda item in the paper; motivates need for new metrics and interaction designs. No empirical validation or sample size reported in the excerpt.
Intent formalization offers a tradeoff spectrum suitable to the reliability needs of different contexts: from lightweight tests that disambiguate likely misinterpretations, through full functional specifications for formal verification, to domain-specific languages from which correct code is synthesized automatically.
Conceptual framework proposed in the paper describing a spectrum of specification formality; presented as an argument rather than an empirical finding, with no sample sizes provided in the excerpt.
Intent formalization — translating informal user intent into checkable formal specifications — is the key challenge that will determine whether AI makes software more reliable or merely more abundant.
Normative argument presented by the authors as the central thesis of the paper; no empirical study or sample size cited in the provided text.
Agentic AI systems can now generate code with remarkable fluency.
Authoritative assertion in the paper based on contemporary observations of large code-generating models; no empirical sample size or benchmark numbers reported in the text provided.
The study implies policy actions to promote high-quality development based on the finding that innovation and the digital economy now play larger roles in growth.
Authors' discussion/conclusion drawing policy implications from empirical findings (declining capital elasticity, rising TFP and digital economy contribution).
Overall, China's growth model shifted over 2010–2022 from being investment-driven to being innovation-driven.
Synthesis of results: declining capital elasticity, rising TFP contribution, substantial share of digital economy in TFP, and regional patterns reported by the study.
The study's method is novel because it uses both migrant worker monitoring data and digital-economy proxy indicators, giving a more accurate picture of how labor quality and technological progress affect each other.
Author-reported methodological description: extended Cobb–Douglas approach combined with quality-adjusted labor measures derived from migrant worker monitoring data and proxy indicators for the digital economy.
Regional analysis shows coastal regions have been driven by innovation, with an estimated (innovation) coefficient of approximately 0.31.
Regional decomposition/estimation reported in the paper's analysis of coastal vs inland regions using the extended production function and digital/labour-quality measures.
The digital economy accounted for 40% of the observed increase in TFP (i.e., made up 40% of the TFP contribution).
Attribution within the growth decomposition from the extended production function, where digital economy indicators are included and their contribution to TFP is estimated.
The contribution rate of total factor productivity (TFP) rose from 18% to 26% between the earlier and later periods.
Decomposition of growth using the extended Cobb–Douglas production function for China over 2010–2022, reporting TFP contribution rates for the two periods.
TDAD (Test-Driven Agentic Development) combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change.
Description of the tool/methodology and its implementation (TDAD is presented as an open-source tool in the paper).
PIER is an offline reinforcement learning framework that learns fuel‑efficient, safety‑aware routing policies from physics‑calibrated environments grounded in historical vessel tracking data and ocean reanalysis products, requiring no online simulator.
Methodological description of PIER in the paper: offline RL trained on environments constructed from AIS and reanalysis data; no online simulator used for policy learning (implementation details provided).
Bootstrap 95% confidence interval for PIER mean CO2 savings relative to great-circle routing is [2.9%, 15.7%].
Bootstrap analysis applied to the 2023 AIS validation results (840 episodes per method) producing the stated 95% CI for mean percent savings.
PIER reduces per‑voyage fuel consumption variance by a factor of 3.5 (p < 0.001).
Statistical comparison of per-voyage fuel variance between PIER and baseline routing on 840 episodes per method from 2023 AIS data; significance reported with p < 0.001.
On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy.
Benchmark evaluation reported in the paper using the LoCoMo benchmark with a reported overall accuracy of 74.8%.
Adversarial governance compliance was 100%.
Adversarial compliance testing reported in the paper (linked to the adversarial query experiments); reported compliance = 100%.
There was zero cross-entity leakage across 500 adversarial queries.
Adversarial testing reported in the paper: 500 adversarial queries used to test cross-entity leakage; result = zero leakage.
Progressive context delivery yielded a 50% token reduction.
Reported experimental result in the controlled experiments indicating token usage reduction from progressive delivery = 50%.
Governance routing precision was 92% in the experiments.
Reported experimental metric from the controlled experiments (N=250, five content types) showing governance routing precision = 92%.
The system achieved 99.6% fact recall (with complementary dual-modality coverage) in the controlled experiments.
Reported experimental result from the controlled experiments (N=250, five content types) as stated in the paper.
Immediate practical steps include improved documentation, stakeholder audits, and multi‑metric evaluation; medium‑term steps include standards for participatory evaluation and tooling for transparency and monitoring; long‑term steps include institutional governance, interoperable safety APIs, and public‑interest evaluation infrastructure.
Prescriptive roadmap in the paper based on conceptual analysis and prior literature; these are recommended policy/program milestones rather than empirically validated interventions.
Transparency (detailed documentation of data, objectives, evaluation processes, and deployment constraints; audit and contest mechanisms) is a necessary mechanism for accountable alignment.
Normative and practical argumentation supported by prior work on model cards, documentation standards, and auditing; no new audits are presented in the paper.
Pluralistic evaluation—using multiple, diverse evaluation criteria and stakeholder‑informed metrics rather than single aggregated alignment scores—will better capture the values and harms at stake.
Argumentative rationale and literature synthesis advocating multi‑metric evaluation approaches; examples from prior evaluation critiques are referenced rather than new empirical comparison.