Evidence (4793 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	402	112	67	480	1076
Governance & Regulation	402	192	122	62	790
Research Productivity	249	98	34	311	697
Organizational Efficiency	395	95	70	40	603
Technology Adoption Rate	321	126	73	39	564
Firm Productivity	306	39	70	12	432
Output Quality	256	66	25	28	375
AI Safety & Ethics	116	177	44	24	363
Market Structure	107	128	85	14	339
Decision Quality	177	76	38	20	315
Fiscal & Macroeconomic	89	58	33	22	209
Employment Level	77	34	80	9	202
Skill Acquisition	92	33	40	9	174
Innovation Output	120	12	23	12	168
Firm Revenue	98	34	22	—	154
Consumer Welfare	73	31	37	7	148
Task Allocation	84	16	33	7	140
Inequality Measures	25	77	32	5	139
Regulatory Compliance	54	63	13	3	133
Error Rate	44	51	6	—	101
Task Completion Time	88	5	4	3	100
Training Effectiveness	58	12	12	16	99
Worker Satisfaction	47	32	11	7	97
Wages & Compensation	53	15	20	5	93
Team Performance	47	12	15	7	82
Automation Exposure	24	22	9	6	62
Job Displacement	6	38	13	—	57
Hiring & Recruitment	41	4	6	3	54
Developer Productivity	34	4	3	1	42
Social Protection	22	10	6	2	40
Creative Output	16	7	5	1	29
Labor Share of Income	12	5	9	—	26
Skill Obsolescence	3	20	2	—	25
Worker Turnover	10	12	—	3	25

Productivity Remove filter

To enhance simulation stability, we implement a mean-field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high-dimensional decision spaces.

Method description: implementation of a mean-field mechanism within the simulator; paper asserts this design stabilizes sampling in high-dimensional decision spaces (method + reported simulation behavior).

high positive MALLES: A Multi-agent LLMs-based Economic Sandbox with Consu... simulation stability / stabilized sampling processes

We introduce a preference learning paradigm in which LLMs are economically aligned via post-training on extensive, heterogeneous transaction records across diverse product categories.

Method description: post-training LLMs on heterogeneous transaction records across product categories to align preferences (methodological / training procedure described).

high positive MALLES: A Multi-agent LLMs-based Economic Sandbox with Consu... ability of models to internalize consumer preferences via post-training

This paper introduces a Multi-Agent Large Language Model-based Economic Sandbox (MALLES) as a unified simulation framework applicable to cross-domain and cross-category scenarios.

Paper description: design and implementation of MALLES, presented as a unified framework leveraging large-scale LLM generalization for cross-domain/cross-category simulation (methodological contribution).

high positive MALLES: A Multi-agent LLMs-based Economic Sandbox with Consu... existence and applicability of MALLES as a unified simulation framework

Human-AI systems should be designed under a cognitive sustainability constraint so that gains in hybrid performance do not come at the cost of degradation in human expertise.

Normative recommendation in the paper based on the conceptual/mathematical framework and the identified trade-off; presented as an argument rather than empirically validated policy outcome in the excerpt.

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... preservation of human expertise under human-AI design choices

Together, these quantities provide a low-dimensional metric space for evaluating whether human-AI systems achieve genuine synergistic performance and whether such performance is cognitively sustainable for the human component over time.

Claim about the utility of the defined metrics, supported within the paper by the conceptual/mathematical framework and the proposed metric definitions (theoretical demonstration rather than reported empirical validation in the excerpt).

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... hybrid human-AI performance and cognitive sustainability

The paper defines a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR).

Explicit listing of newly proposed operational metrics in the paper; this is a descriptive claim about the paper's content (theoretical definitions), no sample size or empirical estimation provided in the excerpt.

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... operational metrics for human-AI cognitive interaction (CAI*, D, HRI, HCDR)

The paper introduces a conceptual and mathematical framework to distinguish cognitive amplification (AI improves hybrid human-AI performance while preserving human expertise) from cognitive delegation (reasoning is progressively outsourced to AI).

Explicit contribution claim in the paper (description of a conceptual and mathematical framework); evidence consists of the model and formal definitions presented in the paper (no external empirical validation reported in the excerpt).

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... mode of human-AI interaction (amplification vs delegation)

Artificial intelligence generates positive spatial spillovers for UCEE (positive effects on neighboring regions).

Spatial Durbin model reported in the abstract indicating positive spillover coefficients for artificial intelligence.

high positive How artificial intelligence and environmental regulation inf... UCEE index (spatial spillover effect of AI)

The Global Malmquist–Luenberger (GML) index and its efficiency change (EC) and technological change (TC) components stay above 1, indicating sustained efficiency gains dominated by technological progress.

GML index and decomposition results reported in the abstract based on the panel data and GML computation.

high positive How artificial intelligence and environmental regulation inf... GML index and its EC and TC components (measures of productivity/efficiency chan...

Nationally, the average UCEE index rises from about 0.3 to above 0.7 over the sample period.

Computed UCEE index results from the Super-SBM model applied to the panel of 30 provinces (2013–2022) as reported in the abstract.

high positive How artificial intelligence and environmental regulation inf... UCEE index (average, national)

SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.

Conceptual/positioning claim made by the authors about the intended shift in benchmarking perspective enabled by SOL-ExecBench.

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... benchmarking_objective_shift_toward_hardware_efficiency

To support robust evaluation of agentic optimizers, we provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis-based checks against common reward-hacking strategies.

Method/tool claim in paper describing the provided evaluation harness and its engineered controls (list of features included).

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... evaluation_robustness_and_integrity_of_benchmarking

We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes.

Paper defines the SOL Score metric and states its interpretive meaning (fraction of gap closed between baseline and hardware SOL bound).

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... fraction_of_gap_closed_to_hardware_bound

SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization.

Methodological claim: introduction of SOLAR pipeline to compute analytic hardware-grounded SOL bounds and use of those bounds as benchmark targets, as described in the paper.

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... proximity_to_hardware_speed_of_light_bounds

The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities.

Paper description of benchmark coverage (workload direction and data types; inclusion of kernels tied to Blackwell hardware features).

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... coverage_of_workloads_and_datatypes

We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs.

Paper reports construction of the benchmark with counts: 235 CUDA kernel problems and 124 source models; descriptive dataset claim in the manuscript.

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... benchmark_problem_count_and_coverage

Given these findings, policymakers should favor 'strategic forbearance'—apply existing laws rather than create new regulations that could stifle innovation and diffusion of AI.

Authors' normative policy recommendation based on their interpretation of the reviewed empirical literature (risk–benefit assessment); this is a prescriptive conclusion rather than an empirical finding, so no sample size applies.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... regulatory approach to AI governance (strategy of forbearance vs. new regulation...

Generative AI lowers entry costs for startups, facilitating new firm entry and product development.

Cited empirical and descriptive evidence in the literature review indicating reduced development costs and faster product prototyping enabled by AI tools; the brief does not provide a pooled sample size or a single quantitative estimate.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... barriers to entry / startup costs and rate of new product development

Generative AI significantly boosts productivity in specific tasks like coding, writing, and customer service—often by 15% to 50%.

Synthesis/review of empirical literature through 2025 (multiple empirical studies of task-level impacts, including field and lab studies and observational analyses); the brief reports aggregate reported effect ranges but does not list a single pooled sample size.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... task-level productivity in coding, writing, and customer service

The AgentDS benchmark datasets are open-sourced and available at https://huggingface.co/datasets/lainmn/AgentDS.

Paper includes link to the open-source datasets and the AgentDS website.

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... availability of datasets

The strongest solutions arise from human-AI collaboration.

Analysis of competition results showing top-performing submissions employed human-AI collaborative approaches rather than AI-only baselines (results from 29 teams / 80 participants).

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... performance of human-AI collaborative solutions

We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.

Paper describes the creation of the AgentDS benchmark and an associated competition as the study's primary methodological contribution.

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... benchmark for evaluating AI agents and human-AI collaboration

Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow.

Statement in the paper referencing recent developments in LLMs and AI agents; presented as motivation rather than validated empirically within the paper.

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... automation of data science workflow

Data science plays a critical role in transforming complex data into actionable insights across numerous domains.

Background statement in the paper (no empirical test or dataset provided to support this claim).

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... transforming complex data into actionable insights

End-to-end verified pipelines can produce provably correct code from informal specifications.

The paper surveys early research demonstrating pipelines that go from informal specifications to formally verified code; the provided text does not include experimental sample sizes or benchmarks.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... provable correctness of generated code

AI-generated postconditions catch real-world bugs missed by prior methods.

Surveyed early research asserted by the paper indicating empirical instances where AI-generated postconditions found bugs that other methods missed; no numeric details provided in the excerpt.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... bugs detected / error detection rate

Interactive test-driven formalization improves program correctness.

Paper surveys early research that reportedly demonstrates this effect (described as 'interactive test-driven formalization that improves program correctness'); the excerpt does not include specific study details or sample sizes.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... program correctness

The central bottleneck is validating specifications: since there is no oracle for specification correctness other than the user, we need semi-automated metrics that can assess specification quality with or without code, through lightweight user interaction and proxy artifacts such as tests.

Analytical claim and research agenda item in the paper; motivates need for new metrics and interaction designs. No empirical validation or sample size reported in the excerpt.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... ability to validate specification correctness / specification quality

Intent formalization offers a tradeoff spectrum suitable to the reliability needs of different contexts: from lightweight tests that disambiguate likely misinterpretations, through full functional specifications for formal verification, to domain-specific languages from which correct code is synthesized automatically.

Conceptual framework proposed in the paper describing a spectrum of specification formality; presented as an argument rather than an empirical finding, with no sample sizes provided in the excerpt.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... suitability of specification approaches for reliability requirements

Intent formalization — translating informal user intent into checkable formal specifications — is the key challenge that will determine whether AI makes software more reliable or merely more abundant.

Normative argument presented by the authors as the central thesis of the paper; no empirical study or sample size cited in the provided text.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... software reliability (correctness relative to user intent)

Agentic AI systems can now generate code with remarkable fluency.

Authoritative assertion in the paper based on contemporary observations of large code-generating models; no empirical sample size or benchmark numbers reported in the text provided.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... code generation fluency / ability to produce code

The study implies policy actions to promote high-quality development based on the finding that innovation and the digital economy now play larger roles in growth.

Authors' discussion/conclusion drawing policy implications from empirical findings (declining capital elasticity, rising TFP and digital economy contribution).

high positive Analysis of China's Economic Growth Drivers: An Empirical St... policy implication for promoting high-quality development

Overall, China's growth model shifted over 2010–2022 from being investment-driven to being innovation-driven.

Synthesis of results: declining capital elasticity, rising TFP contribution, substantial share of digital economy in TFP, and regional patterns reported by the study.

high positive Analysis of China's Economic Growth Drivers: An Empirical St... structural shift in the growth model (investment-driven → innovation-driven)

The study's method is novel because it uses both migrant worker monitoring data and digital-economy proxy indicators, giving a more accurate picture of how labor quality and technological progress affect each other.

Author-reported methodological description: extended Cobb–Douglas approach combined with quality-adjusted labor measures derived from migrant worker monitoring data and proxy indicators for the digital economy.

high positive Analysis of China's Economic Growth Drivers: An Empirical St... measurement accuracy of labor quality and technology interaction (methodological...

Regional analysis shows coastal regions have been driven by innovation, with an estimated (innovation) coefficient of approximately 0.31.

Regional decomposition/estimation reported in the paper's analysis of coastal vs inland regions using the extended production function and digital/labour-quality measures.

high positive Analysis of China's Economic Growth Drivers: An Empirical St... innovation-related elasticity/coefficient in coastal regions (≈0.31)

The digital economy accounted for 40% of the observed increase in TFP (i.e., made up 40% of the TFP contribution).

Attribution within the growth decomposition from the extended production function, where digital economy indicators are included and their contribution to TFP is estimated.

high positive Analysis of China's Economic Growth Drivers: An Empirical St... share of TFP contribution attributable to the digital economy

The contribution rate of total factor productivity (TFP) rose from 18% to 26% between the earlier and later periods.

Decomposition of growth using the extended Cobb–Douglas production function for China over 2010–2022, reporting TFP contribution rates for the two periods.

high positive Analysis of China's Economic Growth Drivers: An Empirical St... TFP contribution rate to economic growth

TDAD (Test-Driven Agentic Development) combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change.

Description of the tool/methodology and its implementation (TDAD is presented as an open-source tool in the paper).

high positive TDAD: Test-Driven Agentic Development - Reducing Code Regres... identification/surfacing of tests likely impacted by code changes (test prioriti...

PIER is an offline reinforcement learning framework that learns fuel‑efficient, safety‑aware routing policies from physics‑calibrated environments grounded in historical vessel tracking data and ocean reanalysis products, requiring no online simulator.

Methodological description of PIER in the paper: offline RL trained on environments constructed from AIS and reanalysis data; no online simulator used for policy learning (implementation details provided).

high positive Physics-informed offline reinforcement learning eliminates c... requirement for online simulator (method characteristic)

Bootstrap 95% confidence interval for PIER mean CO2 savings relative to great-circle routing is [2.9%, 15.7%].

Bootstrap analysis applied to the 2023 AIS validation results (840 episodes per method) producing the stated 95% CI for mean percent savings.

high positive Physics-informed offline reinforcement learning eliminates c... 95% bootstrap confidence interval for mean percent CO2 savings

PIER reduces per‑voyage fuel consumption variance by a factor of 3.5 (p < 0.001).

Statistical comparison of per-voyage fuel variance between PIER and baseline routing on 840 episodes per method from 2023 AIS data; significance reported with p < 0.001.

high positive Physics-informed offline reinforcement learning eliminates c... variance of per-voyage fuel consumption

On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy.

Benchmark evaluation reported in the paper using the LoCoMo benchmark with a reported overall accuracy of 74.8%.

high positive Governed Memory: A Production Architecture for Multi-Agent W... overall accuracy on the LoCoMo benchmark (percentage)

Adversarial governance compliance was 100%.

Adversarial compliance testing reported in the paper (linked to the adversarial query experiments); reported compliance = 100%.

high positive Governed Memory: A Production Architecture for Multi-Agent W... governance compliance under adversarial queries (percentage)

There was zero cross-entity leakage across 500 adversarial queries.

Adversarial testing reported in the paper: 500 adversarial queries used to test cross-entity leakage; result = zero leakage.

high positive Governed Memory: A Production Architecture for Multi-Agent W... cross-entity information leakage (count/occurrence across 500 queries)

Progressive context delivery yielded a 50% token reduction.

Reported experimental result in the controlled experiments indicating token usage reduction from progressive delivery = 50%.

high positive Governed Memory: A Production Architecture for Multi-Agent W... token usage reduction (percentage)

Governance routing precision was 92% in the experiments.

Reported experimental metric from the controlled experiments (N=250, five content types) showing governance routing precision = 92%.

high positive Governed Memory: A Production Architecture for Multi-Agent W... governance routing precision (percentage)

The system achieved 99.6% fact recall (with complementary dual-modality coverage) in the controlled experiments.

Reported experimental result from the controlled experiments (N=250, five content types) as stated in the paper.

high positive Governed Memory: A Production Architecture for Multi-Agent W... fact recall (percentage recall of facts)

Immediate practical steps include improved documentation, stakeholder audits, and multi‑metric evaluation; medium‑term steps include standards for participatory evaluation and tooling for transparency and monitoring; long‑term steps include institutional governance, interoperable safety APIs, and public‑interest evaluation infrastructure.

Prescriptive roadmap in the paper based on conceptual analysis and prior literature; these are recommended policy/program milestones rather than empirically validated interventions.

high positive LLM Alignment should go beyond Harmlessness–Helpfulness and ... implementation status of the recommended immediate, medium‑term, and long‑term a...

Transparency (detailed documentation of data, objectives, evaluation processes, and deployment constraints; audit and contest mechanisms) is a necessary mechanism for accountable alignment.

Normative and practical argumentation supported by prior work on model cards, documentation standards, and auditing; no new audits are presented in the paper.

high positive LLM Alignment should go beyond Harmlessness–Helpfulness and ... availability and granularity of documentation and auditability of model developm...

Pluralistic evaluation—using multiple, diverse evaluation criteria and stakeholder‑informed metrics rather than single aggregated alignment scores—will better capture the values and harms at stake.

Argumentative rationale and literature synthesis advocating multi‑metric evaluation approaches; examples from prior evaluation critiques are referenced rather than new empirical comparison.

high positive LLM Alignment should go beyond Harmlessness–Helpfulness and ... evaluation coverage of diverse values, harms, and stakeholder perspectives

« Prev 1 2 3 … 32 33 34 … 95 96 Next »