Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

The paper defines a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR).

Explicit listing of newly proposed operational metrics in the paper; this is a descriptive claim about the paper's content (theoretical definitions), no sample size or empirical estimation provided in the excerpt.

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... operational metrics for human-AI cognitive interaction (CAI*, D, HRI, HCDR)

The paper introduces a conceptual and mathematical framework to distinguish cognitive amplification (AI improves hybrid human-AI performance while preserving human expertise) from cognitive delegation (reasoning is progressively outsourced to AI).

Explicit contribution claim in the paper (description of a conceptual and mathematical framework); evidence consists of the model and formal definitions presented in the paper (no external empirical validation reported in the excerpt).

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... mode of human-AI interaction (amplification vs delegation)

Artificial intelligence generates positive spatial spillovers for UCEE (positive effects on neighboring regions).

Spatial Durbin model reported in the abstract indicating positive spillover coefficients for artificial intelligence.

high positive How artificial intelligence and environmental regulation inf... UCEE index (spatial spillover effect of AI)

The Global Malmquist–Luenberger (GML) index and its efficiency change (EC) and technological change (TC) components stay above 1, indicating sustained efficiency gains dominated by technological progress.

GML index and decomposition results reported in the abstract based on the panel data and GML computation.

high positive How artificial intelligence and environmental regulation inf... GML index and its EC and TC components (measures of productivity/efficiency chan...

Nationally, the average UCEE index rises from about 0.3 to above 0.7 over the sample period.

Computed UCEE index results from the Super-SBM model applied to the panel of 30 provinces (2013–2022) as reported in the abstract.

high positive How artificial intelligence and environmental regulation inf... UCEE index (average, national)

Recent advances in large language models, tool-using agents, and financial machine learning are shifting financial automation from isolated prediction tasks to integrated decision systems that can perceive information, reason over objectives, and generate or execute actions.

Literature synthesis and conceptual statement in the paper's introduction describing recent technological advances and their effects on financial automation; no empirical sample size reported.

high positive AI Agents in Financial Markets: Architecture, Applications, ... shift in type of financial automation (from isolated prediction to integrated de...

SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.

Conceptual/positioning claim made by the authors about the intended shift in benchmarking perspective enabled by SOL-ExecBench.

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... benchmarking_objective_shift_toward_hardware_efficiency

To support robust evaluation of agentic optimizers, we provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis-based checks against common reward-hacking strategies.

Method/tool claim in paper describing the provided evaluation harness and its engineered controls (list of features included).

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... evaluation_robustness_and_integrity_of_benchmarking

We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes.

Paper defines the SOL Score metric and states its interpretive meaning (fraction of gap closed between baseline and hardware SOL bound).

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... fraction_of_gap_closed_to_hardware_bound

SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization.

Methodological claim: introduction of SOLAR pipeline to compute analytic hardware-grounded SOL bounds and use of those bounds as benchmark targets, as described in the paper.

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... proximity_to_hardware_speed_of_light_bounds

The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities.

Paper description of benchmark coverage (workload direction and data types; inclusion of kernels tied to Blackwell hardware features).

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... coverage_of_workloads_and_datatypes

We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs.

Paper reports construction of the benchmark with counts: 235 CUDA kernel problems and 124 source models; descriptive dataset claim in the manuscript.

high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... benchmark_problem_count_and_coverage

Given these findings, policymakers should favor 'strategic forbearance'—apply existing laws rather than create new regulations that could stifle innovation and diffusion of AI.

Authors' normative policy recommendation based on their interpretation of the reviewed empirical literature (risk–benefit assessment); this is a prescriptive conclusion rather than an empirical finding, so no sample size applies.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... regulatory approach to AI governance (strategy of forbearance vs. new regulation...

Generative AI lowers entry costs for startups, facilitating new firm entry and product development.

Cited empirical and descriptive evidence in the literature review indicating reduced development costs and faster product prototyping enabled by AI tools; the brief does not provide a pooled sample size or a single quantitative estimate.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... barriers to entry / startup costs and rate of new product development

Generative AI significantly boosts productivity in specific tasks like coding, writing, and customer service—often by 15% to 50%.

Synthesis/review of empirical literature through 2025 (multiple empirical studies of task-level impacts, including field and lab studies and observational analyses); the brief reports aggregate reported effect ranges but does not list a single pooled sample size.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... task-level productivity in coding, writing, and customer service

The study contributes to theory by empirically integrating technological, human, and institutional dimensions within a single architectural framework, moving beyond isolated analyses of digital credit.

Author-stated contribution based on combining measures of algorithmic credit systems, human capability, and institutional design and testing interactions in the same regression models.

high positive Architecting financial well-being in algorithmic credit syst... theoretical contribution / integrative framework

Moderation analysis reveals that higher levels of human capability and stronger institutional design amplify the positive effects of algorithmic credit systems and mitigate their adverse effects (i.e., they strengthen repayment and resilience effects and reduce financial stress).

Reported moderation analyses using interaction terms in the regression models on the 400-user cross-sectional sample; results described as significant moderation by human capability and institutional design.

high positive Architecting financial well-being in algorithmic credit syst... conditional effects on repayment behavior, financial resilience, and financial s...

Algorithmic credit systems are positively associated with financial resilience.

Regression analyses reported show a positive relationship between algorithmic credit system use and measures of financial resilience in the sample of 400 users.

high positive Architecting financial well-being in algorithmic credit syst... financial resilience

Algorithmic credit systems are positively associated with repayment behavior.

Multiple regression results reported in the study indicate a positive association between use of algorithmic credit systems and repayment behavior based on cross-sectional survey of 400 users.

high positive Architecting financial well-being in algorithmic credit syst... repayment behavior

Measurement reliability and validity were established through Cronbach's alpha and principal component analysis.

Paper states that Cronbach’s alpha and principal component analysis (PCA) were used to establish measurement reliability and validity.

high positive Architecting financial well-being in algorithmic credit syst... measurement reliability/validity

The study used a quantitative, explanatory, cross-sectional design and employed multiple regression and moderation analyses to assess relationships among algorithmic credit systems, human capability, institutional design, and financial-wellbeing outcomes.

Methods described explicitly: quantitative explanatory cross-sectional design; analytical methods named as multiple regression and moderation analyses.

high positive Architecting financial well-being in algorithmic credit syst... research design / analytic methods

Data were collected from 400 users of algorithmic and digitally mediated credit platforms.

Study reports a quantitative, explanatory, cross-sectional survey of users; sample size explicitly stated as 400.

high positive Architecting financial well-being in algorithmic credit syst... sample_size / data source

Institutional design (enforceable rules, auditable logs, human oversight on high-impact actions) is a precondition for safe delegation of real authority to LLM agents; systems should be stress-tested under governance-like constraints before assignment of real authority.

Policy recommendation derived from simulation findings that governance structure strongly influences corruption-related outcomes and that safeguards alone are not consistently sufficient; grounded in experiments and rubric-assessed outcomes across 28,112 transcript segments.

high positive I Can't Believe It's Corrupt: Evaluating Corruption in Multi... safety of delegation to LLM agents (compliance with rules, avoidance of abuse)

Among models operating below saturation, governance structure is a stronger driver of corruption-related outcomes than model identity.

Comparative analysis within the multi-agent governance simulations across different authority structures and model identities; outcomes aggregated and compared across regimes (based on the 28,112 transcript segments scored).

high positive I Can't Believe It's Corrupt: Evaluating Corruption in Multi... corruption-related outcomes / rule-breaking

Integrity in institutional AI should be treated as a pre-deployment requirement rather than a post-deployment assumption.

Argument and recommendation based on results from multi-agent governance simulations evaluating rule-breaking and abuse; conclusions drawn from aggregate outcomes across simulated regimes and interventions (see study of 28,112 transcript segments).

high positive I Can't Believe It's Corrupt: Evaluating Corruption in Multi... institutional integrity / safety of delegation to LLM agents

The AgentDS benchmark datasets are open-sourced and available at https://huggingface.co/datasets/lainmn/AgentDS.

Paper includes link to the open-source datasets and the AgentDS website.

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... availability of datasets

The strongest solutions arise from human-AI collaboration.

Analysis of competition results showing top-performing submissions employed human-AI collaborative approaches rather than AI-only baselines (results from 29 teams / 80 participants).

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... performance of human-AI collaborative solutions

We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.

Paper describes the creation of the AgentDS benchmark and an associated competition as the study's primary methodological contribution.

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... benchmark for evaluating AI agents and human-AI collaboration

Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow.

Statement in the paper referencing recent developments in LLMs and AI agents; presented as motivation rather than validated empirically within the paper.

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... automation of data science workflow

Data science plays a critical role in transforming complex data into actionable insights across numerous domains.

Background statement in the paper (no empirical test or dataset provided to support this claim).

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... transforming complex data into actionable insights

LLM-generated peer reviews assign scores that, on average, are a full point higher than human reviews.

Analysis of scores in the conference peer review dataset comparing LLM-generated vs human reviews; the excerpt states an average increase of one full point but does not include sample size or scale range.

high positive How LLMs Distort Our Written Language assigned review scores

About 21% of scientific peer reviews at a recent top AI conference were AI-generated (LLM-generated) in the wild.

Analysis of peer reviews from a recent top AI conference reported in the paper; the excerpt reports the 21% figure but does not give total number of reviews in the excerpt.

high positive How LLMs Distort Our Written Language share/proportion of peer reviews that were AI-generated

Even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning.

Experiment in which LLMs were given expert feedback and explicit instructions to perform only grammar edits; comparisons show significant semantic alteration despite constrained instructions; sample size not provided.

high positive How LLMs Distort Our Written Language semantic alteration of text despite constrained grammar-only prompt

Using a dataset of human-written essays (collected in 2021 before widespread LLM release), asking an LLM to revise essays based on human-written feedback induces large changes in the resulting content and meaning.

Controlled experiments applying LLM revision to a pre-LLM essay dataset and comparing pre- and post-revision content/semantics; dataset described as collected in 2021 but sample size not stated in the excerpt.

high positive How LLMs Distort Our Written Language magnitude of content and semantic changes after LLM revision

In a human user study, extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question.

Human user study reported in the paper; the excerpt gives the quantified result (nearly 70% increase) but does not report sample size here.

high positive How LLMs Distort Our Written Language proportion of essays judged as neutral in answering the topic question

LLMs consistently alter the intended meaning of human writing.

Experiments in which human-written essays were revised by LLMs (including prompts asking only for grammar edits) and comparison of pre- and post-LLM text semantics; exact sample sizes not stated in the excerpt.

high positive How LLMs Distort Our Written Language degree of semantic change / alteration of intended meaning

LLMs alter the voice and tone of human writing.

Reported results from a human user study and subsequent experiments comparing original human-written text to LLM-assisted/LLM-revised text; sample sizes not provided in the excerpt.

high positive How LLMs Distort Our Written Language change in voice and tone of writing

Large language models (LLMs) are used by over a billion people globally, most often to assist with writing.

Statement in paper (likely based on external usage statistics or surveys cited by authors); no sample size reported in the provided text.

high positive How LLMs Distort Our Written Language LLM adoption and primary use case (writing assistance)

The code and data used in the study are publicly available at the referenced repository.

Paper statement that code and data are publicly available at a repository (link provided in paper).

high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... availability of replication materials (code and data)

A sensitivity analysis over patrol radius, officer count, and citizen reporting probability reveals outcomes are most sensitive to officer deployment levels.

Reported sensitivity analysis across patrol radius, officer count, and reporting probability showing officer count as the most influential parameter in the simulation outcomes.

high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... sensitivity of bias/detection outcomes to simulation parameters (patrol radius, ...

Persistent Gini coefficients of 0.43 to 0.62 across all conditions indicate concentrated detection inequality.

Reported range of Gini coefficients from simulation experiments across conditions.

high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... Gini Coefficient (detection distribution inequality)

Experiments reveal extreme and year-variant bias in Baltimore's detected mode, with mean annual DIR up to 15,714 in 2019.

Reported experimental result from simulations on Baltimore data giving mean annual DIR up to 15,714 for 2019.

high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... Disparate Impact Ratio (DIR)

We compute four monthly bias metrics across 264 city-year-mode observations: the Disparate Impact Ratio (DIR), Demographic Parity Gap, Gini Coefficient, and a composite Bias Amplification Score.

Statement of metrics computed and the number of observations (264 city-year-mode observations) reported in the paper.

high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... monthly bias metrics (DIR, Demographic Parity Gap, Gini, Bias Amplification Scor...

The study uses 145,000+ Part 1 crime records from Baltimore (2017–2019) and 233,000+ records from Chicago (2022), augmented with US Census ACS demographic data.

Reported dataset sizes and data sources in the paper (crime records from Baltimore and Chicago; ACS demographic augmentation).

high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... data sample size / dataset composition

We present a reproducible simulation framework that couples a Generative Adversarial Network (GAN) with a Noisy OR patrol detection model to measure how racial bias propagates through the full enforcement pipeline from crime occurrence to police contact.

Description of methods in paper: coupling a GAN (CTGAN) for synthetic crime generation with a Noisy OR detection/patrol model; method-level claim rather than a numerical result.

high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... bias propagation through enforcement pipeline (simulation framework)

Empirical simulations of five game scenarios (ranging from repeated prisoner's dilemma to stylized repeated marketing promotion games) validate the theoretical predictions: AI agents naturally exhibit the proposed reasoning patterns and attain stable equilibrium behaviors intrinsically.

Simulation experiments reported in the paper across five distinct game scenarios; these simulations are presented as empirical validation of the theoretical results.

high positive Reasonably reasoning AI agents can avoid game-theoretic fail... frequency/occurrence of stable equilibrium behaviors (Nash-like play) in simulat...

Relaxing the common-knowledge payoff assumption—allowing stage payoffs to be unknown and each agent to observe only its own privately realized stochastic payoffs—still yields the same on-path Nash convergence guarantee.

Theoretical extension/proof in the paper showing convergence results hold under private, stochastic stage payoffs (no common-knowledge of payoffs).

high positive Reasonably reasoning AI agents can avoid game-theoretic fail... on-path Nash convergence under private, stochastic payoffs

We prove that 'reasonably reasoning' agents—agents capable of forming beliefs about others' strategies from previous observation and learning to best respond to these beliefs—eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game.

Formal theoretical proof provided in the paper (mathematical analysis of agent belief-formation and best-response learning leading to on-path closeness to Nash equilibria).

high positive Reasonably reasoning AI agents can avoid game-theoretic fail... on-path proximity (weak closeness) to Nash equilibrium of the continuation game

Off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training.

Stated claim in the paper supported by a combination of theoretical results (formal proofs about convergence properties of 'reasonably reasoning' agents) and empirical simulations across five game scenarios (including repeated prisoner's dilemma and stylized repeated marketing promotion games).

high positive Reasonably reasoning AI agents can avoid game-theoretic fail... attainment of Nash-like play / strategic equilibrium (zero-shot)

End-to-end verified pipelines can produce provably correct code from informal specifications.

The paper surveys early research demonstrating pipelines that go from informal specifications to formally verified code; the provided text does not include experimental sample sizes or benchmarks.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... provable correctness of generated code

« Prev 1 2 3 … 172 173 174 … 277 278 Next »