Evidence (13870 claims)
Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 196 | 98 | 892 | 1984 |
| Governance & Regulation | 817 | 394 | 188 | 121 | 1544 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 627 | 233 | 123 | 96 | 1088 |
| Research Productivity | 411 | 123 | 56 | 332 | 933 |
| Output Quality | 467 | 178 | 59 | 47 | 751 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 167 | 122 | 24 | 496 |
| Task Allocation | 207 | 64 | 71 | 32 | 379 |
| Skill Acquisition | 165 | 59 | 60 | 17 | 301 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 52 | 107 | 13 | 279 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 150 | 48 | 26 | 3 | 227 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 63 | 20 | 12 | 184 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 93 | 21 | 13 | 19 | 148 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 17 | 7 | 3 | 59 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
The paper defines a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR).
Explicit listing of newly proposed operational metrics in the paper; this is a descriptive claim about the paper's content (theoretical definitions), no sample size or empirical estimation provided in the excerpt.
The paper introduces a conceptual and mathematical framework to distinguish cognitive amplification (AI improves hybrid human-AI performance while preserving human expertise) from cognitive delegation (reasoning is progressively outsourced to AI).
Explicit contribution claim in the paper (description of a conceptual and mathematical framework); evidence consists of the model and formal definitions presented in the paper (no external empirical validation reported in the excerpt).
Artificial intelligence generates positive spatial spillovers for UCEE (positive effects on neighboring regions).
Spatial Durbin model reported in the abstract indicating positive spillover coefficients for artificial intelligence.
The Global Malmquist–Luenberger (GML) index and its efficiency change (EC) and technological change (TC) components stay above 1, indicating sustained efficiency gains dominated by technological progress.
GML index and decomposition results reported in the abstract based on the panel data and GML computation.
Nationally, the average UCEE index rises from about 0.3 to above 0.7 over the sample period.
Computed UCEE index results from the Super-SBM model applied to the panel of 30 provinces (2013–2022) as reported in the abstract.
Recent advances in large language models, tool-using agents, and financial machine learning are shifting financial automation from isolated prediction tasks to integrated decision systems that can perceive information, reason over objectives, and generate or execute actions.
Literature synthesis and conceptual statement in the paper's introduction describing recent technological advances and their effects on financial automation; no empirical sample size reported.
SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.
Conceptual/positioning claim made by the authors about the intended shift in benchmarking perspective enabled by SOL-ExecBench.
To support robust evaluation of agentic optimizers, we provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis-based checks against common reward-hacking strategies.
Method/tool claim in paper describing the provided evaluation harness and its engineered controls (list of features included).
We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes.
Paper defines the SOL Score metric and states its interpretive meaning (fraction of gap closed between baseline and hardware SOL bound).
SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization.
Methodological claim: introduction of SOLAR pipeline to compute analytic hardware-grounded SOL bounds and use of those bounds as benchmark targets, as described in the paper.
The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities.
Paper description of benchmark coverage (workload direction and data types; inclusion of kernels tied to Blackwell hardware features).
We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs.
Paper reports construction of the benchmark with counts: 235 CUDA kernel problems and 124 source models; descriptive dataset claim in the manuscript.
Given these findings, policymakers should favor 'strategic forbearance'—apply existing laws rather than create new regulations that could stifle innovation and diffusion of AI.
Authors' normative policy recommendation based on their interpretation of the reviewed empirical literature (risk–benefit assessment); this is a prescriptive conclusion rather than an empirical finding, so no sample size applies.
Generative AI lowers entry costs for startups, facilitating new firm entry and product development.
Cited empirical and descriptive evidence in the literature review indicating reduced development costs and faster product prototyping enabled by AI tools; the brief does not provide a pooled sample size or a single quantitative estimate.
Generative AI significantly boosts productivity in specific tasks like coding, writing, and customer service—often by 15% to 50%.
Synthesis/review of empirical literature through 2025 (multiple empirical studies of task-level impacts, including field and lab studies and observational analyses); the brief reports aggregate reported effect ranges but does not list a single pooled sample size.
The study contributes to theory by empirically integrating technological, human, and institutional dimensions within a single architectural framework, moving beyond isolated analyses of digital credit.
Author-stated contribution based on combining measures of algorithmic credit systems, human capability, and institutional design and testing interactions in the same regression models.
Moderation analysis reveals that higher levels of human capability and stronger institutional design amplify the positive effects of algorithmic credit systems and mitigate their adverse effects (i.e., they strengthen repayment and resilience effects and reduce financial stress).
Reported moderation analyses using interaction terms in the regression models on the 400-user cross-sectional sample; results described as significant moderation by human capability and institutional design.
Algorithmic credit systems are positively associated with financial resilience.
Regression analyses reported show a positive relationship between algorithmic credit system use and measures of financial resilience in the sample of 400 users.
Algorithmic credit systems are positively associated with repayment behavior.
Multiple regression results reported in the study indicate a positive association between use of algorithmic credit systems and repayment behavior based on cross-sectional survey of 400 users.
Measurement reliability and validity were established through Cronbach's alpha and principal component analysis.
Paper states that Cronbach’s alpha and principal component analysis (PCA) were used to establish measurement reliability and validity.
The study used a quantitative, explanatory, cross-sectional design and employed multiple regression and moderation analyses to assess relationships among algorithmic credit systems, human capability, institutional design, and financial-wellbeing outcomes.
Methods described explicitly: quantitative explanatory cross-sectional design; analytical methods named as multiple regression and moderation analyses.
Data were collected from 400 users of algorithmic and digitally mediated credit platforms.
Study reports a quantitative, explanatory, cross-sectional survey of users; sample size explicitly stated as 400.
Institutional design (enforceable rules, auditable logs, human oversight on high-impact actions) is a precondition for safe delegation of real authority to LLM agents; systems should be stress-tested under governance-like constraints before assignment of real authority.
Policy recommendation derived from simulation findings that governance structure strongly influences corruption-related outcomes and that safeguards alone are not consistently sufficient; grounded in experiments and rubric-assessed outcomes across 28,112 transcript segments.
Among models operating below saturation, governance structure is a stronger driver of corruption-related outcomes than model identity.
Comparative analysis within the multi-agent governance simulations across different authority structures and model identities; outcomes aggregated and compared across regimes (based on the 28,112 transcript segments scored).
Integrity in institutional AI should be treated as a pre-deployment requirement rather than a post-deployment assumption.
Argument and recommendation based on results from multi-agent governance simulations evaluating rule-breaking and abuse; conclusions drawn from aggregate outcomes across simulated regimes and interventions (see study of 28,112 transcript segments).
The AgentDS benchmark datasets are open-sourced and available at https://huggingface.co/datasets/lainmn/AgentDS.
Paper includes link to the open-source datasets and the AgentDS website.
The strongest solutions arise from human-AI collaboration.
Analysis of competition results showing top-performing submissions employed human-AI collaborative approaches rather than AI-only baselines (results from 29 teams / 80 participants).
We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.
Paper describes the creation of the AgentDS benchmark and an associated competition as the study's primary methodological contribution.
Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow.
Statement in the paper referencing recent developments in LLMs and AI agents; presented as motivation rather than validated empirically within the paper.
Data science plays a critical role in transforming complex data into actionable insights across numerous domains.
Background statement in the paper (no empirical test or dataset provided to support this claim).
LLM-generated peer reviews assign scores that, on average, are a full point higher than human reviews.
Analysis of scores in the conference peer review dataset comparing LLM-generated vs human reviews; the excerpt states an average increase of one full point but does not include sample size or scale range.
About 21% of scientific peer reviews at a recent top AI conference were AI-generated (LLM-generated) in the wild.
Analysis of peer reviews from a recent top AI conference reported in the paper; the excerpt reports the 21% figure but does not give total number of reviews in the excerpt.
Even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning.
Experiment in which LLMs were given expert feedback and explicit instructions to perform only grammar edits; comparisons show significant semantic alteration despite constrained instructions; sample size not provided.
Using a dataset of human-written essays (collected in 2021 before widespread LLM release), asking an LLM to revise essays based on human-written feedback induces large changes in the resulting content and meaning.
Controlled experiments applying LLM revision to a pre-LLM essay dataset and comparing pre- and post-revision content/semantics; dataset described as collected in 2021 but sample size not stated in the excerpt.
In a human user study, extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question.
Human user study reported in the paper; the excerpt gives the quantified result (nearly 70% increase) but does not report sample size here.
LLMs consistently alter the intended meaning of human writing.
Experiments in which human-written essays were revised by LLMs (including prompts asking only for grammar edits) and comparison of pre- and post-LLM text semantics; exact sample sizes not stated in the excerpt.
LLMs alter the voice and tone of human writing.
Reported results from a human user study and subsequent experiments comparing original human-written text to LLM-assisted/LLM-revised text; sample sizes not provided in the excerpt.
Large language models (LLMs) are used by over a billion people globally, most often to assist with writing.
Statement in paper (likely based on external usage statistics or surveys cited by authors); no sample size reported in the provided text.
The code and data used in the study are publicly available at the referenced repository.
Paper statement that code and data are publicly available at a repository (link provided in paper).
A sensitivity analysis over patrol radius, officer count, and citizen reporting probability reveals outcomes are most sensitive to officer deployment levels.
Reported sensitivity analysis across patrol radius, officer count, and reporting probability showing officer count as the most influential parameter in the simulation outcomes.
Persistent Gini coefficients of 0.43 to 0.62 across all conditions indicate concentrated detection inequality.
Reported range of Gini coefficients from simulation experiments across conditions.
Experiments reveal extreme and year-variant bias in Baltimore's detected mode, with mean annual DIR up to 15,714 in 2019.
Reported experimental result from simulations on Baltimore data giving mean annual DIR up to 15,714 for 2019.
We compute four monthly bias metrics across 264 city-year-mode observations: the Disparate Impact Ratio (DIR), Demographic Parity Gap, Gini Coefficient, and a composite Bias Amplification Score.
Statement of metrics computed and the number of observations (264 city-year-mode observations) reported in the paper.
The study uses 145,000+ Part 1 crime records from Baltimore (2017–2019) and 233,000+ records from Chicago (2022), augmented with US Census ACS demographic data.
Reported dataset sizes and data sources in the paper (crime records from Baltimore and Chicago; ACS demographic augmentation).
We present a reproducible simulation framework that couples a Generative Adversarial Network (GAN) with a Noisy OR patrol detection model to measure how racial bias propagates through the full enforcement pipeline from crime occurrence to police contact.
Description of methods in paper: coupling a GAN (CTGAN) for synthetic crime generation with a Noisy OR detection/patrol model; method-level claim rather than a numerical result.
Empirical simulations of five game scenarios (ranging from repeated prisoner's dilemma to stylized repeated marketing promotion games) validate the theoretical predictions: AI agents naturally exhibit the proposed reasoning patterns and attain stable equilibrium behaviors intrinsically.
Simulation experiments reported in the paper across five distinct game scenarios; these simulations are presented as empirical validation of the theoretical results.
Relaxing the common-knowledge payoff assumption—allowing stage payoffs to be unknown and each agent to observe only its own privately realized stochastic payoffs—still yields the same on-path Nash convergence guarantee.
Theoretical extension/proof in the paper showing convergence results hold under private, stochastic stage payoffs (no common-knowledge of payoffs).
We prove that 'reasonably reasoning' agents—agents capable of forming beliefs about others' strategies from previous observation and learning to best respond to these beliefs—eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game.
Formal theoretical proof provided in the paper (mathematical analysis of agent belief-formation and best-response learning leading to on-path closeness to Nash equilibria).
Off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training.
Stated claim in the paper supported by a combination of theoretical results (formal proofs about convergence properties of 'reasonably reasoning' agents) and empirical simulations across five game scenarios (including repeated prisoner's dilemma and stylized repeated marketing promotion games).
End-to-end verified pipelines can produce provably correct code from informal specifications.
The paper surveys early research demonstrating pipelines that go from informal specifications to formally verified code; the provided text does not include experimental sample sizes or benchmarks.