Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Switchcraft saves over $3,600 per million queries.
Cost savings estimate reported in the paper based on the measured 84% reduction applied to a million-query baseline.
Switchcraft reduces inference cost by 84%.
Empirical cost analysis reported in the paper comparing inference cost with and without Switchcraft.
Switchcraft's accuracy matches or exceeds the best individual model.
Empirical comparison reported in the paper between Switchcraft accuracy (82.9%) and accuracies of individual models (details summarized by authors).
Switchcraft achieves 82.9% accuracy.
Empirical evaluation results reported in the paper (accuracy metric measured on the evaluation framework).
Switchcraft operates inline, selecting the lowest-cost model subject to correctness.
Method description in the paper describing Switchcraft's operational design.
We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling.
Authors' stated contribution / novelty claim in the paper (method description: Switchcraft).
Hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting LLM-generated errors may reinforce existing inequities in scientific recognition.
Analysis linking hallucinated citations to characteristics of the (intended or assigned) cited authors, including measures of prominence and inferred gender, showing over-representation of prominent and male scholars among hallucinated attributions.
Hallucinated references are especially pronounced among small and early-career author teams.
Analysis of hallucination prevalence by author-team characteristics (team size and author career stage) within the audited dataset.
Hallucinated references are especially pronounced in manuscripts with linguistic signatures of AI-assisted writing.
Classification of manuscripts by linguistic features (signatures) indicative of AI-assistance and comparison of hallucination prevalence between groups.
These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake.
Cross-field comparison within the audited dataset showing higher rates of non-existent references in fields identified as having rapid AI adoption.
We provide a conservative estimate of 146,932 hallucinated citations in 2025 alone.
Quantitative extrapolation/estimation from the audit of references in the dataset, producing an annualized (2025) conservative count.
We find a sharp rise in non-existent references following widespread LLM adoption.
Temporal analysis of the audited references comparing prevalence of non-existent (hallucinated) citations before and after the period of widespread LLM adoption across the 111M-reference dataset.
The Analysis Contract framework generalizes across domains of vibe inference through domain-specific instantiation.
Theoretical claim and conceptual generalization proposed in the paper; no cross-domain empirical tests or case studies reported.
The Analysis Contract, a proposed pre-commitment framework, can adapt the logic of pre-analysis plans and the Causal Roadmap to the AI-assisted setting by imposing three conditions before a causal claim is made: a method-data contract, a data audit, and a pre-commitment statement defining what would count as a disconfirming result.
Proposed methodological/framework contribution in the paper; described and motivated conceptually, without empirical validation or implementation evidence.
The paper extends the TOE (Technology-Organization-Environment) framework by identifying an optimal AI adoption range and empirically validating the homogenization trap.
Theoretical contribution claimed in discussion linking empirical inverted-U and homogenization findings back to TOE framework.
AI’s enabling effect on innovation is more sustainable in high-technology firms (relative to low-tech firms).
Heterogeneity analyses by firm technology intensity (high-tech vs. others) showing more sustained positive AI effects in high-tech firms.
AI’s enabling effect on innovation is more sustainable in non-state-owned firms (compared to state-owned firms).
Heterogeneity analyses by ownership type reported in the paper showing stronger/sustained positive AI–innovation effects for non-state-owned firms.
Firm absorptive capacity partially mediates the AI–innovation relationship.
Bootstrap mediation analysis performed on the sample indicating a partial mediation effect of absorptive capacity between AI and innovation.
The positive effect of GGFs on digital–intelligent transformation is particularly strong for firms with robust dynamic capabilities.
Heterogeneity analysis reported in the paper comparing effects across firms with differing levels of dynamic capabilities using the DID sample of Chinese A–share listed firms (2012–2024).
The positive effect of GGFs on digital–intelligent transformation is particularly strong for firms operating in high‑tech industries.
Heterogeneity analysis reported in the paper comparing effects across industries (high‑tech vs. others) using the DID sample of Chinese A–share listed firms (2012–2024).
The positive effect of GGFs on digital–intelligent transformation is particularly strong in firms with high-quality internal controls.
Heterogeneity analysis reported in the paper comparing effects across firms with different internal control quality using the DID sample of Chinese A–share listed firms (2012–2024).
GGFs promote firms’ digital–intelligent transformation by encouraging knowledge spillovers.
Mechanism analysis reported in the paper that identifies knowledge spillovers as a channel from GGFs to firm-level digital–intelligent transformation, using the DID framework on Chinese A–share listed firms (2012–2024).
GGFs promote firms’ digital–intelligent transformation by transmitting policy guidance.
Mechanism analysis reported in the paper indicating a pathway from GGFs to firm transformation via policy guidance channels, based on the DID sample of Chinese A–share listed firms (2012–2024).
GGFs promote firms’ digital–intelligent transformation by easing firms' financing constraints.
Mechanism analysis reported in the paper (mediation / pathway analysis tied to the DID framework) using the same sample of Chinese A–share listed firms (2012–2024).
Government-guided funds (GGFs) significantly promote firms’ digital–intelligent transformation.
Difference-in-differences (DID) analysis applied to Chinese A–share listed firms over 2012–2024, as reported in the paper's main empirical results.
A lightweight pre-generation router exceeds the best cascade policy on four of five datasets, mainly because it avoids the cheap model's generation cost on queries sent directly to a larger model rather than because of a stronger routing signal.
Empirical experiments across the five benchmarks showing the pre-generation router outperforms best cascade on 4/5 datasets; analysis attributing the advantage primarily to avoided generation cost rather than improved routing accuracy/signal.
Broader equity markets, proxied by the S&P 500, remain the dominant source of spillovers throughout the sample period.
Directional spillover results from the TVP-VAR indicating the S&P 500 has the largest and persistent net outward spillover contributions over the full sample.
AI-related equities initially act as net transmitters of shocks.
Directional spillover measures from the TVP-VAR showing AI equity group had positive net directional connectedness early in the sample.
The theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model.
Empirical experiment reported in the paper: pretraining runs of a 124M-parameter GPT-2 model comparing SignSGD (or Muon) vs baseline SGD/variants; details (number of runs, datasets, seeds) are not provided in the abstract.
Extending the sign operator to matrices preserves the optimal scaling with dimensionality: we provide an equivalent optimal lower bound for the Muon optimizer in the matrix domain.
Theoretical extension of the analysis to matrix-valued problems and derivation of a matching optimal lower bound for the Muon optimizer, demonstrating preserved scaling.
SignSGD effectively reduces the complexity by a factor of d under sparse noise, where d is the problem dimension (comparison of SignSGD upper bound with SGD lower bound shows a factor-d improvement).
Theoretical comparison between the derived upper bound for SignSGD and the derived lower bound for SGD within the paper, under the separable/sparse noise model and specified smoothness assumptions.
Under this distinct problem geometry (l1-stationarity, l_infty-smoothness, separable noise), we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD.
Theoretical derivation of both upper bounds (for SignSGD) and matching lower bounds (for the problem class) presented in the paper; proofs establishing tightness.
By analyzing sign-based optimizers under l1-norm stationarity, l_infty-smoothness, and a separable noise model, we can better capture the coordinate-wise nature of signed updates and overcome the barrier that prevents sign-based methods from outperforming SGD in standard settings.
Theoretical analysis in the paper introducing these alternative geometric/assumption settings (l1-stationarity, l_infty-smoothness, separable noise) and deriving results under these assumptions.
The results imply an urgency of early intervention in AI-driven economies to avoid extreme inequality and loss of redistribution options.
Synthesis and policy discussion in the paper based on the finite-time singularity, super-exponential divergence of wealth ratios, and the policy-irreversibility result.
Under mild conditions, the system exhibits a finite-time singularity where AI capability, AI capital, and financial capital diverge.
Analytical dynamical-systems analysis and proofs in the paper demonstrating finite-time blow-up (singularity) of A (AI capability), K_a (AI capital), and K_f (financial capital) for parameter ranges satisfying the stated mild conditions.
Users maintain a moderate level of trust in AI even when their decisions diverge from those of AI.
Reported descriptive/analytic finding from the experiment with 59 pre-service teachers indicating measured trust remained at a moderate level in inconsistent decision conditions.
The proportion of consistent decisions significantly moderates the impact of AI-assisted decision-making paradigms on users' confidence levels.
Moderation analysis reported in the study (N=59); authors indicate that proportion of consistent human-AI decisions significantly moderates the effect of AI-assisted decision-making paradigm on confidence.
Consistency between human and AI decisions significantly enhances task performance.
Within-subject consistency manipulation in the experimental sample of 59 pre-service teachers; authors report significant positive association between proportion of consistent decisions and measured task performance.
Consistency between human and AI decisions significantly enhances users' confidence.
Within-subject manipulation of human-AI consistency in the study (N=59); authors report a significant positive effect of consistency on users' confidence in the measured models.
Consistency between human and AI decisions significantly enhances users' trust in AI.
Within-subject manipulation of human-AI consistency in the experiment with 59 pre-service teachers; authors report a significant positive effect of consistency on trust measured and tested in their models.
When human-AI decision consistency is taken into account, AI-assisted decision-making paradigms influence task performance indirectly through a sequential psychological pathway involving users’ confidence and their trust in the AI.
Same experimental sample (N=59), structural equation modeling reported a significant indirect (mediated) pathway from AI-assisted paradigms → users' confidence → trust in AI → task performance; moderation by human-AI consistency was considered.
Post-hoc SHAP attribution reveals that complaint recurrence and neighborhood-level statistics are stronger predictors of actionable violations than raw complaint volume.
Empirical claim based on post-hoc SHAP feature-attribution analysis applied to the paper's models; the excerpt reports a relative feature importance finding but provides no numeric effect sizes or sample counts.
We formalize each domain as a Markov Decision Process (MDP) in which equitable classification coverage is a first-class reward objective.
Methodological specification in the paper asserting each operational domain was modeled as an MDP with equity-aware reward structure. No further empirical details in the excerpt.
The proposed technique is designed to maximize throughput, minimize misclassification cost, and actively narrow historical equity gaps in service delivery.
Stated design objectives of the RL approach in the paper. No quantified outcomes or evaluation reported in the provided text.
Rather than replacing human classifiers, our agents act as intelligent intake routers that learn to assign incoming complaints to action categories: escalate, batch, defer, inspect now.
Descriptive claim of agent behavior and intended design; asserts agents perform routing decisions into four action categories. No empirical performance numbers provided in the excerpt.
We develop an equity-centered reinforcement learning (RL) framework that augments call classification capacity across six New York City Department of Buildings operational domains (boiler safety, crane and derrick oversight, heat and hot water, housing complaint triage, scaffold safety, and Natural Area District protection).
Methodological development described in the paper; claimed application domain spans six named DOB operational areas. No evaluation metrics or sample sizes provided in the excerpt.
U.S. lawmakers and agencies have advanced standards, testing, and procurement oversight related to AI as the AGI race tightens.
Reported in the paper as a synthesis of recent policy and agency activity (standards, testing programs, procurement oversight); descriptive summary rather than a quantified empirical analysis (no sample size reported).
So far in 2026, agentic coding automation has advanced, with tools that enable end-to-end planning, coding, and debugging.
Asserted in the paper as an observed trend through 2026, based on examples of tooling and product announcements; presented descriptively without a stated empirical sample or controlled evaluation.
Milestones in 2025 also include early regulatory actions.
Reported in the paper's synthesis of 2025 events; based on review of policy developments and announcements rather than a quantitative evaluation (no sample size reported).
Milestones in 2025 highlight the broad adoption of multimodal and agentic AI.
Stated in the paper as part of a narrative synthesis of 2025 milestones; presented as an observational summary drawing on literature, industry reports and documented deployments rather than a systematic empirical study (no sample size or statistical analysis reported).