Evidence (6869 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Governance
Remove filter
Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs.
Empirical comparison in the paper between single-threshold scoring and tail-inclusive (continuous/unbounded) scoring on identical forecast outputs, showing sign reversal of the capability–accuracy relationship (numerical details not provided in excerpt).
A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect.
Within-family empirical comparisons using Llama-3.1 variants examining effects of model scale and post-training (fine-tuning) on forecasting calibration (details and sample sizes not provided in excerpt).
A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put.
Per-quantile decomposition analyses of model predictive distributions reported in the paper, showing quantile-specific changes (specific quantitative results not given in excerpt).
The pattern replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation.
Empirical replication reported on multiple real-world datasets (COVID-19, measles, housing markets, hyperinflation) presented in the paper (dataset sizes not provided in excerpt).
The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control.
Results on the authors' released simulated benchmark (ForecastBench-Sim) using synthetic SIR epidemic simulations and a matched linear-control experiment reported in the paper (specific number of simulations or runs not stated in excerpt).
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change ... more capable models produce worse distributional forecasts.
Empirical experiments reported in the paper comparing LLMs of varying capability on forecasting tasks with superlinear growth and regime-change tail risk; uses distributional forecast evaluation across models (no sample size reported in excerpt).
Regulatory frameworks that address only downstream applications leave the upstream concentration of infrastructural power largely intact.
Policy analysis and theoretical critique of regulatory approaches; argument based on the distinction between upstream infrastructure and downstream applications (qualitative).
Authority in AI systems is exercised not through formal jurisdiction but through infrastructural chokepoints and dependency pathways that precede and condition law.
Genealogical and infrastructural analysis; theoretical argument emphasizing chokepoints and dependency relations (qualitative).
Digital colonialism is distinct from surveillance capitalism: AI extends historical patterns of dispossession and epistemic domination beyond the commodification of individual behavior by embedding extractive and classificatory logics within data architectures, models, and standards.
Conceptual distinction developed via literature review, political-theoretical argumentation, and genealogical analysis (qualitative).
Contemporary biometric and algorithmic systems show continuities with colonial identification infrastructures.
Genealogical analysis and engagement with decolonial scholarship tracing historical continuities (qualitative, no quantitative sample).
AI systems deployed for identification, classification, and governance are the domains where sovereignty is most visibly reconfigured.
Analytic focus and genealogical tracing within the paper; literature review and conceptual examples of identification/classification systems (no quantitative sample reported).
AI constitutes a historically continuous yet technologically novel form of colonial power, shifting sovereignty from territorial authority toward infrastructural and algorithmic control (termed "infrastructural sovereignty").
Theoretical argument and genealogical analysis drawing on political theory and decolonial scholarship; conceptual synthesis presented in the paper (no empirical sample size reported).
In the absence of communicative and institutional safeguards, individually adaptive delegation aggregates into a systemic collective action problem (modeled as a prisoner's dilemma), producing a sociotechnical lock-in that degrades shared epistemic standards.
Game-theoretic analysis in the paper demonstrating aggregation effects and mapping them to a prisoner's-dilemma–style collective action problem (theoretical modeling, no empirical sample).
The complementarity thesis is an over-simplification of the modalities of human-AI interaction and the possibility-space for both individual and collective action that human-AI interaction potentiates.
Theoretical argumentation and conceptual analysis presented in the paper (no empirical data reported).
Kamunun Ar-Ge harcamalarının etkin ve verimli kullanılmadığına işaret eden bulgular vardır (kamu Ar-Ge negatif ilişki gösterdiği için).
Negatif ilişkiyi gösteren rassal etkiler regresyon sonuçlarına dayanan çıkarım (G8 + Türkiye, 2010-2020).
Ekonomik büyüme ile yapay zekâ patent sayıları arasında negatif bir ilişki bulunmaktadır.
Panel regresyon (random effects) sonuçları (G8 + Türkiye, 2010-2020) raporlanmıştır; ekonomik büyüme (muhtemelen GSMH büyüme oranı) değişkeninin AI patent sayıları ile negatif ilişki gösterdiği bildirilmiştir.
Kamunun Ar-Ge harcamaları ile yapay zekâ patent sayıları arasında negatif bir ilişki bulunmaktadır.
Rassal etkiler panel regresyonu üzerine raporlanan sonuçlar (G8 + Türkiye, 2010-2020); kamu Ar-Ge harcamaları değişkeninin AI patent sayısı ile negatif ilişki gösterdiği bildirilmiştir.
Science-to-technology knowledge flow in AI has been insufficiently examined in a systematic and structural way.
Literature-gap claim in the paper motivating the study.
Highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation.
Synthesis of quantitative (coverage/reuse statistics) and qualitative analyses (narrative framing, taxonomy mapping) from the Benchmarking-Cultures-25 project; interpretive conclusion drawn by the authors.
Authors of many 'general knowledge application' benchmarks claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math).
Content analysis of the benchmarks in the dataset showing topical focus (counts/observations indicating predominance of STEM/math topics) versus broader claimed measurement scope.
Qualitative analysis shows many 'general knowledge application' benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI.
Qualitative content analysis of benchmark descriptions and builder narratives in the dataset; authors report themes where construct validity is downplayed and AGI progress is emphasized.
38.5% of highlighted benchmarks appear in just one release.
Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks); the paper reports the share (38.5%) of benchmarks that appear in only a single model release.
The evaluation landscape is fragmented with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder.
Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks). The paper reports the share (63.2%) based on counts of builders per highlighted benchmark.
In strategic decision scenarios, individuals may modify their features after deployment, inducing a post-deployment distribution shift; this strategic manipulation creates a mismatch between the non-strategic prior learned during pretraining and the post-manipulation strategic prior, which leads to systematic prediction bias.
Conceptual/theoretical claim stated in the paper that strategic feature manipulation causes distribution shift and mismatch between learned prior and strategic prior; the abstract asserts this as a cause of systematic prediction bias. No empirical sample sizes given in the abstract.
Content filtering (blocking searches for Gaza War and Tulsa race massacre).
Documented cases of content filtering cited/synthesized in the paper (specific blocked search topics reported).
AI cataloguing failures (26% F1 accuracy for subject headings).
Empirical studies of AI accuracy in cataloguing synthesized by the paper (reported F1 accuracy for subject heading assignment).
Twin agents dissolve that boundary, raising a class of trust calibration challenge these frameworks were not designed to handle.
Argument and design observations from the authors' ongoing project presented in the paper; conceptual claim explaining why existing frameworks may be insufficient for twin agents.
When a human colleague doubts a twin agent's output, they face three failure modes (a schema gap, an epistemic gap, and a model artifact) with no reliable attribution path between them.
Conceptual taxonomy derived from the authors' early design observations; presented as an identified set of failure modes in the paper (qualitative, no numeric sample reported in abstract).
Drawing on early design work in an ongoing project, we identify a trust calibration problem specific to this approach.
Based on the authors' early design work (qualitative/design research) described in the paper; no sample size or quantitative metrics reported in the abstract.
Challenges including algorithmic bias, data privacy concerns, high costs, and skill gaps persist across contexts.
Cross-study synthesis of barriers and challenges reported in the 21 included studies spanning multiple contexts.
SMEs face unique resource constraints yet lag in AI-HRM adoption.
Synthesis conclusion from the systematic review of 21 included studies (published 2019–2026) comparing adoption patterns and barriers for SMEs.
Greater automation can obscure rather than eliminate failure modes.
Analytical claim in paper arguing that increased automation hides failures; presented as an interpretive finding rather than a quantified experimental result in the excerpt.
End-to-end autonomous systems have not yet consistently reached major-venue acceptance standards.
Paper's statement based on review of acceptance/peer-review outcomes and standards as of April 2026; no numeric acceptance-rate data presented in the excerpt.
Research code lags far behind pattern-matching benchmarks.
Paper's evaluative claim from its experiments/coding analysis indicating code produced for research tasks is weaker than benchmark performance on pattern-matching tasks; excerpt contains no numerical comparison.
Generated ideas often degrade after implementation.
Paper statement about the gap between idea generation and implemented results reported in the Creation-phase analysis; no quantified follow-up study reported in the excerpt.
AI remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment.
Summary claim from the paper's end-to-end lifecycle analysis indicating limitations on novelty and experimental rigor; no numeric performance metrics provided in excerpt.
Frontier LLMs fail to judge novelty reliably.
Paper's claim from its Validation-phase analysis that models do not reliably assess novelty; excerpt contains no underlying experimental sample or validation metrics.
Frontier LLMs miss hidden errors.
Qualitative statement from paper indicating models fail to detect some latent or subtle errors in research artifacts; no numeric evaluation provided in excerpt.
Under scientific pressure, even frontier LLMs still fabricate results.
Reported observation in paper about model behavior under scientific-use conditions; no specific quantitative experiments or sample sizes given in the excerpt.
Both major deployed systems and designed mechanisms concentrate on the most observable and easiest-to-govern tier, while the forms of commercial influence most consequential for user autonomy remain poorly understood and lack frameworks for detection, measurement, or disclosure.
Critical review of deployed system design choices and governance mechanisms; authors assert that attention and tooling focus on observable product-mention-level interventions while higher-tier influences lack measurement and disclosure frameworks.
These tiers instantiate across modalities and system architectures, including retrieval-augmented generation and agentic pipelines where upstream decisions can sharply constrain downstream outcomes.
Analytical claim supported by examples and discussion of system architectures (e.g., RAG, agentic pipelines) showing how interventions at different stages map to the taxonomy; no quantitative evaluation reported in excerpt.
Generative AI fundamentally changes advertising: rather than placing products into discrete slots, it enables interventions on the generative process itself, which induce commercial influence through less observable channels.
Conceptual argument backed by analysis of how generative models produce outputs and how interventions can operate on latent variables of generation; illustrated via taxonomy in the paper rather than quantified empirical tests.
Empirical research shows that ads woven directly into large language model (LLM) outputs often go undetected by users.
Reference to prior empirical studies (unspecified in the excerpt) showing user failure to detect embedded ads in LLM outputs; presented as an empirical finding rather than new experimental data in this paper.
Management shareholding and analyst attention amplify the debt-cost penalty faced by AI washing firms.
Heterogeneity/interaction analyses showing larger post-shock financing-cost increases for AI washing firms with higher management shareholding and greater analyst attention (descriptive of moderator effects; no sample sizes in abstract).
Difference-in-differences estimations reveal that AI washing firms experience a 12.5 basis point relative increase in debt financing cost afterward.
Difference-in-differences estimations comparing AI washing firms to others before and after the FYP shock; effect reported as 12.5 basis points increase in debt financing cost (sample size not stated in abstract).
Standard health system digital transformation policy, which typically addresses only the threshold failure through individual incentives, is predicted to systematically produce the partial adoption trap.
Model prediction contrasting full policy architecture vs. conventional policies that focus solely on individual incentives; analytical conclusion that such limited policies leave other failure modes unaddressed and therefore lead to stable partial adoption. Theoretical model; no empirical sample.
The barrier-lowering benefit of failed attempts is offset when trust erosion is rapid.
Model analysis combining cost-ratchet dynamics and trust erosion parameters; results showing interaction where fast trust erosion negates barrier reductions. Theoretical simulations/derivations; no empirical sample.
These failure modes are most severe precisely for the technologies with the greatest systemic value: the Value-Adoption Paradox.
Analytical result from the model showing failure-mode severity as a function of systemic value; theoretical identification of a paradox where higher systemic-value technologies face stronger coordination/trust/cultural barriers. Theoretical derivation; no empirical sample.
The basin of attraction of the partial adoption trap is enlarged by a cultural failure arising from negative coordination norms among doctors.
Model analysis including cultural coordination norms; theoretical demonstration that negative norms exacerbate partial adoption equilibria. Theoretical model; no empirical sample.
The basin of attraction of the partial adoption trap is enlarged by a trust failure arising from the organisation's inability to credibly commit to sharing productivity gains.
Model extension incorporating organisational commitment/transfer of gains; analytical results showing trust/commitment constraints increase stability of partial adoption. Theoretical model; no empirical sample.