Evidence (8570 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Adoption
Remove filter
Using a frontier model's system prompt to supply the procedure exposes proprietary procedures to third-party providers.
Author statement describing privacy/proprietary risk as a cost of the system-prompt approach (qualitative claim).
Using a frontier model's system prompt to supply the procedure requires a frontier model for every conversation.
Author statement describing operational/cost trade-offs associated with the system-prompt approach (qualitative claim).
Using a frontier model's system prompt to supply the procedure has costs: it consumes the context window.
Author statement referencing trade-offs identified alongside the Dennis et al. result; cost described qualitatively (context window consumption).
Emerging evidence indicates that algorithms often inherit and amplify the historical biases present in training data.
Literature claim in paper referencing 'emerging evidence' and empirical studies (2024–2026) — specific studies, methods, and sample sizes not included in excerpt.
Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs.
Empirical comparison in the paper between single-threshold scoring and tail-inclusive (continuous/unbounded) scoring on identical forecast outputs, showing sign reversal of the capability–accuracy relationship (numerical details not provided in excerpt).
A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect.
Within-family empirical comparisons using Llama-3.1 variants examining effects of model scale and post-training (fine-tuning) on forecasting calibration (details and sample sizes not provided in excerpt).
A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put.
Per-quantile decomposition analyses of model predictive distributions reported in the paper, showing quantile-specific changes (specific quantitative results not given in excerpt).
The pattern replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation.
Empirical replication reported on multiple real-world datasets (COVID-19, measles, housing markets, hyperinflation) presented in the paper (dataset sizes not provided in excerpt).
The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control.
Results on the authors' released simulated benchmark (ForecastBench-Sim) using synthetic SIR epidemic simulations and a matched linear-control experiment reported in the paper (specific number of simulations or runs not stated in excerpt).
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change ... more capable models produce worse distributional forecasts.
Empirical experiments reported in the paper comparing LLMs of varying capability on forecasting tasks with superlinear growth and regime-change tail risk; uses distributional forecast evaluation across models (no sample size reported in excerpt).
The lack of prediction stability and predictability can lead to advertiser-perceivable problems such as repeatability issues, cold start, and under-exploration.
Stated as an intuitive/motivational claim in the paper linking instability to advertiser-facing problems; no empirical quantification provided in the excerpt.
Traditional ads recommendation systems have primarily focused on optimizing for prediction accuracy of click or conversion events using canonical metrics such as recall or normalized discounted cumulative gain (NDCG).
Background/contextual claim about prior work and standard practice; stated in the paper as motivation (no empirical evidence provided in the excerpt).
AIO is negatively associated with the carbon emission intensity of upstream suppliers.
Authors report a negative association between firms' AIO and the carbon emission intensity of their upstream suppliers in the empirical results using Chinese listed firms (2010–2023).
AIO is negatively associated with the carbon emission intensity of industry peers.
Authors report a negative association between a firm's AIO and the carbon emission intensity of its industry peers based on their empirical analyses of Chinese listed companies over 2010–2023.
Stronger AIO is associated with lower carbon emission intensity within the focal firm.
Empirical association reported between firm-level AIO (measured via LLMs) and firm carbon emission intensity in the authors' analysis of Chinese listed firms (2010–2023); result described as a negative relationship.
Commercial demand drivers systematically distort finished-goods inventory targets and require integration with sales-and-operations planning for accurate calibration.
Narrative synthesis of studies addressing demand-driver effects on finished-goods targets and recommendations for S&OP integration.
Science-to-technology knowledge flow in AI has been insufficiently examined in a systematic and structural way.
Literature-gap claim in the paper motivating the study.
Highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation.
Synthesis of quantitative (coverage/reuse statistics) and qualitative analyses (narrative framing, taxonomy mapping) from the Benchmarking-Cultures-25 project; interpretive conclusion drawn by the authors.
Authors of many 'general knowledge application' benchmarks claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math).
Content analysis of the benchmarks in the dataset showing topical focus (counts/observations indicating predominance of STEM/math topics) versus broader claimed measurement scope.
Qualitative analysis shows many 'general knowledge application' benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI.
Qualitative content analysis of benchmark descriptions and builder narratives in the dataset; authors report themes where construct validity is downplayed and AGI progress is emphasized.
38.5% of highlighted benchmarks appear in just one release.
Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks); the paper reports the share (38.5%) of benchmarks that appear in only a single model release.
The evaluation landscape is fragmented with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder.
Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks). The paper reports the share (63.2%) based on counts of builders per highlighted benchmark.
Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback, resulting in inefficient exploration and elevated financial risk for advertising platforms.
Argument in the paper contrasting generative-model-based approaches with the authors' proposed solution (conceptual claim; no quantitative backing given in the excerpt).
Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies.
Statement in the paper summarizing limitations of prior RL-based bidding work (qualitative claim; no experimental details or sample size provided in the excerpt).
Early rule-based methods lacked adaptability.
Literature/contextual statement in the paper's introduction summarizing prior approaches to automated bidding (no empirical data or sample size reported).
Twin agents dissolve that boundary, raising a class of trust calibration challenge these frameworks were not designed to handle.
Argument and design observations from the authors' ongoing project presented in the paper; conceptual claim explaining why existing frameworks may be insufficient for twin agents.
When a human colleague doubts a twin agent's output, they face three failure modes (a schema gap, an epistemic gap, and a model artifact) with no reliable attribution path between them.
Conceptual taxonomy derived from the authors' early design observations; presented as an identified set of failure modes in the paper (qualitative, no numeric sample reported in abstract).
Drawing on early design work in an ongoing project, we identify a trust calibration problem specific to this approach.
Based on the authors' early design work (qualitative/design research) described in the paper; no sample size or quantitative metrics reported in the abstract.
Major open challenges for responsible adoption include reliability, bias, privacy, automation bias, transparency, and evaluation.
Authors' identification of risks and open research challenges based on their review/analysis (conceptual synthesis).
Current AI support for code review remains fragmented, with tools focusing on isolated tasks such as reviewer recommendation, PR description generation, or comment suggestion rather than the end-to-end PR review workflow.
Authors' survey/overview of existing AI tooling for code review described in the paper (conceptual / review-based evidence). No quantitative counts provided in the abstract.
AI coding assistants expand the volume of code requiring review, turning code review into a growing bottleneck.
Authors' analytical claim linking increased code production from AI assistants to increased review workload; presented as an observed/trend claim in the paper rather than supported by a quantified study in the abstract.
Code review has evolved for decades, from informal peer checking to today's pull request (PR) workflows, yet it remains a largely manual, uneven, and cognitively demanding process.
Authors' literature review and historical synthesis of code review practices presented in the paper (conceptual / review-based evidence). No empirical sample or experiment reported in the abstract.
Challenges including algorithmic bias, data privacy concerns, high costs, and skill gaps persist across contexts.
Cross-study synthesis of barriers and challenges reported in the 21 included studies spanning multiple contexts.
SMEs face unique resource constraints yet lag in AI-HRM adoption.
Synthesis conclusion from the systematic review of 21 included studies (published 2019–2026) comparing adoption patterns and barriers for SMEs.
Greater automation can obscure rather than eliminate failure modes.
Analytical claim in paper arguing that increased automation hides failures; presented as an interpretive finding rather than a quantified experimental result in the excerpt.
End-to-end autonomous systems have not yet consistently reached major-venue acceptance standards.
Paper's statement based on review of acceptance/peer-review outcomes and standards as of April 2026; no numeric acceptance-rate data presented in the excerpt.
Research code lags far behind pattern-matching benchmarks.
Paper's evaluative claim from its experiments/coding analysis indicating code produced for research tasks is weaker than benchmark performance on pattern-matching tasks; excerpt contains no numerical comparison.
Generated ideas often degrade after implementation.
Paper statement about the gap between idea generation and implemented results reported in the Creation-phase analysis; no quantified follow-up study reported in the excerpt.
AI remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment.
Summary claim from the paper's end-to-end lifecycle analysis indicating limitations on novelty and experimental rigor; no numeric performance metrics provided in excerpt.
Frontier LLMs fail to judge novelty reliably.
Paper's claim from its Validation-phase analysis that models do not reliably assess novelty; excerpt contains no underlying experimental sample or validation metrics.
Frontier LLMs miss hidden errors.
Qualitative statement from paper indicating models fail to detect some latent or subtle errors in research artifacts; no numeric evaluation provided in excerpt.
Under scientific pressure, even frontier LLMs still fabricate results.
Reported observation in paper about model behavior under scientific-use conditions; no specific quantitative experiments or sample sizes given in the excerpt.
Diagnostics also reveal a small tail of extreme errors for the Random Forest model.
Model diagnostic analyses reported in the paper indicating error distribution and presence of extreme prediction errors (tail).
Unrestricted frontier-scale checkpoint synthesis remains open (i.e., not yet solved).
Authors' assessment in the abstract noting current limits; asserts that unrestricted synthesis at frontier/model-scale has not been achieved.
Both major deployed systems and designed mechanisms concentrate on the most observable and easiest-to-govern tier, while the forms of commercial influence most consequential for user autonomy remain poorly understood and lack frameworks for detection, measurement, or disclosure.
Critical review of deployed system design choices and governance mechanisms; authors assert that attention and tooling focus on observable product-mention-level interventions while higher-tier influences lack measurement and disclosure frameworks.
These tiers instantiate across modalities and system architectures, including retrieval-augmented generation and agentic pipelines where upstream decisions can sharply constrain downstream outcomes.
Analytical claim supported by examples and discussion of system architectures (e.g., RAG, agentic pipelines) showing how interventions at different stages map to the taxonomy; no quantitative evaluation reported in excerpt.
Generative AI fundamentally changes advertising: rather than placing products into discrete slots, it enables interventions on the generative process itself, which induce commercial influence through less observable channels.
Conceptual argument backed by analysis of how generative models produce outputs and how interventions can operate on latent variables of generation; illustrated via taxonomy in the paper rather than quantified empirical tests.
Empirical research shows that ads woven directly into large language model (LLM) outputs often go undetected by users.
Reference to prior empirical studies (unspecified in the excerpt) showing user failure to detect embedded ads in LLM outputs; presented as an empirical finding rather than new experimental data in this paper.
Management shareholding and analyst attention amplify the debt-cost penalty faced by AI washing firms.
Heterogeneity/interaction analyses showing larger post-shock financing-cost increases for AI washing firms with higher management shareholding and greater analyst attention (descriptive of moderator effects; no sample sizes in abstract).
Difference-in-differences estimations reveal that AI washing firms experience a 12.5 basis point relative increase in debt financing cost afterward.
Difference-in-differences estimations comparing AI washing firms to others before and after the FYP shock; effect reported as 12.5 basis points increase in debt financing cost (sample size not stated in abstract).