Evidence (8570 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Adoption
Remove filter
A-insensitivity acts as a cognitive barrier between beliefs and trust (i.e., it reduces the extent to which beliefs about forecast accuracy are translated into trust).
Interpretation based on experimental findings showing that higher a-insensitivity weakens the predictive relationship between beliefs about accuracy and expressed trust in analysts (derived from measures and analyses in the lab experiment; sample size not reported in abstract).
Decision-makers who are more a-insensitive are less likely to incorporate their beliefs about forecast accuracy into their trust judgments.
Experimental data where participants' a-insensitivity was measured and used to predict the extent to which their beliefs (optimism about accuracy) translate into trust for analysts (moderation/interaction analysis implied; sample size not reported in abstract).
There is a 'speedup illusion' where people have accurate forecasts of independent completion times but significantly underestimate AI-assisted times.
Empirical pattern reported in the abstract: comparison of predicted vs. actual times shows accurate independent forecasts but underestimation of AI-assisted completion times (preregistered study, N = 1237).
A conventional two-arm test understates the algorithmic channel by a factor of two.
Empirical comparison reported in the paper between the three-arm design estimates and conventional two-arm test estimates from the live campaign.
In the same campaign, the creative channel moves female impression share by -0.68 ppt.
Empirical result from the live Meta campaign reported in the paper; measured effect size (-0.68 percentage points).
Adjusting for the realized audience is biased because audience is a post-treatment mediator.
Causal inference argument in paper explaining why conditioning on realized audience induces bias (audience as post-treatment mediator).
Every two-arm test conflates the creative's effect with the algorithm's targeting response.
Theoretical/causal argument presented in the paper about confounding in standard two-arm experiments when algorithmic delivery is endogenous.
Simultaneously, there is a structural shortage of qualified personnel and a gap between the education system and the needs of the economy in Uzbekistan.
Synthesis of statistical data, industry reviews, and regulatory/legal document analysis presented in the paper (no primary survey/sample size reported).
As these systems scale, the bottleneck shifts away from raw model capability toward coordination.
Analytical/argumentative claim in the paper framing a shift in primary constraint; no empirical study or quantified benchmark reported.
AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up.
Statement in paper's motivation/background; no empirical method or sample size reported in the abstract.
A reported limitation is that at this privacy level the released valuations remain noise-dominated; the system's utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.
Authors' limitation/analysis section and experimental observations.
Static temporal knowledge-graph data marketplace designs suffer three coupled failures: (i) stale hybrid index shortcuts reduce recall as edges evolve, (ii) stationary Shapley pricing misattributes value after distribution shifts, and (iii) uncoordinated agents over-consume a shared differential-privacy budget.
Authors' problem statement / conceptual diagnosis presented in the paper (no numeric sample size reported).
Monotonic baselines collapse when extrapolating beyond the training regime (e.g., predicting a 12B model up to 307B tokens) whereas the Shannon Scaling Law remains predictive.
Empirical comparison on the held-out 12B extrapolation: authors report collapse/failure of monotonic baseline scaling laws in that regime contrasted with Shannon law's successful prediction (pooled R^2 reported).
This Shannon perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation.
Theoretical argument derived from the Shannon-Hartley based formulation plus supporting empirical examples claimed in the paper showing non-monotonic (U-shaped) loss/accuracy behavior when SNR is insufficient.
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute.
Author assertion based on literature/contextual observation and motivating examples (catastrophic overtraining, quantization-induced degradation) referenced in the paper; no specific numeric sample provided in the excerpt.
Commercial or dual-use AI models and semiconductors do not meet the security exception criteria under GATT Article XXI(b), so security interests should be interpreted restrainedly.
Legal argument and interpretive analysis in the paper contending that the GATT Article XXI(b) security exception does not encompass routine commercial or dual-use AI models and semiconductors; doctrinal legal reasoning rather than empirical measurement.
Overusing export controls can complicate dispute resolution and hinder AI progress.
Normative and legal-political argument in the paper: overuse raises legal disputes (e.g., WTO litigation) and may slow cross-border AI development and diffusion (qualitative reasoning).
Overly strict or arbitrary controls may violate WTO obligations.
Legal analysis in the paper arguing that some export controls could conflict with WTO law (GATT) depending on scope and justification; interpretive legal reasoning cited.
The long-term effectiveness of export controls is questionable.
Paper's argumentative assessment drawing on historical examples and theoretical considerations (qualitative reasoning rather than quantitative causal inference).
China responded with export curbs on critical minerals and filed a WTO complaint against the U.S. under GATT.
Factual claim citing China's counter-measures (export curbs) and legal action (WTO complaint under GATT) as described in the paper.
Large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks but their deployment in high-throughput, latency-sensitive environments remains impractical.
Statement about model performance on public benchmarks (upper bounds) and practical deployment constraints (throughput and latency), asserted by authors; no numerical deployment analysis provided in excerpt.
Two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength.
Comparison of jaggedness/local volatility measures and overall scores from the tournament (top-three leaderboard).
Existing strategic-reasoning benchmarks evaluate models on fixed canonical games and may saturate as the frontier improves and fail to generalize to varied real-world strategic environments.
Conceptual critique stated in the paper's motivation/background; no empirical test reported in abstract.
Evaluating state-of-the-art kernel agents on FastKernels, the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53×.
Empirical evaluation of multiple state-of-the-art kernel-generation agents on the FastKernels benchmark; aggregate speedup factors reported in abstract. The number of benchmark tasks is likely the FastKernels task set (46), though the abstract does not explicitly state the evaluation sample size for this measurement.
Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones.
Stated as motivating observation in the paper (conceptual/empirical critique of existing benchmark design and incentives). No numerical sample size given in the abstract.
Other changes are more nuanced and put the typical career growth opportunities, like receiving feedback from professional networks and promoting leadership and mentorship, at risk.
Qualitative reports from interview participants (n=24) expressing concerns that AI-driven changes may reduce feedback, leadership development, and mentoring opportunities.
Notable challenges to AI implementation include concerns about algorithmic bias, privacy, transparency, job displacement, organizational culture, and issues related to ethical and legal oversight.
Synthesis of reported challenges across the 29 empirical studies included in the scoping review.
Zero-shot evaluation shows the best positive-query mask success rate at IoU@0.75 remains below 0.17.
Empirical evaluation reported in the paper: zero-shot tests across 26 model configurations with reported mask success rate at IoU@0.75.
Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-F1 reaches only 0.35.
Empirical evaluation reported in the paper: zero-shot tests across 26 model configurations with reported Set-F1 metric.
Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image.
Problem characterization / motivation described in the paper (qualitative reasoning about dataset and task properties).
Technical bottlenecks (cross-border data compliance, algorithm interpretability) and ethical challenges (algorithmic bias, privacy infringement, cultural conflicts) are intertwined impediments to intelligent international marketing.
Synthesis of challenges identified across the reviewed literature (systematic review and content analysis, 2010–2025) as reported in the paper.
Traditional international marketing theories, constrained by static assumptions and linear logic, struggle to explain intelligent contexts.
Conclusion from the paper's systematic review and content analysis of core literature (2010–2025); no quantitative test or sample size reported in the summary.
Cost and lack of applicable use case are the most cited barriers to AI adoption, followed by expertise.
Survey question(s) on barriers to adoption in the Census Bureau survey in which respondents reported reasons for not adopting AI; ranking provided in the paper (cost, lack of use case, then expertise).
Intensity-weighted adoption is far lower than the 22.8 percent headline rate.
Survey-derived intensity-weighted measure of AI adoption constructed from the same Census Bureau survey (no numeric value reported in the excerpt).
Only 22.8 percent of plants report any AI use as of 2021.
Direct descriptive estimate from the Census Bureau survey of manufacturing establishments; year reported as 2021.
ID-centric ranking models fail to generalize in livestreaming recommendation due to the short-lived nature of live rooms and poorly learned item IDs.
Authors' assertion linking the cold-start item ID problem to poor generalization of ID-centric rankers (motivating claim). No specific experimental metrics or sample sizes cited in the excerpt.
A live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state.
Authors' observational/operational claim about livestream characteristics stated in the paper (motivating problem statement). No sample size or quantitative backing provided in the excerpt.
The de-coring and skill-demand changes are concentrated among low entry-threshold, small firms.
Abstract statement reporting heterogeneity: concentration of observed patterns among firms characterized as small and with low entry thresholds.
Both displacement and augmentation exposure are associated with a de-coring pattern: a shallower and more dispersed skill portfolio with within-category importance diverging from share movements.
Empirical description in abstract that both forms of exposure correlate with changes in portfolio depth and dispersion, and with divergence between within-category importance and category shares.
Displacement exposure is negatively associated with the routine cognitive skill share.
Empirical result stated in abstract: negative association between displacement exposure and routine cognitive share, identified using within-firm variation and the constructed exposure measures.
In deployed settings, the effects of AI systems on human agency, creativity, and institutional well-being emerge over time, shaped by repeated interaction, reuse, and integration into real-world workflows, and these dynamics are rarely visible through pre-deployment evaluation or isolated prompt–response analysis.
Argumentative observation based on conceptual reasoning; no empirical data or sample size reported.
The most significant barriers to AI adoption reported by entrepreneurs are human-centred—talent scarcity, organisational resistance, and change management—rather than technology or cost alone.
Theme 'Barriers and the Adoption Journey' from thematic analysis of interviews (n=16); interviewees repeatedly cited human-centred barriers (talent scarcity, resistance, change management) over purely technical/cost barriers.
Because contracts are negotiated by legal departments alone, many apparent legal disputes are incentive misalignment problems that only scientists at the table can correctly diagnose.
Argumentative claim presented in the paper (normative/diagnostic); no empirical study or sample provided in the excerpt.
These failures are not for scientific reasons, but because academics must publish while companies must protect models trained on proprietary data, and no standard contract framework resolves this tension.
The paper presents this as the causal explanation (analytical/argumentative claim); no empirical testing or sample reported in the provided text.
Industry-academia ML collaborations routinely fail to launch.
Asserted in the paper as an empirical observation/statement; no empirical methods, data, or sample size reported in the provided text (argument/anecdote).
People exhibit self-estimate miscalibration: on average they believe they are using AI less than they actually are.
Same three pre-registered user studies (combined N = 2691) comparing participants' self-reported AI use against observed/recorded AI use during tasks.
The measurement bias understates substitution effects more than it understates augmentation effects.
Analytical argument and empirical evidence showing directional bias from measurement error that causes estimated substitution (labor displacement) effects to be more severely understated than augmentation (complementarity) effects.
Reweighting platform-based exposure measures to Bureau of Labor Statistics workforce shares attenuates estimates by 42 to 93 percent.
Reweighting exercise where exposure scores built from platform logs are reweighted to match BLS workforce shares and resulting employment estimates are compared; reported attenuation range of 42–93%.
Current regulatory frameworks—designed for human-intermediated payments—are ill-equipped to address the dynamic and decentralised nature of agent-led transactions.
Regulatory and legal analysis asserted in the abstract (argument that existing frameworks are mismatched to agent-led payments).
The article identifies and categorises a range of technical, legal and societal risks, including cybersecurity vulnerabilities, liability gaps, regulatory non-compliance, and potential economic disruption.
Risk identification and categorisation presented in the paper (qualitative analysis and case studies referenced in the abstract). No quantitative risk measurement reported in the abstract.