Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Existing approaches to AI explainability, grounding and hallucination detection do not address input fidelity because they focus on output quality rather than input fidelity.
Argument in the paper contrasting prior work on explainability and hallucination detection with the problem of input fidelity; based on literature review and conceptual analysis.
Human advisors suppressed warnings under pressure at two to four times the AI rate.
Comparison between human benchmark (1,201 participants) and LLM outputs (3,360 conversations) in the preregistered experiment; reported suppression rates for humans were 2–4x those for AIs.
Because experienced workers are aging out of the workforce, simultaneous curtailment of formative occupational layers by platforms may create a shortage of workers able to manage complex systems.
Argument combining demographic observation (aging workforce) with the paper's theoretical claim about erosion of entry-level apprenticeship layers; no empirical test or quantified projection provided.
Microsoft's realized routing bias has been voluntarily constrained by a March 2026 multi-model pivot.
Paper's descriptive assessment based on observable product/strategy events (March 2026 pivot) and how that affects routing bias in the comparative mapping.
Other models fail more severely (i.e., worse than the frontier models mentioned).
Comparative results across the 19 evaluated LLMs reported in the experiment indicate worse corruption rates for models not classified as 'frontier'.
Because aggressive compression shifts interpretive burden to the model's reasoning phase, aggressive token compression can paradoxically increase overall cost.
Interpretation/explanation of the experimental result (causal mechanism proposed by authors) linking compression to increased reasoning burden; supported by the reported experiment but mechanism is inferential rather than directly measured in abstract.
There are universal bottlenecks requiring architectural innovations beyond parameter scaling.
Paper interpretation of results and analysis arguing that the observed limitations and asymmetries point to architectural bottlenecks that cannot be resolved solely by increasing model parameters.
Model performance on ImplicitMemBench is far below human baselines.
Paper asserts model scores are 'far below human baselines' after reporting model percentages; the excerpt does not provide the numeric human baseline value.
Models are beginning to be deployed to generate revenue for the companies that created them through advertisements, creating potential conflicts of interest between company incentives and users' best interests.
Conceptual/observational claim advanced in the paper motivated by industry deployment trends and the authors' framework; not a quantified experimental result in the abstract.
Scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.
Conclusion drawn from the paper's experimental findings (comparative performance across models and responses to targeted interventions); presented as a general implication in the abstract.
Existing energy-focused guidelines and metrics have seen limited adoption among practitioners, leaving a gap between research and everyday coding practice.
Claim made in paper's background/motivation; no adoption-rate data included in the excerpt.
Unstructured physical trades and high-stakes caretaking roles exhibit absolute resilience to LLM-driven automation (i.e., very low OAI), quantifying a 'Cognitive Risk Asymmetry.'
Empirical classification from computed OAIs showing low exposure for unstructured physical trades and high-stakes caretaking roles; the excerpt does not provide specific OAI values or counts.
Variance-based Human-in-the-Loop (HITL) validation with an expert panel demonstrates a profound cognitive gap: isolated algorithmic probabilities fail to encapsulate the "institutional premium" imposed by experts bounded by professional liability.
Empirical validation procedure reported: variance-based HITL validation involving an expert panel that compared algorithmic scores and expert adjustments, concluding a systematic difference attributed to institutional liability considerations. The excerpt does not give panel size or quantitative variance statistics.
Industry self-regulation has demonstrably failed, motivating the need for IASCA.
Proposal asserts a 'demonstrated failure of industry self-regulation' as rationale for IASCA; no specific empirical studies, incidents, or metrics are cited in the provided text.
Roughly half of the projected LFPR decline to 55% by 2050 is attributable to AI—equivalent to around 10 million lost jobs.
Authors' decomposition/interpretation of conditional forecast results under the rapid scenario reported in the abstract (ties LFPR decline to job-count equivalents).
Our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation.
Comparative claim referencing prior observations in text-to-SQL literature and the authors' audit results on ELT-Bench; no new cross-benchmark quantitative analysis reported in the excerpt.
That measured machine-equivalent work appeared on no financial statement, workforce report, or government statistical return.
Claim about absence of reporting for the deployment's measured work (asserted in the paper for the deployment case).
The AI-as-advisor approach has limitations: people frequently ignore accurate advice, rely too much on inaccurate advice, and their decision-making skills may deteriorate over time.
Paper asserts these limitations in motivation/background and/or derives them from observed behavior in experiments (stated in abstract as known problems with AI-as-advisor).
When given a choice between which information source to give to an AI agent, a large portion of subjects fail to select the more informative one.
Experimental condition where subjects chose which source (prompt vs revealed-preference data) to provide to an AI agent; reported result that a large portion did not choose the more informative source.
The gap in predictive accuracy is driven by subjects' difficulty in translating their own preferences into written instructions.
Further analysis reported in the experiment attributing the observed accuracy gap to subjects' difficulty converting their preferences into prompts (presumably via analysis comparing content of prompts to revealed choices).
The emergence and diffusion of these technologies create an era of labor displacement.
Framed in the paper as a premise motivating policy proposals; presented as a conceptual claim rather than supported by original empirical estimates in the text provided.
Many automotive firms, especially those developing new energy and intelligent vehicles, have suffered financial distress and even exited the market.
Descriptive statement in the paper's introduction/motivation citing observed industry outcomes (financial distress and market exit) among automotive firms focused on NEV and intelligent vehicles.
The dominant mechanism behind the performance drop is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts.
Analysis of issue-type specific detection rates shows Type2_Contextual detection collapses at config_B; interpretation ties this to attention dilution in longer contexts.
The economic inevitability of technological transformation (in agentic finance) and the critical urgency of proactive intervention.
Author claim synthesizing the paper's argument and modeling results (normative conclusion based on earlier analysis and assertions, not a validated empirical finding).
Surveillance intensity is associated with hyper-vigilance (reported effect = -4.213).
One of the six propositions from the paper's trilevel framework; the abstract reports an effect value of '-4.213' associated with surveillance intensity → hyper-vigilance.
Platform workers receive 36.3% more third-party ratings than traditional workers.
Quantitative synthesis/summary reported in the paper (no primary sample size in abstract); likely aggregated from included studies.
Platform workers experience 59.6% higher digital speed determination than traditional workers.
Quantitative synthesis/summary reported in the paper (no primary sample size given in the abstract); presumably aggregated from included studies comparing platform and traditional workers.
Our findings surface practical limits on the complexity people can manage in human-AI negotiation.
Synthesis claim based on the empirical study varying number of issues and observed decline in performance beyond three issues; presented as a conceptual/practical implication of the results.
Multiple competing arbitrageurs drive down consumer prices, reducing the marginal revenue of model providers.
Analytic argument and empirical/simulation results reported in the paper showing that competition among arbitrageurs lowers prices faced by consumers and decreases marginal revenue for model providers.
Distillation further creates strong arbitrage opportunities, potentially at the expense of the teacher model's revenue.
Experiments or analyses involving model distillation reported in the paper showing that distilled/student models enable profitable arbitrage and may reduce revenue captured by the original teacher model.
The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation.
Interpretation of resume-data patterns: observed dispersion of previously coherent AI practitioners and spread of AI-related vocabulary into other occupational records rather than consolidation into a new occupational cluster.
Beyond an environment-specific optimum, scaling further degrades institutional fitness because trust erosion and cost penalties outweigh marginal capability gains.
Analytical argument from the Institutional Scaling Law together with illustrative examples and discussion of mechanisms (trust erosion, cost penalties) in the paper.
Bias effects vary by vulnerability type, with injection flaws being more susceptible to framing bias than memory corruption bugs.
Subgroup analysis in Study 1 comparing framing sensitivity across vulnerability classes (injection vs memory corruption) within the experiment dataset.
Model convergence in DRL can lead to crowded trades, which has implications for market stability and motivates a robust regulatory framework balancing innovation with market stability.
Analytical argument in the paper linking convergence/crowding to systemic effects; the excerpt does not include empirical market-impact studies, simulations, or measured incidence rates of crowding.
Deploying DRL at scale requires socio-technical infrastructure considerations including algorithmic governance, systemic risk management, and accounting for the environmental cost of large-scale computational finance.
Conceptual and system-level analysis presented in the paper; no empirical auditing data, carbon-footprint measurements, or governance case studies are provided in the excerpt.
Two sources of spurious performance addressed are memorization bias from ticker-specific pre-training and survivorship bias from flawed backtesting.
Problem identification and methodological focus: the paper names memorization bias and survivorship bias as primary confounders it aims to mitigate. The excerpt does not detail experiments that quantify the magnitude of those biases or the degree to which they were reduced.
Traditional ex ante regulatory approaches struggle to keep pace with AI development, exacerbating the 'pacing problem' and the Collingridge dilemma.
Theoretical/legal literature review and conceptual argument presented in the paper (no empirical sample or quantitative data reported in the abstract).
Low internal conflict or unanimity can be diagnostic of variance depletion (i.e., exclusion) rather than healthy integration, so governance systems should treat low conflict as a potential red flag until heterogeneity integration is verified.
Interpretive policy implication derived from the model's demonstration that exclusionary processes can produce deceptively low observed disagreement while increasing fragility; this recommendation is based on theoretical reasoning without empirical validation in the paper.
Most existing candidate matching systems act as keyword filters, failing to handle skill synonyms and nonlinear careers, resulting in missed candidates and opaque match scores.
Paper's introductory assertion about limitations of most current systems. The excerpt does not cite empirical studies, statistics, or systematic reviews to substantiate this claim.
TDD (test-driven development) prompting alone increased regressions to 9.94%.
Empirical result reported in the paper comparing a TDD prompting intervention against other workflows on the benchmark (values given in the excerpt).
Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied.
Paper's critique of existing benchmark literature and practices (asserted by authors in background; no specific benchmark survey details in the excerpt).
The paper identifies five structural challenges arising from the memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi-step executions; and silent quality degradation without feedback loops.
Qualitative analysis and problem framing presented in the paper (authors' identification of five specific challenges).
AI raises managerial cognitive complexity and creates recurring tensions between algorithmic optimisation and systemic, ethical reasoning.
Theoretical synthesis highlighting emergent tensions from integrating computational optimisation with systems thinking and ethical considerations; conceptual, no empirical tests.
Underprovision of verification is likely if left to market forces because information quality has positive externalities and misinformation imposes negative externalities, justifying public funding, subsidies, or regulation.
Economic reasoning and policy implications drawn from the study's findings and the literature on public goods/externalities.
Censorship, restricted data flows, and government interference fragment markets, limit economies of scale, and favor well-resourced, internationally connected actors—widening capacity gaps.
Interpretive economic analysis grounded in observed access constraints and comparative case material across the three platforms.
Limited data access and censorship reduce the efficacy of AI tools by creating training and validation gaps; legal risks complicate use of proprietary platforms and cloud services.
Interviews describing constraints on data availability and legal/operational barriers to using some platforms and cloud services; interpretive analysis of implications for AI training/validation.
Generative AI increases the volume and sophistication of misinformation (deepfakes, fabricated documents), raises false-positive risks, and can be weaponized by state or nonstate actors.
Interview accounts and qualitative analysis noting observed or anticipated misuse of generative models and associated verification challenges.
Resource constraints—limited staff time, funding, and technical capacity—are recurring operational challenges for these platforms.
Staff and stakeholder interviews plus analysis of organizational reports indicating staffing, funding, and technical limitations.
Platforms experience difficulty building and retaining audience trust and engagement, especially in contexts of high public skepticism or polarization.
Interview data from platform staff describing audience engagement challenges, supported by analysis of audience-focused platform formats and community-reporting strategies.
Platforms face limited or asymmetric access to primary data sources such as platform APIs, state data, and archives.
Interview accounts and document analysis noting restricted API access and barriers to state-held data and archives across the three cases.