Evidence (6507 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Productivity
Remove filter
Retail supply chain operations in supermarket chains involve continuous, high-volume manual workflows spanning demand forecasting, procurement, supplier coordination, and inventory replenishment.
Descriptive claim stated in the paper's introduction/abstract; no empirical data, sample, or methods reported to substantiate this characterization within the text provided.
The two margins interact through a self-undermining feedback that can generate low-archive traps (multiple equilibria with low accumulated public archive).
Dynamic equilibrium analysis in the theoretical model showing interacting feedbacks and possible trap equilibria (model-derived result).
Resolution margin: the probability that posted queries are resolved declines because AI raises contributors' outside options, thinning the contributor pool and creating congestion on the platform.
Mechanism and comparative-static implication produced by the paper's theoretical model; no empirical sample provided in the excerpt.
Flow margin: the posted volume of knowledge-enhancing queries declines as AI resolves more problems privately before they reach the platform.
Mechanism derived in the theoretical model; stated as the flow-margin channel (no empirical quantification in the provided text).
AI reduces archive creation through two distinct margins: a flow margin and a resolution margin.
Analytical decomposition derived within the paper's theoretical model (mechanism claimed by the model).
Generative AI resolves user problems without leaving a public trace, so fewer discussions and solutions reach public platforms.
Stated as an empirical motivation in the paper; no empirical sample or quantified measurement reported in the provided text.
Green AI research has largely measured the footprint of models rather than the downstream workflows in which GenAI is a tool.
Literature review / mapping of recent Green AI literature reported in the paper; descriptive claim about the focus of the field (no sample size or numerical counts reported in the abstract).
Existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure.
Paper asserts mismatch between existing benchmarks and production usage as motivation for producing a production-derived benchmark (stated differences: language distribution, prompt style, codebase structure).
Replacing deterministic components with probabilistic workflows changes the failure mode: LLM pipelines may generate plausible but incorrect outputs that pass superficial checks and propagate into irreversible actions such as DOI minting and public release.
Conceptual argument supported by the paper's incident descriptions (e.g., a detected coordinate transformation error); the statement is presented as a general risk rationale.
Occupations whose AI-exposed steps are more dispersed across the production workflow (higher fragmentation) exhibit a substantially lower share of their steps actually executed by AI, conditional on AI exposure share.
Empirical regression analysis controlling for share of AI-exposed steps; uses dataset linking O*NET tasks, human AI exposure assessments, Anthropic Economic Index execution outcomes, and GPT-generated workflow orderings (details in Sections 5.1 and 7).
Treated firms' demand for external capital investment falls by just over $220,000 relative to the control group.
RCT with 515 firms; reported dollar-change in external investment demand between treated and control firms.
Despite faster growth, treated firms do not scale inputs proportionally: their demand for external capital investment falls by 39.5% relative to the control group.
RCT with 515 firms; firms reported external capital demand/investment requests; comparison of investment demand between treatment and control groups.
For the private business sector, if the set of automated tasks were frozen in 1950, 87% of TFP growth between 1950 and 2023 would have been eliminated.
Counterfactual growth-accounting exercise that freezes the set of automated tasks at 1950 while allowing capital, labor, and other productivity growth to follow historical rates (simulation based on calibrated accounting).
The sum of "other" TFP growth and average labor productivity growth (ˆZt + ˆψℓt) is small — for example equal to -0.1% per year for the private business sector since 1950.
Growth-accounting decomposition for the private business sector since 1950 using BEA/BLS data in the task-based framework.
Under the rapid scenario, economists forecast the share of wealth held by the wealthiest 10% of households rising to 80.0% by 2050.
Conditional forecasts in Key Findings for the economist respondent group under the rapid AI scenario (2050 horizon).
Conditional on the rapid scenario, economists forecast the labor force participation rate falling from its current level of 62% to 55% by 2050.
Conditional forecasts in Key Findings for the economist respondent group under the rapid AI scenario (2050 horizon).
There are macroeconomic risks associated with AI-led unemployment.
Paper's macroeconomic analysis drawing on labor economics and technology adoption research; no quantitative estimates or sample sizes provided in the summary.
Managerial incentives drive premature workforce contraction during AI adoption.
Analytical claim grounded in labor economics and organizational behavior review; the summary indicates examination of managerial incentives but does not report primary empirical tests or sample sizes.
Premature workforce contraction in response to AI adoption foreshadows deeper structural challenges as AI systems mature.
Forward-looking claim based on synthesis of literature and theoretical projection; no empirical quantification or sample provided in the summary.
This pattern of premature workforce reductions reflects longstanding corporate short-termism rather than genuine technological displacement.
The paper's interpretation drawing on labor economics and organizational behavior literature; no empirical study or sample size reported in the summary.
Organizations face mounting pressure to demonstrate immediate returns on AI investments, often through workforce reductions that outpace actual automation capabilities.
Argument in paper citing accelerating AI adoption across sectors and observed managerial responses; no primary dataset or sample size reported in the text.
Applying the Auditor-Corrector methodology to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors — including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth — that penalize correct agent outputs.
Audit results on ELT-Bench identifying categories of benchmark errors (rigid scripts, ambiguous specs, incorrect ground truth) and attributing many failed transformation tasks to these errors; no numeric breakdown or sample count given in the excerpt.
On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility.
Reference to initial evaluation results on ELT-Bench showing low success rates for AI agents; the provided excerpt does not give numerical success rates or sample size.
The way we're thinking about generative AI right now is fundamentally individual (this appears in how users interact with models, how models are built, how they're benchmarked, and how commercial and research strategies using AI are defined).
Author's observational/descriptive claim supported by argumentative examples (mentions user interaction patterns, model design and benchmarking practices, and commercial/research strategies); no empirical sample or quantitative analysis reported in the excerpt.
Traditional questionnaires yielded slightly higher accuracy in risk assessment.
Result reported from the two experiments comparing traditional questionnaires to adaptive ARQuest versions; no numeric accuracy or sample size provided in the excerpt.
Insurers must blindly trust users' responses, increasing the chances of fraud.
Stated as a motivating problem in the paper; presented as logical/empirical concern rather than supported by a reported study within the paper.
Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences.
Descriptive claim in paper introduction arguing limitations of standard questionnaires; no experiment or sample size reported for this assertion.
Using a stylised inpatient capacity signalling example and minimal game-theoretic reasoning, task optimisation alone is unlikely to change system outcomes when incentives are unchanged.
Theoretical analysis using a stylised inpatient capacity signalling example and game-theoretic reasoning presented in the paper (no empirical data/sample reported in the abstract).
Deployment of AI systems carries significant costs including ongoing costs of monitoring and it is unclear whether optimism of a deus ex machina solution is well-placed.
Conceptual/argumentative claim made by the authors in the paper (no empirical study or sample size reported in the abstract).
Improvements in operational resilience (OR) effectively reduce corporate operational risk.
Further analysis reported in the paper linking higher OR to lower operational risk measures for firms in the sample.
AI promotes operational resilience by reducing management agency conflicts.
Mechanism (mediation) tests reported in the paper showing AI associated with reductions in measures of agency/management conflict, which in turn relate to OR improvements.
No regulatory framework requires disclosure of machine/AI labor output.
Author's assertion in the paper (policy claim; no legislative survey or quantification reported).
No index tracks machine labor output over time.
Author's assertion in the paper (stated lack of existing indices; no systematic review/sample reported).
This labor force is entirely invisible to the economic infrastructure humanity has built to measure work: no standardized unit of measurement exists.
Author's assertion/diagnosis in the paper (argumentative/observational, no empirical survey or sample reported).
Agent contributions are associated with more churn over time compared to human-authored code.
Longitudinal comparison between agent-generated and human-authored contributions reported in the paper (churn/survival estimates described; association between agent contributions and higher churn asserted).
Unbalanced or poorly governed adoption of Big Data and AI contributes to increased systemic risk, cybersecurity vulnerability, regulatory fragmentation and third-party dependence on BigTech platforms.
Argument based on qualitative literature review and synthesis of international empirical studies and comparative sector analysis; no single-sample empirical study in this paper.
Extreme automation (high AI intensity) causes employment decline.
Part of the U-shaped relationship reported by the paper's empirical results; described qualitatively in the abstract/summary.
Task orchestration is the most under-researched dimension among the five workplace-design components.
Finding from the PRISMA-guided systematic review of 120 papers, which mapped coverage across the five dimensions and identified task orchestration as having the least research attention.
Decision authority allocation emerges as the binding constraint for Society 5.0 transitions.
Result synthesized from the systematic review and theoretical analysis mapping the five workplace-design dimensions; stated as the binding constraint in the paper's findings.
The literature shows persistent gaps in empirical validation, standardized evaluation methods, and sector-specific comparative analyses of agentic AI in financial services.
Review-level assessment noting limited empirical studies, heterogeneous evaluation metrics, and few direct cross-sector comparisons up to mid-2024.
Significant implementation barriers persist, notably workforce transformation challenges, legacy system integration difficulties, and trust deficits.
Thematic synthesis across empirical and conceptual papers in the review reporting implementation barriers and change management issues.
Ethical concerns—including bias, lack of transparency, and regulatory compliance risks—remain critical for agentic AI in financial services and necessitate layered governance and human-AI collaboration.
Collation of ethical, legal, and governance issues reported across the reviewed multidisciplinary studies and normative discussions.
Insurance is comparatively underrepresented in the literature and in reported agentic AI deployments compared with banking and investment.
Review finding (counts/themes across included studies indicating fewer studies/applications in insurance relative to banking and investment).
A weak manager directing a weak worker achieves a 42% success rate, performing worse than the weak agent alone which achieves 44%.
Empirical comparison across the same 200 SWE-bench Lite instances and pipeline configurations, comparing weak-manager+weak-worker pipeline to weak single-agent baseline.
Task complexity shapes substitution: low-complexity tasks see high substitution, while high-complexity tasks favor limited partial automation.
Calibration of the model to O*NET tasks + expert survey + GPT-4o decompositions; implementation results reported for computer vision showing substitution varies with task complexity.
AI systems exhibit predictable but diminishing returns to data, compute, and model size (scaling-law experiments), implying the cost of higher accuracy is convex: good performance may be inexpensive, but near-perfect accuracy is disproportionately costly.
Scaling-law experiments estimating performance as a function of data, compute, and model size; described experimental estimation of production function.
The common claim that generative AI simply amplifies the Dunning–Kruger effect is too coarse to capture the available evidence.
Paper's synthesis of heterogenous empirical findings from human–AI interaction, learning research, and model evaluation used to critique the uniform-amplification interpretation; no single empirical countertest reported.
LLM use degrades metacognitive accuracy and flattens the classic competence–confidence gradient across skill groups (i.e., reduces calibration and narrows differences in self-assessed confidence by skill level).
Synthesis of studies from human–AI interaction and learning research reported in the paper that document worsened calibration and a reduction in the competence–confidence gradient when users rely on LLM outputs; the paper does not report a single combined sample size.
The agent team topology exhibits higher operational fragility due to multi-author code generation.
Reported empirical observation from experiments comparing architectures, attributing increased fragility/errors to multi-author code generation in the agent team setup (stated qualitatively; no quantitative failure rates provided in the abstract).
Azar et al. (2023) show that monopsonistic employers have stronger incentives to automate and document that US commuting zones with higher labor market concentration experienced more robot adoption.
Citation reported in the paper summarizing Azar et al. (2023); empirical analysis across US commuting zones (no sample size provided here).