Evidence (4793 claims)
Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 402 | 112 | 67 | 480 | 1076 |
| Governance & Regulation | 402 | 192 | 122 | 62 | 790 |
| Research Productivity | 249 | 98 | 34 | 311 | 697 |
| Organizational Efficiency | 395 | 95 | 70 | 40 | 603 |
| Technology Adoption Rate | 321 | 126 | 73 | 39 | 564 |
| Firm Productivity | 306 | 39 | 70 | 12 | 432 |
| Output Quality | 256 | 66 | 25 | 28 | 375 |
| AI Safety & Ethics | 116 | 177 | 44 | 24 | 363 |
| Market Structure | 107 | 128 | 85 | 14 | 339 |
| Decision Quality | 177 | 76 | 38 | 20 | 315 |
| Fiscal & Macroeconomic | 89 | 58 | 33 | 22 | 209 |
| Employment Level | 77 | 34 | 80 | 9 | 202 |
| Skill Acquisition | 92 | 33 | 40 | 9 | 174 |
| Innovation Output | 120 | 12 | 23 | 12 | 168 |
| Firm Revenue | 98 | 34 | 22 | — | 154 |
| Consumer Welfare | 73 | 31 | 37 | 7 | 148 |
| Task Allocation | 84 | 16 | 33 | 7 | 140 |
| Inequality Measures | 25 | 77 | 32 | 5 | 139 |
| Regulatory Compliance | 54 | 63 | 13 | 3 | 133 |
| Error Rate | 44 | 51 | 6 | — | 101 |
| Task Completion Time | 88 | 5 | 4 | 3 | 100 |
| Training Effectiveness | 58 | 12 | 12 | 16 | 99 |
| Worker Satisfaction | 47 | 32 | 11 | 7 | 97 |
| Wages & Compensation | 53 | 15 | 20 | 5 | 93 |
| Team Performance | 47 | 12 | 15 | 7 | 82 |
| Automation Exposure | 24 | 22 | 9 | 6 | 62 |
| Job Displacement | 6 | 38 | 13 | — | 57 |
| Hiring & Recruitment | 41 | 4 | 6 | 3 | 54 |
| Developer Productivity | 34 | 4 | 3 | 1 | 42 |
| Social Protection | 22 | 10 | 6 | 2 | 40 |
| Creative Output | 16 | 7 | 5 | 1 | 29 |
| Labor Share of Income | 12 | 5 | 9 | — | 26 |
| Skill Obsolescence | 3 | 20 | 2 | — | 25 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
Productivity
Remove filter
The broader cognitive automation potential is roughly five times larger than visible adoption and is geographically widespread (present across all states, not only coastal hubs).
Direct comparison of the two model-derived aggregates (11.7% vs 2.2%) and spatial analysis of the Iceberg Index across ~3,000 counties and all states in the simulation.
Broader cognitive automation potential across administrative, financial, and professional services amounts to 11.7% (~$1.2 trillion).
Iceberg Index computation summing the wage-value contributions of skills that current AI capabilities can perform; based on mapping of thousands of AI tools to ~32,000 skills and the simulated 151M-agent workforce across ~3,000 counties.
Visible AI adoption concentrated in computing/technology represents about 2.2% of U.S. wage value (~$211 billion).
Model-derived visible-adoption metric computed from mapped AI tool usage in technology/computing occupations, applied to the simulated 151M-worker population and national wage data to estimate percentage and dollar value.
Reduced labor shares disproportionately harm lower- and middle-skill workers relative to higher-skill workers, increasing distributional inequality.
Micro and firm-case analyses linking K_T exposure to occupation- and skill-level wage/employment outcomes; regressions showing heterogeneous effects across skill groups; supporting evidence from sectoral studies.
The loss of labor share and payrolls materially undermines PAYG pension sustainability and payroll-tax revenue bases under realistic adoption trajectories.
Dynamic general equilibrium overlapping-generations model calibrated and simulated to incorporate substitution between labor and K_T and a PAYG pension sector; fiscal simulations show declining contributor bases and pressure on pension balances; sensitivity analyses across adoption speeds.
Wages for workers in K_T‑intensive firms/industries fall or grow more slowly relative to less-exposed counterparts, compressing wage contributions to income.
Panel regressions estimating wage outcomes conditional on K_T intensity measures, with controls and robustness specifications; supported by matched employer‑employee microdata in case studies and industry-level decompositions.
Significant implementation hurdles—chronic infrastructure gaps, weak data governance, severe digital skills shortages, high initial investment costs, and organizational inertia—create a 'pilot trap' that prevents successful AI pilots from scaling.
Qualitative findings from interviews/case studies in the mixed-methods research detailing recurring barriers to scaling AI projects in large enterprises and across the sector.
Strict oversight requirements for GLAI could raise fixed compliance costs (audit, certification, human-in-the-loop processes), benefiting incumbent firms and potentially reducing competition and barriers to entry.
Regulatory economics argument drawing on compliance-cost logic and market structure effects; no empirical entry-cost analysis or case studies.
Perception of increased legal risk and regulatory uncertainty may slow adoption of GLAI and redirect investment toward safer subfields (verification tools, retrieval-augmented systems, formal-reasoning hybrids).
Economic reasoning and market-design argumentation based on risk/uncertainty dynamics; no econometric or survey data presented.
Divergent regulatory regimes (e.g., strict EU rules vs. looser regimes elsewhere) may produce regulatory arbitrage, influencing where GLAI companies locate, invest, and trade internationally.
Cross-jurisdictional regulatory analysis and economic inference about firm behavior under differential regulation; no firm-level relocation data provided.
The positive macroeconomic effects of AI are severely limited by structural issues, notably large petroleum import volumes and the fiscal burden of incomplete fuel subsidy reforms.
Integrated quantitative analysis showing that operational savings are outweighed by import volumes and subsidy fiscal costs; contextual fiscal data cited (fuel subsidy reform peak).
Identified concrete training gaps in current models: delegation, scoped execution, and mode switching are skills absent from current training data and limit splitting models into manager/worker roles.
Authors' diagnosis based on experimental outcomes and qualitative reasoning about model training distributions; recommendation for future training focus.
Interpretive, ad-hoc human-centered evaluation practices (e.g., “vibe checks”, team sense-making) are rational adaptations to LLM behavior rather than merely sloppy or inferior methodological choices.
Authors' interpretive argument based on interview evidence where practitioners explained why such practices persist and how they serve sense-making for unpredictable model behavior.
The possibility of strategic argument construction (gaming) motivates governance needs: standards for provenance, certification, and liability rules.
Policy recommendation based on anticipated incentive problems; no empirical governance evaluations.
Standard GDP statistics can mask AI-driven demand shortfalls; central banks and statistical agencies should therefore monitor labor-share–velocity links, distributional income measures, and consumption by income quantile in addition to headline GDP.
Theoretical Ghost GDP channel and calibration results showing divergence between measured GDP and consumption-relevant income; policy recommendation follows from those model results.
AI changes the nature of capital (digital/algorithmic assets) and complicates productivity accounting; researchers should decompose firm-level productivity gains into AI technology, complementary organizational capital, and human capital effects.
Theoretical proposal grounded in productivity accounting literature and conceptual discussion; no single decomposition empirical result presented.
Conventional productivity statistics and standard evaluation methods may undercount benefits from conversational initiation assistance; new survey and administrative measures might be needed.
Policy and measurement recommendation based on the conceptual model; no empirical measurement validation provided.
Policy and governance issues become salient: liability, IP, security, and certification of AI-generated code require new standards for provenance, testing, and accountability.
Argument based on practitioner-raised concerns about security, IP, and provenance in the Netlight study; authors recommend policy attention; no legal/regulatory analysis or empirical policy evaluation provided.
International shipping produces approximately 3% of global greenhouse gas emissions.
Contextual statement in the paper citing external estimates (specific source not provided in the excerpt).
Output quality saturates at approximately seven governed memories per entity.
Empirical analysis reported in the controlled experiments showing output quality vs. number of governed memories per entity, with saturation near seven memories.
The report provides scenario-based forecasts for HACCA emergence across near-, mid-, and long-term timelines, identifying capability thresholds to monitor.
Capability trajectory assessment combining trends in AI capabilities, automation of software tasks, computation availability, and diffusion dynamics; scenario and expert-judgment approach (qualitative forecasting).
An interpretable logistic-regression model, calibrated with isotonic regression, produces well-calibrated, individual-level attrition probabilities suitable for policy simulation.
Modeling pipeline: logistic regression for prediction, isotonic regression for calibration; authors report strong predictive performance and well-calibrated probabilities (specific performance metrics not included in the provided summary).
A Sankey diagram of thematic evolution shows lexical convergence over time and indicates that a small set of authors has disproportionate influence in structuring the discourse.
Thematic evolution analysis visualized with a Sankey diagram; author influence inferred from performance trends (citations/publication counts) in the bibliometric data.
This paper is one of the first systematic reviews focused specifically on NLP in bank marketing, organizing findings along the customer journey and the marketing mix to provide a practical taxonomy.
Authors' stated novelty claim based on the scoped literature search (2014–2024) and topical focus; novelty inferred from the small number of prior papers identified at the intersection.
Productivity gains from AI may be under- or mis-measured if national accounts and tax systems do not adjust for AI-driven quality changes in services.
Analytic observation in the paper's measurement and externalities discussion; not empirically tested within the study.
ToM alignment matters less (i.e., misalignment has smaller effect) in settings with explicit coordination protocols, strong signaling, or standardized conventions.
Analyses and experiments described in the paper showing smaller performance differences between matched and mismatched ToM orders when explicit conventions or reliable signals are available; reported as part of robustness/conditional analyses.
Manipulating costs and benefits of observation versus action in experiments can probe the switching behavior driven by System M.
Proposed experimental manipulation; no empirical data presented.
Ablation studies disabling System M or decoupling Systems A and B will help test whether meta-control provides empirical benefits.
Suggested experimental design (ablation study) in the methods section; no results provided.
The authors will publicly release the benchmark, code, and pre-trained models.
Statement in the paper (release/availability section) announcing plans to publish benchmark, code, and pre-trained models.
The study is the first empirical investigation of human–AI assistance in a live CTF setting with a direct comparison to autonomous AI agents on the same fresh challenges.
Authors' positioning of their work as novel; methodology involved a live onsite CTF, instrumentation of human–AI interactions (41 participants), and direct benchmarking of four autonomous agents on the same fresh challenge set.
This is the first study to compare human–human and human–AI collaboration outcomes for temporary virtual tasks from employees’ perspective in an applied service-industry context.
Author-stated novelty claim in the paper (based on study design: online experiment with retail employees examining temporary, virtual teamwork).
Measuring AI's contribution to productivity and coordination effects will be challenging; new metrics (e.g., coordination time per task, error/rework rates attributable to communication lapses) are required.
Conceptual argument and recommended measurement agenda in the paper; no empirical testing of proposed metrics provided.
Many early-stage AI advances have not translated into higher Phase II/III success rates.
Synthesis of reported outcomes and failures from industry experience; no new systematic statistical analysis provided.
After roughly a decade of adoption in large biopharma, AI has not yet changed late-stage (Phase II/III) clinical success rates.
Qualitative assessment of industrywide experience and reported outcomes; statement based on narrative review rather than systematic, long-run quantitative analysis or causal estimates.
Three primary adoption archetypes in large pharma are (1) partnership-driven acceleration, (2) culture-centric transformation, and (3) production-first democratization.
Conceptual classification in the editorial derived from trends and illustrative examples rather than empirical survey or sampling; no quantitative validation provided.
AI adoption is not associated with significant changes in operating costs.
Analysis of operating costs in firm financials showing no significant post-adoption change for adopters relative to nonadopters.
The innovation effects of AI adoption are not concentrated among larger firms, financially unconstrained firms, or high-tech firms.
Heterogeneity tests across firm size, financial constraint status, and industry technology intensity showing no concentration of effects in these groups (as reported in the paper).
SWE-Skills-Bench is the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE).
Authors present a new benchmark designed to evaluate marginal utility of skills; benchmark pairs skills with repositories and requirement documents and is described as requirement-driven and focused on isolating marginal utility.
Collaborative ability is distinct from individual problem-solving ability.
Model-based estimates from the Bayesian IRT framework that separately parameterize collaborative ability and individual problem-solving ability, with results indicating they are separable constructs (analysis on n = 667 benchmark data).
A complexity-aware routing mechanism selectively activates planning for complex queries, ensuring optimal resource allocation during online serving.
Method description in the paper explaining adaptive online serving and complexity-aware routing; evaluated in serving experiments.
AI has not yet significantly promoted university–industry collaborative R&D capabilities.
Mechanism analysis in the paper testing the university–industry collaborative R&D channel and reporting no statistically significant effect of AI adoption on that capability in the sample.
This study empirically tests a theoretically acknowledged but rarely tested relationship (AI adoption → performance conditional on structural constraints) in an emerging-economy setting.
Literature gap claim supported by the authors' review and execution of an empirical test using survey data from 280 Tunisian SMEs and PLS-SEM.
Institutional conditions do not exert a significant moderating influence on the relationship between AI adoption and firm performance in this sample.
PLS-SEM moderation tests on the 280 Tunisian SMEs found the institutional-environment moderator to be non-significant.
Key limitations in the literature include methodological heterogeneity, scarce safety data, and a focus on non-acute settings.
Authors' appraisal of the included studies as reported in the discussion section.
Unemployment does not exert a statistically significant impact on GDP growth in the employed model.
Unemployment included among the macroeconomic determinants in the panel regressions but reported as statistically insignificant (no effect) in the provided summary; methods cited include OLS, FE, Difference and System GMM (sample details not included).
Previous studies have identified language barriers as impediments to labor market engagement but empirical information assessing both policy reductions and the relative efficacy of professional, AI-assisted, and hybrid translation methods is scarce.
Paper's literature review claim that existing literature documents language barriers but lacks comparative empirical evaluations of policy reductions and multiple translation models; asserted as motivation for current study.
Translation verified against existing performance implementations achieves throughput parity with MJX (1.04x) for HalfCheetah JAX.
Benchmarking HalfCheetah implemented in the translated backend versus MJX, reporting a 1.04x throughput ratio (approximate parity).
Levers such as raising taxes, reforming pensions, boosting productivity interact with each other through feedback loops and time delays that are not yet well understood.
Literature and model motivation stated in the paper; the integrated model is built to capture such interactions and delays.
These efficiency and cost gains are achieved while maintaining accuracy parity with the matched hierarchical baseline.
Paper states accuracy parity was maintained in the empirical evaluation comparing the proposed framework to the matched hierarchical baseline on the 2,847-query testbed.
The short‑term effect of AI on labor‑intensive industries is weak.
Short‑run/dynamic subgroup analysis in the China 2003–2017 panel indicating minimal or weak immediate growth effects for labor‑intensive sectors.