Evidence (8625 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Adoption
Remove filter
Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.
Prescriptive conclusion derived from the observed cross-configuration heterogeneity in the paper's empirical results.
Framework identity accounts for more of the between-configuration variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%.
Variance decomposition / explained-variance analysis reported for 'mean turns' across configurations (reported percentages: 64% vs 10%).
The analysis separates framework effects from LLM effects by holding each layer fixed in turn and measures one behavior–outcome effect per configuration to examine agreement across configurations.
Methods description in the paper: experimental design holding LLM or framework fixed to disentangle effects.
This study analyzes 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework supplying tools and workflow.
Dataset and experimental design reported in the paper: 64,380 runs; 126 configurations; 43 frameworks.
The paper's contribution is an evaluation and benchmark paradigm (discipline stability / trace-based evaluation), not a new optimizer or a universal claim about MARL.
Author statement in the abstract/summary clarifying the contribution is methodological (evaluation/benchmark) rather than proposing a new optimizer or making universal claims about multi-agent RL.
These factors evolve over time, have inter-dependencies across multiple resource dimensions, and generally do not lend themselves to closed-form analysis.
Methodological observation motivating simulation/sequence-based evaluation; asserted in the paper's rationale.
Higher sectoral digitalization potential (telework feasibility and digital intensity) does not significantly affect aggregate employment levels.
Difference-in-differences (DiD) analysis using the COVID-19 shock as a quasi-natural experiment on a quarterly panel for 27 EU Member States (2018–2024), N = 36,685; reported DiD coefficient = 0.06, p ≈ 0.98.
The convergence properties of the explore-then-exploit pricing pipeline can be characterized via a fluid-limit ordinary differential equation (ODE) analysis.
Analytical method used in the paper: fluid-limit ODE analysis applied to the multi-firm explore-then-exploit model to study convergence.
Firms following an explore-then-exploit pipeline randomize prices during an initial exploration phase, then estimate demand from their own historical data and set prices myopically thereafter; the estimation relies on a misspecified, monopoly-style model that omits competitors' prices.
Model specification and assumptions described in the paper (methodological setup).
The paper constructs estimators for the own-adoption, spillover, and total effects and an inference procedure that allows for spatial dependence.
Presentation of concrete estimators and an inference procedure in the paper; the inference approach explicitly accommodates spatial dependence (methodological contribution).
Spillover effects are learned from never-treated units and evaluated for treated cohorts under the exposure distribution they face.
Methodological procedure in the paper: estimation of spillover effects using never-treated units as the source of variation, then applying those estimates to treated cohorts based on their observed exposure distributions.
Identification uses a prespecified summary of spillover exposure and parallel trends comparisons among units with the same exposure at the baseline and target dates.
Identification strategy articulated in the paper: assumption of a prespecified exposure summary and use of parallel trends comparisons conditional on equal exposure profiles at baseline and event dates.
For each treated cohort and event time, the framework separates the effect of own adoption, the spillover effect generated by other adopters, and the total effect under the realized rollout.
Analytical decomposition provided in the paper that defines separate estimands for (i) own-adoption effect, (ii) spillover effect from other adopters, and (iii) total realized effect for cohorts and event times.
The paper develops a difference-in-differences framework for staggered policy adoption when units can be affected by other units' adoption.
Theoretical development in the paper: presentation of a DID framework that explicitly allows units to be affected by other units' adoption (methodological derivation and formal description).
IIQ is positioned as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation.
Explicit positioning statement in paper: authors state scope and limits of IIQ as deployment/usage metric rather than capability or causal productivity estimator (conceptual/positioning).
Sources were selected purposively through explicit inclusion and exclusion criteria tied to conceptual relevance, scholarly quality, and direct contribution to framework building; higher-order categories were retained only after iterative comparison across the four literature streams.
Author-reported sampling and analytic procedure for the integrative review.
Methodologically, the paper uses a structured integrative review combined with interpretive theory synthesis to connect literature on RegTech, sanctions compliance, institutional voids, supply chain governance, and algorithmic accountability.
Explicit methodological description in the paper (authors' stated approach).
Existing studies on regulatory technology mainly present it as a firm-level compliance tool, giving little attention to its role in shaping coordination across wider enterprise ecosystems in post-conflict and sanctions-affected settings.
Review finding based on purposive selection and comparison of literature on RegTech and related fields (method: structured integrative review and interpretive theory synthesis).
The study uses World Bank Enterprise Survey firm-level data from 2007 to 2024 and employs feasible generalized least squares (FGLS), robust ordinary least squares (OLS), and high-dimensional fixed effects (HDFE) linear regression techniques.
Direct methodological statement in the paper's abstract/summary. This is a descriptive factual claim about data and methods.
Neither survey nor transcript-based measures of participation equity improved under LLM facilitation (an "illusion of inclusion").
Quantitative survey measures and transcript-based analyses of participation equity (e.g., measures of turn-taking, speaking/typing share) showed no improvement in equity metrics for facilitated conditions compared to controls across the experiments.
Across both studies, LLM facilitation did not significantly improve group consensus.
Experimental comparison across the two studies (total N=879) measuring agreement/consensus metrics for groups randomized to LLM facilitation versus other facilitators or no facilitation; reported null effect on consensus.
Study 2 (N=675) compares facilitator strategies against a no-facilitation baseline.
Study 2 comprised N=675 participants (groups of three) randomized to different LLM facilitation strategies and a no-facilitation control.
Study 1 (N=204) compares three frontier LLMs as facilitators.
Study 1 comprised N=204 participants (groups of three) randomized to facilitator conditions comparing three frontier language models.
We present two empirical studies (N=879) of real-time, text-based group deliberation in an incentive-compatible charity allocation task with real financial stakes ($7,200 USD).
Two online experiments involving real-time, text-based group deliberation. Total participants N=879 in groups of three; total monetary stakes for the charity allocation task equal $7,200 USD.
We re-recruited 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset to evaluate personalised and non-personalised language models in blinded multi-turn conversations (large-scale within-subject experiment).
Study methodology reported in paper: within-subject experiment, re-recruitment of 530 participants from 52 countries, blinded multi-turn conversations comparing models.
Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people.
Authors' literature and field observation stated in introduction; contextual claim about common practice in academic evaluations (no numeric experiment reported for this claim).
Using an agent-based simulation of a multi-SKU convenience store environment, the study evaluates deployment efficiency, inventory responsiveness, and managerial cognitive reallocation.
Methodological claim: the paper reports an agent-based simulation experiment in a multi-SKU convenience store context; details such as number of simulations, parameter settings, or statistical results are not provided in the excerpt.
The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts.
Statement in paper presenting a characterization of current AI agent design; conceptual/observational claim with no empirical data or sample reported.
Persistent data gaps—especially concerning worker-level outcomes, informal labor, and non-Anglophone markets—warrant urgent research investment.
Authors' assessment based on scope of included studies and acknowledged limitations in observation windows and geographic/labor-form coverage.
Following PRISMA 2020 guidelines, we systematically searched six academic databases (Scopus, Web of Science, EconLit, SSRN, IEEE Xplore, Google Scholar) for empirical studies documenting observed—not predicted—labor market changes since 2020; from 1,847 initial records, 94 studies meeting inclusion criteria were retained for qualitative synthesis and 42 for quantitative data extraction.
Methods: systematic literature search following PRISMA 2020 across six named databases; initial records = 1,847; retained = 94 for qualitative synthesis, 42 for quantitative extraction.
We thematically analysed twelve semi-structured interviews with SME owners and managers conducted in early 2025 using Atlas.ti, yielding 19 codes grouped into six categories.
Methods statement in the paper describing qualitative sample and analysis procedures.
We examine the interplay between AI adoption, social capital formation, workforce dynamics, and sustainable development in Eastern Macedonia and Thrace (EMT), one of the EU's least developed regions.
Study context and scope as stated in the paper; empirical work conducted in EMT.
Research has concentrated on advanced urban economies, leaving the implications of AI for peripheral small and medium-sized enterprises (SMEs) operating under weak human capital, thin digital infrastructure, and constrained social capital — underexplored.
Statement in the paper contrasting existing research focus (advanced urban economies) with a lack of attention to peripheral SMEs; no empirical sample size for this bibliographic claim reported in the excerpt.
Once functional deployment and operational investment are controlled for, worker-task use is not associated with employment declines.
Multivariate regression results reported in the paper using BTOS AI supplement data showing the coefficient on worker-task use becomes statistically indistinguishable from zero after controlling for functional deployment and operational investment; exact model details and sample size not provided in excerpt.
This study conducts an empirical analysis using data on industrial robots from the International Federation of Robotics (IFR) and panel data from 14 sub-sectors of China's manufacturing industry.
Statement in paper describing data and methods: use of IFR robot data combined with panel data covering 14 manufacturing sub-sectors (panel regression framework implied).
Return forecasts are translated into long–short portfolios to assess economic performance.
Stated evaluation approach: conversion of predicted returns into long–short portfolios for economic/performance assessment.
The analysis is based on 30 market, liquidity, valuation, profitability, technical and risk factors and compares linear models, tree-based machine learning and deep learning architectures (including GRU, LSTM and Transformer) within a rolling-window forecasting framework.
Description of empirical design: use of 30 factor variables and explicit listing of model families (linear, tree-based, GRU, LSTM, Transformer) and use of a rolling-window forecasting setup.
We introduce the weighted evaluation index (WEI), a finance-specific performance metric that integrates prediction accuracy with market adaptability.
Methodological contribution stated in the paper: introduction of a new performance metric called WEI described as integrating accuracy and market adaptability.
We introduce the Diff-RMSE method for nonlinear factor identification.
Methodological contribution stated in the paper: introduction of a new method named 'Diff-RMSE' for identifying nonlinear factors.
The study uses A-share market data from 2013 to 2024 with equity and firm-characteristic data available from databases such as RESSET and CSMAR for more than 5,000 listed firms.
Empirical dataset description in the paper: time period 2013–2024, sources named (RESSET, CSMAR), and statement 'more than 5,000 listed firms'.
Future research should test these findings across different institutional contexts, particularly European economies.
Paper's stated limitations and suggestions for future research.
The analysis employs fixed-effects models, U-tests, bootstrap mediation, and patent text similarity analysis.
Methods statement listing econometric and text-analytic techniques used in the paper.
The study uses a sample of 25,204 firm-year observations from Chinese A-share manufacturing companies (2010–2023).
Paper statement of sample and period; descriptive sample construction (firm-year observations = 25,204).
The empirical analysis is based on Chinese A–share listed firms observed from 2012 to 2024 and uses a difference‑in‑differences (DID) identification strategy.
Study description in the paper's methods/abstract specifying sample period (2012–2024), population (Chinese A–share listed firms), and methodology (DID).
We validate the framework empirically on five benchmarks (MATH, MMLU, TriviaQA, SimpleQA, LiveCodeBench) across eight models from five providers.
Empirical experiments reported in the paper using five named datasets and eight models from five providers (experimental evaluation / benchmarking).
For k-model cascades, first-order conditions imply a single shadow price that equalizes marginal quality-per-cost across stage boundaries.
Analytical derivation of first-order conditions for k-stage cascades within the decision-theoretic constrained-optimization framework presented in the paper.
Given a pool of k models, the frontier achievable by deterministic two-model threshold cascades is the pointwise envelope over choose(k,2) pairwise cascades, with switching points where the optimal pair changes.
Theoretical characterization/derivation in the paper (mathematical result about deterministic two-model threshold cascades and combinatorial envelope over pairwise cascades).
Reciprocal shadow prices link the budget-constrained and quality-constrained formulations of the cascade optimization.
Analytical derivation in the decision-theoretic framework using constrained optimization and duality presented in the paper.
For a two-model cascade, the cost-quality frontier is piecewise concave on decreasing-benefit regions of the confidence support.
Theoretical development in a decision-theoretic framework using constrained optimization and duality; proven properties for the two-model case reported in the paper (analytical result).
In the U.S., no single 'AI Act' has passed (as of 2026).
Stated in the paper as a factual legal/policy status; this is verifiable via legislative records and is presented without an underlying sample (paper cites status as of 2026).