Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
This study analyzes 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework supplying tools and workflow.
Dataset and experimental design reported in the paper: 64,380 runs; 126 configurations; 43 frameworks.
The paper's contribution is an evaluation and benchmark paradigm (discipline stability / trace-based evaluation), not a new optimizer or a universal claim about MARL.
Author statement in the abstract/summary clarifying the contribution is methodological (evaluation/benchmark) rather than proposing a new optimizer or making universal claims about multi-agent RL.
The formal semantics and proof-checked admission model are specified and under active development, with evaluation of the verified core reserved for future work.
Author statement in the paper about the current development status and that evaluation of the verified core is deferred to future work.
Reward is non-positive in the CybORG CAGE-2 environment, so all configurations operate in a failure-mitigation mode.
Environment specification reported in the paper (CybORG CAGE-2 modeled as a POMDP with non-positive reward structure).
The evaluation spanned five model families, six models, and twelve configurations, totaling 3,475 episodes with token-level cost accounting.
Methods description in the paper reporting the experimental design and sample counts.
Skills can be mapped into three categories: those AI is absorbing, those needed to work alongside AI today, and those that make humans irreplaceable tomorrow.
Conceptual taxonomy offered in the chapter, based on labour market data and workplace evidence; presented as an analytical framework rather than a quantified finding.
Fear and hype about technological transitions are temporary.
One of five lessons drawn from historical analogy and labour market history as presented in the chapter.
Virtually every job is being touched by AI.
Stated in chapter summary; claimed on the basis of labour market data and emerging workplace evidence (no numeric sample given in excerpt).
Only 9% of jobs are fully automatable.
Reported directly in chapter; based on labour market data (specific data source and sample size not stated in the excerpt).
AI automates tasks, not jobs.
Conceptual argument in chapter drawing on labour market data and historical analogy; presented as a framing claim rather than a specific empirical estimate.
These factors evolve over time, have inter-dependencies across multiple resource dimensions, and generally do not lend themselves to closed-form analysis.
Methodological observation motivating simulation/sequence-based evaluation; asserted in the paper's rationale.
Higher sectoral digitalization potential (telework feasibility and digital intensity) does not significantly affect aggregate employment levels.
Difference-in-differences (DiD) analysis using the COVID-19 shock as a quasi-natural experiment on a quarterly panel for 27 EU Member States (2018–2024), N = 36,685; reported DiD coefficient = 0.06, p ≈ 0.98.
The study used a structured questionnaire (five-point Likert) administered to employees in AI-enabled organizations across various sectors and analyzed the data using SPSS (descriptive statistics, reliability analysis, correlation analysis, regression analysis).
Methods section summary provided in the paper (survey instrument description and analytical techniques).
The convergence properties of the explore-then-exploit pricing pipeline can be characterized via a fluid-limit ordinary differential equation (ODE) analysis.
Analytical method used in the paper: fluid-limit ODE analysis applied to the multi-firm explore-then-exploit model to study convergence.
Firms following an explore-then-exploit pipeline randomize prices during an initial exploration phase, then estimate demand from their own historical data and set prices myopically thereafter; the estimation relies on a misspecified, monopoly-style model that omits competitors' prices.
Model specification and assumptions described in the paper (methodological setup).
We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the Yellow.ai V3 platform.
Statement in abstract: evaluation across 35 agents over a three-week deployment on Yellow.ai V3 platform (empirical deployment described).
A four-dimensional Flexibility Index is developed to assess reallocation authority, forecast cycles, AI integration, and transparency.
Methods section: construction of an index with four dimensions (reallocation authority, forecast cycles, AI integration, transparency).
The analysis draws on Form 10-K filings from Microsoft, Johnson & Johnson, Procter & Gamble, and ExxonMobil (2019–2023), alongside public sector data from the Open Budget Survey 2023, the OECD Budget Practices Database, and U.S. GAO oversight reports.
Methods/data section listing data sources and firm sample (four named firms, 2019–2023) and public datasets.
The study investigates the non-linear impact of AI on economic growth in 19 G20 countries (2005–2023) using the Generalized Method of Moments (GMM) with both linear and quadratic models.
Methodological description provided in the paper: panel dataset covering 19 G20 countries over 2005–2023 and estimation via GMM with linear and quadratic specifications.
The paper constructs estimators for the own-adoption, spillover, and total effects and an inference procedure that allows for spatial dependence.
Presentation of concrete estimators and an inference procedure in the paper; the inference approach explicitly accommodates spatial dependence (methodological contribution).
Spillover effects are learned from never-treated units and evaluated for treated cohorts under the exposure distribution they face.
Methodological procedure in the paper: estimation of spillover effects using never-treated units as the source of variation, then applying those estimates to treated cohorts based on their observed exposure distributions.
Identification uses a prespecified summary of spillover exposure and parallel trends comparisons among units with the same exposure at the baseline and target dates.
Identification strategy articulated in the paper: assumption of a prespecified exposure summary and use of parallel trends comparisons conditional on equal exposure profiles at baseline and event dates.
For each treated cohort and event time, the framework separates the effect of own adoption, the spillover effect generated by other adopters, and the total effect under the realized rollout.
Analytical decomposition provided in the paper that defines separate estimands for (i) own-adoption effect, (ii) spillover effect from other adopters, and (iii) total realized effect for cohorts and event times.
The paper develops a difference-in-differences framework for staggered policy adoption when units can be affected by other units' adoption.
Theoretical development in the paper: presentation of a DID framework that explicitly allows units to be affected by other units' adoption (methodological derivation and formal description).
IIQ is positioned as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation.
Explicit positioning statement in paper: authors state scope and limits of IIQ as deployment/usage metric rather than capability or causal productivity estimator (conceptual/positioning).
Sources were selected purposively through explicit inclusion and exclusion criteria tied to conceptual relevance, scholarly quality, and direct contribution to framework building; higher-order categories were retained only after iterative comparison across the four literature streams.
Author-reported sampling and analytic procedure for the integrative review.
Methodologically, the paper uses a structured integrative review combined with interpretive theory synthesis to connect literature on RegTech, sanctions compliance, institutional voids, supply chain governance, and algorithmic accountability.
Explicit methodological description in the paper (authors' stated approach).
Existing studies on regulatory technology mainly present it as a firm-level compliance tool, giving little attention to its role in shaping coordination across wider enterprise ecosystems in post-conflict and sanctions-affected settings.
Review finding based on purposive selection and comparison of literature on RegTech and related fields (method: structured integrative review and interpretive theory synthesis).
The study uses World Bank Enterprise Survey firm-level data from 2007 to 2024 and employs feasible generalized least squares (FGLS), robust ordinary least squares (OLS), and high-dimensional fixed effects (HDFE) linear regression techniques.
Direct methodological statement in the paper's abstract/summary. This is a descriptive factual claim about data and methods.
AI deployment has limited effects on retrial rates.
Same randomized field experiment; retrial rates (repeat customer contacts) were measured and reported as showing limited/no substantive change under AI deployment.
The findings are based on India-focused samples.
Paper explicitly notes the sample/context is India-focused.
PRIF was developed and validated using mixed-method design: interviews with 30 risk advisors, case studies, and analysis of 30 forensic reports, with validation via thematic coding, risk metrics, and Delphi panel refinement.
Reported methods in the paper: mixed-method design including 30 risk advisor interviews and analysis of 30 forensic reports; validation methods named (thematic coding, risk metrics, Delphi panel).
Five structural characteristics define the Metis AI zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring.
Theoretical specification and definition of five characteristics grounded in social science, philosophy, and humanitarian practice; no empirical prevalence or measurement reported.
The dominant discourse on AI limitations frames the boundary of AI capability as a divide between digital tasks (where AI excels) and physical tasks (where embodiment is required).
Statement in paper framing prevailing discourse; conceptual observation rather than empirical test (literature critique). No sample size reported.
Including the 2020-2021 COVID-19 lockdowns allows leveraging the pandemic to isolate structural inequalities from transient market shocks.
Design choice: use of data spanning 2016–2021, including pandemic lockdown period, to separate persistent structural disparities from short-term shock effects.
Neither survey nor transcript-based measures of participation equity improved under LLM facilitation (an "illusion of inclusion").
Quantitative survey measures and transcript-based analyses of participation equity (e.g., measures of turn-taking, speaking/typing share) showed no improvement in equity metrics for facilitated conditions compared to controls across the experiments.
Across both studies, LLM facilitation did not significantly improve group consensus.
Experimental comparison across the two studies (total N=879) measuring agreement/consensus metrics for groups randomized to LLM facilitation versus other facilitators or no facilitation; reported null effect on consensus.
Study 2 (N=675) compares facilitator strategies against a no-facilitation baseline.
Study 2 comprised N=675 participants (groups of three) randomized to different LLM facilitation strategies and a no-facilitation control.
Study 1 (N=204) compares three frontier LLMs as facilitators.
Study 1 comprised N=204 participants (groups of three) randomized to facilitator conditions comparing three frontier language models.
We present two empirical studies (N=879) of real-time, text-based group deliberation in an incentive-compatible charity allocation task with real financial stakes ($7,200 USD).
Two online experiments involving real-time, text-based group deliberation. Total participants N=879 in groups of three; total monetary stakes for the charity allocation task equal $7,200 USD.
The study used a qualitative interpretivist research design drawing on semistructured interviews with 28 managers and professionals from 12 organizations across technology, finance and knowledge-intensive service sectors in Europe and Asia, using thematic and interpretive analysis supported by organizational document review.
Methodology statement from the paper (explicit description of sample, sectors, regions and analytic approach).
AI should be conceptualized as a co-evolving organizational capability rather than a deterministic technology.
Argument developed from interpretive analysis of interview data (n=28), literature engagement and organizational document review.
The study develops an emergent framework of AI–human co-adaptation comprising three interrelated dimensions: technological alignment, cognitive calibration and ethical anchoring.
Framework derived from thematic/interpretive analysis of interview data (n=28) and supporting organizational documents.
The paper introduces the concept of 'augmented work agency' as a multi-level, interpretive form of human agency in algorithmically mediated environments.
Conceptual development within the paper grounded in literature review and qualitative interview data (28 participants) and organizational document review.
This study used a three-wave lagged survey design with 381 valid matched employees from knowledge-intensive firms in China.
Methods statement in paper reporting study design and sample composition: three-wave lagged survey and 381 valid matched employee responses from knowledge-intensive Chinese firms.
The overall impact of prompt design on readability remains limited.
Reported results from prompt-dimension experiments indicating that while some prompt elements influence readability, the aggregate effect size of prompt engineering on overall readability was limited.
Current LLMs produce code with overall readability comparable to human-written code.
Comparison of readability scores (from the paper's readability model) between LLM-generated code and human-written code across 5,869 scenarios; reported summary conclusion that overall readability is comparable.
The analysis proceeded through within-case coding and cross-case pattern matching across five dimensions: intelligence source, AI mechanism, decision domain, economic implication, and boundary condition.
Method section describing coding and analytical procedures applied to the archival corpus across the four cases.
The empirical corpus comprises annual reports, 10-K filings, earnings releases, and official corporate materials published mainly between 2024 and 2026, complemented by recent peer-reviewed literature.
Paper's data description listing document types and time window for archival evidence; number of documents not enumerated.
The study adopts a qualitative comparative multiple-case design using four theoretically sampled cases: Walmart, Unilever, Sprinklr, and DoubleVerify.
Methodological statement in the paper describing case selection and study design.