Evidence (13870 claims)
Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 196 | 98 | 892 | 1984 |
| Governance & Regulation | 817 | 394 | 188 | 121 | 1544 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 627 | 233 | 123 | 96 | 1088 |
| Research Productivity | 411 | 123 | 56 | 332 | 933 |
| Output Quality | 467 | 178 | 59 | 47 | 751 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 167 | 122 | 24 | 496 |
| Task Allocation | 207 | 64 | 71 | 32 | 379 |
| Skill Acquisition | 165 | 59 | 60 | 17 | 301 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 52 | 107 | 13 | 279 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 150 | 48 | 26 | 3 | 227 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 63 | 20 | 12 | 184 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 93 | 21 | 13 | 19 | 148 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 17 | 7 | 3 | 59 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
The study sample comprises 21,428 firm-year observations from Chinese A-share listed manufacturing companies over 2010–2022.
Data description provided in the paper's abstract/introduction specifying the sample frame and time period.
We find little evidence of crashing waves (in contrast to recent work by METR).
Analysis of the >3,000 tasks and >17,000 evaluations which reportedly do not show abrupt, concentrated surges in AI capability on small sets of tasks.
The evaluation is based on more than 17,000 evaluations by workers from these jobs.
Reported sample of >17,000 human evaluations of model outputs.
We test for these effects in preliminary evidence from an ongoing evaluation of AI capabilities across over 3,000 broad-based tasks derived from the U.S. Department of Labor O*NET categorization that are text-based and thus LLM-addressable.
Empirical study design reporting an ongoing evaluation covering >3,000 text-based tasks mapped from O*NET.
Green innovation does not yet significantly reduce carbon inequality.
Empirical results from the provincial panel analysis (2003–2021) showing that measures of green innovation are not associated with a statistically significant reduction in carbon inequality.
This paper employs a staggered difference-in-differences (DID) model using data from Chinese A-share listed manufacturing companies from 2012 to 2023 and uses the National Artificial Intelligence Innovative Application Pioneer Zone (AIIAPZ) policy as a quasi-natural experiment.
Staggered DID empirical design; sample described as Chinese A-share listed manufacturing firms, 2012–2023; AIIAPZ policy used as treatment assignment (quasi-natural experiment).
The paper characterizes the symmetric Nash equilibrium in a preemption game of competing frontier-AI firms.
Analytic game-theoretic model and equilibrium derivations presented in the paper (formal characterization/propositions).
The study uses panel data from listed manufacturing firms in China and employs a quasi-natural experiment approach.
Statement in the abstract describing data source (panel of listed manufacturing firms in China) and empirical strategy (quasi-natural experiment).
Big data analytics and blockchain technologies show no significant correlations with exports to specific destinations (multivariate probit result).
Multivariate probit model of destination-specific export decisions showing non-significant coefficients for big data analytics and blockchain across destinations (sample size not reported in prompt).
Adopting blockchain technologies does not have a statistically significant effect on a firm's likelihood of exporting (probit model result).
Probit regression analysis showing non-significant coefficient for blockchain adoption (sample size not reported in prompt).
Adopting big data analytics does not have a statistically significant effect on a firm's likelihood of exporting (probit model result).
Probit regression analysis showing non-significant coefficient for big data analytics adoption (sample size not reported in prompt).
We introduce the Agentic Task Exposure (ATE) score, a composite measure computed algorithmically from O*NET task data using calibrated adoption parameters (not a regression estimate), incorporating AI capability scores, workflow coverage factors, and logistic adoption velocity.
Methodological description in the paper; algorithmic construction from O*NET task data with specified calibrated adoption parameters and components (AI capability scores, workflow coverage, logistic adoption).
Code authoring and review are only a small part of the larger software engineering process; the resulting code must also be maintained and updated over time.
Conceptual/argumentative claim presented in the paper to motivate longitudinal analysis (not presented as an empirical estimate from the dataset).
We offer several longitudinal estimates of survival and churn rates for agent-generated versus human-authored code.
Longitudinal analysis reported in the paper comparing survival and churn for agent-generated and human-authored code over time using the dataset (paper states these estimates were produced).
We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews.
Comparative analysis across agents using the constructed dataset of ~110,000 PRs (paper states these five agents were compared on metrics like merge frequency, edited file types, and interaction signals).
We construct a novel dataset of approximately 110,000 open-source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code.
Descriptive dataset construction reported in the paper (stated sample size ~110,000 PRs including commits, comments, reviews, issues, file changes; representing millions of lines of code).
The paper extends classical (Solow) and endogenous (Romer) growth models to incorporate TAI, producing a dynamic framework for analyzing AI-driven structural change.
Methodological claim: the authors explicitly state they build on Solow (1956) and Romer (1990) to develop an integrated dynamic model that incorporates TAI; evidence is described model extension and formalization within the paper.
The study uses dynamic fixed-effects and dynamic panel threshold regression techniques on a panel of 23 developed and developing countries from 2002 to 2023.
Methodological statement in the abstract specifying the estimation techniques and the dataset: a panel of 23 countries over 2002–2023.
This study uses semi-structured interviews with 10 practitioners to examine perceptions of collaborating with human versus AI teammates.
Methods statement in the paper: semi-structured interviews; sample size explicitly reported as 10 practitioners.
The study is based on a qualitative analysis of recent academic literature, comparative analysis of sector-specific applications of Big Data technologies, and synthesis of empirical findings from international studies using a systemic and structural analysis approach.
Methodological statement within the paper describing data sources and analytic approach; not an empirical claim about outcomes.
The research documents a transition in the literature (2013–2025) from early 'risk-of-automation' evaluations toward task-based and firm-level econometric models.
Literature review/synthesis across the 2013–2025 body of research as described in the paper.
Society 5.0 and Industry 5.0 call for human-centric technology integration, but the concept lacks an operational definition that can be measured, optimized, or evaluated at the firm level.
Motivating claim grounded in literature gap analysis presented in the paper (argument that normative frameworks lack formal, operational metrics at firm level).
We propose the Workplace Augmentation Design Index (WADI), a 36-item theory-grounded instrument for diagnosing human-centricity at the firm level.
Instrument design/proposal presented in the paper (36 items mapped to the five workplace-design dimensions); no validation sample reported in the abstract.
We conducted a PRISMA-guided systematic review of 120 papers (screened from 6,096 records) to map the evidence base for each workplace-design dimension.
Systematic literature review using PRISMA protocol; final sample = 120 papers; initial records screened = 6,096.
Existing models of human-AI complementarity treat the augmentation function phi(D) as exogenous and thus ignore that two firms with identical technology investments can achieve radically different augmentation outcomes depending on workplace organization.
Argument based on literature review of prior models (the paper contrasts its approach with existing complementarity models). No new empirical sample reported for this specific claim.
The widening effect of AI adoption on the electricity output growth gap diminishes over time and becomes statistically insignificant after approximately three years.
Temporal (dynamic) empirical analysis / event-study-style estimation tracing the AI adoption effect over multiple years post-adoption; statistical significance reported to fade by year ~3. Sample size / exact time windows not provided in the summary.
The review employed a systematic analysis of multidisciplinary studies (qualitative, quantitative, and bibliometric) focused on agentic AI technologies in financial domains, covering literature published up to mid-2024.
Stated methodology of the paper (systematic review description).
A subset of four datasets included settings in which the AI provided explanations of its decision.
Paper states that four of the datasets involved AI explanations (explicitly stated in abstract).
The study compared HCT to the AI-as-advisor approach using 10 datasets from various domains, including medical diagnostics and misinformation discernment.
Paper reports an empirical comparison across 10 datasets spanning multiple domains (explicitly stated in abstract).
The hybrid confirmation tree (HCT) elicits a human judgment and an AI judgment independently; if they agree that decision is accepted, and if they disagree a second human breaks the tie.
Description of the HCT method in the paper (procedural/design specification).
The cross-sectional, self-reported survey design prevents strong causal claims about the effect of algorithms or selective exposure on polarization.
Authors explicitly note methodological limitations: cross-sectional survey of N = 450, reliance on self-reported consumption, and lack of platform log or longitudinal/experimental data.
The study adopted a positivist philosophy and a descriptive-correlational design.
Methods section statement in the paper describing the research philosophy and study design.
Data were collected from innovation-focused executives across 39 licensed Kenyan commercial banks.
Paper statement specifying sample source: 'Using data from innovation-focused executives across 39 licensed banks.'
Technological innovation was assessed via adoption of new systems, integration of digital channels, and use of Artificial Intelligence and data analytics.
Measurement description provided in the paper listing the components used to operationalize technological innovation.
Competitiveness in the study was measured through market share, return on equity and customer satisfaction.
Measurement description provided in the paper describing dependent variable operationalization (explicit list of three indicators).
Metode penelitian yang digunakan adalah penelitian hukum normatif dengan pendekatan perundang-undangan, konseptual, dan komparatif, didukung oleh analisis literatur dari jurnal nasional terindeks SINTA dan jurnal internasional bereputasi.
Pernyataan metode yang jelas tercantum dalam abstrak/metodologi makalah.
Penelitian menilai kecukupan perlindungan hukum yang tersedia bagi pekerja terdampak PHK akibat adopsi AI.
Pernyataan tujuan penelitian dan pendekatan analitis (normatif, komparatif) yang didukung oleh tinjauan literatur pada jurnal-jurnal terpilih.
Penelitian ini bertujuan menganalisis bagaimana Undang-Undang Cipta Kerja dan peraturan turunannya mengklasifikasikan dan menjustifikasi Pemutusan Hubungan Kerja (PHK) akibat adopsi AI.
Pernyataan tujuan penelitian yang tercantum di bagian metodologi/pendahuluan; pendekatan peraturan-perundang-undangan dalam penelitian hukum normatif.
The user study had N=50 participants.
Reported user study sample size (N=50) used to evaluate AI-assisted intent expansion in ecologically valid settings.
Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient.
Controlled comparison between three structured frameworks (5W3H, CO-STAR, RISEN) across the evaluated outputs, with no meaningful differences reported between them.
The study evaluated 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks) using an independent judge (DeepSeek-V3).
Reported experimental design and evaluation: 3 languages, 6 conditions, 3 models, 3 domains, 20 tasks; judged by DeepSeek-V3.
The paper frames the LLM-politician relationship through principal-agent theory and bounded rationality, conceptualizing the legislator as a principal delegating advisory tasks to a boundedly rational agent under structural information asymmetry.
Explicit theoretical framing described in the introduction or theory section of the paper.
Model outputs were evaluated using a dual framework combining LLM-as-Judge semantic scoring and programmatic text similarity metrics.
Paper describes the evaluation methodology: semantic scoring via LLM-as-Judge plus programmatic text similarity measures applied to model-generated rationales vs official memoranda.
Six LLMs were evaluated: GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic), and Llama 4 Maverick, Llama 3.3 70B, Llama 3.1 8B (Meta).
Paper explicitly lists the six evaluated models spanning three provider families and multiple capability tiers.
The study uses a dataset of 15 Romanian Senate law proposals paired with their official explanatory memoranda (expuneri de motive).
Explicit statement in the paper describing the dataset composition: 15 Romanian Senate law proposals each paired with its official explanatory memorandum.
We implement a rigorously controlled execution-based testbed featuring Git worktree isolation and explicit global memory to evaluate agent coordination frameworks.
Methodological description in the paper indicating the testbed design choices (Git worktree isolation, explicit global memory) used to ensure controlled, reproducible execution of agent-generated code.
We benchmark a single-agent baseline against two multi-agent paradigms: a subagent architecture (parallel exploration with post-hoc consolidation) and an agent team architecture (experts with pre-execution handoffs) using a rigorously controlled, execution-based testbed.
Description of experimental setup in the paper: an execution-based testbed with Git worktree isolation and explicit global memory; experiments explicitly compare single-agent, subagent, and agent-team architectures under fixed computational time budgets.
Data construction: The authors treat Wikipedia technology pages as distinct technologies and trace them across patents and job postings from 1976 to 2007, using technical bigrams to identify technologies in texts.
Description of dataset construction building on Kalyani et al. (2025) in Section 2; methodological description of linking Wikipedia pages, patent text, and job postings.
Proposition 1: With a constant pace of technology creation (m(b)=m), the model admits a unique balanced growth path (BGP) along which real wages and output grow at rate g, the skill premium remains constant and is independent of m.
Analytical result (proposition) proved in the paper's model appendix under model assumptions.
The modal technology in the top 1% densest locations (e.g., New York, San Francisco) is 34 years old, while the modal technology in the bottom 50% lowest-density locations is 48 years old, indicating sizable diffusion gaps.
Empirical measurement from the text-based technology dataset tracking vintage of technologies across locations; reported modal ages by location density percentile.