Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Total (aggregate) unemployment is statistically insignificant in explaining sustainable development, indicating aggregate measures mask critical distributional differences across skill groups.
ARDL estimation results reported in the paper showing an insignificant coefficient for total unemployment; discussion emphasizing distributional masking.
This study constructs a comprehensive evaluation system of urban ecological resilience from three dimensions: potential, elasticity, and stability.
Methodological description in the paper: authors state they constructed a composite resilience evaluation system composed of three specified dimensions for prefecture-level cities.
AI-assisted feedback does not reduce time per character (i.e., it does not increase time cost per unit of feedback).
Time-per-character was measured in the randomized field experiment; authors report no reduction (no increase in time per character) associated with the AI-assisted drafts. Student-level/completion-level data from the experiment (n=88); 11 TAs.
AI-assisted feedback does not negatively affect student usefulness ratings.
Measured student ratings of usefulness in the randomized field experiment; authors report no negative effect of the treatment on these ratings (no significant decrease reported). Student-level sample n=88; 11 TAs.
Two-stage field experiments in healthcare prescription messaging encompassed 693,139 patient visits in total.
Paper statement of total sample size across Stage 1 and Stage 2.
Stage 2 (Tool-Augmented Agentic AI) autonomously extracted principles from Stage 1 data and generated 17 new message variants tested on 248,448 patient visits.
Study design and reported results from Stage 2 of the two-stage field experiment described in the paper.
Stage 1 (Human + Chatbot) produced 13 message variants and was tested on 444,691 patient visits.
Study design details reported in the paper describing the two-stage field experiment.
The empirical analysis is based on panel data of new energy vehicle firms in the Yangtze River Delta from 2001 to 2023.
Dataset description provided in the paper's abstract/introduction indicating the time span and regional coverage.
R&D expenditure does not constitute a significant mediating channel between artificial intelligence and firms' new quality productive forces.
Mediation analysis using the panel data and constructed indicators; reported nonsignificant mediation effect of R&D expenditure (no sample size or statistics reported in excerpt).
The system was evaluated on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications.
Experimental evaluation reported in the paper using the OMH-Polyglot benchmark.
Explicit commercial content (product placement) shows no engagement premium (−3.8%, not significant).
Analysis comparing videos labeled for explicit commercial content (product placement) to others; reported percent difference and non-significance.
We conducted a multimodal AI audit of 5,051 videos across 79 kidfluencer channels using weak supervision (LLM-based classification of titles and GPT-4 Vision analysis of thumbnails and descriptions across six literature-grounded dimensions) to assign a probabilistic exploitation score to each video.
Described dataset and methods in paper: multimodal automated pipeline combining weak supervision labeling functions (LLM classifiers on titles, GPT-4 Vision on thumbnails/descriptions) applied to 5,051 videos from 79 channels.
The study developed a manufacturing value chain resilience (MVCR) index system based on three dimensions: Readiness, Response, and Recovery, using the CSMAR database.
Methodological description: construction of MVCR index using CSMAR microdata and a three-dimension framework (Readiness, Response, Recovery).
The study constructed indices of industrial robot application at the enterprise-industry-year level by matching industry-level industrial robot data published by the IFR with microdata from Chinese A-share listed companies.
Methodological description in the paper: matching IFR industry-level industrial robot data to microdata from Chinese A-share listed firms to build enterprise-industry-year robot-application indices.
The study uses listed companies in China's manufacturing industry from 2010 to 2023 as the research sample.
Authors explicitly state the empirical sample: listed manufacturing firms in China covering 2010–2023.
The positive relationship between BDTA and CEE remains robust after a series of robustness tests and endogeneity tests.
Authors state they conducted robustness checks and endogeneity tests (unspecified in the summary) and report that the main regression results remain robust.
Brain privacy has both personal and social attributes; its protection therefore implicates individual interests and technological development.
Normative/legal argumentation and conceptual analysis presented in the paper (no empirical data reported).
The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range.
Protocol described in paper explicitly states two runs with different injection SNR scalings (one 'unrealistically loud', one physically motivated).
Both agents received identical written specifications and identical compute resources.
Methodological statement in paper specifying that both agents were given the same written spec and the same shared computing infrastructure.
The pipeline comprised power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D.
Protocol description in paper; matched filter recovery included 100 injected signals (explicitly stated).
We compared two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention.
Experimental design described in paper: two named agents were given identical written specifications and identical compute resources and executed the full pipeline autonomously.
Greater frontier-level compute does not consistently translate to better performance.
Empirical observation in the paper's findings: increasing compute capacity at the Pareto frontier did not uniformly improve task performance across evaluated tasks.
New York City’s Local Law 144 mandates annual bias audits to increase transparency.
Statement of law/policy in paper (factual claim about NYC Local Law 144); legal requirement as described in the text.
The fairness of AI-enabled hiring systems remains uncertain.
Statement in paper (background/interpretive claim); no direct empirical measure provided in the excerpt.
The study employs a comparative mixed-methods approach (comparative institutional analysis) of leading financial systems in China, the United States, and the United Kingdom (2022–2025), integrating secondary quantitative indicators with qualitative documentary evidence.
Direct methodological statement in the abstract describing the study design and data sources.
The distinction matters: debt is a stock of design and governance liability, while the tax is a flow of operating cost that arises because stochastic agents act through tools and workflows.
Conceptual argument in the paper articulating difference between two defined concepts (Agentic Technical Debt vs Stochastic Tax); no empirical demonstration.
Stochastic Tax is the recurring operating burden of keeping probabilistic agent behavior within acceptable bounds.
Paper provides a formal definition / conceptual framing of 'Stochastic Tax'; stated as an operational concept (no empirical quantification provided).
Agentic Technical Debt is the accumulated liability created when prompts, memory, tool schemas, orchestration graphs, control policies, and observability routines are patched together faster than they can be validated, standardized, and governed.
Paper provides a formal definition / conceptual framing of 'Agentic Technical Debt'; presented as a definitional contribution rather than an empirically measured quantity.
Agentic AI systems reason over multiple steps, call tools, act through workflows, and adapt through memory and feedback.
Descriptive/definitional statement in the paper; presented as characteristics of agentic systems rather than supported by empirical measurement.
Agentic AI systems are increasingly being explored as production infrastructure.
Stated as an observation in the paper's introduction/abstract; no empirical data, sample, or formal measurement provided (conceptual/observational claim).
The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage.
Stated audit design and sample counts in paper (method section describing factorial design and coverage of model/prompt cells).
The paper evaluates the proposed architecture using the outcome metric 'time-to-insight'.
Methodological statement in the paper listing evaluation metrics.
The paper evaluates the proposed architecture using the outcome metric 'time-to-find'.
Methodological statement in the paper listing evaluation metrics.
The paper evaluates the proposed architecture using the outcome metric 'data product adoption'.
Methodological statement in the paper listing evaluation metrics.
In the first acquisition the acquirer pursued a disruptive 'rip-and-replace' strategy for the target’s proprietary ERP system.
Empirical observation from the paper's comparative case study of two consecutive acquisitions of the same digital target (qualitative case evidence).
We identify four archetypes (data orchestrators, aggregators, niche specialists, and cloud orchestrators).
Paper states it develops a taxonomy and explicitly lists four archetypes; based on the taxonomy development and conceptual classification reported in the paper (no sample size or quantitative empirical test reported in abstract).
This paper contributes a large-scale empirical dataset involving 57,954 essays from 10,195 students across 120 schools over two years.
The paper explicitly states the dataset size and coverage in the abstract: 57,954 essays, 10,195 students, 120 schools, two-year period.
We leverage logo design job posts before and after the launch of an early-stage platform-embedded logo-AI tool on the online labour market EPWK, using a difference-in-differences design and a new large language model-based skill extraction and embedding framework.
Paper's described empirical design and methods: dataset of logo design job posts on EPWK around the logo-AI tool launch; difference-in-differences analytic approach; LLM-based skill extraction and embedding pipeline. No sample size provided in the abstract.
Existing research mainly examines general-purpose GenAI, such as ChatGPT, and focuses on aggregate outcomes, including falling demand and compressed prices in easily automated tasks, while revealing little about the demand for work skills and the role of platform-embedded GenAI.
Paper's literature review / background statement summarizing prior empirical work on general-purpose GenAI (e.g., studies documenting falling demand and price compression in automatable tasks). No sample size reported in this statement.
We distill our findings into a meta-design and four design principles (DPs), grounded in kernel theories, for systems where human contextual intelligence and algorithmic recognition must coexist.
Design contribution presented in the paper (meta-design artifact and four DPs derived from the study).
We developed a collaborative forecasting system that leverages semantic processing using large language models (LLMs) to solve the 'cold-start' problem for novel menu items while preserving human agency via override mechanisms.
Description of system design and implementation produced during the ADR project (practice-driven abductive approach).
This paper reports on a 9-month action design research (ADR) project at a German financial services firm.
Explicit methodological description in the paper (study duration and organizational context).
We examined how different degrees of embodiment affect team performance and conversational dynamics in a real-life escape room; teams were composed of either three humans or two humans and an artificial agent (a Box, an Avatar, or a hyper-realistic humanoid).
Experimental field study reported in the paper: a real-life escape room experiment comparing team compositions (3 humans vs. 2 humans + agent of three embodiment types). Sample size not reported in the provided text.
To the best of the authors' knowledge, no prior study has examined the psychological mechanism through which algorithmic management shapes employee voice and silence behaviour outside of gig economy and platform work contexts.
Author claim based on literature review (stated gap in existing research).
The empirical strategy uses panel local projections to estimate the dynamic effects of AI adoption.
Methodological statement in the paper: application of panel local projections to panel data of industries/establishments over 2017-2025.
AI adoption is measured using the share of establishment-level job postings that explicitly require AI-related skills across 13 industries over 2017-2025.
Study design / data description: share of establishment-level job postings requiring AI skills; coverage across 13 industries for years 2017-2025.
Estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference.
Analysis reported in the paper examining the relationship between message volume and estimation accuracy; described as a weak dependency.
This paper uses the Difference-in-Differences method for empirical research.
Methodological statement in the excerpt explicitly naming the DiD approach.
Regression models and moderation analyses were performed in R to examine associations between governance exposure, AI maturity, and adaptation intensity.
Methods statement: 'Regression models and moderation analyses were performed in R (R Computing, Austria) to examine associations between governance exposure, AI maturity, and adaptation intensity.'
Path-specific composite indices for bifurcation, modularity, ethical signaling, and compartmentalization were quantified using validated scales.
Methods description in the paper: 'Path-specific composite indices ... were quantified using validated scales.'