Evidence (13870 claims)
Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 196 | 98 | 892 | 1984 |
| Governance & Regulation | 817 | 394 | 188 | 121 | 1544 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 627 | 233 | 123 | 96 | 1088 |
| Research Productivity | 411 | 123 | 56 | 332 | 933 |
| Output Quality | 467 | 178 | 59 | 47 | 751 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 167 | 122 | 24 | 496 |
| Task Allocation | 207 | 64 | 71 | 32 | 379 |
| Skill Acquisition | 165 | 59 | 60 | 17 | 301 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 52 | 107 | 13 | 279 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 150 | 48 | 26 | 3 | 227 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 63 | 20 | 12 | 184 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 93 | 21 | 13 | 19 | 148 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 17 | 7 | 3 | 59 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Industrial robot penetration is used as a proxy measure for AI adoption in Chinese provinces.
Paper explicitly states industrial robot penetration was used as the proxy for AI adoption in the empirical analysis.
The study uses panel data on 31 Chinese provinces for the period 2000–2022 and employs panel threshold regression models with ageing and AI adoption as threshold variables.
Paper description: panel data from 31 provinces (2000–2022); use of panel threshold regression models; threshold variables specified as ageing and AI adoption (industrial robot penetration).
Specification and implementation are available at https://github.com/chelof100/acp-framework-en
Repository URL provided in the specification text; points to the stated implementation and documentation artifacts.
The specification defines more than 62 verifiable requirements and 12 prohibited behaviors.
Quantitative claims stated in the specification about requirement and prohibited-behavior counts.
The v1.13 release includes an OpenAPI 3.1.0 specification for all HTTP endpoints.
Specification/repository statement indicating an OpenAPI 3.1.0 specification is provided for HTTP endpoints.
The v1.13 release includes 51 signed conformance test vectors (Ed25519 + SHA-256).
Repository/specification statement listing 51 signed conformance test vectors and the signature/hash algorithms used.
The v1.13 release includes a Go reference implementation of 22 packages covering all L1-L4 capabilities.
Repository statement describing a Go reference implementation comprising 22 packages and coverage claim for L1-L4.
The v1.13 specification comprises 36 technical documents organized into five conformance levels (L1-L5).
Explicit quantitative statement in the specification/repository describing document count and organization.
The experiment compared three prompt conditions: (A) simple prompts, (B) raw PPS JSON, and (C) natural-language-rendered PPS.
Method description of the three prompt conditions used in the controlled experiment.
The study used three specific LLMs: DeepSeek-V3, Qwen-Max, and Kimi.
Method section listing the three models evaluated in the experiment.
We ran a controlled three-condition study across 60 tasks in three domains (business, technical, and travel), three large language models (DeepSeek-V3, Qwen-Max, and Kimi), and three prompt conditions, collecting 540 AI-generated outputs evaluated by an LLM judge.
Authors report an experimental study design: 60 tasks × 3 models × 3 prompt conditions = 540 outputs, with outputs evaluated by an LLM judge (methodological description in the paper).
The paper presents a formal evolutionary taxonomy of generative AI spanning five eras (1943–present) and analyzes frontier lab dynamics, sovereign AI emergence, and post-training alignment evolution from RLHF through GRPO.
Conceptual taxonomy and historical/organizational analysis provided in the paper. No empirical sample size reported in the excerpt.
The framework extends the Sustainability Index of Han et al. (2025) from hardware-level analysis to ecosystem-level analysis.
Conceptual / methodological extension claimed by the authors referencing Han et al. (2025). No empirical sample size reported in the excerpt.
Classical scaling laws model AI performance as monotonically improving with model size.
Statement about prior literature / modeling assumptions (classical scaling laws). No empirical sample size reported in the excerpt.
Existing financial question answering benchmarks primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals.
Literature/background claim made in the paper motivating the new benchmark; authors contrast prior benchmarks' focus on balance sheet data with the lack of market/trading-signal evaluation.
Retrieval provides limited benefit for trading-signal reasoning.
Experimental comparison reported in the paper showing that retrieval-augmentation had little impact on performance for trading-signal-focused questions.
To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment.
Methodological claim in the paper describing the QA and annotation pipeline; the paper reports using these components as part of their reliability framework.
The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning.
Direct description of the benchmark's taxonomy in the paper; the authors specify these three categories as the organizational structure for the 1,400 questions.
FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window.
Statement in the paper describing the benchmark construction and scope; the paper reports the benchmark size (1,400 questions) and the dataset grounding (NASDAQ-100 over ten years).
The paper derives formal conditions under which the inversion (smaller, orchestrated models outperforming frontier models) holds.
Mathematical derivations and stated sufficient/necessary conditions presented in the paper.
We develop the Institutional Fitness Manifold, a mathematical framework that evaluates AI systems along four dimensions: capability, institutional trust, affordability, and sovereign compliance.
Theoretical/model development presented in the paper (formal definition of the manifold and its four dimensions).
There have been five eras of AI development since 1943, and within the current Generative AI Era there are four distinct epochs, each initiated by a discontinuous event.
Descriptive/historical classification within the paper (counts of eras and epochs; named initiating events such as the transformer and the 'DeepSeek Moment').
The study uses panel data for 30 Chinese provinces from 2013–2022 to measure urban circular economy efficiency (UCEE) with a Super-SBM model including undesirable outputs, track dynamics via the Global Malmquist–Luenberger index, and estimate spatial effects with a spatial Durbin model.
Methodological description in the abstract: explicit statement of data (30 provinces, 2013–2022) and the three methods used (Super-SBM with undesirable outputs, GML index, spatial Durbin model).
Despite fears of mass unemployment, aggregate labor-market data through 2025 show limited labor-market disruption from generative AI.
Review of aggregate employment and labor-market studies and macro-level data through 2025 cited in the brief; methods include analyses of employment statistics and macro labor indicators (no single sample size reported).
We scored rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112 transcript segments from multi-agent governance simulations.
Reported methodology: multi-agent governance simulations with agents in formal governmental roles, outcomes evaluated by an independent rubric-based judge; explicit sample count of 28,112 transcript segments.
We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.
Empirical study design described in the paper: open competition with reported counts of teams and participants (29 teams, 80 participants); comparison between participant submissions and AI-only baselines.
AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.
Descriptive dataset/benchmark specification in the paper stating task count and industry coverage.
Open research challenges that define the research agenda include scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, and designing human-AI specification interactions.
Authors' explicit enumeration of open problems and a proposed multi-disciplinary research agenda; presented as expert opinion rather than empirical finding.
The interaction between selection and recourse generates a closed-loop dynamical system linking candidate selection and strategic recourse.
Formalization in the paper showing feedback dynamics between selection outcomes and candidate adjustments (modeling/result claim).
This setting produces endogenous selection, in which both the decision rule and the selection threshold are determined by the population's current feature state.
Derived implication of the framework and model dynamics described in the paper (theoretical consequence of the model).
The success benchmark evolves endogenously as many candidates adjust simultaneously.
Analytical property of the proposed model: simultaneous adjustments by candidates change the effective benchmark (theoretical result asserted by authors).
The study proposes a framework that models recourse as a strategic interaction among candidates under a risk-based selection rule.
The paper introduces a formal/modeling framework (methodological contribution described by the authors).
Actionable recourse studies whether individuals can modify feasible features to overturn unfavorable outcomes produced by AI-assisted decision-support systems.
Definition and framing stated by the authors in the paper's introduction/background (conceptual claim).
A GNN graph is constructed from reasoning embeddings and trading decisions are made using a PPO-DSR policy.
Method description: the paper reports embedding agents' reasoning, building a graph neural network (GNN) from those embeddings, and using a PPO-DSR reinforcement learning policy to trade. Specific GNN/PPO-DSR hyperparameters and architecture are not provided in the excerpt.
Four LLM agents output scores along with reasoning.
Method description: the paper states that four LLM agents produce numeric scores and associated textual reasoning. The number of agents is explicitly given as four; no further architecture or model-family details included in the excerpt.
BlindTrade anonymizes tickers and company names (blindfolding agents by anonymizing all identifiers).
Methodological description in the paper: the system design explicitly replaces tickers and company names with anonymized identifiers. Implementation details and examples not provided in the excerpt.
Data ethics, as a central pillar of digital ethics, emphasizes the responsible use and protection of personal information.
Conceptual/definitional statement in the paper situating data ethics within digital ethics and highlighting protection of personal information as a core concern.
Big data usage is proxied by keyword frequency in firms' annual reports.
Operationalization described in the paper: frequency/count of big-data-related keywords in annual reports used as the proxy for firms' big data application.
The empirical analysis uses a fixed-effects regression approach to measure the impact of big data application on firm value.
Methodological statement in the paper specifying fixed-effects regression as the primary econometric approach.
The study analyzes panel data covering Chinese A-share listed companies from 2007 to 2021.
Description of dataset in the paper: panel of Chinese A-share listed companies spanning the years 2007–2021 (sample period stated).
The analysis extends the dynamic taxation setup of Slavik and Yazici (2014).
Methodological claim: the model and solution approach build on and modify the framework from Slavik and Yazici (2014) (reference to prior theoretical framework rather than empirical data).
We characterize the optimal tax policy in an economy with human manual and cognitive labor, physical capital, and artificial intelligence (AI).
Theoretical/analytical work: the paper develops and analyzes a dynamic general-equilibrium model that includes manual and cognitive human labor, physical capital, and AI. (No empirical sample; model-based characterization.)
Self-concordance did not mediate the AI-over-questionnaire effect on goal progress.
Preplanned mediation model reported in the paper found no evidence that self-concordance mediated the AI vs questionnaire effect on goal progress; reported as non-significant in the preregistered analysis.
Compared with the matched written-reflection questionnaire, the AI did not significantly improve overall goal progress.
Preplanned comparison within the preregistered RCT; reported non-significant difference between AI and written-reflection condition on overall goal progress at two-week follow-up (no significant p-value reported in the summary).
We conducted a preregistered three-arm randomized controlled trial (RCT) comparing an AI career coach ('Leon,' powered by Claude Sonnet), a matched structured written questionnaire, and a no-support control.
Preregistered RCT reported in the paper; three arms as described; total sample size N = 517; participants randomized to AI coach, written-reflection questionnaire, or no-support control; outcomes assessed at two-week follow-up.
All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
Provision of a public GitHub repository URL in the paper.
Evaluation was performed on SWE-bench Verified with two local models: Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances.
Experimental setup reported in the paper specifying benchmark (SWE-bench Verified) and model-instance counts.
Controlled experiments were run with N = 250 across five content types to validate the mechanisms.
Experimental methods reported in the paper: controlled experiments with specified sample size and content-type breakdown.
The dependent variable is the Market Opportunity Index, which is a combination of indicators of innovation activity, the share of firms with new products, and the share of opportunity-oriented entrepreneurs.
Paper provides the construction/definition of the dependent variable (components listed in the excerpt).
The model used lags of the dependent variable to take into account inertia in the development of entrepreneurial opportunities, and the stability of the impact of cognitive tools was tested.
Paper states the model specification included lagged dependent variables and that stability tests for the impact of cognitive tools were performed (no further details on lag length or test statistics in the excerpt).