Evidence (7448 claims)
Adoption
5267 claims
Productivity
4560 claims
Governance
4137 claims
Human-AI Collaboration
3103 claims
Labor Markets
2506 claims
Innovation
2354 claims
Org Design
2340 claims
Skills & Training
1945 claims
Inequality
1322 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 378 | 106 | 59 | 455 | 1007 |
| Governance & Regulation | 379 | 176 | 116 | 58 | 739 |
| Research Productivity | 240 | 96 | 34 | 294 | 668 |
| Organizational Efficiency | 370 | 82 | 63 | 35 | 553 |
| Technology Adoption Rate | 296 | 118 | 66 | 29 | 513 |
| Firm Productivity | 277 | 34 | 68 | 10 | 394 |
| AI Safety & Ethics | 117 | 177 | 44 | 24 | 364 |
| Output Quality | 244 | 61 | 23 | 26 | 354 |
| Market Structure | 107 | 123 | 85 | 14 | 334 |
| Decision Quality | 168 | 74 | 37 | 19 | 301 |
| Fiscal & Macroeconomic | 75 | 52 | 32 | 21 | 187 |
| Employment Level | 70 | 32 | 74 | 8 | 186 |
| Skill Acquisition | 89 | 32 | 39 | 9 | 169 |
| Firm Revenue | 96 | 34 | 22 | — | 152 |
| Innovation Output | 106 | 12 | 21 | 11 | 151 |
| Consumer Welfare | 70 | 30 | 37 | 7 | 144 |
| Regulatory Compliance | 52 | 61 | 13 | 3 | 129 |
| Inequality Measures | 24 | 68 | 31 | 4 | 127 |
| Task Allocation | 75 | 11 | 29 | 6 | 121 |
| Training Effectiveness | 55 | 12 | 12 | 16 | 96 |
| Error Rate | 42 | 48 | 6 | — | 96 |
| Worker Satisfaction | 45 | 32 | 11 | 6 | 94 |
| Task Completion Time | 78 | 5 | 4 | 2 | 89 |
| Wages & Compensation | 46 | 13 | 19 | 5 | 83 |
| Team Performance | 44 | 9 | 15 | 7 | 76 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 18 | 17 | 9 | 5 | 50 |
| Job Displacement | 5 | 31 | 12 | — | 48 |
| Social Protection | 21 | 10 | 6 | 2 | 39 |
| Developer Productivity | 29 | 3 | 3 | 1 | 36 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Skill Obsolescence | 3 | 19 | 2 | — | 24 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Labor Share of Income | 10 | 4 | 9 | — | 23 |
The v1.13 release includes a Go reference implementation of 22 packages covering all L1-L4 capabilities.
Repository statement describing a Go reference implementation comprising 22 packages and coverage claim for L1-L4.
The v1.13 specification comprises 36 technical documents organized into five conformance levels (L1-L5).
Explicit quantitative statement in the specification/repository describing document count and organization.
The experiment compared three prompt conditions: (A) simple prompts, (B) raw PPS JSON, and (C) natural-language-rendered PPS.
Method description of the three prompt conditions used in the controlled experiment.
The study used three specific LLMs: DeepSeek-V3, Qwen-Max, and Kimi.
Method section listing the three models evaluated in the experiment.
We ran a controlled three-condition study across 60 tasks in three domains (business, technical, and travel), three large language models (DeepSeek-V3, Qwen-Max, and Kimi), and three prompt conditions, collecting 540 AI-generated outputs evaluated by an LLM judge.
Authors report an experimental study design: 60 tasks × 3 models × 3 prompt conditions = 540 outputs, with outputs evaluated by an LLM judge (methodological description in the paper).
The paper presents a formal evolutionary taxonomy of generative AI spanning five eras (1943–present) and analyzes frontier lab dynamics, sovereign AI emergence, and post-training alignment evolution from RLHF through GRPO.
Conceptual taxonomy and historical/organizational analysis provided in the paper. No empirical sample size reported in the excerpt.
The framework extends the Sustainability Index of Han et al. (2025) from hardware-level analysis to ecosystem-level analysis.
Conceptual / methodological extension claimed by the authors referencing Han et al. (2025). No empirical sample size reported in the excerpt.
Classical scaling laws model AI performance as monotonically improving with model size.
Statement about prior literature / modeling assumptions (classical scaling laws). No empirical sample size reported in the excerpt.
Existing financial question answering benchmarks primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals.
Literature/background claim made in the paper motivating the new benchmark; authors contrast prior benchmarks' focus on balance sheet data with the lack of market/trading-signal evaluation.
Retrieval provides limited benefit for trading-signal reasoning.
Experimental comparison reported in the paper showing that retrieval-augmentation had little impact on performance for trading-signal-focused questions.
To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment.
Methodological claim in the paper describing the QA and annotation pipeline; the paper reports using these components as part of their reliability framework.
The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning.
Direct description of the benchmark's taxonomy in the paper; the authors specify these three categories as the organizational structure for the 1,400 questions.
FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window.
Statement in the paper describing the benchmark construction and scope; the paper reports the benchmark size (1,400 questions) and the dataset grounding (NASDAQ-100 over ten years).
The paper derives formal conditions under which the inversion (smaller, orchestrated models outperforming frontier models) holds.
Mathematical derivations and stated sufficient/necessary conditions presented in the paper.
We develop the Institutional Fitness Manifold, a mathematical framework that evaluates AI systems along four dimensions: capability, institutional trust, affordability, and sovereign compliance.
Theoretical/model development presented in the paper (formal definition of the manifold and its four dimensions).
There have been five eras of AI development since 1943, and within the current Generative AI Era there are four distinct epochs, each initiated by a discontinuous event.
Descriptive/historical classification within the paper (counts of eras and epochs; named initiating events such as the transformer and the 'DeepSeek Moment').
The study uses panel data for 30 Chinese provinces from 2013–2022 to measure urban circular economy efficiency (UCEE) with a Super-SBM model including undesirable outputs, track dynamics via the Global Malmquist–Luenberger index, and estimate spatial effects with a spatial Durbin model.
Methodological description in the abstract: explicit statement of data (30 provinces, 2013–2022) and the three methods used (Super-SBM with undesirable outputs, GML index, spatial Durbin model).
Despite fears of mass unemployment, aggregate labor-market data through 2025 show limited labor-market disruption from generative AI.
Review of aggregate employment and labor-market studies and macro-level data through 2025 cited in the brief; methods include analyses of employment statistics and macro labor indicators (no single sample size reported).
We scored rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112 transcript segments from multi-agent governance simulations.
Reported methodology: multi-agent governance simulations with agents in formal governmental roles, outcomes evaluated by an independent rubric-based judge; explicit sample count of 28,112 transcript segments.
We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.
Empirical study design described in the paper: open competition with reported counts of teams and participants (29 teams, 80 participants); comparison between participant submissions and AI-only baselines.
AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.
Descriptive dataset/benchmark specification in the paper stating task count and industry coverage.
Open research challenges that define the research agenda include scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, and designing human-AI specification interactions.
Authors' explicit enumeration of open problems and a proposed multi-disciplinary research agenda; presented as expert opinion rather than empirical finding.
The interaction between selection and recourse generates a closed-loop dynamical system linking candidate selection and strategic recourse.
Formalization in the paper showing feedback dynamics between selection outcomes and candidate adjustments (modeling/result claim).
This setting produces endogenous selection, in which both the decision rule and the selection threshold are determined by the population's current feature state.
Derived implication of the framework and model dynamics described in the paper (theoretical consequence of the model).
The success benchmark evolves endogenously as many candidates adjust simultaneously.
Analytical property of the proposed model: simultaneous adjustments by candidates change the effective benchmark (theoretical result asserted by authors).
The study proposes a framework that models recourse as a strategic interaction among candidates under a risk-based selection rule.
The paper introduces a formal/modeling framework (methodological contribution described by the authors).
Actionable recourse studies whether individuals can modify feasible features to overturn unfavorable outcomes produced by AI-assisted decision-support systems.
Definition and framing stated by the authors in the paper's introduction/background (conceptual claim).
A GNN graph is constructed from reasoning embeddings and trading decisions are made using a PPO-DSR policy.
Method description: the paper reports embedding agents' reasoning, building a graph neural network (GNN) from those embeddings, and using a PPO-DSR reinforcement learning policy to trade. Specific GNN/PPO-DSR hyperparameters and architecture are not provided in the excerpt.
Four LLM agents output scores along with reasoning.
Method description: the paper states that four LLM agents produce numeric scores and associated textual reasoning. The number of agents is explicitly given as four; no further architecture or model-family details included in the excerpt.
BlindTrade anonymizes tickers and company names (blindfolding agents by anonymizing all identifiers).
Methodological description in the paper: the system design explicitly replaces tickers and company names with anonymized identifiers. Implementation details and examples not provided in the excerpt.
Data ethics, as a central pillar of digital ethics, emphasizes the responsible use and protection of personal information.
Conceptual/definitional statement in the paper situating data ethics within digital ethics and highlighting protection of personal information as a core concern.
Big data usage is proxied by keyword frequency in firms' annual reports.
Operationalization described in the paper: frequency/count of big-data-related keywords in annual reports used as the proxy for firms' big data application.
The empirical analysis uses a fixed-effects regression approach to measure the impact of big data application on firm value.
Methodological statement in the paper specifying fixed-effects regression as the primary econometric approach.
The study analyzes panel data covering Chinese A-share listed companies from 2007 to 2021.
Description of dataset in the paper: panel of Chinese A-share listed companies spanning the years 2007–2021 (sample period stated).
The analysis extends the dynamic taxation setup of Slavik and Yazici (2014).
Methodological claim: the model and solution approach build on and modify the framework from Slavik and Yazici (2014) (reference to prior theoretical framework rather than empirical data).
We characterize the optimal tax policy in an economy with human manual and cognitive labor, physical capital, and artificial intelligence (AI).
Theoretical/analytical work: the paper develops and analyzes a dynamic general-equilibrium model that includes manual and cognitive human labor, physical capital, and AI. (No empirical sample; model-based characterization.)
Self-concordance did not mediate the AI-over-questionnaire effect on goal progress.
Preplanned mediation model reported in the paper found no evidence that self-concordance mediated the AI vs questionnaire effect on goal progress; reported as non-significant in the preregistered analysis.
Compared with the matched written-reflection questionnaire, the AI did not significantly improve overall goal progress.
Preplanned comparison within the preregistered RCT; reported non-significant difference between AI and written-reflection condition on overall goal progress at two-week follow-up (no significant p-value reported in the summary).
We conducted a preregistered three-arm randomized controlled trial (RCT) comparing an AI career coach ('Leon,' powered by Claude Sonnet), a matched structured written questionnaire, and a no-support control.
Preregistered RCT reported in the paper; three arms as described; total sample size N = 517; participants randomized to AI coach, written-reflection questionnaire, or no-support control; outcomes assessed at two-week follow-up.
All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
Provision of a public GitHub repository URL in the paper.
Evaluation was performed on SWE-bench Verified with two local models: Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances.
Experimental setup reported in the paper specifying benchmark (SWE-bench Verified) and model-instance counts.
Controlled experiments were run with N = 250 across five content types to validate the mechanisms.
Experimental methods reported in the paper: controlled experiments with specified sample size and content-type breakdown.
The dependent variable is the Market Opportunity Index, which is a combination of indicators of innovation activity, the share of firms with new products, and the share of opportunity-oriented entrepreneurs.
Paper provides the construction/definition of the dependent variable (components listed in the excerpt).
The model used lags of the dependent variable to take into account inertia in the development of entrepreneurial opportunities, and the stability of the impact of cognitive tools was tested.
Paper states the model specification included lagged dependent variables and that stability tests for the impact of cognitive tools were performed (no further details on lag length or test statistics in the excerpt).
The methodological foundation of the study was panel econometric modelling, which enabled taking into account international differences observed over time and the dynamics of indicators in the domestic sphere.
Description of methods in the paper: use of panel econometric modelling on an international panel over the 2020–2024 period (sample size not specified in the excerpt).
The field study used a 44-item questionnaire with 45 participants to measure comprehension, reported behavior change/adoption, and perceptions of volunteer legitimacy.
Methodological description provided in the paper: instrument and sample sizes explicitly reported.
Research agenda: empirical microdata on managerial time use, task-level automation, performance outcomes, and wage impacts are needed to quantify substitution versus complementarity and to evaluate human-in-the-loop designs' effects on firm performance and distributional outcomes.
Explicit methodological recommendation within the paper; identifies gaps due to the paper's conceptual (non-empirical) approach.
No original quantitative dataset or controlled evaluation is reported in this paper.
Methodological description in the paper stating reliance on prior literature, conceptual analysis, and prescriptive recommendations; paper does not present new experiments.
The paper is a position/normative paper (not an empirical study) that uses conceptual analysis, literature synthesis, and prescriptive roadmaping rather than new quantitative experiments or datasets.
Explicit methodological statement in the paper summarizing genre and methods used; absence of reported original data or controlled evaluations.
There is a need for longitudinal and cross‑country empirical research to measure how hybrid work and AI tools affect promotion rates, network centrality, productivity, privacy harms, trust, and long‑term career trajectories.
Statement of research gaps derived from the paper's methodological approach (conceptual synthesis and secondary case studies) and absence of longitudinal/cross‑cultural primary data.