Evidence (2954 claims)
Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 369 | 105 | 58 | 432 | 972 |
| Governance & Regulation | 365 | 171 | 113 | 54 | 713 |
| Research Productivity | 229 | 95 | 33 | 294 | 655 |
| Organizational Efficiency | 354 | 82 | 58 | 34 | 531 |
| Technology Adoption Rate | 277 | 115 | 63 | 27 | 486 |
| Firm Productivity | 273 | 33 | 68 | 10 | 389 |
| AI Safety & Ethics | 112 | 177 | 43 | 24 | 358 |
| Output Quality | 228 | 61 | 23 | 25 | 337 |
| Market Structure | 105 | 118 | 81 | 14 | 323 |
| Decision Quality | 154 | 68 | 33 | 17 | 275 |
| Employment Level | 68 | 32 | 74 | 8 | 184 |
| Fiscal & Macroeconomic | 74 | 52 | 32 | 21 | 183 |
| Skill Acquisition | 85 | 31 | 38 | 9 | 163 |
| Firm Revenue | 96 | 30 | 22 | — | 148 |
| Innovation Output | 100 | 11 | 20 | 11 | 143 |
| Consumer Welfare | 66 | 29 | 35 | 7 | 137 |
| Regulatory Compliance | 51 | 61 | 13 | 3 | 128 |
| Inequality Measures | 24 | 66 | 31 | 4 | 125 |
| Task Allocation | 64 | 6 | 28 | 6 | 104 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 10 | 16 | 93 |
| Worker Satisfaction | 42 | 32 | 11 | 6 | 91 |
| Task Completion Time | 71 | 5 | 3 | 1 | 80 |
| Wages & Compensation | 38 | 13 | 19 | 4 | 74 |
| Team Performance | 41 | 8 | 15 | 7 | 72 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 17 | 15 | 9 | 5 | 46 |
| Job Displacement | 5 | 28 | 12 | — | 45 |
| Social Protection | 18 | 8 | 6 | 1 | 33 |
| Developer Productivity | 25 | 1 | 2 | 1 | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 7 | 4 | 9 | — | 20 |
Human Ai Collab
Remove filter
The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning.
Direct description of the benchmark's taxonomy in the paper; the authors specify these three categories as the organizational structure for the 1,400 questions.
FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window.
Statement in the paper describing the benchmark construction and scope; the paper reports the benchmark size (1,400 questions) and the dataset grounding (NASDAQ-100 over ten years).
Despite fears of mass unemployment, aggregate labor-market data through 2025 show limited labor-market disruption from generative AI.
Review of aggregate employment and labor-market studies and macro-level data through 2025 cited in the brief; methods include analyses of employment statistics and macro labor indicators (no single sample size reported).
We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.
Empirical study design described in the paper: open competition with reported counts of teams and participants (29 teams, 80 participants); comparison between participant submissions and AI-only baselines.
AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.
Descriptive dataset/benchmark specification in the paper stating task count and industry coverage.
Open research challenges that define the research agenda include scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, and designing human-AI specification interactions.
Authors' explicit enumeration of open problems and a proposed multi-disciplinary research agenda; presented as expert opinion rather than empirical finding.
The interaction between selection and recourse generates a closed-loop dynamical system linking candidate selection and strategic recourse.
Formalization in the paper showing feedback dynamics between selection outcomes and candidate adjustments (modeling/result claim).
This setting produces endogenous selection, in which both the decision rule and the selection threshold are determined by the population's current feature state.
Derived implication of the framework and model dynamics described in the paper (theoretical consequence of the model).
The success benchmark evolves endogenously as many candidates adjust simultaneously.
Analytical property of the proposed model: simultaneous adjustments by candidates change the effective benchmark (theoretical result asserted by authors).
The study proposes a framework that models recourse as a strategic interaction among candidates under a risk-based selection rule.
The paper introduces a formal/modeling framework (methodological contribution described by the authors).
Actionable recourse studies whether individuals can modify feasible features to overturn unfavorable outcomes produced by AI-assisted decision-support systems.
Definition and framing stated by the authors in the paper's introduction/background (conceptual claim).
Self-concordance did not mediate the AI-over-questionnaire effect on goal progress.
Preplanned mediation model reported in the paper found no evidence that self-concordance mediated the AI vs questionnaire effect on goal progress; reported as non-significant in the preregistered analysis.
Compared with the matched written-reflection questionnaire, the AI did not significantly improve overall goal progress.
Preplanned comparison within the preregistered RCT; reported non-significant difference between AI and written-reflection condition on overall goal progress at two-week follow-up (no significant p-value reported in the summary).
We conducted a preregistered three-arm randomized controlled trial (RCT) comparing an AI career coach ('Leon,' powered by Claude Sonnet), a matched structured written questionnaire, and a no-support control.
Preregistered RCT reported in the paper; three arms as described; total sample size N = 517; participants randomized to AI coach, written-reflection questionnaire, or no-support control; outcomes assessed at two-week follow-up.
All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
Provision of a public GitHub repository URL in the paper.
Evaluation was performed on SWE-bench Verified with two local models: Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances.
Experimental setup reported in the paper specifying benchmark (SWE-bench Verified) and model-instance counts.
Controlled experiments were run with N = 250 across five content types to validate the mechanisms.
Experimental methods reported in the paper: controlled experiments with specified sample size and content-type breakdown.
Research agenda: empirical microdata on managerial time use, task-level automation, performance outcomes, and wage impacts are needed to quantify substitution versus complementarity and to evaluate human-in-the-loop designs' effects on firm performance and distributional outcomes.
Explicit methodological recommendation within the paper; identifies gaps due to the paper's conceptual (non-empirical) approach.
There is a need for longitudinal and cross‑country empirical research to measure how hybrid work and AI tools affect promotion rates, network centrality, productivity, privacy harms, trust, and long‑term career trajectories.
Statement of research gaps derived from the paper's methodological approach (conceptual synthesis and secondary case studies) and absence of longitudinal/cross‑cultural primary data.
Practical recommendations for firms and policymakers include investing in training for AI curation/evaluation/coordination, experimenting with decentralised decision rights and governance safeguards, and monitoring competitive dynamics related to model/platform providers.
Policy and practitioner takeaways explicitly presented in the discussion/implications sections, deriving from the conceptual framework and mapped literature.
The paper recommends a research agenda for AI economists: causal microeconometric studies (DiD, IVs, RCTs), structural models with hybrid human–AI agents, measurement work on GenAI use, distributional analysis and policy evaluation.
Explicit recommendations listed in the implications and research agenda sections; logical follow‑on from bibliometric findings about gaps in causal and measurement evidence.
Bibliometric mapping profiles the intellectual structure and evolution of the field but does not establish causal effects of GenAI on organisational outcomes.
Methodological limitation explicitly stated in the paper; bibliometric approach (co‑word, citation, thematic mapping) is descriptive and historical in scope.
Co‑word and thematic analyses reveal six coherent conceptual clusters that bridge technical AI topics (e.g., LLMs, GANs) with managerial themes (e.g., autonomy, coordination, decision‑making).
Thematic mapping and co‑word network analysis performed on the 212‑paper corpus; identification of six clusters reported in results.
Bibliometric and conceptual tools (VOSviewer, Bibliometrix) were used to identify performance trends, co‑word structures, thematic maps, and conceptual evolution in the GenAI–organisation literature.
Methods section: use of VOSviewer for network visualization and Bibliometrix for bibliometric statistics, co‑word analysis, thematic mapping and Sankey thematic evolution.
The study analysed a corpus of 212 Scopus‑indexed publications covering 2018–2025 to map emergent literature on Generative AI and organisational change.
Bibliometric dataset constructed from Scopus; sample size = 212 peer‑reviewed articles; time window 2018–2025; analyses performed with Bibliometrix and VOSviewer.
Outcomes reported are primarily self-reported psychological measures rather than objective productivity metrics.
Paper reports measurement instruments focused on self-reported self-efficacy, psychological ownership, meaningfulness, and enjoyment/satisfaction; no primary objective productivity metrics reported.
The experiment was pre-registered, used occupation-specific writing tasks, and employed a between-subjects design with three conditions (No-AI, Passive AI, Active collaboration).
Study design reported in the paper: pre-registration statement, N = 269, between-subjects assignment to three conditions using occupation-specific writing tasks.
Active, collaborative AI use preserves perceived meaningfulness of work at levels comparable to independent work and does not produce the lasting psychological costs seen with passive use.
Pre-registered experiment (N = 269) with post-manipulation and post-return measures; Active-collaboration condition matched No-AI on meaningfulness and showed no persistent declines after returning to manual tasks.
Active, collaborative AI use preserves psychological ownership of outputs at levels comparable to independent work.
Pre-registered experiment (N = 269); Active-collaboration condition reported ownership levels similar to No-AI condition on self-report scales.
Active, collaborative AI use (human drafts first, then uses AI to refine) preserves self-efficacy at levels comparable to independent (no-AI) work.
Pre-registered experiment (N = 269) comparing Active-collaboration and No-AI conditions; no statistically meaningful differences in self-efficacy between them (self-reported measures).
The authors propose research priorities for economists: quantify productivity gains from closing the actionability gap; estimate firm-level heterogeneity in evaluation capability and its effect on adoption; and model investment trade-offs between building evaluation-to-action pipelines versus accepting reduced LLM performance.
Paper's concluding recommendations for future research directions (explicitly listed by the authors).
The paper produces as primary outcomes a taxonomy of ten evaluation practices, the articulation of the results-actionability gap, and recommended strategies observed among successful teams.
Authors report these as the main outcomes of their thematic analysis and syntheses from the 19 interviews.
The study method consisted of semi-structured qualitative interviews with 19 practitioners across multiple industries and roles, analyzed via thematic coding.
Explicit methods section of the paper stating sample size (n=19), participant diversity, interview approach, and coding/analysis procedure.
The methodology is normative-philosophical argumentation supplemented by interdisciplinary synthesis (phenomenology, deconstruction, OOO, STS/material turn); this is not an empirical causal study and contains no quantitative datasets.
Author-declared methods and limits: statement that the intervention is theory-driven and qualitative; absence of quantitative analysis reported.
The paper’s empirical grounding consists of illustrative case studies and vignettes from healthcare robotics, autonomous vehicles, and algorithmic governance used to demonstrate distributed agency and responsibility.
Author-stated methodology: qualitative vignettes/case illustrations across three domains; no reported sample sizes or systematic data collection.
The experiment used NYSE TAQ transaction and quote data for SPY covering 2015–2024 and tested six pre-specified hypotheses about market-quality trends.
Data and methods section specifying dataset (NYSE TAQ SPY, 2015–2024), the number of pre-specified hypotheses (six), and experimental protocol with 150 autonomous agents.
Agents' methodological choices and resulting effect estimates were systematically recorded and used to quantify dispersion and measure switching across stages.
Study design description: recorded agents' methodological choices (measure selection, estimation procedures), resulting estimates, and tracked switching and dispersion metrics (IQR) across the three-stage protocol applied to SPY TAQ data (2015–2024) with 150 agents.
AI peer review (agents exchanging written critiques) produced minimal reduction in dispersion of estimates.
Three-stage protocol: after stage 1 (independent analyses) and stage 2 (AI peer review), measured dispersion (e.g., IQR) across agents showed little change following the peer-review stage across the six hypotheses and agent pool (n=150).
The work is qualitative and exploratory — presenting naturalistic phenomena rather than causal empirical estimates, and is intended to be hypothesis-generating rather than definitive.
Methodology explicitly stated: naturalistic, qualitative daily observations over one month across multiple platforms; comparative observational documentation without experimental manipulation or causal identification.
CoMAI is a modular, four-agent interview-assessment framework coordinated by a centralized finite-state machine.
System design and implementation described in the paper: a pipeline of four specialized agents (question generation, security/validation, scoring by rubric, summarization/reporting) with a centralized finite-state machine enforcing workflow and information flow constraints.
Field experiments (A/B testing) and willingness-to-pay experiments are necessary to quantify monetary benefits, adoption curves, and optimal pricing for alignment capabilities.
Paper explicitly recommends these empirical approaches in the recommendations for economists and product teams; this is a methodological recommendation rather than an empirical finding.
Recommended evaluation directions include automatic metrics (embedding similarity, task success, turn counts), human evaluation (satisfaction, perceived collaboration), and A/B testing in deployed settings (latency, compute, retention).
Paper's explicit evaluation proposals and recommended metrics listed in the Data & Methods and Evaluation Directions sections; these are prescriptive recommendations rather than executed experiments.
The paper focuses on architecture and conceptual arguments rather than reporting large-scale empirical datasets or results.
Data & Methods section and overall document framing emphasize architecture description and proposed evaluations; explicitly notes absence of large-scale empirical results in the provided summary.
Alignment verification can be implemented using semantic embeddings (cosine similarity) or learned classifiers with threshold-based decision branching.
Paper describes these as recommended implementation approaches for the alignment verification component; no empirical benchmark comparing methods is reported.
Temporal decay in the retrieval component can be modeled with functions such as exponential decay and a tunable half-life parameter applied to dialogue-turn embeddings.
Methodological description in the paper specifying temporal decay modeling options (exponential decay example) and tunable parameters; descriptive claim about intended implementation (no empirical comparison of decay functions provided).
Research agenda items for economists include: quantifying willingness-to-pay for verifiable reasoning, studying labor-market impacts for validators, designing contracts/mechanisms to incentivize truthful argument provision, and evaluating regulatory interventions.
Paper's stated research and policy agenda; prescriptive rather than empirical.
Evaluation currently lacks metrics and benchmarks for argument quality, fidelity, contestability, and human trust; developing these is necessary.
Paper notes the gap and proposes evaluation metrics and experimental designs; no new benchmarks introduced.
Methodology is primarily conceptual and normative: the paper synthesizes policy texts, safety standards, and crisis-management literature and relies on illustrative mappings and thought experiments rather than new empirical field data.
Authors' methodological description in the Data & Methods section (explicit statement about sources and use of thought experiments).
The paper defines and specifies four oversight modes (spanning near-full autonomy to strict human control) and provides criteria for selecting modes based on task complexity, risk level, and consequence severity.
Conceptual taxonomy developed in the paper; mapping exercises and triage framework (risk–complexity–consequence) presented as illustrative mappings (no empirical testing).
Sample sizes reported: human–AI experiment n = 126; human–human benchmark n = 108.
Study's Data & Methods section reporting sample sizes for the human–AI experiment (n = 126) and citing the human–human benchmark (Dvorak & Fehrler 2024, n = 108).