Evidence (2954 claims)
Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 369 | 105 | 58 | 432 | 972 |
| Governance & Regulation | 365 | 171 | 113 | 54 | 713 |
| Research Productivity | 229 | 95 | 33 | 294 | 655 |
| Organizational Efficiency | 354 | 82 | 58 | 34 | 531 |
| Technology Adoption Rate | 277 | 115 | 63 | 27 | 486 |
| Firm Productivity | 273 | 33 | 68 | 10 | 389 |
| AI Safety & Ethics | 112 | 177 | 43 | 24 | 358 |
| Output Quality | 228 | 61 | 23 | 25 | 337 |
| Market Structure | 105 | 118 | 81 | 14 | 323 |
| Decision Quality | 154 | 68 | 33 | 17 | 275 |
| Employment Level | 68 | 32 | 74 | 8 | 184 |
| Fiscal & Macroeconomic | 74 | 52 | 32 | 21 | 183 |
| Skill Acquisition | 85 | 31 | 38 | 9 | 163 |
| Firm Revenue | 96 | 30 | 22 | — | 148 |
| Innovation Output | 100 | 11 | 20 | 11 | 143 |
| Consumer Welfare | 66 | 29 | 35 | 7 | 137 |
| Regulatory Compliance | 51 | 61 | 13 | 3 | 128 |
| Inequality Measures | 24 | 66 | 31 | 4 | 125 |
| Task Allocation | 64 | 6 | 28 | 6 | 104 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 10 | 16 | 93 |
| Worker Satisfaction | 42 | 32 | 11 | 6 | 91 |
| Task Completion Time | 71 | 5 | 3 | 1 | 80 |
| Wages & Compensation | 38 | 13 | 19 | 4 | 74 |
| Team Performance | 41 | 8 | 15 | 7 | 72 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 17 | 15 | 9 | 5 | 46 |
| Job Displacement | 5 | 28 | 12 | — | 45 |
| Social Protection | 18 | 8 | 6 | 1 | 33 |
| Developer Productivity | 25 | 1 | 2 | 1 | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 7 | 4 | 9 | — | 20 |
Human Ai Collab
Remove filter
Experimental design: subjects played an indefinitely repeated Prisoner’s Dilemma in supergames with two between-subjects treatments varying chat timing (chat only before first round of each supergame vs chat before every round); the AI partner was GPT-5.2.
Methods description of the lab experiment reported in the paper: indefinitely repeated PD in supergames, two chat-frequency between-subjects treatments, AI implemented as GPT-5.2; human–AI sample n = 126.
Allowing repeated pre-play communication (chat before every round) has no detectable effect on cooperation rates when the partner is an AI.
Between-subjects manipulation within the human–AI experiment comparing chat-before-first-round vs chat-before-every-round treatments (human–AI n = 126 total); statistical comparison of cooperation rates across the two chat-frequency treatments showed no detectable difference.
Initial cooperation rates against the AI (GPT-5.2) are high and comparable to initial cooperation in human–human pairs.
Laboratory experiment with human subjects playing an indefinitely repeated Prisoner’s Dilemma against an AI chatbot (GPT-5.2); human–AI sample n = 126; human–human benchmark taken from Dvorak & Fehrler (2024) with n = 108; comparison of initial-round / early-round cooperation rates across conditions.
Evaluation metrics for the architecture should include sample efficiency, generalization across tasks, robustness to distribution shift, autonomy (fraction of learning decisions made internally), transfer speed, lifelong retention, and safety/constraint adherence.
Explicit recommendations for evaluation metrics in the paper.
This paper is a conceptual/theoretical architecture proposal rather than an empirical study; empirical validation should follow via suggested experiments.
Explicit statement in the paper about nature of contribution.
Results are from role-play contexts and short-term interventions; economic estimates of benefit require validation in field settings, across diverse populations, and with different LLM models.
Authors' caveats and limitations stated in the paper noting external validity concerns and the experimental context (role-play, short-term follow-up).
Outcome measures included alignment to the normative taxonomy (coding/automated), recipient-rated perceptions of being heard/validated, and blinded empathy judgments.
Methods section description listing primary and secondary outcomes used in the trial and evaluations.
A data-driven taxonomy was derived mapping common idiomatic empathic moves (e.g., validation, perspective-taking, emotional labeling, offers of support) used in naturalistic support conversations.
Textual analysis of the collected corpus (33,938 messages) produced an operational taxonomy of idiomatic empathic expressions used in the role-play dialogues.
The Lend an Ear platform collected a large conversational corpus: 33,938 messages across 2,904 conversations with 968 participants.
Dataset description reported in the paper specifying counts of participants, conversations, and messages used to build and analyze communication patterns.
Suggested empirical research directions for AI economists include: comparing LLM performance and economic outcomes on rule‑encodable vs tacit tasks; quantifying performance decline when forcing LLMs into interpretable rule representations; studying contracting/pricing where buyers cannot verify internal rules; and measuring returns to scale attributable to tacit capabilities.
Explicitly enumerated recommended research agenda items in the paper; these are proposed studies rather than executed work.
New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.
Methodological recommendation in the paper listing specific metric categories for future empirical work.
Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.
Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.
The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.
Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.
The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.
Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.
Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines.
Evaluation description: use of human-quality proxy metrics and direct comparisons across models on the 48-brief benchmark.
The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward.
Reward design section enumerating each component and how they contribute to the composite reward used in RL training.
The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline.
Methods description: OpenEnv-compatible RL environment with tool interfaces (web/knowledge access and rendering) used during multi-turn planning and execution.
Code for the environment and experiments is released at the specified GitHub repository.
Artifacts: code release reported at https://github.com/pushing-the-frontier/slide-forge-llm.
The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility.
Artifacts released: SlideRL dataset reported as 288 multi-turn rollouts, hosted at provided Hugging Face URL.
Evaluation was conducted on 48 diverse business briefs across six models.
Data & Methods: evaluation suite comprised 48 business briefs selected for diversity; six models compared.
Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data.
Methods: expert demonstration prompts collected from Claude Opus 4.6 used as seed/bootstrapping data for training.
Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO.
Methods section: GRPO-based reinforcement learning fine-tuning, with parameter-efficient update covering 0.5% of model parameters.
Evaluation combined verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes.
Methods description detailing a mixed evaluation approach: verifiability checks for factual items, qualitative coding for strategic narratives, and longitudinal comparisons over the 11 nodes.
The evaluation was conducted at 11 discrete temporal nodes during the crisis to capture changing public information and uncertainty.
Methods specification: definition and use of 11 temporal nodes as the backbone of the temporally grounded evaluation.
The study used 42 node-specific verifiable questions plus 5 broader exploratory prompts to probe factual inferences and higher-level strategic reasoning.
Methods specification: explicit count of 42 verifiable, node-specific questions and 5 exploratory prompts designed by the study authors.
Key measurable metrics for future evaluation include contest frequency and outcomes, time-to-help for different groups, user satisfaction, perceived fairness, incidence of automation bias, and usability/access disparities.
List of proposed metrics in the paper's evaluation agenda.
The paper does not report empirical data; instead it provides a vignette and a proposed evaluation agenda (user studies, field pilots, A/B tests, logs, surveys).
Explicit methodological statement in the Data & Methods section summarised by the authors; factual description of the paper's empirical status.
The pattern provides an outcome-specific, easy-to-use contest channel allowing users to contest particular decisions without renegotiating global rules.
Design element described in the paper and exemplified in the vignette; proposed contest metrics and evaluation agenda but no empirical data.
The pattern requires legibility at the contact point so the robot clearly communicates which active mode is in use and why when deferring or prioritizing.
Design specification and rationale in the paper; supported by the public-concourse vignette; no empirical measurement.
The pattern constrains prioritization to a governance-approved menu of admissible modes, limiting the policy space to vetted options.
Design specification in the paper (architectural requirement); illustrated in the vignette; no empirical testing.
Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.
Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.
Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.
Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.
Eight state-of-the-art LLMs were evaluated in the study.
Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.
The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).
Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.
RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).
Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.
The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).
Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.
BenchPreS defines two complementary metrics—Misapplication Rate (MR) and Appropriate Application Rate (AAR)—to quantify over‑application and correct personalization, respectively.
Methodological contribution described in the paper: explicit definitions of MR as fraction of inappropriate applications and AAR as fraction of appropriate applications, used to score model behavior.
Key empirical metrics introduced and used are: AI adoption rates (sector-level intensity), Skill shift index, Hybrid job share, and employment levels/net changes by sector.
Methods description listing the constructed metrics used in the simulated dataset and subsequent analyses (definitions and calculation procedures provided in the paper).
The study's main limitations include reliance on a simulated dataset rather than exhaustive administrative microdata, literature limited to selected publishers/years, and correlational (not causal) identification of some effects.
Authors' explicitly stated limitations in the paper's methods and discussion sections describing data choices (simulated dataset, selected publishers 2020–2024) and the observational/correlational nature of several analyses.
The authors explicitly note limitations: the study focuses on prediction (not causation), results are sensitive to data quality, workforce records may contain biases, and practical constraints like privacy and deployment complexity limit direct operational adoption.
Limitations section described by the authors listing prediction-versus-causation distinction, sensitivity to data quality, potential biases, privacy concerns, and deployment complexity.
The study used a reproducible modeling pipeline (data cleaning, feature engineering, model training and tuning, systematic evaluation) applied to several freely available workforce datasets to enable replication.
Methods section describes a reproducible workflow including preprocessing steps, engineered features, hyperparameter tuning for each model class, cross-validation, and use of publicly available datasets.
This work is conceptual/theoretical and reports no original empirical dataset; it explicitly calls for mixed-methods empirical validation (case studies, field experiments, longitudinal studies), measurement development, and multi-level data collection.
Explicit methodological statement in the paper describing its nature as a theoretical synthesis and listing empirical needs; no empirical sample provided.
Four autonomous agents were benchmarked on the same fresh CTF challenge set alongside human teams.
Benchmarking experiment described in the study: four autonomous AI agents evaluated on the identical fresh challenge set used in the live onsite CTF.
Data and methods: the study used an online experiment with 861 online-retail employees performing short-duration, virtual, task-focused collaborations; analyses focused on direct effects, moderation (emotion and partner type), mediation (service empathy), and moderated-mediation.
Methods description in the paper specifying design, sample size (n = 861), task context (temporary virtual teamwork), and analytic approach (hypothesis tests including moderation and mediation analyses).
Teamwork partner type (human vs AI) has no direct, significant effect on collaboration proficiency for temporary virtual tasks.
Online experiment with employees in the online-retail industry (n = 861). Hypothesis testing showed no significant main effect of partner type on the outcome variable 'collaboration proficiency' in the reported analyses.
Empirical strategy: the main identification strategy uses panel regressions with quadratic AI specification and interaction terms, controlling for firm covariates, employing fixed effects and robustness checks (alternative measures, sub-samples).
Methods section description: panel regressions including AI and AI^2, interactions for moderators, controls, fixed effects, and robustness analyses reported in the paper.
Data/sample claim: the empirical analysis uses a panel of 2,575 Chinese listed firms observed from 2013 to 2023.
Paper-stated sample description (panel dataset covering 2013–2023, N = 2,575 firms).
The paper recommends an empirical research agenda including field experiments comparing teams with and without AI mediation, structural models of labor supply and wages under reduced language frictions, microdata analysis of adopters, and measurement studies for coordination costs and mediated-action reliability.
Explicit recommendations and research agenda stated in the paper; this is a descriptive claim about the paper's content rather than an empirical finding.
The paper's primary approach is conceptual/theoretical development and agenda-setting; it does not report large-scale empirical or experimental data.
Explicit methods statement in the paper: synthesis, illustrative examples, framework development; absence of reported empirical sample or experiments.
The study's empirical base consists of 40 semi-structured interviews with cross-industry project practitioners in the UK, analyzed using thematic qualitative methods.
Stated data and methods in the paper: sample size (40), interview method, cross-industry sampling, and thematic analysis.