The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (2954 claims)

Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 369 105 58 432 972
Governance & Regulation 365 171 113 54 713
Research Productivity 229 95 33 294 655
Organizational Efficiency 354 82 58 34 531
Technology Adoption Rate 277 115 63 27 486
Firm Productivity 273 33 68 10 389
AI Safety & Ethics 112 177 43 24 358
Output Quality 228 61 23 25 337
Market Structure 105 118 81 14 323
Decision Quality 154 68 33 17 275
Employment Level 68 32 74 8 184
Fiscal & Macroeconomic 74 52 32 21 183
Skill Acquisition 85 31 38 9 163
Firm Revenue 96 30 22 148
Innovation Output 100 11 20 11 143
Consumer Welfare 66 29 35 7 137
Regulatory Compliance 51 61 13 3 128
Inequality Measures 24 66 31 4 125
Task Allocation 64 6 28 6 104
Error Rate 42 47 6 95
Training Effectiveness 55 12 10 16 93
Worker Satisfaction 42 32 11 6 91
Task Completion Time 71 5 3 1 80
Wages & Compensation 38 13 19 4 74
Team Performance 41 8 15 7 72
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 17 15 9 5 46
Job Displacement 5 28 12 45
Social Protection 18 8 6 1 33
Developer Productivity 25 1 2 1 29
Worker Turnover 10 12 3 25
Creative Output 15 5 3 1 24
Skill Obsolescence 3 18 2 23
Labor Share of Income 7 4 9 20
Clear
Human Ai Collab Remove filter
Experimental design: subjects played an indefinitely repeated Prisoner’s Dilemma in supergames with two between-subjects treatments varying chat timing (chat only before first round of each supergame vs chat before every round); the AI partner was GPT-5.2.
Methods description of the lab experiment reported in the paper: indefinitely repeated PD in supergames, two chat-frequency between-subjects treatments, AI implemented as GPT-5.2; human–AI sample n = 126.
high null result Playing Against the Machine: Cooperation, Communication, and... experimental treatment specification (chat-frequency manipulation; AI identity)
Allowing repeated pre-play communication (chat before every round) has no detectable effect on cooperation rates when the partner is an AI.
Between-subjects manipulation within the human–AI experiment comparing chat-before-first-round vs chat-before-every-round treatments (human–AI n = 126 total); statistical comparison of cooperation rates across the two chat-frequency treatments showed no detectable difference.
high null result Playing Against the Machine: Cooperation, Communication, and... effect of chat frequency on cooperation rate (difference in cooperation between ...
Initial cooperation rates against the AI (GPT-5.2) are high and comparable to initial cooperation in human–human pairs.
Laboratory experiment with human subjects playing an indefinitely repeated Prisoner’s Dilemma against an AI chatbot (GPT-5.2); human–AI sample n = 126; human–human benchmark taken from Dvorak & Fehrler (2024) with n = 108; comparison of initial-round / early-round cooperation rates across conditions.
high null result Playing Against the Machine: Cooperation, Communication, and... initial cooperation rate (cooperation in early rounds / first round of supergame...
Evaluation metrics for the architecture should include sample efficiency, generalization across tasks, robustness to distribution shift, autonomy (fraction of learning decisions made internally), transfer speed, lifelong retention, and safety/constraint adherence.
Explicit recommendations for evaluation metrics in the paper.
high null result Why AI systems don't learn and what to do about it: Lessons ... listed evaluation metrics (sample efficiency; generalization; robustness; autono...
This paper is a conceptual/theoretical architecture proposal rather than an empirical study; empirical validation should follow via suggested experiments.
Explicit statement in the paper about nature of contribution.
high null result Why AI systems don't learn and what to do about it: Lessons ... N/A (no empirical outcomes reported)
Results are from role-play contexts and short-term interventions; economic estimates of benefit require validation in field settings, across diverse populations, and with different LLM models.
Authors' caveats and limitations stated in the paper noting external validity concerns and the experimental context (role-play, short-term follow-up).
high null result Practicing with Language Models Cultivates Human Empathic Co... generalizability/external validity (not directly measured)
Outcome measures included alignment to the normative taxonomy (coding/automated), recipient-rated perceptions of being heard/validated, and blinded empathy judgments.
Methods section description listing primary and secondary outcomes used in the trial and evaluations.
high null result Practicing with Language Models Cultivates Human Empathic Co... alignment metrics, recipient-rated perceptions, blinded empathy judgments
A data-driven taxonomy was derived mapping common idiomatic empathic moves (e.g., validation, perspective-taking, emotional labeling, offers of support) used in naturalistic support conversations.
Textual analysis of the collected corpus (33,938 messages) produced an operational taxonomy of idiomatic empathic expressions used in the role-play dialogues.
high null result Practicing with Language Models Cultivates Human Empathic Co... taxonomy of empathic communication moves (categorical coding scheme)
The Lend an Ear platform collected a large conversational corpus: 33,938 messages across 2,904 conversations with 968 participants.
Dataset description reported in the paper specifying counts of participants, conversations, and messages used to build and analyze communication patterns.
high null result Practicing with Language Models Cultivates Human Empathic Co... corpus size (number of messages, conversations, participants)
Suggested empirical research directions for AI economists include: comparing LLM performance and economic outcomes on rule‑encodable vs tacit tasks; quantifying performance decline when forcing LLMs into interpretable rule representations; studying contracting/pricing where buyers cannot verify internal rules; and measuring returns to scale attributable to tacit capabilities.
Explicitly enumerated recommended research agenda items in the paper; these are proposed studies rather than executed work.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed empirical research topics and corresponding outcomes to measure
New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.
Methodological recommendation in the paper listing specific metric categories for future empirical work.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed metrics for assessing tacit LLM capabilities
Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.
Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... types of empirical tests recommended for validating the thesis
The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.
Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... extent of empirical/quantitative evidence presented
The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.
Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... presence/absence of empirical experiments in the paper
Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines.
Evaluation description: use of human-quality proxy metrics and direct comparisons across models on the 48-brief benchmark.
high null result Learning to Present: Inverse Specification Rewards for Agent... Human-quality proxy scores and comparative model rankings
The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward.
Reward design section enumerating each component and how they contribute to the composite reward used in RL training.
high null result Learning to Present: Inverse Specification Rewards for Agent... Components of the reward signal used for RL training
The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline.
Methods description: OpenEnv-compatible RL environment with tool interfaces (web/knowledge access and rendering) used during multi-turn planning and execution.
high null result Learning to Present: Inverse Specification Rewards for Agent... Environment capabilities: OpenEnv compatibility and tool-use support
Code for the environment and experiments is released at the specified GitHub repository.
Artifacts: code release reported at https://github.com/pushing-the-frontier/slide-forge-llm.
high null result Learning to Present: Inverse Specification Rewards for Agent... Availability of experiment code (GitHub repo)
The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility.
Artifacts released: SlideRL dataset reported as 288 multi-turn rollouts, hosted at provided Hugging Face URL.
high null result Learning to Present: Inverse Specification Rewards for Agent... Number of rollout trajectories in dataset (288) and coverage across models (6)
Evaluation was conducted on 48 diverse business briefs across six models.
Data & Methods: evaluation suite comprised 48 business briefs selected for diversity; six models compared.
high null result Learning to Present: Inverse Specification Rewards for Agent... Number of evaluation tasks (48 briefs) and number of models compared (6)
Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data.
Methods: expert demonstration prompts collected from Claude Opus 4.6 used as seed/bootstrapping data for training.
high null result Learning to Present: Inverse Specification Rewards for Agent... Source of demonstration prompts (Claude Opus 4.6)
Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO.
Methods section: GRPO-based reinforcement learning fine-tuning, with parameter-efficient update covering 0.5% of model parameters.
high null result Learning to Present: Inverse Specification Rewards for Agent... Proportion of model parameters updated during training (0.5%)
Evaluation combined verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes.
Methods description detailing a mixed evaluation approach: verifiability checks for factual items, qualitative coding for strategic narratives, and longitudinal comparisons over the 11 nodes.
high null result When AI Navigates the Fog of War evaluation components (verifiability checks, qualitative coding, longitudinal an...
The evaluation was conducted at 11 discrete temporal nodes during the crisis to capture changing public information and uncertainty.
Methods specification: definition and use of 11 temporal nodes as the backbone of the temporally grounded evaluation.
high null result When AI Navigates the Fog of War number of temporal nodes
The study used 42 node-specific verifiable questions plus 5 broader exploratory prompts to probe factual inferences and higher-level strategic reasoning.
Methods specification: explicit count of 42 verifiable, node-specific questions and 5 exploratory prompts designed by the study authors.
high null result When AI Navigates the Fog of War number and type of questions/prompts used
Key measurable metrics for future evaluation include contest frequency and outcomes, time-to-help for different groups, user satisfaction, perceived fairness, incidence of automation bias, and usability/access disparities.
List of proposed metrics in the paper's evaluation agenda.
high null result Designing for Disagreement: Front-End Guardrails for Assista... the specified metrics (contest frequency/outcomes, time-to-help, satisfaction, p...
The paper does not report empirical data; instead it provides a vignette and a proposed evaluation agenda (user studies, field pilots, A/B tests, logs, surveys).
Explicit methodological statement in the Data & Methods section summarised by the authors; factual description of the paper's empirical status.
high null result Designing for Disagreement: Front-End Guardrails for Assista... presence/absence of empirical data in the paper (binary)
The pattern provides an outcome-specific, easy-to-use contest channel allowing users to contest particular decisions without renegotiating global rules.
Design element described in the paper and exemplified in the vignette; proposed contest metrics and evaluation agenda but no empirical data.
high null result Designing for Disagreement: Front-End Guardrails for Assista... availability and specificity of contest channels (system functionality)
The pattern requires legibility at the contact point so the robot clearly communicates which active mode is in use and why when deferring or prioritizing.
Design specification and rationale in the paper; supported by the public-concourse vignette; no empirical measurement.
high null result Designing for Disagreement: Front-End Guardrails for Assista... legibility of active mode (user understanding at time of deferral)
The pattern constrains prioritization to a governance-approved menu of admissible modes, limiting the policy space to vetted options.
Design specification in the paper (architectural requirement); illustrated in the vignette; no empirical testing.
high null result Designing for Disagreement: Front-End Guardrails for Assista... existence of governance-approved admissible modes (system design property)
Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.
Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability, efficiency, and robustness/degradation metrics
Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.
Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... baseline agent architectures used for comparison
Eight state-of-the-art LLMs were evaluated in the study.
Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... number of LLMs evaluated (n = 8)
The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).
Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... agent architectural modularity (temporal decomposition into strategy vs executio...
RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).
Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... environment difficulty gradient (complexity/stochasticity/non-stationarity level...
The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).
Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... benchmark realism and coverage of non-stationarity for long-horizon decision-mak...
BenchPreS defines two complementary metrics—Misapplication Rate (MR) and Appropriate Application Rate (AAR)—to quantify over‑application and correct personalization, respectively.
Methodological contribution described in the paper: explicit definitions of MR as fraction of inappropriate applications and AAR as fraction of appropriate applications, used to score model behavior.
high null result BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Definition and use of MR and AAR metrics
Key empirical metrics introduced and used are: AI adoption rates (sector-level intensity), Skill shift index, Hybrid job share, and employment levels/net changes by sector.
Methods description listing the constructed metrics used in the simulated dataset and subsequent analyses (definitions and calculation procedures provided in the paper).
high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Defined metrics (AI adoption rate, Skill shift index, Hybrid job share, Employme...
The study's main limitations include reliance on a simulated dataset rather than exhaustive administrative microdata, literature limited to selected publishers/years, and correlational (not causal) identification of some effects.
Authors' explicitly stated limitations in the paper's methods and discussion sections describing data choices (simulated dataset, selected publishers 2020–2024) and the observational/correlational nature of several analyses.
high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Study validity/generalizability limitations
The authors explicitly note limitations: the study focuses on prediction (not causation), results are sensitive to data quality, workforce records may contain biases, and practical constraints like privacy and deployment complexity limit direct operational adoption.
Limitations section described by the authors listing prediction-versus-causation distinction, sensitivity to data quality, potential biases, privacy concerns, and deployment complexity.
high null result Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Scope and limitations of study conclusions (qualitative)
The study used a reproducible modeling pipeline (data cleaning, feature engineering, model training and tuning, systematic evaluation) applied to several freely available workforce datasets to enable replication.
Methods section describes a reproducible workflow including preprocessing steps, engineered features, hyperparameter tuning for each model class, cross-validation, and use of publicly available datasets.
high null result Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Reproducibility of predictive modeling workflow (procedural, not an empirical pe...
This work is conceptual/theoretical and reports no original empirical dataset; it explicitly calls for mixed-methods empirical validation (case studies, field experiments, longitudinal studies), measurement development, and multi-level data collection.
Explicit methodological statement in the paper describing its nature as a theoretical synthesis and listing empirical needs; no empirical sample provided.
high null result Revolutionizing Human Resource Development: A Theoretical Fr... presence/absence of original empirical data in the paper (none)
Four autonomous agents were benchmarked on the same fresh CTF challenge set alongside human teams.
Benchmarking experiment described in the study: four autonomous AI agents evaluated on the identical fresh challenge set used in the live onsite CTF.
high null result Understanding Human-AI Collaboration in Cybersecurity Compet... agent performance metrics on the fresh CTF challenge set (success rates, traject...
Data and methods: the study used an online experiment with 861 online-retail employees performing short-duration, virtual, task-focused collaborations; analyses focused on direct effects, moderation (emotion and partner type), mediation (service empathy), and moderated-mediation.
Methods description in the paper specifying design, sample size (n = 861), task context (temporary virtual teamwork), and analytic approach (hypothesis tests including moderation and mediation analyses).
high null result Adoption of AI partners in temporary tasks: exploring the ef... NA (methodological claim about study design and analyses)
Teamwork partner type (human vs AI) has no direct, significant effect on collaboration proficiency for temporary virtual tasks.
Online experiment with employees in the online-retail industry (n = 861). Hypothesis testing showed no significant main effect of partner type on the outcome variable 'collaboration proficiency' in the reported analyses.
high null result Adoption of AI partners in temporary tasks: exploring the ef... collaboration proficiency
Empirical strategy: the main identification strategy uses panel regressions with quadratic AI specification and interaction terms, controlling for firm covariates, employing fixed effects and robustness checks (alternative measures, sub-samples).
Methods section description: panel regressions including AI and AI^2, interactions for moderators, controls, fixed effects, and robustness analyses reported in the paper.
high null result Attention to Whom? AI Adoption and Corporate Social Responsi... N/A (methodological claim)
Data/sample claim: the empirical analysis uses a panel of 2,575 Chinese listed firms observed from 2013 to 2023.
Paper-stated sample description (panel dataset covering 2013–2023, N = 2,575 firms).
high null result Attention to Whom? AI Adoption and Corporate Social Responsi... N/A (sample description)
The paper recommends an empirical research agenda including field experiments comparing teams with and without AI mediation, structural models of labor supply and wages under reduced language frictions, microdata analysis of adopters, and measurement studies for coordination costs and mediated-action reliability.
Explicit recommendations and research agenda stated in the paper; this is a descriptive claim about the paper's content rather than an empirical finding.
high null result AI as a universal collaboration layer: Eliminating language ... existence of the recommended research agenda items in the paper
The paper's primary approach is conceptual/theoretical development and agenda-setting; it does not report large-scale empirical or experimental data.
Explicit methods statement in the paper: synthesis, illustrative examples, framework development; absence of reported empirical sample or experiments.
high null result AI as a universal collaboration layer: Eliminating language ... presence/absence of empirical/experimental data in the paper
The study's empirical base consists of 40 semi-structured interviews with cross-industry project practitioners in the UK, analyzed using thematic qualitative methods.
Stated data and methods in the paper: sample size (40), interview method, cross-industry sampling, and thematic analysis.
high null result AI in project teams: how trust calibration reconfigures team... study sample and methodology (empirical basis)