The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (2954 claims)

Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 369 105 58 432 972
Governance & Regulation 365 171 113 54 713
Research Productivity 229 95 33 294 655
Organizational Efficiency 354 82 58 34 531
Technology Adoption Rate 277 115 63 27 486
Firm Productivity 273 33 68 10 389
AI Safety & Ethics 112 177 43 24 358
Output Quality 228 61 23 25 337
Market Structure 105 118 81 14 323
Decision Quality 154 68 33 17 275
Employment Level 68 32 74 8 184
Fiscal & Macroeconomic 74 52 32 21 183
Skill Acquisition 85 31 38 9 163
Firm Revenue 96 30 22 148
Innovation Output 100 11 20 11 143
Consumer Welfare 66 29 35 7 137
Regulatory Compliance 51 61 13 3 128
Inequality Measures 24 66 31 4 125
Task Allocation 64 6 28 6 104
Error Rate 42 47 6 95
Training Effectiveness 55 12 10 16 93
Worker Satisfaction 42 32 11 6 91
Task Completion Time 71 5 3 1 80
Wages & Compensation 38 13 19 4 74
Team Performance 41 8 15 7 72
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 17 15 9 5 46
Job Displacement 5 28 12 45
Social Protection 18 8 6 1 33
Developer Productivity 25 1 2 1 29
Worker Turnover 10 12 3 25
Creative Output 15 5 3 1 24
Skill Obsolescence 3 18 2 23
Labor Share of Income 7 4 9 20
Clear
Human Ai Collab Remove filter
The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning.
Direct description of the benchmark's taxonomy in the paper; the authors specify these three categories as the organizational structure for the 1,400 questions.
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs benchmark organization / task taxonomy
FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window.
Statement in the paper describing the benchmark construction and scope; the paper reports the benchmark size (1,400 questions) and the dataset grounding (NASDAQ-100 over ten years).
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs benchmark size and scope (number of questions; data grounding)
Despite fears of mass unemployment, aggregate labor-market data through 2025 show limited labor-market disruption from generative AI.
Review of aggregate employment and labor-market studies and macro-level data through 2025 cited in the brief; methods include analyses of employment statistics and macro labor indicators (no single sample size reported).
high null result AI, Productivity, and Labor Markets: A Review of the Empiric... aggregate employment / labor-market disruption
We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.
Empirical study design described in the paper: open competition with reported counts of teams and participants (29 teams, 80 participants); comparison between participant submissions and AI-only baselines.
high null result AgentDS Technical Report: Benchmarking the Future of Human-A... competition participation enabling comparison
AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.
Descriptive dataset/benchmark specification in the paper stating task count and industry coverage.
high null result AgentDS Technical Report: Benchmarking the Future of Human-A... number of challenges and industry coverage
Open research challenges that define the research agenda include scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, and designing human-AI specification interactions.
Authors' explicit enumeration of open problems and a proposed multi-disciplinary research agenda; presented as expert opinion rather than empirical finding.
high null result Intent Formalization: A Grand Challenge for Reliable Coding ... progress on research questions (research agenda advancement)
The interaction between selection and recourse generates a closed-loop dynamical system linking candidate selection and strategic recourse.
Formalization in the paper showing feedback dynamics between selection outcomes and candidate adjustments (modeling/result claim).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... closed-loop dynamics between selection and recourse (system state over time)
This setting produces endogenous selection, in which both the decision rule and the selection threshold are determined by the population's current feature state.
Derived implication of the framework and model dynamics described in the paper (theoretical consequence of the model).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... dependence of decision rule and threshold on population feature distribution
The success benchmark evolves endogenously as many candidates adjust simultaneously.
Analytical property of the proposed model: simultaneous adjustments by candidates change the effective benchmark (theoretical result asserted by authors).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... endogenous evolution of the selection benchmark/threshold
The study proposes a framework that models recourse as a strategic interaction among candidates under a risk-based selection rule.
The paper introduces a formal/modeling framework (methodological contribution described by the authors).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... structure of the formal model (strategic interactions under a risk-based rule)
Actionable recourse studies whether individuals can modify feasible features to overturn unfavorable outcomes produced by AI-assisted decision-support systems.
Definition and framing stated by the authors in the paper's introduction/background (conceptual claim).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... ability of individuals to change features to reverse AI-produced outcomes (quali...
Self-concordance did not mediate the AI-over-questionnaire effect on goal progress.
Preplanned mediation model reported in the paper found no evidence that self-concordance mediated the AI vs questionnaire effect on goal progress; reported as non-significant in the preregistered analysis.
high null result AI-Assisted Goal Setting Improves Goal Progress Through Soci... goal progress (mediator tested: self-concordance, self-report)
Compared with the matched written-reflection questionnaire, the AI did not significantly improve overall goal progress.
Preplanned comparison within the preregistered RCT; reported non-significant difference between AI and written-reflection condition on overall goal progress at two-week follow-up (no significant p-value reported in the summary).
high null result AI-Assisted Goal Setting Improves Goal Progress Through Soci... goal progress (self-reported goal progress at two-week follow-up)
We conducted a preregistered three-arm randomized controlled trial (RCT) comparing an AI career coach ('Leon,' powered by Claude Sonnet), a matched structured written questionnaire, and a no-support control.
Preregistered RCT reported in the paper; three arms as described; total sample size N = 517; participants randomized to AI coach, written-reflection questionnaire, or no-support control; outcomes assessed at two-week follow-up.
high null result AI-Assisted Goal Setting Improves Goal Progress Through Soci... trial design / allocation and follow-up measurement of goal-related outcomes at ...
All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
Provision of a public GitHub repository URL in the paper.
high null result TDAD: Test-Driven Agentic Development - Reducing Code Regres... availability of code, data, and logs (public repository)
Evaluation was performed on SWE-bench Verified with two local models: Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances.
Experimental setup reported in the paper specifying benchmark (SWE-bench Verified) and model-instance counts.
high null result TDAD: Test-Driven Agentic Development - Reducing Code Regres... evaluation sample size / benchmark coverage (number of instances per model)
Controlled experiments were run with N = 250 across five content types to validate the mechanisms.
Experimental methods reported in the paper: controlled experiments with specified sample size and content-type breakdown.
high null result Governed Memory: A Production Architecture for Multi-Agent W... experimental sample size and content-type breadth (N=250, 5 content types)
Research agenda: empirical microdata on managerial time use, task-level automation, performance outcomes, and wage impacts are needed to quantify substitution versus complementarity and to evaluate human-in-the-loop designs' effects on firm performance and distributional outcomes.
Explicit methodological recommendation within the paper; identifies gaps due to the paper's conceptual (non-empirical) approach.
high null result Comparative analysis of strategic vs. computational thinking... availability and use of microdata on managerial tasks, automation, firm performa...
There is a need for longitudinal and cross‑country empirical research to measure how hybrid work and AI tools affect promotion rates, network centrality, productivity, privacy harms, trust, and long‑term career trajectories.
Statement of research gaps derived from the paper's methodological approach (conceptual synthesis and secondary case studies) and absence of longitudinal/cross‑cultural primary data.
high null result The Sociology of Remote Work and Organisational Culture: How... research gap existence (need for longitudinal and cross‑country empirical studie...
Practical recommendations for firms and policymakers include investing in training for AI curation/evaluation/coordination, experimenting with decentralised decision rights and governance safeguards, and monitoring competitive dynamics related to model/platform providers.
Policy and practitioner takeaways explicitly presented in the discussion/implications sections, deriving from the conceptual framework and mapped literature.
high null result Generative AI and the algorithmic workplace: a bibliometric ... recommended organisational and policy actions
The paper recommends a research agenda for AI economists: causal microeconometric studies (DiD, IVs, RCTs), structural models with hybrid human–AI agents, measurement work on GenAI use, distributional analysis and policy evaluation.
Explicit recommendations listed in the implications and research agenda sections; logical follow‑on from bibliometric findings about gaps in causal and measurement evidence.
high null result Generative AI and the algorithmic workplace: a bibliometric ... recommended methodological directions for future empirical and theoretical resea...
Bibliometric mapping profiles the intellectual structure and evolution of the field but does not establish causal effects of GenAI on organisational outcomes.
Methodological limitation explicitly stated in the paper; bibliometric approach (co‑word, citation, thematic mapping) is descriptive and historical in scope.
high null result Generative AI and the algorithmic workplace: a bibliometric ... methodological limitation (inability to infer causality from bibliometric mappin...
Co‑word and thematic analyses reveal six coherent conceptual clusters that bridge technical AI topics (e.g., LLMs, GANs) with managerial themes (e.g., autonomy, coordination, decision‑making).
Thematic mapping and co‑word network analysis performed on the 212‑paper corpus; identification of six clusters reported in results.
high null result Generative AI and the algorithmic workplace: a bibliometric ... number and thematic composition of conceptual clusters (six clusters linking tec...
Bibliometric and conceptual tools (VOSviewer, Bibliometrix) were used to identify performance trends, co‑word structures, thematic maps, and conceptual evolution in the GenAI–organisation literature.
Methods section: use of VOSviewer for network visualization and Bibliometrix for bibliometric statistics, co‑word analysis, thematic mapping and Sankey thematic evolution.
high null result Generative AI and the algorithmic workplace: a bibliometric ... types of bibliometric analyses applied (performance trends, co‑word structures, ...
The study analysed a corpus of 212 Scopus‑indexed publications covering 2018–2025 to map emergent literature on Generative AI and organisational change.
Bibliometric dataset constructed from Scopus; sample size = 212 peer‑reviewed articles; time window 2018–2025; analyses performed with Bibliometrix and VOSviewer.
high null result Generative AI and the algorithmic workplace: a bibliometric ... size and timeframe of bibliometric corpus (number of publications, 2018–2025)
Outcomes reported are primarily self-reported psychological measures rather than objective productivity metrics.
Paper reports measurement instruments focused on self-reported self-efficacy, psychological ownership, meaningfulness, and enjoyment/satisfaction; no primary objective productivity metrics reported.
high null result Relying on AI at work reduces self-efficacy, ownership, and ... measurement type (self-reported psychological outcomes)
The experiment was pre-registered, used occupation-specific writing tasks, and employed a between-subjects design with three conditions (No-AI, Passive AI, Active collaboration).
Study design reported in the paper: pre-registration statement, N = 269, between-subjects assignment to three conditions using occupation-specific writing tasks.
high null result Relying on AI at work reduces self-efficacy, ownership, and ... n/a (methodological claim)
Active, collaborative AI use preserves perceived meaningfulness of work at levels comparable to independent work and does not produce the lasting psychological costs seen with passive use.
Pre-registered experiment (N = 269) with post-manipulation and post-return measures; Active-collaboration condition matched No-AI on meaningfulness and showed no persistent declines after returning to manual tasks.
high null result Relying on AI at work reduces self-efficacy, ownership, and ... perceived meaningfulness of work (including post-return)
Active, collaborative AI use preserves psychological ownership of outputs at levels comparable to independent work.
Pre-registered experiment (N = 269); Active-collaboration condition reported ownership levels similar to No-AI condition on self-report scales.
high null result Relying on AI at work reduces self-efficacy, ownership, and ... psychological ownership of outputs
Active, collaborative AI use (human drafts first, then uses AI to refine) preserves self-efficacy at levels comparable to independent (no-AI) work.
Pre-registered experiment (N = 269) comparing Active-collaboration and No-AI conditions; no statistically meaningful differences in self-efficacy between them (self-reported measures).
high null result Relying on AI at work reduces self-efficacy, ownership, and ... self-efficacy (confidence to complete tasks without AI)
The authors propose research priorities for economists: quantify productivity gains from closing the actionability gap; estimate firm-level heterogeneity in evaluation capability and its effect on adoption; and model investment trade-offs between building evaluation-to-action pipelines versus accepting reduced LLM performance.
Paper's concluding recommendations for future research directions (explicitly listed by the authors).
high null result Results-Actionability Gap: Understanding How Practitioners E... recommended research agenda topics
The paper produces as primary outcomes a taxonomy of ten evaluation practices, the articulation of the results-actionability gap, and recommended strategies observed among successful teams.
Authors report these as the main outcomes of their thematic analysis and syntheses from the 19 interviews.
high null result Results-Actionability Gap: Understanding How Practitioners E... reported study outputs (taxonomy, articulated gap, recommended strategies)
The study method consisted of semi-structured qualitative interviews with 19 practitioners across multiple industries and roles, analyzed via thematic coding.
Explicit methods section of the paper stating sample size (n=19), participant diversity, interview approach, and coding/analysis procedure.
high null result Results-Actionability Gap: Understanding How Practitioners E... study design and sample size
The methodology is normative-philosophical argumentation supplemented by interdisciplinary synthesis (phenomenology, deconstruction, OOO, STS/material turn); this is not an empirical causal study and contains no quantitative datasets.
Author-declared methods and limits: statement that the intervention is theory-driven and qualitative; absence of quantitative analysis reported.
high null result Examining ethical challenges in human–robot interaction usin... study type and presence/absence of quantitative data (methodological)
The paper’s empirical grounding consists of illustrative case studies and vignettes from healthcare robotics, autonomous vehicles, and algorithmic governance used to demonstrate distributed agency and responsibility.
Author-stated methodology: qualitative vignettes/case illustrations across three domains; no reported sample sizes or systematic data collection.
high null result Examining ethical challenges in human–robot interaction usin... use of illustrative case material (methodological/descriptive)
The experiment used NYSE TAQ transaction and quote data for SPY covering 2015–2024 and tested six pre-specified hypotheses about market-quality trends.
Data and methods section specifying dataset (NYSE TAQ SPY, 2015–2024), the number of pre-specified hypotheses (six), and experimental protocol with 150 autonomous agents.
high null result Nonstandard Errors in AI Agents dataset and experimental design variables (data coverage, number of hypotheses t...
Agents' methodological choices and resulting effect estimates were systematically recorded and used to quantify dispersion and measure switching across stages.
Study design description: recorded agents' methodological choices (measure selection, estimation procedures), resulting estimates, and tracked switching and dispersion metrics (IQR) across the three-stage protocol applied to SPY TAQ data (2015–2024) with 150 agents.
high null result Nonstandard Errors in AI Agents recorded methodological choices (categorical), effect estimates (continuous), di...
AI peer review (agents exchanging written critiques) produced minimal reduction in dispersion of estimates.
Three-stage protocol: after stage 1 (independent analyses) and stage 2 (AI peer review), measured dispersion (e.g., IQR) across agents showed little change following the peer-review stage across the six hypotheses and agent pool (n=150).
high null result Nonstandard Errors in AI Agents change in dispersion (IQR) of estimates between independent-analysis stage and p...
The work is qualitative and exploratory — presenting naturalistic phenomena rather than causal empirical estimates, and is intended to be hypothesis-generating rather than definitive.
Methodology explicitly stated: naturalistic, qualitative daily observations over one month across multiple platforms; comparative observational documentation without experimental manipulation or causal identification.
high null result When Openclaw Agents Learn from Each Other: Insights from Em... nature of evidence (qualitative/exploratory vs. causal inference)
CoMAI is a modular, four-agent interview-assessment framework coordinated by a centralized finite-state machine.
System design and implementation described in the paper: a pipeline of four specialized agents (question generation, security/validation, scoring by rubric, summarization/reporting) with a centralized finite-state machine enforcing workflow and information flow constraints.
high null result CoMAI: A Collaborative Multi-Agent Framework for Robust and ... system architecture (agent decomposition and FSM coordination)
Field experiments (A/B testing) and willingness-to-pay experiments are necessary to quantify monetary benefits, adoption curves, and optimal pricing for alignment capabilities.
Paper explicitly recommends these empirical approaches in the recommendations for economists and product teams; this is a methodological recommendation rather than an empirical finding.
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... adoption rates, willingness-to-pay, retention, task completion differences acros...
Recommended evaluation directions include automatic metrics (embedding similarity, task success, turn counts), human evaluation (satisfaction, perceived collaboration), and A/B testing in deployed settings (latency, compute, retention).
Paper's explicit evaluation proposals and recommended metrics listed in the Data & Methods and Evaluation Directions sections; these are prescriptive recommendations rather than executed experiments.
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... specified evaluation metrics (task success rate, turn counts, retention, latency...
The paper focuses on architecture and conceptual arguments rather than reporting large-scale empirical datasets or results.
Data & Methods section and overall document framing emphasize architecture description and proposed evaluations; explicitly notes absence of large-scale empirical results in the provided summary.
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... presence/absence of large-scale empirical evaluation
Alignment verification can be implemented using semantic embeddings (cosine similarity) or learned classifiers with threshold-based decision branching.
Paper describes these as recommended implementation approaches for the alignment verification component; no empirical benchmark comparing methods is reported.
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... similarity scores, classifier accuracy, false positive/negative rates for drift ...
Temporal decay in the retrieval component can be modeled with functions such as exponential decay and a tunable half-life parameter applied to dialogue-turn embeddings.
Methodological description in the paper specifying temporal decay modeling options (exponential decay example) and tunable parameters; descriptive claim about intended implementation (no empirical comparison of decay functions provided).
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... decay parameter values / impact of decay function on retrieval weighting
Research agenda items for economists include: quantifying willingness-to-pay for verifiable reasoning, studying labor-market impacts for validators, designing contracts/mechanisms to incentivize truthful argument provision, and evaluating regulatory interventions.
Paper's stated research and policy agenda; prescriptive rather than empirical.
high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... existence and prioritization of empirical research on WTP, labor impacts, mechan...
Evaluation currently lacks metrics and benchmarks for argument quality, fidelity, contestability, and human trust; developing these is necessary.
Paper notes the gap and proposes evaluation metrics and experimental designs; no new benchmarks introduced.
high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... availability and maturity of evaluation metrics and benchmarks
Methodology is primarily conceptual and normative: the paper synthesizes policy texts, safety standards, and crisis-management literature and relies on illustrative mappings and thought experiments rather than new empirical field data.
Authors' methodological description in the Data & Methods section (explicit statement about sources and use of thought experiments).
high null result Resilience Meets Autonomy: Governing Embodied AI in Critical... methodological characterization (use of conceptual synthesis vs. empirical data ...
The paper defines and specifies four oversight modes (spanning near-full autonomy to strict human control) and provides criteria for selecting modes based on task complexity, risk level, and consequence severity.
Conceptual taxonomy developed in the paper; mapping exercises and triage framework (risk–complexity–consequence) presented as illustrative mappings (no empirical testing).
high null result Resilience Meets Autonomy: Governing Embodied AI in Critical... existence and specification of four oversight modes and their mapping criteria (...
Sample sizes reported: human–AI experiment n = 126; human–human benchmark n = 108.
Study's Data & Methods section reporting sample sizes for the human–AI experiment (n = 126) and citing the human–human benchmark (Dvorak & Fehrler 2024, n = 108).