The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (7448 claims)

Adoption
5267 claims
Productivity
4560 claims
Governance
4137 claims
Human-AI Collaboration
3103 claims
Labor Markets
2506 claims
Innovation
2354 claims
Org Design
2340 claims
Skills & Training
1945 claims
Inequality
1322 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 378 106 59 455 1007
Governance & Regulation 379 176 116 58 739
Research Productivity 240 96 34 294 668
Organizational Efficiency 370 82 63 35 553
Technology Adoption Rate 296 118 66 29 513
Firm Productivity 277 34 68 10 394
AI Safety & Ethics 117 177 44 24 364
Output Quality 244 61 23 26 354
Market Structure 107 123 85 14 334
Decision Quality 168 74 37 19 301
Fiscal & Macroeconomic 75 52 32 21 187
Employment Level 70 32 74 8 186
Skill Acquisition 89 32 39 9 169
Firm Revenue 96 34 22 152
Innovation Output 106 12 21 11 151
Consumer Welfare 70 30 37 7 144
Regulatory Compliance 52 61 13 3 129
Inequality Measures 24 68 31 4 127
Task Allocation 75 11 29 6 121
Training Effectiveness 55 12 12 16 96
Error Rate 42 48 6 96
Worker Satisfaction 45 32 11 6 94
Task Completion Time 78 5 4 2 89
Wages & Compensation 46 13 19 5 83
Team Performance 44 9 15 7 76
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 18 17 9 5 50
Job Displacement 5 31 12 48
Social Protection 21 10 6 2 39
Developer Productivity 29 3 3 1 36
Worker Turnover 10 12 3 25
Skill Obsolescence 3 19 2 24
Creative Output 15 5 3 1 24
Labor Share of Income 10 4 9 23
The v1.13 release includes a Go reference implementation of 22 packages covering all L1-L4 capabilities.
Repository statement describing a Go reference implementation comprising 22 packages and coverage claim for L1-L4.
high null result Agent Control Protocol: Admission Control for Agent Actions number of Go packages in the reference implementation and claimed coverage of co...
The v1.13 specification comprises 36 technical documents organized into five conformance levels (L1-L5).
Explicit quantitative statement in the specification/repository describing document count and organization.
high null result Agent Control Protocol: Admission Control for Agent Actions number of technical documents and conformance-level organization
The experiment compared three prompt conditions: (A) simple prompts, (B) raw PPS JSON, and (C) natural-language-rendered PPS.
Method description of the three prompt conditions used in the controlled experiment.
The study used three specific LLMs: DeepSeek-V3, Qwen-Max, and Kimi.
Method section listing the three models evaluated in the experiment.
We ran a controlled three-condition study across 60 tasks in three domains (business, technical, and travel), three large language models (DeepSeek-V3, Qwen-Max, and Kimi), and three prompt conditions, collecting 540 AI-generated outputs evaluated by an LLM judge.
Authors report an experimental study design: 60 tasks × 3 models × 3 prompt conditions = 540 outputs, with outputs evaluated by an LLM judge (methodological description in the paper).
high null result Evaluating 5W3H Structured Prompting for Intent Alignment in... experimental_data_collection (AI outputs evaluated by LLM judge)
The paper presents a formal evolutionary taxonomy of generative AI spanning five eras (1943–present) and analyzes frontier lab dynamics, sovereign AI emergence, and post-training alignment evolution from RLHF through GRPO.
Conceptual taxonomy and historical/organizational analysis provided in the paper. No empirical sample size reported in the excerpt.
high null result The Institutional Scaling Law: Non-Monotonic Fitness, Capabi... evolutionary taxonomy and contextual analysis of generative AI eras and dynamics
The framework extends the Sustainability Index of Han et al. (2025) from hardware-level analysis to ecosystem-level analysis.
Conceptual / methodological extension claimed by the authors referencing Han et al. (2025). No empirical sample size reported in the excerpt.
high null result The Institutional Scaling Law: Non-Monotonic Fitness, Capabi... scope/level of the Sustainability Index (hardware-level → ecosystem-level)
Classical scaling laws model AI performance as monotonically improving with model size.
Statement about prior literature / modeling assumptions (classical scaling laws). No empirical sample size reported in the excerpt.
high null result The Institutional Scaling Law: Non-Monotonic Fitness, Capabi... AI performance as a function of model size
Existing financial question answering benchmarks primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals.
Literature/background claim made in the paper motivating the new benchmark; authors contrast prior benchmarks' focus on balance sheet data with the lack of market/trading-signal evaluation.
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs scope of existing financial QA benchmarks (focus on balance sheet data vs. tradi...
Retrieval provides limited benefit for trading-signal reasoning.
Experimental comparison reported in the paper showing that retrieval-augmentation had little impact on performance for trading-signal-focused questions.
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs change in performance on trading-signal-focused questions with retrieval
To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment.
Methodological claim in the paper describing the QA and annotation pipeline; the paper reports using these components as part of their reliability framework.
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs benchmark annotation and validation procedure
The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning.
Direct description of the benchmark's taxonomy in the paper; the authors specify these three categories as the organizational structure for the 1,400 questions.
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs benchmark organization / task taxonomy
FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window.
Statement in the paper describing the benchmark construction and scope; the paper reports the benchmark size (1,400 questions) and the dataset grounding (NASDAQ-100 over ten years).
high null result FinTradeBench: A Financial Reasoning Benchmark for LLMs benchmark size and scope (number of questions; data grounding)
The paper derives formal conditions under which the inversion (smaller, orchestrated models outperforming frontier models) holds.
Mathematical derivations and stated sufficient/necessary conditions presented in the paper.
high null result Punctuated Equilibria in Artificial Intelligence: The Instit... parameter conditions for comparative performance inversion
We develop the Institutional Fitness Manifold, a mathematical framework that evaluates AI systems along four dimensions: capability, institutional trust, affordability, and sovereign compliance.
Theoretical/model development presented in the paper (formal definition of the manifold and its four dimensions).
high null result Punctuated Equilibria in Artificial Intelligence: The Instit... institutional fitness evaluated across four dimensions
There have been five eras of AI development since 1943, and within the current Generative AI Era there are four distinct epochs, each initiated by a discontinuous event.
Descriptive/historical classification within the paper (counts of eras and epochs; named initiating events such as the transformer and the 'DeepSeek Moment').
high null result Punctuated Equilibria in Artificial Intelligence: The Instit... count and classification of historical AI eras/epochs
The study uses panel data for 30 Chinese provinces from 2013–2022 to measure urban circular economy efficiency (UCEE) with a Super-SBM model including undesirable outputs, track dynamics via the Global Malmquist–Luenberger index, and estimate spatial effects with a spatial Durbin model.
Methodological description in the abstract: explicit statement of data (30 provinces, 2013–2022) and the three methods used (Super-SBM with undesirable outputs, GML index, spatial Durbin model).
high null result How artificial intelligence and environmental regulation inf... use of Super-SBM measurement, GML dynamics, and spatial Durbin estimation (metho...
Despite fears of mass unemployment, aggregate labor-market data through 2025 show limited labor-market disruption from generative AI.
Review of aggregate employment and labor-market studies and macro-level data through 2025 cited in the brief; methods include analyses of employment statistics and macro labor indicators (no single sample size reported).
high null result AI, Productivity, and Labor Markets: A Review of the Empiric... aggregate employment / labor-market disruption
We scored rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112 transcript segments from multi-agent governance simulations.
Reported methodology: multi-agent governance simulations with agents in formal governmental roles, outcomes evaluated by an independent rubric-based judge; explicit sample count of 28,112 transcript segments.
high null result I Can't Believe It's Corrupt: Evaluating Corruption in Multi... rule-breaking and abuse outcomes (as assessed by rubric-based judge)
We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines.
Empirical study design described in the paper: open competition with reported counts of teams and participants (29 teams, 80 participants); comparison between participant submissions and AI-only baselines.
high null result AgentDS Technical Report: Benchmarking the Future of Human-A... competition participation enabling comparison
AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking.
Descriptive dataset/benchmark specification in the paper stating task count and industry coverage.
high null result AgentDS Technical Report: Benchmarking the Future of Human-A... number of challenges and industry coverage
Open research challenges that define the research agenda include scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, and designing human-AI specification interactions.
Authors' explicit enumeration of open problems and a proposed multi-disciplinary research agenda; presented as expert opinion rather than empirical finding.
high null result Intent Formalization: A Grand Challenge for Reliable Coding ... progress on research questions (research agenda advancement)
The interaction between selection and recourse generates a closed-loop dynamical system linking candidate selection and strategic recourse.
Formalization in the paper showing feedback dynamics between selection outcomes and candidate adjustments (modeling/result claim).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... closed-loop dynamics between selection and recourse (system state over time)
This setting produces endogenous selection, in which both the decision rule and the selection threshold are determined by the population's current feature state.
Derived implication of the framework and model dynamics described in the paper (theoretical consequence of the model).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... dependence of decision rule and threshold on population feature distribution
The success benchmark evolves endogenously as many candidates adjust simultaneously.
Analytical property of the proposed model: simultaneous adjustments by candidates change the effective benchmark (theoretical result asserted by authors).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... endogenous evolution of the selection benchmark/threshold
The study proposes a framework that models recourse as a strategic interaction among candidates under a risk-based selection rule.
The paper introduces a formal/modeling framework (methodological contribution described by the authors).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... structure of the formal model (strategic interactions under a risk-based rule)
Actionable recourse studies whether individuals can modify feasible features to overturn unfavorable outcomes produced by AI-assisted decision-support systems.
Definition and framing stated by the authors in the paper's introduction/background (conceptual claim).
high null result Actionable Recourse in Competitive Environments: A Dynamic G... ability of individuals to change features to reverse AI-produced outcomes (quali...
A GNN graph is constructed from reasoning embeddings and trading decisions are made using a PPO-DSR policy.
Method description: the paper reports embedding agents' reasoning, building a graph neural network (GNN) from those embeddings, and using a PPO-DSR reinforcement learning policy to trade. Specific GNN/PPO-DSR hyperparameters and architecture are not provided in the excerpt.
high null result Can Blindfolded LLMs Still Trade? An Anonymization-First Fra... use of GNN on reasoning embeddings and use of PPO-DSR policy to produce trading ...
Four LLM agents output scores along with reasoning.
Method description: the paper states that four LLM agents produce numeric scores and associated textual reasoning. The number of agents is explicitly given as four; no further architecture or model-family details included in the excerpt.
high null result Can Blindfolded LLMs Still Trade? An Anonymization-First Fra... agent outputs: numeric scores and textual reasoning
BlindTrade anonymizes tickers and company names (blindfolding agents by anonymizing all identifiers).
Methodological description in the paper: the system design explicitly replaces tickers and company names with anonymized identifiers. Implementation details and examples not provided in the excerpt.
high null result Can Blindfolded LLMs Still Trade? An Anonymization-First Fra... presence/absence of identifier anonymization (anonymization applied to input dat...
Data ethics, as a central pillar of digital ethics, emphasizes the responsible use and protection of personal information.
Conceptual/definitional statement in the paper situating data ethics within digital ethics and highlighting protection of personal information as a core concern.
Big data usage is proxied by keyword frequency in firms' annual reports.
Operationalization described in the paper: frequency/count of big-data-related keywords in annual reports used as the proxy for firms' big data application.
high null result How Big Data Enhances Firm Value Under Data Privacy Regulati... big data usage (proxy)
The empirical analysis uses a fixed-effects regression approach to measure the impact of big data application on firm value.
Methodological statement in the paper specifying fixed-effects regression as the primary econometric approach.
The study analyzes panel data covering Chinese A-share listed companies from 2007 to 2021.
Description of dataset in the paper: panel of Chinese A-share listed companies spanning the years 2007–2021 (sample period stated).
The analysis extends the dynamic taxation setup of Slavik and Yazici (2014).
Methodological claim: the model and solution approach build on and modify the framework from Slavik and Yazici (2014) (reference to prior theoretical framework rather than empirical data).
high null result Workers' Incentives and the Optimal Taxation of AI scope and structure of the theoretical model (extension of the referenced dynami...
We characterize the optimal tax policy in an economy with human manual and cognitive labor, physical capital, and artificial intelligence (AI).
Theoretical/analytical work: the paper develops and analyzes a dynamic general-equilibrium model that includes manual and cognitive human labor, physical capital, and AI. (No empirical sample; model-based characterization.)
high null result Workers' Incentives and the Optimal Taxation of AI form and properties of the optimal tax policy in the specified theoretical econo...
Self-concordance did not mediate the AI-over-questionnaire effect on goal progress.
Preplanned mediation model reported in the paper found no evidence that self-concordance mediated the AI vs questionnaire effect on goal progress; reported as non-significant in the preregistered analysis.
high null result AI-Assisted Goal Setting Improves Goal Progress Through Soci... goal progress (mediator tested: self-concordance, self-report)
Compared with the matched written-reflection questionnaire, the AI did not significantly improve overall goal progress.
Preplanned comparison within the preregistered RCT; reported non-significant difference between AI and written-reflection condition on overall goal progress at two-week follow-up (no significant p-value reported in the summary).
high null result AI-Assisted Goal Setting Improves Goal Progress Through Soci... goal progress (self-reported goal progress at two-week follow-up)
We conducted a preregistered three-arm randomized controlled trial (RCT) comparing an AI career coach ('Leon,' powered by Claude Sonnet), a matched structured written questionnaire, and a no-support control.
Preregistered RCT reported in the paper; three arms as described; total sample size N = 517; participants randomized to AI coach, written-reflection questionnaire, or no-support control; outcomes assessed at two-week follow-up.
high null result AI-Assisted Goal Setting Improves Goal Progress Through Soci... trial design / allocation and follow-up measurement of goal-related outcomes at ...
All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
Provision of a public GitHub repository URL in the paper.
high null result TDAD: Test-Driven Agentic Development - Reducing Code Regres... availability of code, data, and logs (public repository)
Evaluation was performed on SWE-bench Verified with two local models: Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances.
Experimental setup reported in the paper specifying benchmark (SWE-bench Verified) and model-instance counts.
high null result TDAD: Test-Driven Agentic Development - Reducing Code Regres... evaluation sample size / benchmark coverage (number of instances per model)
Controlled experiments were run with N = 250 across five content types to validate the mechanisms.
Experimental methods reported in the paper: controlled experiments with specified sample size and content-type breakdown.
high null result Governed Memory: A Production Architecture for Multi-Agent W... experimental sample size and content-type breadth (N=250, 5 content types)
The dependent variable is the Market Opportunity Index, which is a combination of indicators of innovation activity, the share of firms with new products, and the share of opportunity-oriented entrepreneurs.
Paper provides the construction/definition of the dependent variable (components listed in the excerpt).
high null result Innovative Cognitive Tools for Studying Market Opportunities... Market Opportunity Index (composite of innovation activity, share of firms with ...
The model used lags of the dependent variable to take into account inertia in the development of entrepreneurial opportunities, and the stability of the impact of cognitive tools was tested.
Paper states the model specification included lagged dependent variables and that stability tests for the impact of cognitive tools were performed (no further details on lag length or test statistics in the excerpt).
high null result Innovative Cognitive Tools for Studying Market Opportunities... Market Opportunity Index (lagged dynamics)
The methodological foundation of the study was panel econometric modelling, which enabled taking into account international differences observed over time and the dynamics of indicators in the domestic sphere.
Description of methods in the paper: use of panel econometric modelling on an international panel over the 2020–2024 period (sample size not specified in the excerpt).
high null result Innovative Cognitive Tools for Studying Market Opportunities... dynamics of the Market Opportunity Index across countries and over time
The field study used a 44-item questionnaire with 45 participants to measure comprehension, reported behavior change/adoption, and perceptions of volunteer legitimacy.
Methodological description provided in the paper: instrument and sample sizes explicitly reported.
high null result From Linguistic Hybridity to Development Sovereignty: Pidgin... study design details (instrument and sample size)
Research agenda: empirical microdata on managerial time use, task-level automation, performance outcomes, and wage impacts are needed to quantify substitution versus complementarity and to evaluate human-in-the-loop designs' effects on firm performance and distributional outcomes.
Explicit methodological recommendation within the paper; identifies gaps due to the paper's conceptual (non-empirical) approach.
high null result Comparative analysis of strategic vs. computational thinking... availability and use of microdata on managerial tasks, automation, firm performa...
No original quantitative dataset or controlled evaluation is reported in this paper.
Methodological description in the paper stating reliance on prior literature, conceptual analysis, and prescriptive recommendations; paper does not present new experiments.
high null result LLM Alignment should go beyond Harmlessness–Helpfulness and ... existence of original empirical data or controlled experiments in the paper
The paper is a position/normative paper (not an empirical study) that uses conceptual analysis, literature synthesis, and prescriptive roadmaping rather than new quantitative experiments or datasets.
Explicit methodological statement in the paper summarizing genre and methods used; absence of reported original data or controlled evaluations.
high null result LLM Alignment should go beyond Harmlessness–Helpfulness and ... presence or absence of original empirical data / controlled evaluation in the pa...
There is a need for longitudinal and cross‑country empirical research to measure how hybrid work and AI tools affect promotion rates, network centrality, productivity, privacy harms, trust, and long‑term career trajectories.
Statement of research gaps derived from the paper's methodological approach (conceptual synthesis and secondary case studies) and absence of longitudinal/cross‑cultural primary data.
high null result The Sociology of Remote Work and Organisational Culture: How... research gap existence (need for longitudinal and cross‑country empirical studie...