The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (4137 claims)

Adoption
5267 claims
Productivity
4560 claims
Governance
4137 claims
Human-AI Collaboration
3103 claims
Labor Markets
2506 claims
Innovation
2354 claims
Org Design
2340 claims
Skills & Training
1945 claims
Inequality
1322 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 378 106 59 455 1007
Governance & Regulation 379 176 116 58 739
Research Productivity 240 96 34 294 668
Organizational Efficiency 370 82 63 35 553
Technology Adoption Rate 296 118 66 29 513
Firm Productivity 277 34 68 10 394
AI Safety & Ethics 117 177 44 24 364
Output Quality 244 61 23 26 354
Market Structure 107 123 85 14 334
Decision Quality 168 74 37 19 301
Fiscal & Macroeconomic 75 52 32 21 187
Employment Level 70 32 74 8 186
Skill Acquisition 89 32 39 9 169
Firm Revenue 96 34 22 152
Innovation Output 106 12 21 11 151
Consumer Welfare 70 30 37 7 144
Regulatory Compliance 52 61 13 3 129
Inequality Measures 24 68 31 4 127
Task Allocation 75 11 29 6 121
Training Effectiveness 55 12 12 16 96
Error Rate 42 48 6 96
Worker Satisfaction 45 32 11 6 94
Task Completion Time 78 5 4 2 89
Wages & Compensation 46 13 19 5 83
Team Performance 44 9 15 7 76
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 18 17 9 5 50
Job Displacement 5 31 12 48
Social Protection 21 10 6 2 39
Developer Productivity 29 3 3 1 36
Worker Turnover 10 12 3 25
Skill Obsolescence 3 19 2 24
Creative Output 15 5 3 1 24
Labor Share of Income 10 4 9 23
Clear
Governance Remove filter
Series 2 consisted of local and API open-source systems (n = 6) administered blind and declared, with four systems re-administered under declared conditions.
Methods description detailing Series 2 composition, modes (blind and declared), and that four systems were re-tested under declared conditions.
high null result Literary Narrative as Moral Probe : A Cross-System Framework... count of systems in Series 2 (n=6) and number re-administered under declared con...
Series 1 consisted of frontier commercial systems administered blind (n = 7).
Methods description specifying Series 1 composition and blind administration.
high null result Literary Narrative as Moral Probe : A Cross-System Framework... count of systems in Series 1 (n=7) and administration mode (blind)
The study employed 24 experimental conditions spanning 13 distinct LLM systems across two series.
Study design reported in Methods: Series 1 (frontier commercial, blind, n=7), Series 2 (local/API open-source, blind and declared, n=6), plus re-administered declared runs and ceiling-probe runs summing to 24 conditions.
high null result Literary Narrative as Moral Probe : A Cross-System Framework... number of experimental conditions and distinct systems tested (study scope)
The experiment used NYSE TAQ transaction and quote data for SPY covering 2015–2024 and tested six pre-specified hypotheses about market-quality trends.
Data and methods section specifying dataset (NYSE TAQ SPY, 2015–2024), the number of pre-specified hypotheses (six), and experimental protocol with 150 autonomous agents.
high null result Nonstandard Errors in AI Agents dataset and experimental design variables (data coverage, number of hypotheses t...
Agents' methodological choices and resulting effect estimates were systematically recorded and used to quantify dispersion and measure switching across stages.
Study design description: recorded agents' methodological choices (measure selection, estimation procedures), resulting estimates, and tracked switching and dispersion metrics (IQR) across the three-stage protocol applied to SPY TAQ data (2015–2024) with 150 agents.
high null result Nonstandard Errors in AI Agents recorded methodological choices (categorical), effect estimates (continuous), di...
AI peer review (agents exchanging written critiques) produced minimal reduction in dispersion of estimates.
Three-stage protocol: after stage 1 (independent analyses) and stage 2 (AI peer review), measured dispersion (e.g., IQR) across agents showed little change following the peer-review stage across the six hypotheses and agent pool (n=150).
high null result Nonstandard Errors in AI Agents change in dispersion (IQR) of estimates between independent-analysis stage and p...
The work is qualitative and exploratory — presenting naturalistic phenomena rather than causal empirical estimates, and is intended to be hypothesis-generating rather than definitive.
Methodology explicitly stated: naturalistic, qualitative daily observations over one month across multiple platforms; comparative observational documentation without experimental manipulation or causal identification.
high null result When Openclaw Agents Learn from Each Other: Insights from Em... nature of evidence (qualitative/exploratory vs. causal inference)
Future empirical work should measure calibration (user trust vs. model accuracy), hallucination rate, user comprehension of capability limits, and behavioral dependence on system recommendations.
Explicit methodological recommendations and suggested metrics in the paper; these are proposed future measurements rather than reported findings.
high null result Why We Need to Destroy the Illusion of Speaking to A Human: ... calibration metrics, hallucination rates, user comprehension, behavioral depende...
Conversational AI differs from interpersonal conversation: it has no true beliefs/intentions or accountability and produces probabilistic, sometimes inconsistent outputs with opaque training/data provenance.
Analytical/distinctive claim based on properties of LLMs and machine learning models discussed in the paper; conceptual analysis, no empirical testing.
high null result Why We Need to Destroy the Illusion of Speaking to A Human: ... ontological status of AI outputs (beliefs/intentions/accountability) and propert...
Research agenda items for economists include: quantifying willingness-to-pay for verifiable reasoning, studying labor-market impacts for validators, designing contracts/mechanisms to incentivize truthful argument provision, and evaluating regulatory interventions.
Paper's stated research and policy agenda; prescriptive rather than empirical.
high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... existence and prioritization of empirical research on WTP, labor impacts, mechan...
Evaluation currently lacks metrics and benchmarks for argument quality, fidelity, contestability, and human trust; developing these is necessary.
Paper notes the gap and proposes evaluation metrics and experimental designs; no new benchmarks introduced.
high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... availability and maturity of evaluation metrics and benchmarks
Methodology is primarily conceptual and normative: the paper synthesizes policy texts, safety standards, and crisis-management literature and relies on illustrative mappings and thought experiments rather than new empirical field data.
Authors' methodological description in the Data & Methods section (explicit statement about sources and use of thought experiments).
high null result Resilience Meets Autonomy: Governing Embodied AI in Critical... methodological characterization (use of conceptual synthesis vs. empirical data ...
The paper defines and specifies four oversight modes (spanning near-full autonomy to strict human control) and provides criteria for selecting modes based on task complexity, risk level, and consequence severity.
Conceptual taxonomy developed in the paper; mapping exercises and triage framework (risk–complexity–consequence) presented as illustrative mappings (no empirical testing).
high null result Resilience Meets Autonomy: Governing Embodied AI in Critical... existence and specification of four oversight modes and their mapping criteria (...
Suggested empirical research directions for AI economists include: comparing LLM performance and economic outcomes on rule‑encodable vs tacit tasks; quantifying performance decline when forcing LLMs into interpretable rule representations; studying contracting/pricing where buyers cannot verify internal rules; and measuring returns to scale attributable to tacit capabilities.
Explicitly enumerated recommended research agenda items in the paper; these are proposed studies rather than executed work.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed empirical research topics and corresponding outcomes to measure
New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.
Methodological recommendation in the paper listing specific metric categories for future empirical work.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed metrics for assessing tacit LLM capabilities
Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.
Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... types of empirical tests recommended for validating the thesis
The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.
Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... extent of empirical/quantitative evidence presented
The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.
Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... presence/absence of empirical experiments in the paper
Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.
Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.
high null result Runtime Governance for AI Agents: Policies on Paths existence of empirical research gaps (identified/not identified)
No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.
Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.
high null result Runtime Governance for AI Agents: Policies on Paths use of empirical data (presence/absence of large-scale empirical evaluation)
Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.
Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).
high null result Runtime Governance for AI Agents: Policies on Paths existence of calibrated thresholds and procedures (presence/absence)
Because the sample is non-representative (support-group recruitment and media cases) and small (19 users), the authors note that generalizability is limited and the sample is biased toward more severe cases.
Limitations section stating recruitment sources, small N, and bias toward severe cases.
high null result Characterizing Delusional Spirals through Human-LLM Chat Log... representativeness and generalizability of the sample
The study analyzed conversation logs from 19 users who reported psychological harm associated with chatbot use, comprising a total corpus of 391,562 messages (user + chatbot).
Dataset described in paper: 19 users' conversation logs aggregated; total message count reported as 391,562 messages across user and chatbot messages.
high null result Characterizing Delusional Spirals through Human-LLM Chat Log... size of dataset (number of users and total messages)
Key measurable metrics for future evaluation include contest frequency and outcomes, time-to-help for different groups, user satisfaction, perceived fairness, incidence of automation bias, and usability/access disparities.
List of proposed metrics in the paper's evaluation agenda.
high null result Designing for Disagreement: Front-End Guardrails for Assista... the specified metrics (contest frequency/outcomes, time-to-help, satisfaction, p...
The paper does not report empirical data; instead it provides a vignette and a proposed evaluation agenda (user studies, field pilots, A/B tests, logs, surveys).
Explicit methodological statement in the Data & Methods section summarised by the authors; factual description of the paper's empirical status.
high null result Designing for Disagreement: Front-End Guardrails for Assista... presence/absence of empirical data in the paper (binary)
The pattern provides an outcome-specific, easy-to-use contest channel allowing users to contest particular decisions without renegotiating global rules.
Design element described in the paper and exemplified in the vignette; proposed contest metrics and evaluation agenda but no empirical data.
high null result Designing for Disagreement: Front-End Guardrails for Assista... availability and specificity of contest channels (system functionality)
The pattern requires legibility at the contact point so the robot clearly communicates which active mode is in use and why when deferring or prioritizing.
Design specification and rationale in the paper; supported by the public-concourse vignette; no empirical measurement.
high null result Designing for Disagreement: Front-End Guardrails for Assista... legibility of active mode (user understanding at time of deferral)
The pattern constrains prioritization to a governance-approved menu of admissible modes, limiting the policy space to vetted options.
Design specification in the paper (architectural requirement); illustrated in the vignette; no empirical testing.
high null result Designing for Disagreement: Front-End Guardrails for Assista... existence of governance-approved admissible modes (system design property)
Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.
Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability, efficiency, and robustness/degradation metrics
Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.
Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... baseline agent architectures used for comparison
Eight state-of-the-art LLMs were evaluated in the study.
Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... number of LLMs evaluated (n = 8)
The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).
Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... agent architectural modularity (temporal decomposition into strategy vs executio...
RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).
Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... environment difficulty gradient (complexity/stochasticity/non-stationarity level...
The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).
Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... benchmark realism and coverage of non-stationarity for long-horizon decision-mak...
The project developed domain- and specialty-focused models: Fanar-Sadiq (Islamic content multi-agent architecture), Fanar-Diwan (classical Arabic poetry), and FanarShaheen (bilingual translation).
Paper enumerates these domain/specialty models and their stated focuses as part of the product stack.
high null result Fanar 2.0: Arabic Generative AI Stack existence and intended domain of specialized models
FanarGuard is a 4B bilingual moderation model focused on Arabic safety and cultural alignment.
Paper lists FanarGuard in the expanded product stack and specifies model size (4B) and bilingual moderation purpose emphasizing Arabic safety/cultural alignment.
high null result Fanar 2.0: Arabic Generative AI Stack model existence, size (4B), and intended function (bilingual moderation)
Fanar-27B was produced by continual pre-training from a Gemma-3-27B 27B backbone.
Paper describes model development: continual pre-training of Fanar-27B from the Gemma-3-27B 27B backbone.
high null result Fanar 2.0: Arabic Generative AI Stack model lineage/architecture (Fanar-27B ← Gemma-3-27B)
The Fanar 2.0 training corpus is a curated set totalling approximately 120 billion high-quality tokens organized into three data 'recipes' emphasizing Arabic and cross-lingual relevance.
Paper reports a curated corpus of ~120B high-quality tokens split across three data recipes; emphasis on relevance and quality for Arabic and cross-lingual performance.
high null result Fanar 2.0: Arabic Generative AI Stack training token count and dataset composition (three recipes)
Training and operations for Fanar 2.0 were performed on-premises using 256 NVIDIA H100 GPUs at QCRI.
Paper states compute and infrastructure: training and operations performed on 256 NVIDIA H100 GPUs, fully on-premises at QCRI (HBKU).
high null result Fanar 2.0: Arabic Generative AI Stack compute infrastructure (GPU count & location)
Experiments were conducted on three benchmarks and across multiple LLM families to evaluate generation, scoring, calibration, robustness, and efficiency dimensions.
Data & Methods section summary in the paper stating systematic evaluation across three benchmarks and a variety of LLMs and verifiers.
high null result Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... experimental coverage (benchmarks and model families)
Complete provenance of training data is often unavailable, so contamination detection is imperfect and some leakage may be undetectable (or overestimated in some categories).
Authors' stated limitation about unavailable/partial training-data provenance and methodological caveats for the lexical-matching pipeline and behavioral probes.
high null result Are Large Language Models Truly Smarter Than Humans? uncertainty in contamination detection accuracy due to incomplete provenance
Results are specific to MMLU; contamination levels and effects may differ on other benchmarks or newer models.
Authors' limitations: experiments were conducted only on the MMLU dataset (513 questions) and on the listed six models; generalizability is therefore uncertain.
high null result Are Large Language Models Truly Smarter Than Humans? generalizability of contamination findings to other benchmarks/models
BenchPreS defines two complementary metrics—Misapplication Rate (MR) and Appropriate Application Rate (AAR)—to quantify over‑application and correct personalization, respectively.
Methodological contribution described in the paper: explicit definitions of MR as fraction of inappropriate applications and AAR as fraction of appropriate applications, used to score model behavior.
high null result BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Definition and use of MR and AAR metrics
Research priorities include empirical testing and simulation of ISB-based control systems, cost–benefit analysis of proactive versus reactive AI governance, and distributional impact assessments.
Explicit research agenda proposed by the author (conceptual recommendation), not empirical results.
high null result DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... n/a (research agenda recommendation rather than an empirical outcome)
Key empirical metrics introduced and used are: AI adoption rates (sector-level intensity), Skill shift index, Hybrid job share, and employment levels/net changes by sector.
Methods description listing the constructed metrics used in the simulated dataset and subsequent analyses (definitions and calculation procedures provided in the paper).
high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Defined metrics (AI adoption rate, Skill shift index, Hybrid job share, Employme...
The study's main limitations include reliance on a simulated dataset rather than exhaustive administrative microdata, literature limited to selected publishers/years, and correlational (not causal) identification of some effects.
Authors' explicitly stated limitations in the paper's methods and discussion sections describing data choices (simulated dataset, selected publishers 2020–2024) and the observational/correlational nature of several analyses.
high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Study validity/generalizability limitations
Limitation: Implementation heterogeneity — the costs and feasibility of the recommended HR changes vary by context and may affect generalisability.
Explicit limitation acknowledged in the paper; drawn from theoretical reasoning about contextual heterogeneity and practitioner variability.
high null result Symbiarchic leadership: leading integrated human and AI cybe... implementation costs; feasibility; effect on generalisability
Limitation: The framework is conceptual and requires empirical validation across sectors, firm sizes and AI‑intensity levels.
Explicit limitation acknowledged by the authors; based on the paper's method (theoretical synthesis, no original data).
high null result Symbiarchic leadership: leading integrated human and AI cybe... generalizability and empirical validity across contexts
The paper generates empirically testable propositions (e.g., how leader practices affect AI adoption speed, task reallocation, productivity, error rates, employee well‑being and turnover) and suggests natural‑experiment settings for evaluation.
Stated methodological output of the conceptual synthesis; the paper lists candidate empirical tests and research opportunities but contains no original empirical tests.
high null result Symbiarchic leadership: leading integrated human and AI cybe... AI adoption speed; task reallocation; productivity; error rates; employee well‑b...
The available evidence consists mainly of promising empirical studies and case studies, but there are few long-run, generalized ROI or productivity estimates; results are heterogeneous across therapeutic areas.
Self-described limitation of the narrative review: heterogeneity of study designs and outcomes precluded pooled quantitative estimates and long-run ROI assessment.
high null result From Algorithm to Medicine: AI in the Discovery and Developm... evidence quality (availability of long-run ROI/productivity estimates) and heter...