The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (8625 claims)

Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 761 200 101 904 2020
Governance & Regulation 829 400 191 122 1566
Organizational Efficiency 784 193 125 84 1197
Technology Adoption Rate 637 236 124 97 1103
Research Productivity 431 131 58 340 972
Output Quality 481 183 59 47 770
Decision Quality 332 177 82 49 647
Firm Productivity 439 57 88 20 610
AI Safety & Ethics 218 279 66 33 602
Market Structure 181 170 123 24 503
Task Allocation 214 64 72 33 388
Skill Acquisition 174 62 62 17 315
Innovation Output 204 27 45 18 295
Employment Level 105 54 108 13 282
Fiscal & Macroeconomic 132 69 43 26 277
Consumer Welfare 117 63 42 11 233
Firm Revenue 154 48 26 3 231
Task Completion Time 173 31 8 12 225
Inequality Measures 44 123 50 6 223
Worker Satisfaction 89 65 22 12 188
Error Rate 71 92 10 2 175
Regulatory Compliance 77 69 14 5 165
Automation Exposure 58 56 26 13 156
Training Effectiveness 96 21 14 19 152
Wages & Compensation 77 37 25 6 145
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 81 21 1 115
Hiring & Recruitment 52 7 8 3 70
Creative Output 32 20 8 3 64
Skill Obsolescence 5 47 6 1 59
Social Protection 28 16 8 2 54
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Adoption Remove filter
The paper distinguishes physical electricity transmission from digital relocation of electricity-consuming computation.
Conceptual/analytic distinction explicitly stated as a contribution in the paper.
high null result AI Inference as Relocatable Electricity Demand: A Latency-Co... conceptual differentiation between transmission of electrons and relocation of c...
We develop an energy-geography framework for geo-distributed AI inference that models a three-layer architecture of clients, service nodes, and compute nodes, and formulates inference placement as a constrained optimization problem over electricity prices, marginal carbon intensity, power usage effectiveness, compute capacity, network latency, and migration frictions.
Methodological contribution described in the paper: formulation of a modeling/optimization framework and specification of variables considered.
high null result AI Inference as Relocatable Electricity Demand: A Latency-Co... inference placement feasibility and optimization across energy and latency dimen...
Inference workloads can sometimes be executed away from the user-facing service location, provided that latency, state locality, capacity, and regulatory constraints remain acceptable.
Conceptual claim and modeling premise stated in the paper; used as an assumption motivating the relocation/placement model rather than an empirical finding.
high null result AI Inference as Relocatable Electricity Demand: A Latency-Co... feasibility of relocating inference workload execution given constraints (latenc...
The paper traces near-term evolutionary trajectories for digital proto-life through three narratives: Lamarck (self-modifying coding agents), Remora (resource-seeking companion chatbots), and Mycelium (DAO-LLC trading bots).
Methodological statement in the abstract: exploratory scenario method with three specified narrative scenarios; descriptive rather than empirical.
high null result Digital Darwinism: steering the evolution of artificial life... narrative scenarios produced (Lamarck, Remora, Mycelium)
Long-running agents accumulated thousands of sequential decisions; continuously active agents reached 6,000+ prompt-state-action cycles.
Agent activity traces showing sequential decision counts per agent (trace-level telemetry).
high null result Operating-Layer Controls for Onchain Language-Model Agents U... number of prompt-state-action cycles per agent
The system consumed roughly 70B inference tokens across the deployment.
API/inference telemetry reporting total token usage.
high null result Operating-Layer Controls for Onchain Language-Model Agents U... inference token consumption
More than 5,000 ETH was deployed by agents during the experiment.
Accounting of ETH held/deployed by agent-controlled vaults during deployment.
Agents executed about $20M in trading volume over the deployment.
Aggregate trading-volume accounting from the bounded onchain market during deployment.
The deployment produced roughly 300K onchain actions.
Onchain transaction logs aggregated over the deployment.
high null result Operating-Layer Controls for Onchain Language-Model Agents U... onchain actions (transactions executed)
The system produced 7.5M agent invocations during the deployment.
System invocation logs reporting total agent calls across the deployment.
high null result Operating-Layer Controls for Onchain Language-Model Agents U... agent invocations (usage)
DX Terminal Pro was deployed for 21 days with 3,505 user-funded agents trading real ETH in a bounded onchain market.
Deployment logs and system telemetry from a 21-day field deployment reporting the number of user-funded agents.
high null result Operating-Layer Controls for Onchain Language-Model Agents U... number of active agents
Cross-stage correlations are very weak: parsing->retrieval r = 0.14, parsing->generation r = 0.17, retrieval->generation r = 0.02.
Reported Pearson (or Spearman) correlation coefficients between stage-level metrics in the benchmark; exact correlation method not specified in excerpt.
high null result Benchmarking Complex Multimodal Document Processing Pipeline... correlation between stage-level quality metrics
We evaluate SecMate in a controlled study with 144 participants and 711 conversations.
Reported experimental study sample and conversation counts in the paper.
high null result SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... study sample size and conversation count
Given the limited sample size, the results should be interpreted as exploratory.
Authors explicitly note limited sample size (20 decks) and label findings exploratory.
high null result Algorithmic personalities and the myth of neutrality: financ... interpretation caveat regarding sample size
Reliability (stability across repeated runs) varies substantially across models, with ICC values ranging from 0.240 to 0.930.
Paper reports interclass correlation coefficient (ICC) analysis of model output reliability across runs, giving a range of ICC values from 0.240 to 0.930.
high null result Algorithmic personalities and the myth of neutrality: financ... output reliability (ICC)
To account for stochastic variation in outputs, each model pair was evaluated five times under identical conditions.
Paper states that each model pair was run five times under identical conditions to distinguish one-off variation from persistent tendencies.
high null result Algorithmic personalities and the myth of neutrality: financ... number of repeated runs (methodological)
Each model evaluated 20 real startup pitch decks spanning multiple industries and funding stages.
Paper reports a controlled simulation design in which each model assessed 20 real pitch decks (sample of 20 decks).
high null result Algorithmic personalities and the myth of neutrality: financ... number of pitch decks evaluated (methodological)
The study used three leading models—GPT-4o, Claude 3.5 Sonnet, and DeepSeek-V2.
Explicit statement in the paper describing the experimental subjects: three named LLMs were evaluated.
high null result Algorithmic personalities and the myth of neutrality: financ... models evaluated (methodological)
The paper develops a typology of enterprise applications by their sensitivity to AI-induced shifts in make-or-buy economics.
Paper's stated contribution (conceptual typology based on analysis of application categories and AI sensitivity).
high null result The Buy-or-Build Decision, Revisited: How Agentic AI Changes... classification (typology) of enterprise applications by sensitivity to AI
This paper adopts a conceptual research approach, combining transaction cost economics and the resource-based view with an assessment of current AI capabilities, to systematically re-evaluate the factors underlying the make-or-buy decision.
Paper's stated methodology and theoretical framing (methodological claim about the paper itself).
high null result The Buy-or-Build Decision, Revisited: How Agentic AI Changes... methodological approach to studying make-or-buy decisions
Empirically, the decomposition eliminates evidence of speculation in the 2020-2025 AI rally.
Empirical application of the proposed decomposition and bubble test to asset price data covering the 2020–2025 period associated with the AI rally (data analysis reported in the paper).
high null result General-Purpose Technology and Speculative Bubble Detection presence (or absence) of speculative bubble evidence in the 2020–2025 AI rally
At this stage, AI adoption in Israel does not result in widespread layoffs; its primary impact lies in restructuring the labor market through a slowdown in recruitment, changes in job composition, and the emergence of new AI-related roles.
Empirical claim reported in the paper; the excerpt does not specify datasets, time periods, or sample sizes supporting this observation.
high null result Artificial Intelligence in Israel, Trends, Developments, and... employment changes attributable to AI adoption (layoffs, recruitment rates, job ...
Overall, robot exposure is only weakly related to job-quality outcomes once controls and fixed effects are included.
Individual-level data from the European Working Conditions Telephone Survey (EWCTS) 2021 merged with country–industry robot exposure measures from International Federation of Robotics (IFR) statistics; weighted logistic regression models including individual and job controls and country and industry fixed effects.
high null result Gendered Effects of Robotisation on Job Quality job-quality outcomes (aggregate across dimensions)
Semantic search maintained comparable inter-rater agreement while reducing chart abstraction time.
Clinical utility evaluation reports that inter-rater agreement was comparable between semantic-search-assisted abstraction and clinician-performed chart review.
The authors optimized embedding model and chunking strategy using a physician-authored benchmark dataset.
Methods: experiment described as optimization of embedding model and chunking using a physician-authored benchmark dataset.
high null result Health System Scale Semantic Search Across Unstructured Clin... model_and_chunking_configuration
The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework.
Methods description of system architecture and governance provided in the paper.
high null result Health System Scale Semantic Search Across Unstructured Clin... system_architecture / governance_compliance
We deployed a semantic search system indexing 166 million clinical notes (484 million vectors) from 1.68 million patients.
Paper reports a production deployment at a large children's hospital and gives exact index counts: 166 million clinical notes, 484 million vectors, 1.68 million patients.
high null result Health System Scale Semantic Search Across Unstructured Clin... number_of_notes_indexed / index_size
Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap.
Methodological description in the paper indicating the authors performed a systematic sensitivity analysis across environmental parameters (resource scarcity and temporal dominance) to measure performance differences between training modalities.
Foundational research on AI identity is the central conclusion of this report.
Authors' stated conclusion of the paper.
high null result AI Identity: Standards, Gaps, and Research Directions for AI... priority recommendation for future research
We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment.
Conceptual definition presented by the authors (conceptual/terminological contribution rather than empirical evidence).
high null result AI Identity: Standards, Gaps, and Research Directions for AI... conceptualization of AI agent identity
We develop a formal model in which institutions choose the scale of automation, the degree of codification, and safeguards on iterative use.
Methodological statement: the paper presents a formal/theoretical model specifying institutional choice variables (model description rather than empirical result).
high null result AI Governance under Political Turnover: The Alignment Surfac... institutional choices regarding automation scale, codification, and safeguards (...
On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated (ρ_s=0.25).
Spearman rank correlation between composite rankings and benchmark-only rankings on an 11-agent subset that has published SWE-bench scores; reported correlation.
high null result AgentPulse: A Continuous Multi-Signal Framework for Evaluati... rank correlation between composite ranking and benchmark-only ranking
We document the performance of a market-based scaffolding with these LLMs.
Empirical documentation reported in the paper describing how a market-based scaffolding performs when using the six LLMs on the 93 tasks.
high null result MarketBench: Evaluating AI Agents as Market Participants performance metrics of a market-based scaffolding using LLM self-reports
We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration.
Empirical setup described in the paper: evaluation uses a 93-task subset of SWE-bench Lite and six recent LLMs.
high null result MarketBench: Evaluating AI Agents as Market Participants experimental dataset size and model set used for demonstration
We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities.
Paper contribution claim: introduction of a benchmark named MarketBench described in the paper.
high null result MarketBench: Evaluating AI Agents as Market Participants existence of the MarketBench benchmark
In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so.
Conceptual claim / design requirement motivating the benchmark; stated as part of the paper's framing rather than an empirical result.
high null result MarketBench: Evaluating AI Agents as Market Participants informativeness/calibration of self-reported ability and cost signals
We instrument ITAS, a four-agent tutoring system built on Gemini 2.5 Flash and Google Vertex AI, across three throughput tiers (Standard PayGo, Priority PayGo, and Provisioned Throughput) and eleven concurrency levels up to 50 simultaneous users, producing over 3,000 requests drawn from a live graduate STEM deployment.
Methods statement in paper describing experimental setup: four-agent ITAS built on Gemini 2.5 Flash and Google Vertex AI; three throughput tiers; eleven concurrency levels up to 50; over 3,000 requests from a live graduate STEM deployment.
high null result Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... instrumented request sample (number of requests and concurrency levels)
We compare LLM-guided bidding against truthful and heuristic strategies using the Vickrey-Clarke-Groves (VCG) mechanism as a benchmark for incentive-compatible, dominant-strategy truthfulness.
Methodological claim describing the comparative experimental design: simulations use VCG as benchmark and include comparisons to truthful and heuristic bidding strategies. No sample size or detailed experimental parameters are provided in the excerpt.
high null result Strategic Bidding in 6G Spectrum Auctions with Large Languag... comparative performance of bidding strategies
When the theoretical assumptions guaranteeing truthfulness hold, LLM bidders recover near-equilibrium outcomes consistent with VCG predictions.
Simulation experiments comparing LLM-guided bidding to the VCG benchmark and to truthful/heuristic strategies under conditions where VCG assumptions are satisfied. The paper reports that LLM outcomes were close to the VCG-predicted equilibrium. No numeric sample size or quantitative effect sizes reported in the provided text.
high null result Strategic Bidding in 6G Spectrum Auctions with Large Languag... equilibrium outcomes / allocation and utility relative to VCG benchmark
We investigate the use of Large Language Models (LLMs) as bidding agents in repeated 6G spectrum auctions with budget constraints in vehicular networks.
Descriptive statement of the study design: the paper reports simulation/experimental evaluation where each user equipment (UE) is modeled as a rational player in repeated spectrum auctions; comparison against truthful and heuristic strategies under Vickrey-Clarke-Groves (VCG) benchmark. No numeric sample size reported in the provided text.
high null result Strategic Bidding in 6G Spectrum Auctions with Large Languag... use of LLMs as bidding agents (methodological evaluation)
The welfare consequences of genAI can be organized by a two-dimensional taxonomy: the strength of the incentive to perform the task without AI, and the severity of model collapse.
Analytical organization derived from the theoretical model presented in the paper (conceptual taxonomy based on model parameters; no empirical sample reported in abstract).
high null result Generative artificial intelligence reduces social welfare th... social welfare outcomes as a function of incentive strength and model collapse s...
We develop a parsimonious model of behavior in collaborative interactions in which individuals can either exert human effort, rely on genAI, or refrain from work altogether.
Methodological claim: authors present a formal theoretical model with the specified choice set (model description in paper; no empirical sample reported in abstract).
high null result Generative artificial intelligence reduces social welfare th... choice among effort modalities (human effort, genAI reliance, abstention)
Predictive performance exhibits saturation beyond a certain context length.
Experiments varying the context (input) length in foundation models and observing changes in forecasting performance; reported saturation effect in analyses.
high null result FETS Benchmark: Foundation Models Outperform Dataset-specifi... change in forecast accuracy as context length increases
Task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend.
Analysis comparing human expert difficulty ratings to measured token costs for tasks in SWE-bench Verified; weak alignment reported in the paper between ratings and token consumption.
high null result How Do AI Agents Spend Your Money? Analyzing and Predicting ... correspondence/alignment between human-rated task difficulty and measured token ...
Higher token usage does not translate into higher accuracy; accuracy often peaks at intermediate cost and saturates at higher costs.
Comparison of accuracy (task success) versus total token usage across runs/trajectories in the agentic coding experiments on SWE-bench Verified; reported observed relationship (peak at intermediate costs and saturation thereafter).
high null result How Do AI Agents Spend Your Money? Analyzing and Predicting ... task accuracy as a function of token usage
Die Studie basiert auf einer wiederholten Querschnittsbefragung lizenzierter Beschäftigter einer außeruniversitären Forschungseinrichtung.
Autorenangabe im Abstract: wiederholte Querschnittsbefragung (survey) unter lizenzieren Beschäftigten der untersuchten Forschungseinrichtung; methodische Beschreibung im Abstract.
high null result Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Studiendesign / Datengrundlage (repeated cross-sectional survey)
The main findings are robust to multiple robustness checks.
Paper reports multiple unspecified robustness checks applied to the fixed-effects regression analyses on the panel of publicly listed Chinese firms (2012–2023).
high null result Following the Herd or the Bellwether: Peer Effects in Firms’... robustness of reported peer effect findings
We use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows.
Methodological contribution described in the paper: a unified amortized computational framework applied to eight Shapley variants, evaluated under latency constraints typical of operational workflows.
high null result Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... ability to isolate semantic differences among Shapley variants under low-latency...
No formulation improved objective analyst performance.
Controlled/empirical experiment reported in the paper evaluating eight Shapley variants with professional analysts in the fraud-detection environment; performance measured over 3,735 case reviews.
high null result Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... objective analyst performance (e.g., accuracy on case reviews)
Standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility.
Empirical comparison in the paper between quantitative metrics (sparsity, faithfulness) and human-judged clarity/decision-utility across the datasets and analyst reviews; based on the authors' large-scale evaluation.
high null result Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... correlation/alignment between quantitative explanation metrics (sparsity, faithf...