The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13661 claims)

Adoption
8339 claims
Productivity
7479 claims
Governance
6715 claims
Human-AI Collaboration
6267 claims
Org Design
4098 claims
Innovation
3987 claims
Labor Markets
3488 claims
Skills & Training
2888 claims
Inequality
2016 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 740 192 95 871 1945
Governance & Regulation 796 388 185 119 1512
Organizational Efficiency 765 186 123 82 1166
Technology Adoption Rate 610 227 121 95 1061
Research Productivity 409 121 56 331 928
Output Quality 464 174 58 47 743
Decision Quality 318 173 75 42 615
Firm Productivity 432 55 88 20 601
AI Safety & Ethics 214 273 65 33 589
Market Structure 175 165 120 24 489
Task Allocation 206 64 70 31 376
Skill Acquisition 161 57 57 16 291
Innovation Output 201 27 41 18 288
Fiscal & Macroeconomic 130 69 43 26 275
Employment Level 104 50 105 13 274
Consumer Welfare 116 62 42 11 231
Firm Revenue 149 45 26 3 223
Inequality Measures 43 120 49 6 218
Task Completion Time 164 29 8 12 214
Worker Satisfaction 89 60 20 12 181
Error Rate 69 89 9 2 169
Regulatory Compliance 74 67 14 4 159
Training Effectiveness 91 19 13 19 144
Wages & Compensation 77 33 25 6 141
Team Performance 86 17 27 9 140
Automation Exposure 49 50 22 12 136
Developer Productivity 91 17 14 5 128
Job Displacement 12 80 19 1 112
Hiring & Recruitment 51 7 8 3 69
Creative Output 31 16 7 2 57
Skill Obsolescence 5 43 6 1 55
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
In a user study where 12 participants created slide decks, MindTrellis outperformed retrieval-only baselines in knowledge organization and cognitive load, as measured by expert ratings of content coverage and structural quality.
Controlled user study reported in the paper: N = 12 participants performing slide-deck creation tasks; outcomes assessed via expert ratings of content coverage and structural quality (comparison to retrieval-only baseline).
high positive MindTrellis: Co-Creating Knowledge Structures with AI throug... knowledge organization and cognitive load (operationalized via expert ratings of...
MindTrellis is an interactive visual system where users and AI collaboratively build a dynamic knowledge graph; users can query the graph for document-grounded information and contribute by introducing new concepts, modifying relationships, and reorganizing the hierarchy.
System design and implementation described in the paper (feature description and demonstration).
high positive MindTrellis: Co-Creating Knowledge Structures with AI throug... system capability to support collaborative construction and manipulation of a dy...
The framework, all collected signals, scoring outputs, and evaluation harness are released under CC BY 4.0.
Statement of data and code release policy in the paper.
high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... availability/license of framework and data
AgentPulse surfaces deployment signal absent from benchmarks; it is a methodology, not a ground-truth ranking.
Conceptual and empirical argument in the paper supported by the analyses described (correlations with external adoption proxies and divergence from benchmark-only rankings).
high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... presence of deployment/adoption signals not captured by standard benchmarks
The Benchmark+Sentiment sub-composite correlates with VS Code installs (ρ_s=0.44, p<0.05), reported as illustrative given that only 11 of 35 agents have non-zero installs.
Spearman correlation between Benchmark+Sentiment sub-composite and VS Code installs on the 35-agent sample, with a caveat that installs are non-zero for only 11 agents; reported correlation and p-value.
high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... VS Code installs (IDE install counts as adoption proxy)
The Benchmark+Sentiment sub-composite predicts Stack Overflow question volume (ρ_s=0.49, p<0.01) in the circularity-controlled test (n=35).
Circularity-controlled Spearman correlation between Benchmark+Sentiment sub-composite and Stack Overflow question volume on 35 agents; reported correlation and p-value.
high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... Stack Overflow question volume (external adoption/engagement proxy)
A circularity-controlled test (n=35) shows the Benchmark+Sentiment sub-composite, which contains no GitHub-derived signals, predicts external adoption proxies it does not aggregate: GitHub stars (ρ_s=0.52, p<0.01).
Circularity-controlled correlation test (Spearman) between Benchmark+Sentiment sub-composite and GitHub stars on a 35-agent sample; reported Spearman correlation and p-value.
high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... GitHub stars (external adoption proxy)
We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload categories along four factors (Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health) aggregated from 18 real-time signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards.
Methodological description in the paper; reported sample of 50 agents and use of 18 signals from enumerated sources.
high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... AgentPulse composite and factor scores (Benchmark Performance, Adoption Signals,...
A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration.
Follow-up experimental intervention reported in the paper: augmenting model context with prior capability information and measuring calibration change.
high positive MarketBench: Evaluating AI Agents as Market Participants change in calibration of predicted success probability and token usage after add...
Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly.
Conceptual/theoretical argument presented in the paper (no empirical test reported in the excerpt).
high positive MarketBench: Evaluating AI Agents as Market Participants suitability of markets for coordinating AI agents (theoretical promise)
These results provide concrete tier-selection guidance across deployment scales from a single seminar to a university-wide rollout.
Concluding claim in paper based on empirical latency and cost comparisons across throughput tiers and concurrency up to 50 users from a live deployment.
high positive Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... tier-selection guidance for deployment scale decision-making
Provisioned Throughput, expensive under continuous provisioning, becomes cost-competitive for institutions that can predict and concentrate their traffic toward high utilization.
Cost modeling and comparison across provisioning modes in the paper showing trade-offs conditional on utilization; based on the instrumented tiers and assumed usage patterns.
high positive Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... cost competitiveness (cost per unit of usage vs utilization)
Cost analysis places both pay-per-token tiers well below the price of a STEM textbook per student per semester under a worst-case usage ceiling.
Cost analysis presented in the paper comparing per-student per-semester costs under a stated worst-case usage ceiling to the price of a STEM textbook; exact numeric assumptions not provided in the excerpt.
high positive Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... cost per student per semester
Priority PayGo maintains flat sub-4-second response times across the full load range.
Empirical latency measurements from the ITAS instrumentation described in the paper (over 3,000 requests across concurrency levels up to 50 and three throughput tiers).
Multi-agent LLM tutoring systems improve response quality through agent specialization.
Statement in paper describing design rationale; no quantitative quality comparison or metrics provided in the excerpt.
LLMs reveal their ability to approximate adaptive equilibria beyond static mechanism design.
Interpretive claim based on simulation results showing LLM bidders adapt and achieve favorable outcomes when mechanism assumptions fail (e.g., static budgets). This is an inferential claim drawn from comparative experiments described in the paper; no numerical quantification provided in the excerpt.
high positive Strategic Bidding in 6G Spectrum Auctions with Large Languag... ability to approximate adaptive equilibria (strategic adaptation capability)
When theoretical assumptions break—such as under static budget constraints—LLMs sustain longer participation and achieve higher utilities.
Reported simulation results under scenarios violating VCG truthfulness assumptions (example: static budget constraints). The paper states LLM bidders maintained participation for longer and obtained higher utility than truthful and heuristic strategies in these scenarios. No numeric sample sizes or quantified effect sizes provided in the abstract.
high positive Strategic Bidding in 6G Spectrum Auctions with Large Languag... participation duration and agent utility
Unlike heuristics, LLMs leverage historical outcomes and prompt-based reasoning to adapt their bidding behavior dynamically.
Method description and reported behavioral difference: the paper states LLM bidders incorporate prior auction outcomes and prompt engineering to inform bids, contrasted with static heuristic strategies. Based on simulation implementation rather than field deployment; no sample size provided.
high positive Strategic Bidding in 6G Spectrum Auctions with Large Languag... bidding behavior adaptability (dynamic adaptation using history and prompts)
Generative artificial intelligence (genAI) is rapidly reshaping how knowledge and culture are produced and consumed.
Author's descriptive statement based on observed changes in production/consumption patterns (no empirical sample reported in paper abstract).
high positive Generative artificial intelligence reduces social welfare th... production and consumption of knowledge and culture
The paper provides a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories.
Methodological contribution described in the paper (framework/overview component).
high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... existence of a structured overview framework
The FETS benchmark collects and analyzes 54 datasets across 9 data categories guided by typical stakeholder interests.
Descriptive claim about the benchmark dataset compilation reported in the paper (dataset count and category count).
high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... breadth of dataset coverage (count and categories)
Foundation models show improved performance at higher aggregation levels such as national load, district heating, and power grid data.
Subset analyses within the benchmark comparing performance across aggregation levels (e.g., national load, district heating, power grid) showing better results for aggregated data.
high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... forecast accuracy stratified by aggregation level
Foundation models outperform classical machine learning approaches despite the latter having seen the full historic target data during training.
Benchmark setup where classical ML baselines had access to full historic target data during training while foundation models were pretrained/generalized; empirical comparison across the dataset collection.
high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... forecasting accuracy when classical ML had access to full historic targets
Covariate-informed foundation models achieve the strongest performance.
Benchmark experiments that compare foundation model variants, including those that incorporate covariates, across the collected datasets; reported as a comparative finding in the benchmark results.
high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... predictive performance of covariate-informed vs non-covariate models
Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories.
Empirical benchmark comparing foundation models vs classical dataset-specific ML approaches across multiple forecasting settings and data categories; reported analysis uses the collection of datasets assembled for the FETS benchmark.
high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... predictive performance of time series forecasts (forecast accuracy/output qualit...
This paper presents the first systematic study of token consumption patterns in agentic coding tasks, analyzing trajectories from eight frontier LLMs on SWE-bench Verified and evaluating models' ability to predict their own token costs before task execution.
Stated scope and methodology in the paper: dataset is SWE-bench Verified, eight frontier LLMs were analyzed, and experiments included model self-prediction evaluation.
high positive How Do AI Agents Spend Your Money? Analyzing and Predicting ... scope of study (presence of systematic analysis and self-prediction evaluation)
Technological capability (AI) and board diversity are complementary in strengthening corporate governance and fiscal discipline in developing economies.
Synthesis/interpretation of empirical results (main effects and interaction effects) from panel regressions and robustness analyses on 1,586 firms across 2009–2023.
high positive AI-Enabled Governance: Board Gender Diversity and Corporate ... corporate governance effectiveness and fiscal discipline (proxied by tax complia...
The main findings are robust to alternative tax avoidance measures, alternative BGD specifications, heterogeneity analyses, and selection-bias corrections (Heckman, propensity score matching, and instrumental-variable 2SLS approaches).
Reported robustness checks in the paper applying multiple alternative variable specifications and methods for selection-bias correction on the primary sample.
high positive AI-Enabled Governance: Board Gender Diversity and Corporate ... stability/robustness of the association between BGD (and its interaction with AI...
AI capability significantly strengthens the relationship between BGD and effective tax rates; firms with higher AI adoption exhibit a stronger governance effect of gender-diverse boards on tax compliance.
Interaction models estimated on the same balanced panel (1,586 firms, 2009–2023) using lagged AI capability specification; estimated with firm FE and dynamic two-step System GMM, with reported statistically significant interaction effects.
high positive AI-Enabled Governance: Board Gender Diversity and Corporate ... effective tax rate / tax compliance
Board gender diversity (BGD) is positively associated with effective tax rates, implying lower levels of corporate tax avoidance.
Empirical analysis using a balanced panel of 1,586 non-financial firms from developing economies over 2009–2023; firm fixed effects models and dynamic two-step System GMM estimations used to address unobserved heterogeneity, endogeneity, and persistence of corporate tax behavior.
high positive AI-Enabled Governance: Board Gender Diversity and Corporate ... effective tax rate (ETR) / level of corporate tax avoidance
Reducing variability in solder-joint quality and cycle time.
Abstract statement that variability in solder-joint quality and cycle time was reduced during the deployment (no quantitative variability metrics provided in the abstract).
high positive Learning-augmented robotic automation for real-world manufac... variability of solder-joint quality; variability of cycle time
It maintained near-human takt time.
Abstract claim comparing the system's cycle/takt time to human performance during the deployment (no numeric takt-time comparison provided in the abstract).
high positive Learning-augmented robotic automation for real-world manufac... takt time (cycle time) relative to human workers
Achieving a 99.4% pass rate on product-level quality-control tests.
Reported QC pass rate from the production run in the abstract (presumably based on the produced motors).
high positive Learning-augmented robotic automation for real-world manufac... product-level quality-control pass rate
Operating without physical fencing.
Abstract statement that the run occurred "without physical fencing" (implying operation around people without traditional fences).
high positive Learning-augmented robotic automation for real-world manufac... use of physical fences for safety (absent)
Produced 108 motors.
Count of products produced during the continuous run reported in the abstract.
high positive Learning-augmented robotic automation for real-world manufac... number of motors produced during the run
The system operated continuously for 5 h 10 min.
Reported continuous operation duration from the production run described in the abstract.
high positive Learning-augmented robotic automation for real-world manufac... continuous operational time without interruption
Less than 20 min of real-world data per task.
Reported training data requirement for the deployed tasks in the authors' field experiment (abstract statement).
high positive Learning-augmented robotic automation for real-world manufac... amount of real-world training data per task
With less than 20 min of real-world data per task, the system operated continuously for 5 h 10 min, producing 108 motors without physical fencing and achieving a 99.4% pass rate on product-level quality-control tests.
Single field deployment / production run reported in the paper; numbers reported in the abstract (training data time, continuous operation duration, number of motors produced, fencing status, QC pass rate).
high positive Learning-augmented robotic automation for real-world manufac... training data required; continuous operational duration; production quantity; pr...
We deployed the system on an electric-motor production line to automate deformable cable insertion and soldering under real manufacturing constraints, a step previously performed manually by human workers.
Field deployment on an actual electric-motor production line described by the authors (deployment + task specification).
high positive Learning-augmented robotic automation for real-world manufac... automation of previously manual deformable cable insertion and soldering tasks
We present Learning-Augmented Robotic Automation, a hybrid system that integrates learned task controllers and a neural 3D safety monitor into conventional industrial workflows.
Description of the system developed by the authors (system design/development reported in the paper).
high positive Learning-augmented robotic automation for real-world manufac... integration of learned controllers and 3D safety monitoring
Self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.
Synthesis of theoretical framing (Markov model and diagnostic inequality) and empirical results across multiple models/datasets showing thresholds and promptability of EIR.
high positive When Does LLM Self-Correction Help? A Control-Theoretic Mark... policy/recommendation about when to enable iterative self-correction to improve ...
A 'verify-first' prompt ablation on GPT-4o-mini reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p < 10^-4).
A prompt-ablation experiment reported for GPT-4o-mini showing EIR dropping from 2% to 0% and the observed accuracy change flipping from -6.2 percentage points to +0.2 percentage points; statistical significance assessed with a paired McNemar test (p < 10^-4).
high positive When Does LLM Self-Correction Help? A Control-Theoretic Mark... EIR and accuracy change from self-correction after prompt modification
In this framework, EIR functions as a stability margin and prompting functions as lightweight controller design.
Conceptual framing in the paper (cybernetic feedback loop where the same language model is controller and plant), supported by associated experiments showing prompt changes affect EIR and outcomes.
high positive When Does LLM Self-Correction Help? A Control-Theoretic Mark... stability of iterative refinement (EIR) and resulting accuracy
Iterate only when ECR/EIR > Acc/(1 - Acc).
The paper frames self-correction as a two-state Markov model over {Correct, Incorrect} and derives this deployment diagnostic analytically from that model.
high positive When Does LLM Self-Correction Help? A Control-Theoretic Mark... whether iterative self-correction is expected to improve accuracy
Im Forschungskontext sind kontextbezogene Schulungs- und Begleitmaßnahmen entscheidend für den Erfolg der Copilot-Einführung.
Schlussfolgerung der Autoren aus den Befunden zur zeitlichen Entwicklung der Bewertungen wissenschaftlicher Mitarbeitender und zu unterschiedlichen Nutzenwahrnehmungen (im Abstract genannt).
high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Bedeutung von Schulungs- und Begleitmaßnahmen für Erfolg/Adoption
Die Untersuchung zeigt, dass Microsoft 365 Copilot insbesondere im administrativen Bereich Effizienzgewinne ermöglicht.
Selbstberichtete Einschätzungen der Beschäftigten (speziell Verwaltungsmitarbeitende) in der wiederholten Querschnittsbefragung; Autoren ziehen daraus praktische Relevanz im administrativen Bereich (Abstract).
high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Wahrgenommene Effizienzgewinne im administrativen Bereich
Die Befunde unterstreichen die Bedeutung kontextspezifischer Einführung, rollenbezogener Qualifizierung und Governance für eine nachhaltige Akzeptanz generativer KI in Organisationen.
Interpretation/Schlussfolgerung der Autoren basierend auf den survey-Ergebnissen und beobachteten Unterschieden zwischen Rollen sowie zeitlichen Entwicklungen (im Abstract formuliert).
high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Empfohlene Implementierungsmaßnahmen (Kontextanpassung, Schulung, Governance) zu...
Der größte Mehrwert von Copilot liegt bei klar strukturierten, textbasierten Aufgaben.
Befragungsergebnisse zur Nutzenabschätzung für typische Tätigkeiten der Wissensarbeit, wie im Abstract zusammengefasst (präferierte Aufgabenarten: strukturierte, textbasierte Aufgaben).
high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Wahrgenommener Nutzen nach Aufgabentyp (textbasierte, strukturierte Aufgaben)
Microsoft 365 Copilot wird überwiegend als benutzerfreundlich und technisch zuverlässig wahrgenommen.
Selbstberichtete Beurteilungen zu Benutzerfreundlichkeit und technischer Zuverlässigkeit in der wiederholten Querschnittsbefragung (Angabe im Abstract).
high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Perzipierte Benutzerfreundlichkeit und technische Zuverlässigkeit
Wissenschaftliche Mitarbeitende entwickeln im Zeitverlauf positivere Einschätzungen, insbesondere hinsichtlich Produktivität und Arbeitserleichterung durch Copilot.
Längsschnittähnliche Beobachtung über die wiederholten Querschnittserhebungen; zeitliche Veränderung der Selbsteinschätzungen wissenschaftlicher Mitarbeitender im Abstract beschrieben.
high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Perzipierte Produktivität und Arbeitserleichterung (Selbsteinschätzung über Zeit...