The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6574 claims)

Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 761 200 101 904 2020
Governance & Regulation 829 400 191 122 1566
Organizational Efficiency 784 193 125 84 1197
Technology Adoption Rate 637 236 124 97 1103
Research Productivity 431 131 58 340 972
Output Quality 481 183 59 47 770
Decision Quality 332 177 82 49 647
Firm Productivity 439 57 88 20 610
AI Safety & Ethics 218 279 66 33 602
Market Structure 181 170 123 24 503
Task Allocation 214 64 72 33 388
Skill Acquisition 174 62 62 17 315
Innovation Output 204 27 45 18 295
Employment Level 105 54 108 13 282
Fiscal & Macroeconomic 132 69 43 26 277
Consumer Welfare 117 63 42 11 233
Firm Revenue 154 48 26 3 231
Task Completion Time 173 31 8 12 225
Inequality Measures 44 123 50 6 223
Worker Satisfaction 89 65 22 12 188
Error Rate 71 92 10 2 175
Regulatory Compliance 77 69 14 5 165
Automation Exposure 58 56 26 13 156
Training Effectiveness 96 21 14 19 152
Wages & Compensation 77 37 25 6 145
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 81 21 1 115
Hiring & Recruitment 52 7 8 3 70
Creative Output 32 20 8 3 64
Skill Obsolescence 5 47 6 1 59
Social Protection 28 16 8 2 54
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Human Ai Collab Remove filter
On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation.
Empirical evaluation on the ACE2-Spike assay within the ProteinGym benchmark; reported relative improvement in Spearman correlation versus prior state-of-the-art.
high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... Spearman correlation on ACE2-Spike binding fitness prediction
On GPT training optimization, AutoScientists continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements).
Empirical comparison of discovered/accepted improvements during GPT training optimization; counts of accepted improvements for AutoScientists (7) versus single-agent approach (0).
high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... count of accepted improvements discovered
On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch.
Empirical training-time comparison between AutoScientists and Autoresearch on GPT training optimization tasks; reported speedup multiplier to reach a validation bits-per-byte target.
high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... time-to-target (validation bits-per-byte)
On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%.
Empirical evaluation on the BioML-Bench benchmark (24 tasks); reported mean leaderboard percentile and comparative improvement versus the strongest baseline agent.
high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... leaderboard percentile across benchmark tasks
Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction.
Empirical comparisons reported in paper across multiple benchmark suites and tasks (BioML-Bench, GPT training optimization experiments, ProteinGym).
high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... overall performance across multiple benchmarks
AutoScientists is a decentralized team of AI agents that interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration.
System design and implementation described in the paper (architecture and agent protocols); qualitative description of agent behaviors and coordination mechanisms; demonstrated in experiments.
high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... agent coordination and information sharing (qualitative description)
We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.
Statement of the paper's contributions and contents (methodological description of what the paper includes).
high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... capability to study reliability, failure modes, and readiness of LLM agents
By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation.
Stated objective/claim in the paper about the benchmark's purpose and what it measures (conceptual/goal-oriented statement).
high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... reliability of LLM agents in performing optimization work (beyond text generatio...
OR-Space defines an Explain task mode, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts.
Definition of the Explain task mode provided in the paper (design/specification).
high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... ability to generate grounded explanations using workspace evidence
OR-Space defines a Revise task mode, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic.
Definition of the Revise task mode in the benchmark design (descriptive claim in the paper).
high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... ability to revise models while preserving prior logic
OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts.
Definition of one of the benchmark's task modes as described in the paper (method/design description).
high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... ability to construct solver-ready models
Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files.
Design specification of OR-Space provided in the paper (descriptive claim about benchmark instance structure).
high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... complexity and composition of benchmark instances
We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation.
Paper presents and names a new benchmark (methodological contribution described directly in the text).
high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... capability of benchmarks to evaluate OR agents across lifecycle tasks
Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling.
Statement in the paper asserting an observed trend; likely based on literature/context motivating the work (no empirical sample or quantitative citation provided in the excerpt).
high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... LLM agent adoption in OR workflows
A recommended organizational design for the AI era is the 'resonance protocol enterprise' in which structures are temporary crystallizations, AI governance protects adaptive openness, and legitimacy derives from sustaining recursive renewal.
Normative/proposal in the paper outlining a new organizational design paradigm; presented as conceptual design without empirical pilot or evaluation.
high positive The Lantern in the Vault: AI, Crisis, and the Ontology of Or... organizational design aimed at sustaining adaptive renewal and legitimacy under ...
Digital transformation initially enhanced adaptability by fluidifying information flows and expanding relational connectivity, thereby improving some organizations' adaptability.
Theoretical claim supported by qualitative interpretation of digital transformation phenomena; no systematic measurement or reported sample.
high positive The Lantern in the Vault: AI, Crisis, and the Ontology of Or... organizational adaptability associated with digital transformation practices
Organizations capable of rapid relational reconfiguration, customer reconnection, and generative experimentation often proved more resilient during the pandemic.
Illustrative/theoretical interpretation of pandemic cases offered in the paper; no quantified sample or formal empirical evidence reported.
high positive The Lantern in the Vault: AI, Crisis, and the Ontology of Or... organizational resilience as a function of relational reconfiguration and experi...
Although AI creates obstacles, it also has the potential to be an important tool for creating innovative opportunities and continued growth if managed with sound practices.
Concluding statement in the paper's abstract presenting a normative/conditional conclusion based on the paper's evaluation and synthesis of evidence (no primary quantified results provided in the supplied text).
high positive Impact of Artificial Intelligence on Employment and Society innovation opportunities and continued economic/organizational growth under soun...
AI leads to the creation of new jobs.
The paper explicitly states it examines the creation of new jobs as a ramification of AI (abstract); claim presented qualitatively without reported sample sizes or quantified effect in the provided text.
high positive Impact of Artificial Intelligence on Employment and Society creation of new jobs / net employment effects
GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.
Architectural description in the paper; claim about knowledge base acting as ground truth and enabling capability compounding (design-level claim). No quantitative evaluation given in the abstract.
high positive GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesi... accumulation/compounding of capabilities across runs (longitudinal improvement o...
GENESIS is an agentic AI framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base.
System design / implementation claim presented in the paper (description of proposed framework). The abstract does not report empirical evaluation metrics or sample size.
high positive GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesi... ability to produce solutions validated by over-the-air experiments (end-to-end R...
Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes.
Paper's stated comparison/claim (likely based on prior reports or authors' experience); no experimental details or sample size provided in the abstract.
high positive GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesi... time to complete R&D/software engineering tasks
Operational reasoning paradigms such as ReasonOps may become foundational infrastructure for next-generation trustworthy AI ecosystems.
Author's forward-looking argument / conjecture about the potential future impact and adoption of operational reasoning paradigms; presented as an argument rather than demonstrated empirically in the excerpt.
high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... future adoption / foundational role of operational reasoning paradigms
The paper presents the ReasonOps architecture, demonstrates its workflow using an autonomous braking system analysis example, and discusses its potential role in future safety-critical autonomous AI systems.
Author statement about the paper's content and demonstration (explicitly claims an architecture and an example walkthrough); evidence is the paper's own descriptive content.
high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... presence of architecture and example demonstration in the paper
The proposed paradigm integrates semantic interpretation, autoformalization, symbolic reasoning, theorem proving, runtime assurance, probabilistic reliability estimation, and adaptive correction into a unified reasoning lifecycle.
Author claim about the architecture and components of ReasonOps; presented as a proposed integrated lifecycle in the paper (no empirical evaluation reported in excerpt).
high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... integration of multiple reasoning and assurance components
ReasonOps treats reasoning as a continuously monitored, verifiable, reliability-aware operational process rather than an isolated inference task.
Author description of the ReasonOps paradigm and its operational stance (conceptual framework described in paper).
high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... operationalization of reasoning processes (monitoring, verification, reliability...
This paper introduces ReasonOps, a unified operational paradigm for trustworthy verified reasoning systems.
Declarative claim about the paper's contribution (introduction of a named paradigm); supported by the paper itself (architectural description and example claimed).
high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... existence/introduction of an operational paradigm (ReasonOps)
Recent advances in theorem proving, autoformalization, symbolic reasoning, and tool-augmented language models demonstrate substantial progress toward machine-assisted formal reasoning.
Author statement citing multiple research directions (theorem proving, autoformalization, symbolic reasoning, tool-augmented LMs); no specific empirical results or quantitative studies provided in excerpt.
high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... progress toward machine-assisted formal reasoning
Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning agents.
Author assertion in paper's introduction; conceptual argument referencing recent developments in LLMs (no empirical study or sample size reported in text excerpt).
high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... capability of LLMs to perform reasoning
Agentic Technical Debt and Stochastic Tax are related but distinct: debt can amplify the tax.
Theoretical relationship asserted in the structural model; the note states debt can amplify the recurring Stochastic Tax and provides model expressions and discussion (and illustrative simulation) to substantiate the relationship.
high positive Modeling Agentic Technical Debt and Stochastic Tax: A Standa... impact of accumulated Agentic Technical Debt on the magnitude of Stochastic Tax ...
Combining both levers yields a 502% improvement on single-cell RNA denoising over the initial baseline.
Reported experimental result in the paper comparing SIA to the initial baseline on the single-cell RNA denoising task (denoising metric unspecified in abstract).
high positive SIA: Self Improving AI with Harness & Weight Updates denoising performance for single-cell RNA data
Combining both levers yields a 91.9% runtime reduction on GPU kernels over the initial baseline.
Reported experimental result in the paper comparing SIA to the initial baseline on the low-level GPU kernel optimisation task (runtime measured).
high positive SIA: Self Improving AI with Harness & Weight Updates runtime for GPU kernels
Combining both levers yields a 56.6% gain on LawBench (Chinese legal charge classification) over the initial baseline.
Reported experimental result in the paper comparing SIA to the initial baseline on the LawBench task.
high positive SIA: Self Improving AI with Harness & Weight Updates task performance on LawBench (unspecified metric in abstract)
Combining both levers (harness updates and weight updates) outperforms scaffold iteration alone on all three benchmarks.
Empirical comparison reported in the paper: experiments across the three domains comparing SIA (combined harness+weight updates) against scaffold-iteration-only baseline.
high positive SIA: Self Improving AI with Harness & Weight Updates overall task performance relative to scaffold-only baseline
There exists a data supply chain that runs from individual translators through language service providers (LSPs) and platforms to model developers.
Mapping and descriptive analysis of industry supply chains and intermediary roles provided in the paper; conceptual and empirical examples of flows of translation data from translators to model developers. No numerical sample reported.
high positive Translators as Invisible Teachers of AI: Copyright, Translat... structure and flow of translation data across actors
Article 30-4 of the Japanese Copyright Act legitimates a mode of use the paper terms 'appropriation without consumption'—i.e., mining works for statistical features rather than reading or experiencing them.
Textual/legal analysis of Article 30-4 of the Japanese Copyright Act and its interpretation; comparative legal reading presented in the paper. No numerical sample reported.
high positive Translators as Invisible Teachers of AI: Copyright, Translat... legal legitimation of non-experiential mining of copyrighted works
The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of translation data (TM/parallel corpora).
Historical and technical literature review linking MT/NLP methodological advances to the availability and use of parallel corpora and TM; comparative analysis of model development histories described in the paper. No numerical sample reported.
high positive Translators as Invisible Teachers of AI: Copyright, Translat... dependence of major MT/LLM advances on accumulated translation data
Translation memories (TM) and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation.
Conceptual argument and literature review of machine translation practice (discussion of TM/parallel corpora as supervised training data); examples and descriptive evidence from MT research and industry practice presented in the paper. No numerical sample reported.
high positive Translators as Invisible Teachers of AI: Copyright, Translat... value of translation data as supervised training inputs for MT
EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.
Methodological claim that training is performed offline using recorded agent-to-agent interaction data rather than online interactions; described as part of framework benefits.
high positive EmoDistill: Offline Emotion Skill Distillation for Language ... training approach (offline learning) and its cost-avoidance benefit
Transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments.
Reported transfer experiments in which EmoDistill-trained policies were evaluated on different negotiation domains, with unseen counterparties, and in tournaments between trained agents; results reportedly show generalization. (Exact metrics and sample sizes not provided in the excerpt.)
high positive EmoDistill: Offline Emotion Skill Distillation for Language ... generalization of policy performance (utility) across domains and opponents
Ablations show that emotion conditioning is essential.
Ablation experiments reported in the paper removing or altering emotion conditioning, which reportedly degrade performance relative to the full EmoDistill model. (No numeric results provided in the excerpt.)
high positive EmoDistill: Offline Emotion Skill Distillation for Language ... performance/utility difference when emotion conditioning is removed
Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection.
Empirical evaluation across four negotiation domains comparing EmoDistill-trained SLM policies to vanilla SLM/LLM baselines and an ablated IQL-only emotion selector. (Paper reports comparative utility results, but exact sample sizes and numeric effect sizes are not provided in the excerpt.)
high positive EmoDistill: Offline Emotion Skill Distillation for Language ... utility (negotiation reward/outcome)
EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns which emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns how to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO).
Description of model architecture and training approach: IQL used as selector; LoRA-based policy trained with SFT and JPO for expression. (Design/implementation claim from methods section.)
high positive EmoDistill: Offline Emotion Skill Distillation for Language ... ability to select and express emotion (method decomposition)
We introduce EmoDistill, an offline framework for distilling emotional negotiation skills into language model agents.
Methodological contribution described in the paper: design and presentation of the EmoDistill framework (decomposition, training pipeline). This is a description of a proposed method rather than an empirical result.
high positive EmoDistill: Offline Emotion Skill Distillation for Language ... method/framework existence and capability to distill emotional negotiation skill...
Hybrid Fusion significantly accelerated the recovery of smaller Slow AI teams (+6.9% at N=4).
Reported intervention result: Hybrid Fusion produced a +6.9% acceleration in recovery for smaller Slow AI teams, reported at N=4.
high positive The Timing Dependencies of Trust: Speed, Accuracy, and cBCI ... team recovery acceleration (performance improvement) after Hybrid Fusion
Integrating these isolated veridical signals via Hybrid Fusion successfully rescued the Fast AI team (+7.6% at N=8).
Reported intervention result: application of Hybrid Fusion integration produced a +7.6% improvement in Fast AI team performance, reported at N=8.
high positive The Timing Dependencies of Trust: Speed, Accuracy, and cBCI ... team performance improvement after Hybrid Fusion
The Riemannian Oracle adapted to task states by heavily restricting temporal windows (< 0.8s) to intercept fast reflexive compliance and widening windows (> 1.2s) to capture delayed cognitive conflict.
Reported algorithmic behavior of the 2D Adaptive Riemannian Oracle in response to measured spatial covariance: window sizes described as <0.8s for fast states and >1.2s for slow states.
high positive The Timing Dependencies of Trust: Speed, Accuracy, and cBCI ... temporal gating/window size of the Riemannian Oracle
In the Slow AI condition, behavioural teams (N=8) eventually recovered to 100.0%.
Reported team performance metric for behavioural teams in Slow AI condition with N=8; team performance reported to reach 100.0%.
high positive The Timing Dependencies of Trust: Speed, Accuracy, and cBCI ... team accuracy/recovery over time
We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.
Authors' stated aim/goal for the benchmark (normative/aspirational statement in the paper).
high positive JobBench: Aligning Agent Work With Human Will intended shift in community priorities / framing of labour-market effects (repla...
Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work.
Design description of task packaging in JobBench (benchmark construction/methodological detail).
high positive JobBench: Aligning Agent Work With Human Will realism of task inputs (heterogeneous reference files; information clutter)