The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13661 claims)

Adoption
8339 claims
Productivity
7479 claims
Governance
6715 claims
Human-AI Collaboration
6267 claims
Org Design
4098 claims
Innovation
3987 claims
Labor Markets
3488 claims
Skills & Training
2888 claims
Inequality
2016 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 740 192 95 871 1945
Governance & Regulation 796 388 185 119 1512
Organizational Efficiency 765 186 123 82 1166
Technology Adoption Rate 610 227 121 95 1061
Research Productivity 409 121 56 331 928
Output Quality 464 174 58 47 743
Decision Quality 318 173 75 42 615
Firm Productivity 432 55 88 20 601
AI Safety & Ethics 214 273 65 33 589
Market Structure 175 165 120 24 489
Task Allocation 206 64 70 31 376
Skill Acquisition 161 57 57 16 291
Innovation Output 201 27 41 18 288
Fiscal & Macroeconomic 130 69 43 26 275
Employment Level 104 50 105 13 274
Consumer Welfare 116 62 42 11 231
Firm Revenue 149 45 26 3 223
Inequality Measures 43 120 49 6 218
Task Completion Time 164 29 8 12 214
Worker Satisfaction 89 60 20 12 181
Error Rate 69 89 9 2 169
Regulatory Compliance 74 67 14 4 159
Training Effectiveness 91 19 13 19 144
Wages & Compensation 77 33 25 6 141
Team Performance 86 17 27 9 140
Automation Exposure 49 50 22 12 136
Developer Productivity 91 17 14 5 128
Job Displacement 12 80 19 1 112
Hiring & Recruitment 51 7 8 3 69
Creative Output 31 16 7 2 57
Skill Obsolescence 5 43 6 1 55
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
This paper presents a taxonomy of seven failure modes unique to production agentic systems.
Author contribution: taxonomy presented in the paper (count = seven failure modes).
high positive Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... cataloging of distinct failure modes in production agentic systems
These findings provide insights for designing flexible yet reliable constraint-based workflows.
Synthesis and discussion of study results and technical evaluation in paper's conclusion.
high positive U-Define: Designing User Workflows for Hard and Soft Constra... design guidance for constraint-based workflows
User-defined constraint types improve user satisfaction.
Reported user study measures showing higher satisfaction for participants using U-Define compared to baselines (no sample size or numeric effects provided).
high positive U-Define: Designing User Workflows for Hard and Soft Constra... user satisfaction (self-reported)
User-defined constraint types improve performance.
Reported results from user studies and/or technical evaluation indicating better task performance when users can set hard/soft constraint types (no numeric effect size or sample size in excerpt).
high positive U-Define: Designing User Workflows for Hard and Soft Constra... performance (task success / quality of generated plans)
User-defined constraint types improve perceived usefulness.
Results from the reported user studies comparing U-Define (user-defined constraint types) to baselines; based on participant responses and measures of perceived usefulness (sample sizes/details not provided in excerpt).
high positive U-Define: Designing User Workflows for Hard and Soft Constra... perceived usefulness (user-reported)
U-Define verifies hard constraints using formal model checking and verifies soft constraints using an LLM-as-judge evaluation.
Description of the complementary verification methods employed in the U-Define system (technical design/implementation).
high positive U-Define: Designing User Workflows for Hard and Soft Constra... verification of constraint types (hard via model checking, soft via LLM evaluati...
We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility.
System implementation and description in paper (design and implementation of U-Define).
high positive U-Define: Designing User Workflows for Hard and Soft Constra... ability to specify constraints (natural-language input and categorization into h...
KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.
Theoretical claim about economic and cumulative effects of adopting KOs; no cost-benefit analysis, pilot results, or quantitative evidence reported in the paper.
high positive Reliable AI Needs to Externalize Implicit Knowledge: A Human... cost-effectiveness of verification and cumulative improvement in AI reliability
We propose Knowledge Objects (KOs) — structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse.
Proposed solution described in the paper; conceptual design and intended properties presented, without reported deployments, trials, or empirical evaluation.
high positive Reliable AI Needs to Externalize Implicit Knowledge: A Human... externalization and human verifiability of implicit knowledge via KOs
Evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, provides added value compared to focusing on benchmark performance only.
Argument/interpretation in the paper based on the study's multi-turn human-in-the-loop evaluation showing differences between objective performance gains and participant perceptions.
high positive Seeking Information with RAG-Assistants: Does Model Size Mat... evaluation methodology value (usability, satisfaction, accuracy)
Hybrid systems (human + RAG assistant) are beneficial in information-seeking scenarios.
Conclusion drawn from the experiment showing human-AI collaboration outperforms model-only baselines across model sizes in a realistic multi-turn information-seeking task with N=112 participants.
high positive Seeking Information with RAG-Assistants: Does Model Size Mat... task performance in information-seeking
The performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size.
Reported results from the experimental comparison across conditions and three model sizes (3B, 8B, 70B) with N=112 participants; paper states the performance gain is significant across sizes (no numeric effect sizes or p-values provided in the excerpt).
high positive Seeking Information with RAG-Assistants: Does Model Size Mat... task accuracy / performance
The framework addresses AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.
Paper lists and provides guidance on AI-specific methodological issues (model versioning, interaction dynamics, contamination/spillover, equity). This is a descriptive claim about topics the framework covers, not an empirical evaluation of solutions.
high positive Principles and Guidelines for Randomized Controlled Trials i... coverage of AI-specific methodological challenges in evaluation guidelines
The framework implements a graded transparency and repeatability framework.
Paper extends TOP-guideline-derived transparency principle into a graded scheme for transparency and repeatability; described as an operational feature of the proposed framework.
high positive Principles and Guidelines for Randomized Controlled Trials i... graded transparency and repeatability practices for AI RCTs
The framework integrates heterogeneity analysis and practical significance assessment.
Paper reports inclusion of guidance on analyzing heterogenous treatment effects and assessing practical significance; presented as part of guidelines rather than tested across datasets.
high positive Principles and Guidelines for Randomized Controlled Trials i... inclusion of heterogeneity and practical significance analysis in evaluation pra...
The framework formalizes causal inference through RCT methodology for AI contexts.
Paper states adoption of randomized controlled trial methods and causal inference framing for AI impact evaluation; described as methodological proposition rather than validated application.
high positive Principles and Guidelines for Randomized Controlled Trials i... use of RCTs to support causal inference in AI evaluations
Our framework extends prior work by centering evaluation on human performance rather than model output alone.
Paper claims a conceptual shift: focus on human performance metrics; supported by argumentative rationale and literature references rather than empirical demonstration.
high positive Principles and Guidelines for Randomized Controlled Trials i... focus of evaluation metrics (human performance vs. model output)
The principles and guidelines serve three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms.
Paper's stated intended uses/positioning of the framework; presented as roles in the discussion/positioning section rather than empirically validated roles.
high positive Principles and Guidelines for Randomized Controlled Trials i... utility of the framework in planning, evaluating, and standard-setting
We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases.
Paper reports a concrete output: 33 guidelines derived from the five principles, with each guideline presented as requirement + rationale + implementation instructions + evidence base (documented in paper content).
high positive Principles and Guidelines for Randomized Controlled Trials i... availability of operational guidelines for AI RCTs
The paper adopts the (Shadish et al., 2002) four-validity framework and extends it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025).
Explicit methodological choice described in the paper: adoption of Shadish et al. four-validity framework and addition of a transparency/repeatability principle based on TOP Guidelines; documented in the text as design decision.
high positive Principles and Guidelines for Randomized Controlled Trials i... methodological framework / validity criteria
The framework draws on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology.
Paper reports literature review and cross-disciplinary synthesis as the methodological foundation for the framework (references to those disciplines). No empirical cross-disciplinary experiment reported.
high positive Principles and Guidelines for Randomized Controlled Trials i... methodological comprehensiveness / interdisciplinary grounding
This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies).
Paper's stated contribution: development of a conceptual framework integrating RCT design principles for AI evaluation. Based on literature synthesis and methodological argumentation rather than empirical testing.
high positive Principles and Guidelines for Randomized Controlled Trials i... standardization of AI evaluation RCTs / evaluation methodology
The paper introduces a Specification Governance Model (SGM), grounded in Transaction Cost Economics, and provides a practical governance decision guide.
Conceptual/modeling contribution described in the paper: SGM grounded in TCE with an applied decision guide (theoretical plus prescriptive).
high positive The Productivity-Reliability Paradox: Specification-Driven G... governance decision-making for specification practices
The paper proposes the AI-Augmented Methodology Taxonomy (AAMT), classifying six methodologies under three AI integration tiers.
Conceptual contribution: taxonomy introduced and described in the paper (six methodologies, three tiers).
high positive The Productivity-Reliability Paradox: Specification-Driven G... existence and classification of methodologies (taxonomic contribution)
Telemetry across 10,000+ developers shows a 98% increase in pull requests.
Observational telemetry data aggregated across >10,000 developers reported in the paper; metric reported is percent increase in pull request count.
high positive The Productivity-Reliability Paradox: Specification-Driven G... number of pull requests (pull_request_count)
Controlled studies report 20-56% productivity gains on well-scoped tasks.
Aggregate of multiple controlled experimental studies cited in the paper (2022–2026); reported as observed productivity improvements on well-scoped tasks in those studies. Specific study-level sample sizes not reported in the claim text.
Practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration can be articulated, and calibrated beliefs plus utility-aware policies can improve agentic AI orchestration (illustrated via concrete examples and design patterns)
Paper provides articulated properties, examples, and design patterns but no empirical validation; claims of improvement are illustrated conceptually.
high positive Position: agentic AI orchestration should be Bayes-consisten... improvement in agentic AI orchestration from calibrated beliefs and utility-awar...
Coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters
Central prescriptive claim of the position paper; supported by conceptual argumentation and illustrative examples rather than empirical tests.
high positive Position: agentic AI orchestration should be Bayes-consisten... coherence of decision-making in agentic systems as a function of orchestration-l...
Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions
Argumentative/theoretical claim in the position paper; illustrated with conceptual examples and design patterns rather than empirical evaluation.
high positive Position: agentic AI orchestration should be Bayes-consisten... decision quality of agentic control via belief maintenance and updating
Many high-value deployments rely on decisions under uncertainty (for example, which tool to call, which expert to consult, or how many resources to invest)
Stated as a motivating observation in the paper; no quantitative data or sample provided.
high positive Position: agentic AI orchestration should be Bayes-consisten... prevalence of decision-under-uncertainty requirements in high-value deployments
LLMs excel at predictive tasks and complex reasoning tasks
Asserted in the paper's opening motivation; no empirical evaluation or sample reported in the paper itself.
high positive Position: agentic AI orchestration should be Bayes-consisten... LLM performance on predictive and reasoning tasks
The platform was used to support compound AI use cases at Salesforce, specifically Agentforce (autonomous AI agents) and ApexGuru (AI-powered code analysis).
Paper states the deployment was developed at Salesforce and lists Agentforce and ApexGuru as supported use cases; this is an implementation/adoption claim rather than a quantitative result.
high positive Scalable Inference Architectures for Compound AI Systems: A ... support/adoption by named applications (Agentforce, ApexGuru)
The architecture enables compound AI systems to: (a) scale model invocations in parallel, (b) handle bursty multi-agent workloads, and (c) support rapid model iteration — capabilities essential for operationalizing agentic AI at enterprise scale.
Paper provides case studies (Agentforce, ApexGuru) and operational lessons from production deployment to support these functional claims; the provided text does not include numerical benchmarks for each capability individually nor sample sizes.
high positive Scalable Inference Architectures for Compound AI Systems: A ... scalability of model invocations, ability to handle bursty workloads, support fo...
The modular, platform-agnostic inference architecture integrates serverless execution, dynamic autoscaling, and MLOps pipelines to deliver consistent low-latency inference across multi-component agent workflows.
System design and production deployment description in the paper; claim supported by implementation details and reported production performance (qualitative and operational evidence), but no detailed experimental protocol or sample sizes are given in the provided text.
high positive Scalable Inference Architectures for Compound AI Systems: A ... consistency of low-latency inference (multi-component agent workflows)
The platform delivered 30 to 40% cost savings relative to prior static deployments.
Reported production cost comparisons between the new modular inference architecture and prior static deployments (paper states "30 to 40% cost savings"); the provided text does not include details on cost components, time period, or sample size.
high positive Scalable Inference Architectures for Compound AI Systems: A ... infrastructure / inference cost
The deployment produced up to 3.9x throughput improvement compared to prior static deployments.
Reported production results comparing throughput of the modular inference architecture to prior static deployments (statement in the paper: "up to 3.9x throughput improvement"); no sample size or confidence intervals provided in the provided text.
The production deployment achieved over 50% reduction in tail latency (P95) compared to prior static deployments.
Reported production results comparing the modular inference architecture to prior static deployments (production measurements of P95 tail latency); paper states this was observed in production but does not report sample size or detailed statistical tests in the provided text.
We release the benchmark, harness, sweep configurations, and full run corpus.
Statement of artifact release in the paper; verifiable by checking the project's repository or supplementary materials.
high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... availability of released materials (benchmark and run corpus)
These findings suggest a practical design principle for agentic systems: use smaller open-weight models for the broad base of routine actions, and reserve large frontier models for the narrower class of tasks that truly demand deeper planning and control.
Synthesis/recommendation drawn from the empirical results on AgentFloor showing where small/mid models suffice and where frontier models have advantage; prescriptive claim rather than a direct empirical measurement.
high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... recommended task routing strategy for agentic systems (model assignment to task ...
The gap appears most clearly on long-horizon planning tasks that require sustained coordination and reliable constraint tracking over many steps, where frontier models still hold an advantage, though neither side reaches strong reliability.
Performance breakdown by capability tier on AgentFloor showing frontier (GPT-5) advantage on long-horizon planning/constraint-tracking tasks; both model groups have low absolute reliability on these tasks according to reported results.
high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... performance on long-horizon planning tasks (ability to sustain coordination and ...
We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs.
Empirical evaluation reported in the paper: 16 open-weight models spanning specified parameter sizes, inclusion of GPT-5, and a total of 16,542 scored runs (reported counts).
high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... evaluation runs (model-by-task performance across 16,542 scored runs)
We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints.
Paper describes the design of the benchmark: deterministic, 30 tasks, organized into six tiers covering specified capabilities. This is a descriptive claim about the artifact introduced in the work.
high positive AgentFloor: How Far Up the tool use Ladder Can Small Open-We... benchmark construction (30 tasks, six-tier capability ladder)
The paper proposes five forms of online and offline issuance of RSDM, providing a prototype for creating a globally recognized modern honest money.
Authors' stated contribution in the paper (enumeration of five issuance forms and provision of a prototype); the excerpt explicitly refers to 'five forms'.
high positive RSDM: The Consensus Honest Money in the AI Era number_of_issuance_forms_proposed_and_provision_of_a_prototype
RSDM is an innovative version of Jiaozi (a deposit receipt for base metal coin that emerged in Sichuan, China, about a thousand years ago).
Comparative/analogical claim by the authors linking the proposed design to a historical instrument; no empirical analysis provided in the excerpt.
high positive RSDM: The Consensus Honest Money in the AI Era similarity_between_RSDM_and_historical_Jiaozi
Redeemable Self-Decaying/Devaluing Money (RSDM) is a tokenized commodity money whose essential innovation is to fill the hole in the storage fee of metal coins through the self-devaluing of metal weight recorded on the deposit certificate (warehouse receipt) of metal coins.
Design/specification proposed in the paper (conceptual mechanism); no empirical evaluation or sample size reported in the excerpt.
high positive RSDM: The Consensus Honest Money in the AI Era design_feature_RSDM_self-devaluation_to_cover_storage_fee
When AI acts as an agent for cross-border capital pool and cross cyclical asset allocation, it needs a sound money that can resist the depreciation of fiat currency and store long-term value.
Theoretical argument in the paper about functional requirements of AI agents managing cross-border capital; no empirical sample reported in the excerpt.
high positive RSDM: The Consensus Honest Money in the AI Era need_for_sound_money_by_AI_agents_in_cross-border_capital_allocation
In the AI world, however, the medium of exchange tends to be a globally recognized currency.
Author's theoretical assertion / forward-looking claim in the paper; no empirical data or sample provided in the excerpt.
high positive RSDM: The Consensus Honest Money in the AI Era likelihood_of_global_currency_becoming_medium_of_exchange_for_AI
TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication.
Author statement about the intended use of the benchmark and transparency practices (publication of provenance and limitations).
high positive Token Arena: A Continuous Benchmark Unifying Energy and Cogn... positioning of TokenArena as a methodological framework with published provenanc...
We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0.
Author statement of artifact release (license explicitly CC BY 4.0).
high positive Token Arena: A Continuous Benchmark Unifying Energy and Cogn... availability of TokenArena artifacts and leaderboard under CC BY 4.0
We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference).
Methodological contribution described by the authors; framework specification and composite metrics defined in the paper.
high positive Token Arena: A Continuous Benchmark Unifying Energy and Cogn... five core axes (output speed, time to first token, workload-blended price, effec...