The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13827 claims)

Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 749 195 97 889 1979
Governance & Regulation 815 391 188 121 1539
Organizational Efficiency 771 189 124 83 1177
Technology Adoption Rate 624 233 123 96 1084
Research Productivity 410 121 56 331 929
Output Quality 466 177 59 47 749
Decision Quality 320 174 75 42 618
Firm Productivity 435 55 88 20 604
AI Safety & Ethics 214 276 65 33 593
Market Structure 178 166 122 24 495
Task Allocation 206 64 70 31 376
Skill Acquisition 165 57 60 17 299
Innovation Output 201 27 41 18 288
Employment Level 105 51 107 13 278
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 116 63 42 11 232
Firm Revenue 149 46 26 3 224
Inequality Measures 44 122 49 6 221
Task Completion Time 169 29 8 12 219
Worker Satisfaction 89 61 20 12 182
Error Rate 69 91 10 2 172
Regulatory Compliance 76 68 14 5 163
Training Effectiveness 92 19 13 19 145
Wages & Compensation 77 36 25 6 144
Automation Exposure 51 54 22 12 142
Team Performance 86 17 27 9 140
Developer Productivity 94 17 14 6 132
Job Displacement 12 80 20 1 113
Hiring & Recruitment 51 7 8 3 69
Skill Obsolescence 5 45 6 1 57
Creative Output 31 16 7 2 57
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
We curated real evidence images together with their associated review and product metadata, identified genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation.
Data curation pipeline combining multimodal large language model (MLLM) filtering and human annotation as described in the methods.
high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... label quality (genuine damaged vs undamaged) via MLLM-assisted filtering and hum...
FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios.
Dataset construction procedure described in the paper specifying source domains (e-commerce, food delivery, travel services).
high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... coverage of real-world domains in dataset
We introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence.
Methodological contribution described in the paper: design and release of a benchmark dataset (FraudBench).
high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... availability of a benchmark dataset for claim-conditioned fraudulent evidence de...
A digital twin analytics platform validation shows that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.
Validation/demonstration reported in the paper using a digital twin analytics platform; platform demonstration claimed to eliminate tool-call hallucination and enable cross-domain configurability via configuration only.
high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... tool-call hallucination elimination and cross-domain configurability without app...
In the same controlled experiment, ontology-grounded parameters reduced domain-identifier hallucination to 0%.
Same controlled experiment (six industry configurations, 72 tool invocations with Qwen3-32B) reported in the paper; ontology-grounded parameter condition produced 0% hallucination.
high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... hallucination rate for domain identifiers (ontology-grounded condition)
The architecture is formalized as a three-operation interface contract — resolve, contextualize, annotate — with invariants enforced by an AIOps orchestration layer.
Design specification and formalization presented in the paper (architectural description).
high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... existence of a three-operation interface contract and invariant enforcement
Embedding manufacturing ontology directly into the AI tool layer as a typed relational configuration enforces semantic constraints at runtime and closes the semantic training gap.
Proposed system architecture described and argued in the paper; validated via demonstrations and experiments described later in the paper.
high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... enforcement of semantic constraints at runtime / closure of semantic gap
This budget-split approach is responsive to the needs of real-world, resource-constrained advertisers committed to equitable distribution of public service outreach via online advertising.
Authors' normative/qualitative conclusion based on the implemented intervention and its practical suitability for government advertisers; no empirical quantification provided in excerpt.
high positive Into the Unknown: Accounting for Missing Demographic Data wh... practical suitability / responsiveness of the intervention for resource-constrai...
The budget split intervention is a valuable approach to addressing ad delivery skew without excluding unknown users.
Authors' empirical finding from the collaboration/intervention (paper reports results from implemented intervention; specific metrics, sample size, and quantitative results are not provided in the excerpt).
high positive Into the Unknown: Accounting for Missing Demographic Data wh... reduction of gender-based ad delivery skew while maintaining inclusion of unknow...
In the absence of platform-provided solutions to skewed ad delivery, advertisers can counteract skew by targeting demographic groups directly.
Descriptive claim about common advertiser strategies; motivated by platform capability gaps (no experimental/sample details in excerpt).
high positive Into the Unknown: Accounting for Missing Demographic Data wh... ability of advertisers to mitigate ad delivery skew via direct demographic targe...
Sustainable progress requires collaborative integration of humans and machines, rather than replacement.
Normative conclusion/recommendation stated in the paper based on study findings (argument for augmented intelligence over replacement).
high positive Augmented Intelligence: Resolving the AI integration-obsoles... approach to AI-human integration
This research presents the innovative Marketing Intelligence Operations (MIO) Framework and a practical AI Adoption Readiness Scorecard, enabling leaders to manage the operational balance between transformative efficiency improvements and human capital vulnerability.
Paper states that it introduces a new framework and a practical scorecard as deliverables of the research (descriptive claim about the paper's contributions).
high positive Augmented Intelligence: Resolving the AI integration-obsoles... AI adoption readiness / operational management capability
AI-integrated Marketing Intelligence Operations (MIO) quantitatively improves campaign Return on Investment (ROI) by 47%.
Reported as an empirical result from the paper's mixed-methods study (the paper states use of audits, surveys, and NLP analysis to evaluate MIO outcomes).
high positive Augmented Intelligence: Resolving the AI integration-obsoles... campaign Return on Investment (ROI)
Deploying LegalCheck in the Municipality of Amsterdam demonstrated substantial efficiency gains, improved legal consistency, and positive user acceptance.
Summary claim based on the real-world deployment outcomes described in the paper (timing improvements, consistency/factual accuracy statements, and reported positive reception by professionals); specific quantitative metrics and sample sizes are not fully reported in the excerpt.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... efficiency (time), legal consistency, user acceptance
The system produced explainable outputs based on actual regulations and prior cases, providing citations/explainability that support legal reasoning.
Paper describes retrieval from curated legal knowledge bases and generation of outputs grounded in regulations and prior cases during the Amsterdam deployment; presented as a feature of the system and supported by expert review.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... explainability / traceability of generated legal reasoning to source regulations...
LegalCheck uses a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) with curated legal knowledge bases and controlled prompting to retrieve relevant laws and precedents and incorporate case-specific details into coherent drafts.
System architecture and methodology described in the paper (design/implementation claim).
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... n/a (system design / method description)
Legal professionals found that the system ensured a consistent application of legal standards without replacing human judgment.
Reported qualitative feedback from professionals in the Municipality of Amsterdam deployment and the system design that includes an expert-in-the-loop review; no formal measurement of 'replacement' was reported.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... consistency in application of legal standards and preservation of human oversigh...
Legal professionals found that the system reduced their workload.
Reported user feedback from legal professionals during the Municipality of Amsterdam deployment; qualitative statements that professionals experienced workload reduction (no numeric workload metrics or sample size reported).
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... perceived workload of legal professionals
The system's output captured the vast majority of required legal reasoning—often 80% to 100% of essential content.
Reported coverage statistic from the deployment/evaluation described in the paper (phrased as 'often 80% to 100% of essential content'); exact evaluation method, sample size, and measurement protocol are not provided in the excerpt.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... proportion of essential legal reasoning/content captured in generated drafts
LegalCheck maintained high legal consistency and factual accuracy when generating draft letters.
Evaluation during real-world deployment with expert-in-the-loop review and feedback from legal professionals in the Municipality of Amsterdam; claims of high consistency and factual accuracy are reported but no formal numeric accuracy metric or sample size is provided in the text.
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... legal consistency and factual accuracy of generated letters
LegalCheck produced near-final advice letters in minutes rather than hours.
Reported results from a real-world deployment within the Municipality of Amsterdam; system logs / timing comparisons between human drafting time (hours) and LegalCheck-assisted drafting time (minutes) are described in the paper (no explicit numeric sample size reported).
high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... time to produce advice/objection response letters
We outline a research program for the runtime systems that foundation-model software agents will require.
Paper claims to present a forward-looking research agenda or program (stated in abstract); this is a conceptual contribution rather than an empirical finding.
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... research directions needed for runtime systems for foundation-model software age...
Applied to a controlled validation task, the framework yields episode packages whose evidence structure varies systematically with harness level: lower levels produce only a final patch, while higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports.
Empirical application described in the abstract: framework applied to a controlled validation task showing systematic variation in episode-package evidence structure across harness levels. The abstract does not report sample size or statistical measures.
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... evidence structure of episode packages produced (types of artifacts: final patch...
We propose a trace-based evaluation protocol that converts each agent run into an auditable episode package.
Methodological proposal described in the abstract proposing a trace-based protocol and an auditable episode package format; no quantitative evaluation details provided in the abstract.
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... auditability of agent runs (availability of trace-based episode packages)
We operationalize the harness through a four-level ladder (H0–H3) that progressively exposes runtime support to the agent.
Design contribution described in the paper (abstract) introducing a four-level ladder (H0–H3) as an operationalization of the harness concept.
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... degree of runtime support exposed to an agent across harness levels
Foundation models have transformed automated code generation.
Statement in paper's abstract referring to broad impact of foundation models on automated code generation; likely supported by citations and literature overview within the paper (no sample size or quantitative study reported in the abstract).
high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... ability of foundation models to generate code (automation of coding tasks)
Authorship preservation should be a design priority for AI tools deployed in identity-relevant, behavior-dependent tasks.
Authors' recommendation based on experimental results showing negative motivational and behavioral consequences of delegating authorship to LLMs despite improved objective goal quality.
high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... design recommendation (no empirical outcome measured)
Mediation analyses identified psychological ownership as the mechanism: it mediated the authorship effect on every downstream motivational and behavioral outcome, while objective goal quality did not.
Mediation analyses reported in the preregistered experiment (authors tested psychological ownership and objective goal quality as mediators of authorship effects on multiple downstream outcomes); preregistered N = 470.
high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... mediating effect of psychological ownership on authorship => motivational and be...
At two-week follow-up, 72.8% of self-authored participants had acted on two or more of their goals, compared to 46.6% in the LLM condition.
Behavioral follow-up measure collected two weeks after the intervention in the preregistered experiment; percentages reported in the paper/abstract. (Follow-up completion N not specified in the abstract.)
high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... proportion of participants who acted on two or more goals within two weeks (beha...
LLM-generated goals scored higher on SMART criteria (specificity, measurability, achievability, relevance, and time-boundedness).
Preregistered randomized experiment comparing self-authored vs LLM-authored goals derived from a personal reflection; reported effect size d = 2.26; total preregistered N = 470.
high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... SMART criteria score (objective goal quality)
As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior.
Intervention experiments applying PGLA to model decoding on IMAVB; reported consistent improvements in the models' tendency to reject misleading premises after logit adjustment guided by probes.
We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension.
Description of new benchmark introduced in paper: 500 clips, 2x2 design (vision vs audio × standard vs misleading premises); used to measure conflict detection independently of standard multimodal QA.
The Agent-First paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.
Conceptual argument and mapping presented in the paper asserting interoperability/orthogonality with transport-layer standards (e.g., MCP).
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... compatibility_with_transport_layer_standards
Agent-First APIs improve autonomous error recovery by 5.8x (compared to optimized CRUD baselines).
Reported comparative experiments on 50 real operational tasks measuring autonomous error recovery capability.
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... autonomous_error_recovery
Agent-First APIs reduce required human interventions by 72.7% (compared to optimized CRUD baselines).
Same set of comparative experiments on 50 real operational tasks reported in the paper.
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... required_human_interventions
Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%).
Empirical comparative experiments reported in the paper on 50 real operational tasks, comparing Agent-First APIs to optimized CRUD baselines.
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... end-to-end_task_success_rate
The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains.
Reported production implementation and deployment statistics (platform with 85 registered tools spanning 6 business domains).
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... deployment_of_paradigm_on_production_SaaS_platform
We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation.
Design and specification presented in the paper (proposed architecture and components).
high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... proposed_API_paradigm_and_components
LLMs can help generate more correct and functional code compared to participant-generated solutions.
Comparative analysis of generated solutions reported in the paper (no sample-size for solutions explicitly stated in the abstract). The paper states LLM-assisted solutions were more correct/functional.
high positive "Like Taking the Path of Least Resistance": Exploring the Im... correctness and functionality of generated code
Qualitative analysis of participants' interactions and interviews revealed four different human-LLM collaboration modes supporting various problem-solving strategies.
Qualitative analysis of interaction logs and retrospective interviews from the study participants (N=20) reported in the paper; identification of four collaboration modes described.
high positive "Like Taking the Path of Least Resistance": Exploring the Im... types of collaboration modes
We conducted a within-subject study followed by retrospective interviews with programmers (N=20).
Stated methods in the paper: within-subject experimental design plus retrospective interviews; sample size explicitly given as N=20.
Organizations classified as 'Proactive Integrators' can reduce the risk of obsolescence by up to 53%.
Subgroup finding reported in the study (reduction estimate for organizations labeled 'Proactive Integrators'); specific subgroup sample not provided in abstract.
high positive The AI-engineering imperative - Navigating synergy and obsol... reduction in risk of skills obsolescence
AI-assisted engineering teams can achieve a 24% increase in productivity.
Empirical finding reported by the study, derived from the mixed-methods analysis (survey of 320 orgs, Delphi with 40 experts, and case studies of 5 industries as described in abstract).
high positive The AI-engineering imperative - Navigating synergy and obsol... increase in productivity of AI-assisted engineering teams
Entities that strategically implement AI can enhance their innovation cycles by up to 30%.
Statement in paper (presented as a forecast/estimate; no specific study or sample detailed in abstract).
high positive The AI-engineering imperative - Navigating synergy and obsol... improvement in innovation cycle speed/efficiency
Frontier directions include differentiable token budgets and dynamic markets to lay the theoretical foundation for scalable next-generation agent systems.
Paper's conclusion/recommendations based on surveyed literature and identified gaps; presented as proposed future research directions rather than empirically validated findings.
high positive Token Economics for LLM Agents: A Dual-View Study from Compu... proposal of differentiable token budgets and dynamic markets as key research fro...
Security: Internalizing adversarial threats as endogenous economic constraints.
Authors argue for modeling adversarial threats within the economic/tokens framework as endogenous constraints; conceptual/theoretical claim from the survey.
high positive Token Economics for LLM Agents: A Dual-View Study from Compu... treatment of adversarial threats as endogenous constraints in token economics mo...
Macro-level (Agent Ecosystems): Addressing congestion externalities and pricing via mechanism design.
Paper posits mechanism-design approaches to tackle congestion externalities and pricing in agent ecosystems; conceptual proposal based on economic theory and literature synthesis.
high positive Token Economics for LLM Agents: A Dual-View Study from Compu... mitigation of congestion externalities and improved pricing in agent ecosystems
Meso-level (Multi-Agent Systems): Minimizing collaboration friction using transaction cost and principal-agent theories.
Authors propose applying transaction-cost and principal-agent frameworks to multi-agent token interactions; presented as a theoretical taxonomy/synthesis without reported empirical sample.
high positive Token Economics for LLM Agents: A Dual-View Study from Compu... reduction of collaboration friction in multi-agent systems through economic-theo...
Micro-level (Single Agent): Optimizing budget-constrained factor substitution via neoclassical firm theory.
The paper asserts a micro-level taxonomy using neoclassical firm theory to model single-agent token-budget optimization; presented as conceptual/theoretical mapping rather than empirical test.
high positive Token Economics for LLM Agents: A Dual-View Study from Compu... ability to optimize budget-constrained factor substitution at single-agent level
We conceptualize tokens as production factors, exchange mediums, and units of account.
Paper provides a conceptual taxonomy framing tokens in three economic roles; based on theoretical argumentation and literature synthesis.
high positive Token Economics for LLM Agents: A Dual-View Study from Compu... conceptual framing of tokens into three economic roles