The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (4049 claims)

Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 369 105 58 432 972
Governance & Regulation 365 171 113 54 713
Research Productivity 229 95 33 294 655
Organizational Efficiency 354 82 58 34 531
Technology Adoption Rate 277 115 63 27 486
Firm Productivity 273 33 68 10 389
AI Safety & Ethics 112 177 43 24 358
Output Quality 228 61 23 25 337
Market Structure 105 118 81 14 323
Decision Quality 154 68 33 17 275
Employment Level 68 32 74 8 184
Fiscal & Macroeconomic 74 52 32 21 183
Skill Acquisition 85 31 38 9 163
Firm Revenue 96 30 22 148
Innovation Output 100 11 20 11 143
Consumer Welfare 66 29 35 7 137
Regulatory Compliance 51 61 13 3 128
Inequality Measures 24 66 31 4 125
Task Allocation 64 6 28 6 104
Error Rate 42 47 6 95
Training Effectiveness 55 12 10 16 93
Worker Satisfaction 42 32 11 6 91
Task Completion Time 71 5 3 1 80
Wages & Compensation 38 13 19 4 74
Team Performance 41 8 15 7 72
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 17 15 9 5 46
Job Displacement 5 28 12 45
Social Protection 18 8 6 1 33
Developer Productivity 25 1 2 1 29
Worker Turnover 10 12 3 25
Creative Output 15 5 3 1 24
Skill Obsolescence 3 18 2 23
Labor Share of Income 7 4 9 20
Clear
Governance Remove filter
Jurisdictions are taking divergent policy approaches (e.g., U.S. emphasis on innovation/competition, EU emphasis on rights/standards like GDPR), producing fragmented digital trade rules.
Comparative legal and policy analysis of existing national/regional rules and international instruments (examples cited include GDPR and U.S. policy orientations); descriptive, with specific regulatory texts analyzed.
high negative Path Analysis of Digital Economy and Reconstruction of Inter... regulatory fragmentation / interoperability of digital trade rules
AI creates novel non-tariff frictions, e.g., pressures toward data localization and regulatory requirements for algorithmic transparency.
Comparative legal and policy analysis of emerging regulations (e.g., data localization laws, algorithmic regulation initiatives) and illustrative jurisdictional examples.
high negative Path Analysis of Digital Economy and Reconstruction of Inter... non-tariff regulatory frictions (data-flow restrictions, transparency/compliance...
Vietnam's civil-law features—statutory specificity, formal procedures, and constitutional principles like legal certainty and fairness—make straightforward AI deployment legally fraught.
Close textual analysis of Vietnam's statutes, constitutional provisions, and administrative procedures (doctrinal legal analysis); no quantitative sample.
high negative ARTIFICIAL INTELLIGENCE AND ADMINISTRATIVE GOVERNANCE: A CRI... legal compatibility of AI deployment (degree of legal obstacles to deployment)
Automated decisions complicate assigning responsibility and hinder judicial and administrative reviewability.
Doctrinal examination of accountability and review mechanisms in administrative law plus comparative institutional analysis of automated decision-making governance.
high negative ARTIFICIAL INTELLIGENCE AND ADMINISTRATIVE GOVERNANCE: A CRI... clarity of accountability (ability to assign responsibility) and effectiveness o...
Opaque AI models risk violating notice, reason-giving, and appeal rights protected under administrative due process.
Analysis of procedural due-process requirements (notice, reason-giving, appeal) in Vietnam's legal framework and assessment of opacity issues in algorithmic systems; qualitative reasoning, no empirical testing.
high negative ARTIFICIAL INTELLIGENCE AND ADMINISTRATIVE GOVERNANCE: A CRI... compliance with due-process requirements (notice, reasons, appealability)
Provider incentives may be misaligned (e.g., optimizing for engagement or test performance instead of durable learning), requiring contracts, regulation, or purchaser design to align incentives.
Consensus from interdisciplinary workshop (50 scholars) highlighting incentive risks and market-design considerations; descriptive, not empirical.
high negative The Future of Feedback: How Can AI Help Transform Feedback t... provider optimization metrics (engagement/test performance) vs. durable learning...
Extensive learner data needed to personalize AI feedback raises privacy and data-governance concerns (consent, storage, usage).
Qualitative consensus from workshop participants (50 scholars) noting data-collection requirements and governance risks; no empirical governance studies included.
high negative The Future of Feedback: How Can AI Help Transform Feedback t... volume/type of learner data collected; privacy risk indicators; compliance with ...
Automated feedback may not capture pedagogical nuances expert teachers use (motivation, socio-emotional cues, complex reasoning), limiting pedagogical fit.
Expert syntheses from the workshop of 50 scholars highlighting limits of automation relative to expert teacher judgment; no empirical comparisons presented.
high negative The Future of Feedback: How Can AI Help Transform Feedback t... coverage of socio-emotional and complex-reasoning cues in feedback; corresponden...
AI-generated feedback can be incorrect, misleading, or misaligned with learning objectives; assessing feedback quality is nontrivial.
Repeated concern raised across workshop participants (50 scholars) in qualitative synthesis; noted as a substantive risk and open challenge rather than empirically quantified here.
high negative The Future of Feedback: How Can AI Help Transform Feedback t... feedback factual correctness; alignment with stated learning objectives; rate of...
Exposure to top-rated exemplar papers produced large reductions in interquartile range (IQR) of estimates—within converging measure families, IQR fell by roughly 80–99%.
Stage 3 of the protocol: after agents were shown top-rated exemplar papers, measured within-measure-family IQRs of agents' estimates decreased substantially; reported quantitative reduction range of 80%–99% within measure families that converged.
high negative Nonstandard Errors in AI Agents percentage reduction in interquartile range (IQR) of effect estimates within mea...
Integration costs—domain modeling, human-in-the-loop protocols, and regulatory/liability frameworks—are significant barriers to deployment.
Conceptual assessment of operational and regulatory requirements; no quantified cost studies provided.
high negative Argumentative Human-AI Decision-Making: Toward AI Agents Tha... implementation cost and organizational burden for deploying argumentative AI sys...
AFs and LLMs may be gamed or misled; adversaries may exploit systems leading to strategic argumentation or manipulation.
Conceptual security/adversarial concern based on known vulnerabilities in ML and strategic behavior; no adversarial tests reported.
high negative Argumentative Human-AI Decision-Making: Toward AI Agents Tha... system vulnerability metrics / susceptibility to adversarial manipulation
Faithful extraction—aligning LLM-extracted arguments with formal AF primitives and ensuring fidelity to source evidence—is a key technical challenge.
Paper's explicit identification of failure modes and alignment issues; grounded in documented limitations of IE/LLMs (no empirical quantification here).
high negative Argumentative Human-AI Decision-Making: Toward AI Agents Tha... fidelity/alignment error rate between extracted elements and source evidence
Computational argumentation approaches have required heavy feature engineering and domain-specific knowledge to be effective.
Conceptual claim grounded in prior work and practical experience reported in the literature; no quantitative cost estimates provided in the paper.
high negative Argumentative Human-AI Decision-Making: Toward AI Agents Tha... engineering cost / domain modeling effort required for AF-based systems
Automation bias (human tendency to defer to automated outputs) compounds the risk that GLAI errors become embedded in legal processes.
Behavioral literature review on automation bias and trust in AI systems; applied to legal-context vignettes. No primary empirical test within the paper.
high negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... likelihood of human operators deferring to GLAI outputs (automation bias effect)
A key architectural risk is interoperability failure and fragmentation across vendors and protocols in agent ecosystems.
Comparative analysis with IoT and other platform histories showing vendor/protocol fragmentation; argument is conceptual and illustrative rather than empirically measured for future agent ecosystems.
high negative The Internet of Physical AI Agents: Interoperability, Longev... degree of interoperability and fragmentation across vendors/protocols
Domains such as disaster response, healthcare, industrial automation, and mobility will be affected and are safety‑critical, where failures have high social and economic cost.
Domain examples and policy reasoning; draws on general knowledge about those sectors and potential harms; no new empirical damage quantification provided in the paper.
high negative The Internet of Physical AI Agents: Interoperability, Longev... social and economic costs of failures in safety‑critical domains
IoT digitized perception at scale but exposed limitations such as fragmentation, weak security, limited autonomy, and poor sustainability.
Historical and comparative analysis of IoT deployments and literature cited illustratively in the paper; qualitative evidence from prior IoT incidents and ecosystem studies rather than new empirical data.
high negative The Internet of Physical AI Agents: Interoperability, Longev... levels of fragmentation, security robustness, autonomy, and sustainability in Io...
A single malicious or compromised LLM agent with high stubbornness and persuasive power can trigger a persuasion cascade that steers the collective opinion of a multi-agent LLM system (MAS).
Theoretical analysis using the Friedkin–Johnsen (FJ) opinion-formation model (analysis of fixed points and influence propagation) plus simulation experiments mapping LLM-MAS interactions to FJ dynamics across multiple network topologies and attacker profiles. (Paper reports simulation results but does not provide exact sample sizes in the provided summary.)
high negative Don't Trust Stubborn Neighbors: A Security Framework for Age... extent of adversarial sway / shift in collective opinion (final consensus and op...
Static ACLs evaluate deterministic rules that ignore partial execution paths and therefore can only capture a subset of organizational constraints.
Formal argument and examples showing static ACLs map to Policy functions that do not depend on partial_path; illustrative limitations presented.
high negative Runtime Governance for AI Agents: Policies on Paths coverage of organizational constraints by static ACLs (proportion of constraints...
Runtime evaluation imposes additional compute, latency, logging, and engineering costs that increase the marginal cost of deploying agents.
Operational discussion in the paper outlining additional runtime compute and logging requirements; cost implications argued qualitatively; no empirical cost measurements provided.
high negative Runtime Governance for AI Agents: Policies on Paths marginal deployment cost (compute/latency/engineering overhead)
Prompt-level instructions and static access control lists (ACLs) are limited special cases of a more general runtime policy-evaluation framework and cannot, in general, enforce path-dependent rules.
Formalization showing prompt/system messages and static ACLs map to restricted forms of the Policy(agent_id, partial_path, proposed_action, org_state) function; logical proof/argument in the paper and illustrative counterexamples.
high negative Runtime Governance for AI Agents: Policies on Paths ability to detect/enforce path-dependent policy violations (yes/no / coverage of...
LLM-based agent behavior is non-deterministic and path-dependent: an agent's safety/compliance risk depends on the entire execution path, not just the current prompt or single action.
Formal/abstract execution model defined in the paper (states, actions, execution paths) and conceptual arguments/illustrative examples showing how earlier states/actions affect later behavior; no large-scale empirical dataset reported.
high negative Runtime Governance for AI Agents: Policies on Paths path-dependent compliance/safety risk (probability of policy violation condition...
Qualitative case studies show modality-specific failures, such as correct entity recognition but wrong factual attribute.
Paper includes qualitative examples/case studies from the benchmark where models identify entities in images correctly but produce incorrect time-sensitive attributes (e.g., current officeholder or company status).
high negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... case-study examples of modality-specific failure modes
Real-world deployment will require representative data coverage and online adaptation despite the method’s robustness mechanisms.
Authors' discussion/limitations section: theoretical requirements for persistently exciting/representative trajectories for DeePC and recommendation for online adaptation and continual data collection for deployment.
high negative Data-driven generalized perimeter control: Zürich case study data representativeness and need for online adaptation (deployment readiness/ris...
Agent performance degrades markedly as environment complexity, stochasticity, and non-stationarity increase, revealing core limitations of current LLM-based agents for long-horizon, multi-factor decision problems.
Experimental results across progressively harder RetailBench environments showing performance falloff for multiple LLMs under increased task complexity and non-stationarity.
high negative RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... overall agent performance across increasing environment complexity (e.g., fulfil...
Behavioral memorization probe (TS‑Guessing) signaled memorization above chance for 72.5% of prompts across all models and items.
Experiment 3 — TS‑Guessing behavioral probe applied exhaustively to all 513 MMLU questions × six models (total prompts = 513×6); statistical thresholds used to classify above-chance memorization signals, yielding 72.5% of prompts flagged.
high negative Are Large Language Models Truly Smarter Than Humans? fraction of prompt-model pairs with statistically significant memorization signa...
Paraphrase / indirect-reference diagnostic: on a 100-question subset, average accuracy dropped by 7.0 percentage points under indirect referencing.
Experiment 2 — paraphrase/indirect-reference diagnostic applied to a 100-question subset of MMLU; measured delta between original and paraphrased question accuracy averaged to 7.0 percentage points.
high negative Are Large Language Models Truly Smarter Than Humans? mean accuracy drop (percentage points) under paraphrase/indirect prompts
STEM items show higher lexical contamination (18.1%) relative to the overall rate.
Category-level results from Experiment 1 (lexical matching) on the MMLU dataset (513 questions), aggregated by subject domain to compute an 18.1% contamination rate for STEM categories.
high negative Are Large Language Models Truly Smarter Than Humans? category-level contamination prevalence (STEM)
Overall lexical contamination: 13.8% of MMLU items show evidence of exposure in training data.
Experiment 1 — lexical contamination detection pipeline that searched model training–era public corpora and the open web for literal or near-literal occurrences of the 513 MMLU questions/answers; per-item contamination flags aggregated to produce the 13.8% figure.
high negative Are Large Language Models Truly Smarter Than Humans? contamination prevalence (fraction of benchmark items with lexical matches)
Public leaderboards overstate modern LLM capabilities because substantial portions of benchmark QA items appear in (or are memorized from) training data, inflating measured accuracy.
Multi-method contamination audit across six frontier LLMs (GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, Qwen3-235B) evaluated on the MMLU benchmark (513 questions, 57 subjects), using lexical matching, paraphrase sensitivity, and behavioral memorization probes that together show systematic leakage.
high negative Are Large Language Models Truly Smarter Than Humans? inflation of measured benchmark accuracy / overstatement of model capability
Proactive AI at national scale amplifies concerns around transparency, accountability, privacy, and potential misuse, necessitating robust regulatory and ethical frameworks.
Normative and ethical analysis in the paper, supported by general literature on large-scale AI governance; no empirical assessment of regulatory effectiveness in Russia included.
high negative DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... risks to transparency, accountability, privacy and potential for misuse
There are limited randomized controlled trials or longitudinal evaluations; few studies measure patient-relevant outcomes or economic impacts.
Literature synthesis noting scarcity of RCTs and long-term observational studies, and absence of widespread patient-outcome and cost-effectiveness evaluations in existing publications.
high negative Human-AI interaction and collaboration in radiology: from co... number of RCTs/longitudinal studies, frequency of patient outcome and economic o...
Many published studies focus on standalone algorithm accuracy rather than clinician–AI joint performance in routine workflows.
Review of the literature categorizing study designs (preponderance of algorithm development/validation studies, fewer reader-in-the-loop, simulation, or deployment studies).
high negative Human-AI interaction and collaboration in radiology: from co... proportion of studies reporting standalone algorithm metrics versus those report...
Ethical and legal issues—patient privacy, algorithmic bias, intellectual property, and equitable access—pose risks to AI deployment in drug development.
Ethics and legal analyses, policy reports, and documented case examples collated in the review that identify these recurring concerns.
high negative From Algorithm to Medicine: AI in the Discovery and Developm... ethical/legal risk incidence; privacy breaches; bias outcomes; access inequities
Regulatory uncertainty about validation standards and liability for AI tools raises investment risk and may slow deployment.
Regulatory and policy reports included in the narrative review describing evolving standards and open questions about validation, explainability, and liability for ML-based tools.
high negative From Algorithm to Medicine: AI in the Discovery and Developm... regulatory clarity; investment risk and deployment timelines
Adoption of AI in drug R&D requires high upfront investment in data curation, compute infrastructure, and specialized talent.
Industry reports and economic analyses summarized in the review reporting capital and operational needs for building AI capabilities; qualitative synthesis rather than quantitative costing across firms.
high negative From Algorithm to Medicine: AI in the Discovery and Developm... fixed upfront costs (data curation, compute, hiring/training)
Limited transparency and interpretability of many AI algorithms (black-box models) complicate clinical and regulatory trust and adoption.
Regulatory reports, methodological critiques, and case examples in the review highlighting interpretability concerns and their impact on clinical/regulatory acceptance.
high negative From Algorithm to Medicine: AI in the Discovery and Developm... clinical/regulatory acceptance, trust, and adoption rates; explainability metric...
Performance of AI models in drug R&D depends on large, high-quality, and representative biomedical datasets; dataset bias or gaps substantially undermine model performance and generalizability.
Methodological literature and case studies cited in the review documenting failures or limited generalization when training data are biased, sparse, or non-representative; thematic synthesis rather than pooled quantification.
high negative From Algorithm to Medicine: AI in the Discovery and Developm... model performance/generalizability across populations and contexts
High-quality, standardized, interoperable data (clean, annotated, connected across modalities) is a critical limiting factor for translating AI capability into sustained impact.
Conceptual emphasis and domain knowledge argument in the editorial; no empirical measurement of data quality's causal effect included.
high negative AI as the Catalyst for a New Paradigm in Biomedical Research ability to translate AI capability into sustained impact (dependent on data qual...
The paper's evidence base is limited by early-stage projects with limited longitudinal outcome data and dependence on publicly available project information which may be incomplete or biased.
Methods and limitations explicitly stated in the paper (qualitative review; reliance on secondary sources; two case studies; absence of large-scale quantitative evaluation).
high negative Decentralized Autonomous Organizations in the Pharmaceutical... completeness and robustness of empirical evidence supporting claims about DAO ef...
Data protection and privacy (especially sensitive health data) complicate open-data DAO models.
Conceptual analysis referencing privacy/data-protection concerns for health data (e.g., GDPR-like regimes); no empirical evaluation of privacy breaches within DAOs provided.
high negative Decentralized Autonomous Organizations in the Pharmaceutical... data privacy risk level, feasibility of open-data sharing for clinical data
Significant barriers remain for DAOs in pharma: regulatory uncertainty about tokenized securities, IP fractionalization, and clinical data sharing.
Legal/regulatory analysis and literature synthesis highlighting unclear classifications and open regulatory questions; no new regulatory rulings provided.
high negative Decentralized Autonomous Organizations in the Pharmaceutical... regulatory clarity/status for tokenized securities and IP models; legal risk ind...
Pharmaceutical R&D faces rising costs, long approval timelines, supply-chain inefficiencies, and low patient involvement.
Literature review and synthesis of well-documented industry challenges cited in the paper (secondary sources); no new primary data presented in this study.
high negative Decentralized Autonomous Organizations in the Pharmaceutical... R&D cost per approved drug, average time-to-approval, supply-chain performance m...
There is limited reporting on privacy safeguards, model interpretability, and external validity in the reviewed studies.
Review observed sparse reporting on privacy protections, interpretability analyses and external validation across included studies.
high negative Deep technologies and safer gambling: A systematic review. frequency/extent of reporting on privacy safeguards and interpretability (qualit...
Misclassification risks (false positives and false negatives) are a common limitation and can harm consumers by incorrectly restricting access or by failing to detect harm.
Review notes model error rates reported via precision/recall and AUC; discusses harms from false positives/negatives as a recurrent limitation in the literature.
high negative Deep technologies and safer gambling: A systematic review. model error rates and downstream consumer harm risk (false positive/negative imp...
Privacy and ethical concerns are substantial: continuous monitoring and sensitive behavioural inference raise privacy, surveillance, and misuse risks.
Multiple included studies and the review discussion explicitly identify privacy, ethical, and potential misuse concerns with continuous monitoring and behavioural inference.
high negative Deep technologies and safer gambling: A systematic review. privacy/ethical risk (qualitative concerns reported across studies)
Advanced technologies' complexity and lack of explainability create risks for audit reliability and professional judgement.
Findings from literature synthesis and professional/regulatory perspectives included in the review; presented as an identified risk/challenge rather than quantified effect.
high negative Audit 5.0 and the Digital Transformation of Auditing: The Ro... audit reliability and the exercise of professional judgement in presence of opaq...
Audit 5.0 introduces key challenges: data quality and integration issues, complexity and explainability of advanced technologies, regulatory and ethical uncertainty, and skills shortages combined with cultural resistance.
Systematic literature review and synthesis of professional standards and regulatory perspectives; assertions based on reviewed literature rather than a single empirical dataset.
high negative Audit 5.0 and the Digital Transformation of Auditing: The Ro... barriers to adoption/readiness factors (data quality, explainability, regulatory...
At the question level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best.
Question-level analysis from the randomized experiment comparing cases where chatbot suggestions were incorrect versus control; paper reports a ~66% reduction in accuracy on easy questions when chatbot suggestions were incorrect (exact denominators and statistics not provided in the excerpt).
high negative LLMs in social services: How does chatbot accuracy affect hu... caseworker accuracy on easy questions when presented with incorrect chatbot sugg...