The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (8625 claims)

Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 761 200 101 904 2020
Governance & Regulation 829 400 191 122 1566
Organizational Efficiency 784 193 125 84 1197
Technology Adoption Rate 637 236 124 97 1103
Research Productivity 431 131 58 340 972
Output Quality 481 183 59 47 770
Decision Quality 332 177 82 49 647
Firm Productivity 439 57 88 20 610
AI Safety & Ethics 218 279 66 33 602
Market Structure 181 170 123 24 503
Task Allocation 214 64 72 33 388
Skill Acquisition 174 62 62 17 315
Innovation Output 204 27 45 18 295
Employment Level 105 54 108 13 282
Fiscal & Macroeconomic 132 69 43 26 277
Consumer Welfare 117 63 42 11 233
Firm Revenue 154 48 26 3 231
Task Completion Time 173 31 8 12 225
Inequality Measures 44 123 50 6 223
Worker Satisfaction 89 65 22 12 188
Error Rate 71 92 10 2 175
Regulatory Compliance 77 69 14 5 165
Automation Exposure 58 56 26 13 156
Training Effectiveness 96 21 14 19 152
Wages & Compensation 77 37 25 6 145
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 81 21 1 115
Hiring & Recruitment 52 7 8 3 70
Creative Output 32 20 8 3 64
Skill Obsolescence 5 47 6 1 59
Social Protection 28 16 8 2 54
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Adoption Remove filter
Cluster reliability should be validated (e.g., bootstrap, perturbations) and automatic labels complemented with expert human validation for critical analyses.
Caveat and recommended validation steps provided in summary; suggests bootstrap/perturbation and manual validation as best practices. No empirical stability metrics provided in summary.
high negative Soft-Prompted Semantic Normalization for Unsupervised Analys... cluster stability/reliability and accuracy of automatically generated labels
Results are sensitive to model and prompt choice; researchers should perform robustness checks across LLMs, soft prompts, and embedding models.
Caveat explicitly stated in the paper summary noting model and prompt sensitivity; recommended validation steps include robustness checks across models and prompts.
high negative Soft-Prompted Semantic Normalization for Unsupervised Analys... sensitivity of clustering/labeling results to LLM, prompt design, and embedding ...
Empirical validation is concentrated on the Agora-12 corpus; generalizability to other architectures, scales, or deployment contexts is unproven and identified as a limitation.
Authors' own limitations section and scope of empirical tests (analyses limited to Agora-12 and four clinical cases).
high negative Model Medicine: A Clinical Framework for Understanding, Diag... Scope of empirical validation (limited to Agora-12 dataset and 4 case studies)
Higher complaint volume is significantly associated with near-term stock price declines.
Fixed-effects panel path models estimated on monthly data for 261 financial firms (2018–2023) report statistically significant negative associations between firm–month complaint volume and subsequent abnormal returns.
high negative More than words: valuation of words for stock price by using... near-term abnormal stock returns
Consumer complaints—measured by monthly volume, topic composition, and VADER sentiment of complaint narratives—contain behavioral signals that predict short-term abnormal stock returns in U.S. financial firms.
CFPB complaint records matched to 261 publicly traded U.S. financial firms (monthly observations, 2018–2023); analyses use fixed-effects panel path models to link firm–month complaint features (volume, LDA topic prevalences, aggregated VADER sentiment) to firm-level abnormal returns; complementary machine-learning models evaluate out-of-sample predictive performance.
high negative More than words: valuation of words for stock price by using... short-term firm-level abnormal stock returns
Measurement issues (task-based output measurement, attributing output changes to AI) and selection into early adoption bias estimated productivity gains upward.
Methodological robustness checks reported in the paper: task-based measures, bounding exercises, placebo tests, and analysis of pre-trends; discussions of selection on unobservables and potential upward bias.
high negative S-TCO: A Sustainable Teacher Context Ontology for Educationa... validity/bias of estimated productivity effects
Implementing the governed hyperautomation pattern raises upfront costs (governance tooling, monitoring, validation, compliance processes).
Economic and cost-structure discussion in the paper, based on qualitative reasoning and industry experience; no quantified cost estimates or sample-based cost analysis provided.
high negative Governed Hyperautomation for CRM and ERP: A Reference Patter... upfront implementation costs (governance tooling, validation, compliance overhea...
Use of standardized (non-adaptive) dialogues limits ecological validity relative to live adaptive chatbots.
Limitations section acknowledges that standardized (non-adaptive) experimental dialogues reduce ecological validity compared with live/adaptive chatbot interactions.
Platform KPIs (e.g., eCPM) can diverge from social welfare metrics (consumer surplus, privacy harms), creating metric misalignment.
Conceptual critique with examples of common platform metrics versus welfare economics; not accompanied by a quantitative comparison dataset.
high negative Artificial Intelligence for Personalized Digital Advertising... alignment between platform KPIs and social welfare measures
Privacy constraints reduce observability and necessitate privacy-preserving study designs that complicate estimation.
Methodological analysis referencing differential privacy, federated learning and their effects on statistical power/observability; no experimental power analyses with sample sizes presented here.
high negative Artificial Intelligence for Personalized Digital Advertising... observability and estimation precision under privacy constraints
Data access asymmetries (platforms holding proprietary logs) limit external auditability and replication of advertising research.
Empirical and institutional observation about industry data practices; supported by calls for privacy-preserving shared datasets in the paper; no quantified survey sample included.
high negative Artificial Intelligence for Personalized Digital Advertising... external auditability and ability to replicate studies
Attribution complexity — multi-touch, cross-device, and delayed conversions — confounds causal inference in advertising measurement.
Methodological discussion referencing causal inference challenges and standard problems in attribution; widely-documented in the literature though not re-measured in this paper.
high negative Artificial Intelligence for Personalized Digital Advertising... accuracy of causal attribution for ad effects
Complex automated systems make attribution and responsibility harder when harms occur (Automation vs accountability trade-off).
Qualitative institutional analysis and case-study reasoning about multi-agent automated pipelines and opaque model decisions; no single empirical incident dataset provided.
high negative Artificial Intelligence for Personalized Digital Advertising... clarity of attribution and accountability in case of harms
Richer personalization depends on granular data and cross-device identity, creating privacy externalities and compliance risks (Personalization vs privacy trade-off).
Data source inventory and privacy literature review; supported by observational industry trends (move to first-party identity) rather than a quantified sample in the paper.
high negative Artificial Intelligence for Personalized Digital Advertising... degree of personalization versus exposure to privacy risks/compliance failures
The cost of formalizing informal labor (CFIL) implies formalizing a worker costs on average 88% more than the informal wage in 2023.
New CFIL metric calculated for 19 countries (2023 baseline) by estimating the additional employer cost of hiring and formalizing an informal worker and reporting it relative to the informal wage, using compiled statutory obligations and informal wage benchmarks.
high negative Salaried Labor Costs in Latin America and the Caribbean: A T... CFIL (additional cost of formalizing) as % above informal wage
There is sizable attrition in the pipeline from applicant admission through to direct employment of AI graduates, indicating leakages at multiple stages (application → admission → graduation → employment).
Quantification of human-resource losses across pipeline stages using the monitoring dataset for the 191 institutions; descriptive counts/percentages of entrants, admitted students, graduates, and those directly employed in AI roles (pipeline loss metrics reported in paper).
high negative Employment og Graduates of Educational Programs in the Field... Attrition rates / absolute losses at sequential pipeline stages (applicants → ad...
Graduates from Russian universities running AI-related educational programs together with alternative training routes (self-education and professional retraining) satisfy 43.9% of estimated national AI personnel demand.
Monitoring dataset of 191 Russian universities implementing AI-related programs; aggregated counts of university graduates plus estimated contributions from self-education and professional retraining compared to an estimated national AI personnel demand (coverage reported as 43.9%).
high negative Employment og Graduates of Educational Programs in the Field... Share (%) of estimated national AI personnel demand satisfied by combined univer...
AI automates routine and some mid-skill tasks, reducing employment in those occupations.
Empirical task-based exposure measures mapping AI capabilities to occupational task content, microdata analyses of employment by occupation using household/employer/administrative datasets, and panel regressions/decompositions that document within-occupation declines and between-occupation shifts.
high negative Intelligence and Labor Market Transformation: A Critical Ana... employment levels in routine and mid-skill occupations
Relying on secondary literature limits the paper's ability to make causal inferences and constrains empirical generalizability to all sectors or countries.
Stated limitations in the paper's Data & Methods section acknowledging scope and inferential constraints.
high negative Who Loses to Automation? AI-Driven Labour Displacement and t... causal inference strength and generalizability of conclusions
Increases in K_T reduce employment levels in affected firms and industries even when aggregate productivity rises.
Panel econometric estimates at firm and industry levels relating K_T intensity to employment outcomes, controlling for demand, input prices, and firm characteristics; difference-in-differences specifications and instrumental-variable robustness checks; corroborated by sectoral case studies.
high negative The Macroeconomic Transition of Technological Capital in the... employment (firm- and industry-level employment counts or employment growth)
Rising technological capital (K_T) — proxied by robot/automation density, software and intangible capital accumulation, AI adoption surveys, and AI-related patenting — leads to a decline in labor’s share of output.
Firm- and industry-level panel regressions linking constructed K_T intensity measures to labor shares, supported by macro growth-accounting decompositions; robustness checks include difference-in-differences and instrumenting adoption with plausibly exogenous shocks (e.g., cross-border technology diffusion, trade shocks); validated with cross-country comparisons and case studies.
high negative The Macroeconomic Transition of Technological Capital in the... labor share of income (share of output paid to labor)
We evaluate the system in two controlled human-subject experiments comparing AI-based pre-mediation with professional human mediators in a multi-issue negotiation scenario.
Reported experimental design: two controlled human-subject studies conducted by the authors (details and comparisons described in paper).
high neutral Automated Mediator for Human Negotiation: Pre-Mediation via ... comparative evaluation between AI-mediated and human-mediated pre-mediation
The pipeline's components are not autonomous and do not interact peer-to-peer; outputs are passed forward in a fixed sequence (single-party pipeline).
Implementation detail described in the paper (system architecture specification).
high neutral Automated Mediator for Human Negotiation: Pre-Mediation via ... pipeline execution model (fixed-sequence, non-autonomous modules)
ALE is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks.
Author-provided counts describing the benchmark taxonomy and task pool.
high neutral Agents' Last Exam taxonomy breadth (subfields, clusters, number of tasks)
ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy).
Design specification described in the paper referencing O*NET / SOC 2018.
high neutral Agents' Last Exam scope of industries covered by the benchmark
At inference time, BRANE selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining.
Method description and algorithmic claim in the paper (selection rule maximizing predicted correctness with cost penalty). No empirical sample size required for algorithmic description.
high neutral Natural Language Query to Configuration for Retrieval Agents cost-quality tradeoff exposed by selection strategy
We propose BRANE, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly.
Method description in the paper: BRANE architecture and training procedure (LLM-based feature extraction + per-configuration correctness predictor). No numeric sample size reported for method description.
high neutral Natural Language Query to Configuration for Retrieval Agents method (feature extraction and predictor training)
AI deployment should be evaluated not only by average task speed, but by its overall effects on congestion, rework, and the robustness of human oversight under load.
Policy/recommendation based on the paper's theoretical results and derived implications from the queueing model (conceptual/prescriptive conclusion; no empirical testing reported).
high neutral Queue & AI: When Faster Tasks Slow Down the Workflow organizational_efficiency
The divergence between mean task speed and system-level delay caused by AI assistance is labeled the 'variance wedge'.
Definition/terminology introduced in the paper as part of its conceptual framing; supported by the analytic model description.
high neutral Queue & AI: When Faster Tasks Slow Down the Workflow task_completion_time
GENSTRAT generates a distribution of two-player zero-sum imperfect-information card games.
Design specification in paper; reported generated pool size of 2,000 games (abstract).
high neutral GENSTRAT: Toward a Science of Strategic Reasoning in Large L... game distribution (two-player zero-sum imperfect-information card games)
Few benchmarks achieve widespread use (examples given include GPQA Diamond, LiveCodeBench, AIME 2025).
Empirical observation from the dataset showing that only a small number of benchmarks are highlighted across multiple builders/releases; specific named benchmarks are cited as relatively widely used.
high neutral Unsteady Metrics and Benchmarking Cultures of AI Model Build... frequency of benchmark highlighting across builders/releases
We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping.
Paper contribution: authors present a four-tier taxonomy as a conceptual framework; this is a descriptive/constructive claim about the content of the paper itself.
high neutral Generative AI Advertising as a Problem of Trustworthy Commer... categorization of types of commercial influence in generative systems
Regulatory technology is viewed as a governance arrangement that organizes relations between firms, banks, insurers, logistics actors, buyers, and regulators.
Conceptual framing developed through the interpretive synthesis of multiple literature streams in the paper.
high neutral RegTech-enabled governance of sanctions-safe enterprise ecos... conceptual role of RegTech in organizing inter-actor relations
Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results.
Experimental setup reported by authors: 9 models from 3 providers, with 30 trials per model using real SDK tool-use and autonomous graph queries.
high neutral Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... experimental coverage and evaluation methodology (models invoked graph query too...
Oracle Poisoning manipulates the data agents reason over, not their instructions, distinguishing it from prompt injection.
Theoretical distinction and definitional comparison made by the authors (conceptual argument in the paper).
high neutral Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... mechanism of attack (data-layer vs instruction-layer manipulation)
A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering.
Methodological claim describing a six-gate producer audit procedure in the paper to diagnose engineering failures vs. commercial steering.
high neutral TourMart: A Parametric Audit Instrument for Commission Steer... ability to distinguish engineering failures from commercial steering
Holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template (paired counterfactual).
Method description of the paired counterfactual experimental design used by TourMart.
high neutral TourMart: A Parametric Audit Instrument for Commission Steer... steering delta (difference in acceptance between commission-aware and minimum-di...
We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance, driven by two governance levers — lambda (gain on message-induced perception) and kappa (budget-normalized cap on how far the message can shift perceived welfare).
Methodological proposal described in paper: design of an audit instrument and two formal levers (lambda, kappa).
high neutral TourMart: A Parametric Audit Instrument for Commission Steer... audit instrument capability for measuring message-induced perception shifts unde...
Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice.
Descriptive assertion in paper about product/industry UI change; no empirical sample or formal measurement reported in excerpt.
high neutral TourMart: A Parametric Audit Instrument for Commission Steer... interface format (ranked-list → single-sentence conversational recommendation)
We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget.
Methods / experimental setup reported in the paper: five function-calling benchmarks and a DistilBERT classifier trained and deployed under latency constraints.
high neutral Switchcraft: AI Model Router for Agentic Tool Calling evaluation framework and classifier training/deployment
We show that ρ ≥ 1 is the no-excess-crowding parity condition and connect Δ to an adoption game with exposure-dependent redundancy costs.
Theoretical result derived in the paper linking the human-relative diversity ratio ρ to a parity condition and relating the excess-crowding coefficient Δ to an adoption-game model with exposure-dependent redundancy costs.
high neutral Ex Ante Evaluation of AI-Induced Idea Diversity Collapse parity condition for no-excess-crowding (ρ ≥ 1) and economic/game-theoretic rela...
The paper's contribution is a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces (not a new optimizer or a hotel-pricing leaderboard).
Authors' framing and explicit statements of intended contribution; supported by the failure diagnosis, diagnostic protocol, and Trace-Prior RL repair demonstrated in simulator experiments.
high neutral Market-Alignment Risk in Pricing Agents: Trace Diagnostics a... methodological reproducibility and conceptual framing
We position DAO-governed decentralized physical infrastructure networks (DePIN) within a vertically integrated stack that links energy and sensing to connectivity, storage/compute, models, and robots.
Architectural/framework description in the paper that maps DePIN elements into a vertically integrated stack; conceptual/mapping method without empirical measurement.
high neutral DAO-enabled decentralized physical AI: A new paradigm for hu... conceptual integration of DePIN components into a vertical infrastructure stack
The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool.
System description in paper noting that the curated index is available via a Gosset MCP server for external models to call.
high neutral Curated AI beats frontier LLMs at pharma asset discovery availability of curated index as callable MCP server
All five systems receive the same natural-language query and the same JSON output schema.
Methodological detail reported in paper describing controlled inputs across systems.
high neutral Curated AI beats frontier LLMs at pharma asset discovery consistency of input/query and output schema across systems
We benchmark Gosset ... against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets.
Experimental benchmark described in paper: direct comparison of Gosset versus four named models on 10 targets; methodological statement.
high neutral Curated AI beats frontier LLMs at pharma asset discovery comparative retrieval performance on 10 niche oncology/immunology targets
Weight-based memory generalizes by applying abstract rules to inputs never seen before.
Conceptual claim grounded in the paper's theoretical distinction between weight-based learning and retrieval; references Complementary Learning Systems theory; no empirical sample in abstract.
high neutral Contextual Agentic Memory is a Memo, Not True Memory type of generalization performed by weight-based memory
Retrieval generalizes by similarity to stored cases.
Conceptual claim stated in paper (distinction between retrieval-based and weight-based generalization); supported by theoretical characterization, not empirical data in abstract.
high neutral Contextual Agentic Memory is a Memo, Not True Memory type of generalization performed by retrieval systems
Many practical machine learning applications are online and sequential, meaning prior decisions inform future ones — a setting in which fairness challenges differ from standard supervised learning.
Background claim in the paper motivating the work; literature context and conceptual discussion rather than new empirical data.
high neutral Fairness under uncertainty in sequential decisions characterization of ML application setting (online/sequential)
The paper establishes a taxonomy of forgetting mechanisms: passive decay-based, active deletion-based, safety-triggered, and adaptive reinforcement-based.
Explicit taxonomy presented in paper (listed in abstract).
high neutral FSFM: A Biologically-Inspired Framework for Selective Forget... classification of forgetting mechanisms