The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (8570 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Adoption Remove filter
Using a frontier model's system prompt to supply the procedure exposes proprietary procedures to third-party providers.
Author statement describing privacy/proprietary risk as a cost of the system-prompt approach (qualitative claim).
high negative Compiling Agentic Workflows into LLM Weights: Near-Frontier ... exposure of proprietary procedures to third-party providers (privacy/intellectua...
Using a frontier model's system prompt to supply the procedure requires a frontier model for every conversation.
Author statement describing operational/cost trade-offs associated with the system-prompt approach (qualitative claim).
high negative Compiling Agentic Workflows into LLM Weights: Near-Frontier ... requirement to use frontier model per conversation (operational/deployment cost)
Using a frontier model's system prompt to supply the procedure has costs: it consumes the context window.
Author statement referencing trade-offs identified alongside the Dennis et al. result; cost described qualitatively (context window consumption).
Emerging evidence indicates that algorithms often inherit and amplify the historical biases present in training data.
Literature claim in paper referencing 'emerging evidence' and empirical studies (2024–2026) — specific studies, methods, and sample sizes not included in excerpt.
high negative The Algorithmic Mirror: Can Artificial Intelligence Truly Mi... presence and amplification of historical bias in algorithmic outputs
Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs.
Empirical comparison in the paper between single-threshold scoring and tail-inclusive (continuous/unbounded) scoring on identical forecast outputs, showing sign reversal of the capability–accuracy relationship (numerical details not provided in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... capability–accuracy relationship under tail-inclusive scoring (impact of model c...
A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect.
Within-family empirical comparisons using Llama-3.1 variants examining effects of model scale and post-training (fine-tuning) on forecasting calibration (details and sample sizes not provided in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... relationship between model scale / post-training and forecasting calibration (di...
A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put.
Per-quantile decomposition analyses of model predictive distributions reported in the paper, showing quantile-specific changes (specific quantitative results not given in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... upper-tail forecast calibration / shift in predictive quantiles
The pattern replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation.
Empirical replication reported on multiple real-world datasets (COVID-19, measles, housing markets, hyperinflation) presented in the paper (dataset sizes not provided in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... forecast performance on real-world time series (distributional forecasts / calib...
The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control.
Results on the authors' released simulated benchmark (ForecastBench-Sim) using synthetic SIR epidemic simulations and a matched linear-control experiment reported in the paper (specific number of simulations or runs not stated in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... forecast performance on simulated SIR epidemics (distributional forecasts)
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change ... more capable models produce worse distributional forecasts.
Empirical experiments reported in the paper comparing LLMs of varying capability on forecasting tasks with superlinear growth and regime-change tail risk; uses distributional forecast evaluation across models (no sample size reported in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... distributional forecast quality / calibration
The lack of prediction stability and predictability can lead to advertiser-perceivable problems such as repeatability issues, cold start, and under-exploration.
Stated as an intuitive/motivational claim in the paper linking instability to advertiser-facing problems; no empirical quantification provided in the excerpt.
high negative LLM Retrieval for Stable and Predictable Ad Recommendations repeatability, cold start, under-exploration (advertiser-perceived issues)
Traditional ads recommendation systems have primarily focused on optimizing for prediction accuracy of click or conversion events using canonical metrics such as recall or normalized discounted cumulative gain (NDCG).
Background/contextual claim about prior work and standard practice; stated in the paper as motivation (no empirical evidence provided in the excerpt).
high negative LLM Retrieval for Stable and Predictable Ad Recommendations optimization focus on click/conversion prediction accuracy (recall, NDCG)
AIO is negatively associated with the carbon emission intensity of upstream suppliers.
Authors report a negative association between firms' AIO and the carbon emission intensity of their upstream suppliers in the empirical results using Chinese listed firms (2010–2023).
high negative Artificial intelligence orientation and decarbonization spil... carbon emission intensity (upstream suppliers)
AIO is negatively associated with the carbon emission intensity of industry peers.
Authors report a negative association between a firm's AIO and the carbon emission intensity of its industry peers based on their empirical analyses of Chinese listed companies over 2010–2023.
high negative Artificial intelligence orientation and decarbonization spil... carbon emission intensity (industry peers)
Stronger AIO is associated with lower carbon emission intensity within the focal firm.
Empirical association reported between firm-level AIO (measured via LLMs) and firm carbon emission intensity in the authors' analysis of Chinese listed firms (2010–2023); result described as a negative relationship.
high negative Artificial intelligence orientation and decarbonization spil... carbon emission intensity (focal firm)
Commercial demand drivers systematically distort finished-goods inventory targets and require integration with sales-and-operations planning for accurate calibration.
Narrative synthesis of studies addressing demand-driver effects on finished-goods targets and recommendations for S&OP integration.
high negative Equitable railway corridor investment under demand uncertain... accuracy/calibration of finished-goods inventory targets
Science-to-technology knowledge flow in AI has been insufficiently examined in a systematic and structural way.
Literature-gap claim in the paper motivating the study.
high negative Knowledge flows from science to AI technology: Identifying c... extent of systematic/structural study of science-to-technology knowledge flow in...
Highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation.
Synthesis of quantitative (coverage/reuse statistics) and qualitative analyses (narrative framing, taxonomy mapping) from the Benchmarking-Cultures-25 project; interpretive conclusion drawn by the authors.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... primary function of highlighted benchmarks (standardized measurement vs narrativ...
Authors of many 'general knowledge application' benchmarks claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math).
Content analysis of the benchmarks in the dataset showing topical focus (counts/observations indicating predominance of STEM/math topics) versus broader claimed measurement scope.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... topical focus of benchmark content (STEM/math prevalence) versus stated measurem...
Qualitative analysis shows many 'general knowledge application' benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI.
Qualitative content analysis of benchmark descriptions and builder narratives in the dataset; authors report themes where construct validity is downplayed and AGI progress is emphasized.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... degree of attention to construct validity vs AGI-framing in benchmark narratives
38.5% of highlighted benchmarks appear in just one release.
Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks); the paper reports the share (38.5%) of benchmarks that appear in only a single model release.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... durability/reuse of benchmarks across releases
The evaluation landscape is fragmented with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder.
Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks). The paper reports the share (63.2%) based on counts of builders per highlighted benchmark.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... degree of cross-model benchmark reuse (benchmarks per builder)
Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback, resulting in inefficient exploration and elevated financial risk for advertising platforms.
Argument in the paper contrasting generative-model-based approaches with the authors' proposed solution (conceptual claim; no quantitative backing given in the excerpt).
high negative Generative Auto-Bidding with Unified Modeling and Exploratio... exploration efficiency and financial risk in generative-model-based auto-bidding
Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies.
Statement in the paper summarizing limitations of prior RL-based bidding work (qualitative claim; no experimental details or sample size provided in the excerpt).
high negative Generative Auto-Bidding with Unified Modeling and Exploratio... ability of RL approaches to handle long-term dependencies
Early rule-based methods lacked adaptability.
Literature/contextual statement in the paper's introduction summarizing prior approaches to automated bidding (no empirical data or sample size reported).
high negative Generative Auto-Bidding with Unified Modeling and Exploratio... adaptability of early rule-based bidding methods
Twin agents dissolve that boundary, raising a class of trust calibration challenge these frameworks were not designed to handle.
Argument and design observations from the authors' ongoing project presented in the paper; conceptual claim explaining why existing frameworks may be insufficient for twin agents.
high negative From Role to Person: Trust Calibration Challenges in Twin Ag... framework_applicability_for_trust_calibration
When a human colleague doubts a twin agent's output, they face three failure modes (a schema gap, an epistemic gap, and a model artifact) with no reliable attribution path between them.
Conceptual taxonomy derived from the authors' early design observations; presented as an identified set of failure modes in the paper (qualitative, no numeric sample reported in abstract).
Drawing on early design work in an ongoing project, we identify a trust calibration problem specific to this approach.
Based on the authors' early design work (qualitative/design research) described in the paper; no sample size or quantitative metrics reported in the abstract.
Major open challenges for responsible adoption include reliability, bias, privacy, automation bias, transparency, and evaluation.
Authors' identification of risks and open research challenges based on their review/analysis (conceptual synthesis).
high negative Rethinking Code Review in the Age of AI: A Vision for Agenti... list of key risks and challenges for AI adoption in code review
Current AI support for code review remains fragmented, with tools focusing on isolated tasks such as reviewer recommendation, PR description generation, or comment suggestion rather than the end-to-end PR review workflow.
Authors' survey/overview of existing AI tooling for code review described in the paper (conceptual / review-based evidence). No quantitative counts provided in the abstract.
high negative Rethinking Code Review in the Age of AI: A Vision for Agenti... completeness / fragmentation of AI tool coverage across PR review tasks
AI coding assistants expand the volume of code requiring review, turning code review into a growing bottleneck.
Authors' analytical claim linking increased code production from AI assistants to increased review workload; presented as an observed/trend claim in the paper rather than supported by a quantified study in the abstract.
high negative Rethinking Code Review in the Age of AI: A Vision for Agenti... volume of code requiring review / code review bottleneck
Code review has evolved for decades, from informal peer checking to today's pull request (PR) workflows, yet it remains a largely manual, uneven, and cognitively demanding process.
Authors' literature review and historical synthesis of code review practices presented in the paper (conceptual / review-based evidence). No empirical sample or experiment reported in the abstract.
high negative Rethinking Code Review in the Age of AI: A Vision for Agenti... manualness and cognitive demand of code review process
Challenges including algorithmic bias, data privacy concerns, high costs, and skill gaps persist across contexts.
Cross-study synthesis of barriers and challenges reported in the 21 included studies spanning multiple contexts.
high negative Application of Artificial Intelligence in Human Resource Man... prevalence of adoption barriers (bias, privacy, cost, skills)
SMEs face unique resource constraints yet lag in AI-HRM adoption.
Synthesis conclusion from the systematic review of 21 included studies (published 2019–2026) comparing adoption patterns and barriers for SMEs.
high negative Application of Artificial Intelligence in Human Resource Man... AI-HRM adoption (lag) and resource constraints
Greater automation can obscure rather than eliminate failure modes.
Analytical claim in paper arguing that increased automation hides failures; presented as an interpretive finding rather than a quantified experimental result in the excerpt.
high negative AI for Auto-Research: Roadmap & User Guide visibility or obscuration of failure modes under automation
End-to-end autonomous systems have not yet consistently reached major-venue acceptance standards.
Paper's statement based on review of acceptance/peer-review outcomes and standards as of April 2026; no numeric acceptance-rate data presented in the excerpt.
high negative AI for Auto-Research: Roadmap & User Guide consistency of meeting major-venue acceptance standards
Research code lags far behind pattern-matching benchmarks.
Paper's evaluative claim from its experiments/coding analysis indicating code produced for research tasks is weaker than benchmark performance on pattern-matching tasks; excerpt contains no numerical comparison.
high negative AI for Auto-Research: Roadmap & User Guide quality/performance of research code relative to pattern-matching benchmarks
Generated ideas often degrade after implementation.
Paper statement about the gap between idea generation and implemented results reported in the Creation-phase analysis; no quantified follow-up study reported in the excerpt.
high negative AI for Auto-Research: Roadmap & User Guide quality change of generated ideas after implementation
AI remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment.
Summary claim from the paper's end-to-end lifecycle analysis indicating limitations on novelty and experimental rigor; no numeric performance metrics provided in excerpt.
high negative AI for Auto-Research: Roadmap & User Guide robustness on novel ideas, research-level experiments, and scientific judgment
Frontier LLMs fail to judge novelty reliably.
Paper's claim from its Validation-phase analysis that models do not reliably assess novelty; excerpt contains no underlying experimental sample or validation metrics.
high negative AI for Auto-Research: Roadmap & User Guide reliability of novelty judgments
Frontier LLMs miss hidden errors.
Qualitative statement from paper indicating models fail to detect some latent or subtle errors in research artifacts; no numeric evaluation provided in excerpt.
high negative AI for Auto-Research: Roadmap & User Guide ability to detect hidden errors
Under scientific pressure, even frontier LLMs still fabricate results.
Reported observation in paper about model behavior under scientific-use conditions; no specific quantitative experiments or sample sizes given in the excerpt.
high negative AI for Auto-Research: Roadmap & User Guide incidence of fabricated results by LLMs
Diagnostics also reveal a small tail of extreme errors for the Random Forest model.
Model diagnostic analyses reported in the paper indicating error distribution and presence of extreme prediction errors (tail).
high negative Determinants of Successful IoT and AI Initiatives in the SMA... distribution of prediction errors (presence of extreme errors)
Unrestricted frontier-scale checkpoint synthesis remains open (i.e., not yet solved).
Authors' assessment in the abstract noting current limits; asserts that unrestricted synthesis at frontier/model-scale has not been achieved.
high negative Position: Weight Space Should Be a First-Class Generative AI... feasibility/status of unrestricted frontier-scale checkpoint synthesis
Both major deployed systems and designed mechanisms concentrate on the most observable and easiest-to-govern tier, while the forms of commercial influence most consequential for user autonomy remain poorly understood and lack frameworks for detection, measurement, or disclosure.
Critical review of deployed system design choices and governance mechanisms; authors assert that attention and tooling focus on observable product-mention-level interventions while higher-tier influences lack measurement and disclosure frameworks.
high negative Generative AI Advertising as a Problem of Trustworthy Commer... coverage of governance/mechanisms across influence tiers and the existence of fr...
These tiers instantiate across modalities and system architectures, including retrieval-augmented generation and agentic pipelines where upstream decisions can sharply constrain downstream outcomes.
Analytical claim supported by examples and discussion of system architectures (e.g., RAG, agentic pipelines) showing how interventions at different stages map to the taxonomy; no quantitative evaluation reported in excerpt.
high negative Generative AI Advertising as a Problem of Trustworthy Commer... presence of influence tiers across different system modalities and architectures
Generative AI fundamentally changes advertising: rather than placing products into discrete slots, it enables interventions on the generative process itself, which induce commercial influence through less observable channels.
Conceptual argument backed by analysis of how generative models produce outputs and how interventions can operate on latent variables of generation; illustrated via taxonomy in the paper rather than quantified empirical tests.
high negative Generative AI Advertising as a Problem of Trustworthy Commer... modes/channels of commercial influence in advertising systems
Empirical research shows that ads woven directly into large language model (LLM) outputs often go undetected by users.
Reference to prior empirical studies (unspecified in the excerpt) showing user failure to detect embedded ads in LLM outputs; presented as an empirical finding rather than new experimental data in this paper.
high negative Generative AI Advertising as a Problem of Trustworthy Commer... user detection/recognition of ads embedded in LLM outputs
Management shareholding and analyst attention amplify the debt-cost penalty faced by AI washing firms.
Heterogeneity/interaction analyses showing larger post-shock financing-cost increases for AI washing firms with higher management shareholding and greater analyst attention (descriptive of moderator effects; no sample sizes in abstract).
high negative Dissipation of Debt Financing Privilege on Corporate AI Wash... magnitude of debt financing cost penalty
Difference-in-differences estimations reveal that AI washing firms experience a 12.5 basis point relative increase in debt financing cost afterward.
Difference-in-differences estimations comparing AI washing firms to others before and after the FYP shock; effect reported as 12.5 basis points increase in debt financing cost (sample size not stated in abstract).