The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6869 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Governance Remove filter
Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs.
Empirical comparison in the paper between single-threshold scoring and tail-inclusive (continuous/unbounded) scoring on identical forecast outputs, showing sign reversal of the capability–accuracy relationship (numerical details not provided in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... capability–accuracy relationship under tail-inclusive scoring (impact of model c...
A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect.
Within-family empirical comparisons using Llama-3.1 variants examining effects of model scale and post-training (fine-tuning) on forecasting calibration (details and sample sizes not provided in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... relationship between model scale / post-training and forecasting calibration (di...
A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put.
Per-quantile decomposition analyses of model predictive distributions reported in the paper, showing quantile-specific changes (specific quantitative results not given in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... upper-tail forecast calibration / shift in predictive quantiles
The pattern replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation.
Empirical replication reported on multiple real-world datasets (COVID-19, measles, housing markets, hyperinflation) presented in the paper (dataset sizes not provided in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... forecast performance on real-world time series (distributional forecasts / calib...
The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control.
Results on the authors' released simulated benchmark (ForecastBench-Sim) using synthetic SIR epidemic simulations and a matched linear-control experiment reported in the paper (specific number of simulations or runs not stated in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... forecast performance on simulated SIR epidemics (distributional forecasts)
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change ... more capable models produce worse distributional forecasts.
Empirical experiments reported in the paper comparing LLMs of varying capability on forecasting tasks with superlinear growth and regime-change tail risk; uses distributional forecast evaluation across models (no sample size reported in excerpt).
high negative Is Capability a Liability? More Capable Language Models Make... distributional forecast quality / calibration
Regulatory frameworks that address only downstream applications leave the upstream concentration of infrastructural power largely intact.
Policy analysis and theoretical critique of regulatory approaches; argument based on the distinction between upstream infrastructure and downstream applications (qualitative).
high negative Digital colonialism, techno-sovereignty, and infrastructural... effectiveness of downstream-focused regulation in reducing upstream infrastructu...
Authority in AI systems is exercised not through formal jurisdiction but through infrastructural chokepoints and dependency pathways that precede and condition law.
Genealogical and infrastructural analysis; theoretical argument emphasizing chokepoints and dependency relations (qualitative).
high negative Digital colonialism, techno-sovereignty, and infrastructural... mechanisms of authority in AI systems (infrastructural chokepoints vs formal leg...
Digital colonialism is distinct from surveillance capitalism: AI extends historical patterns of dispossession and epistemic domination beyond the commodification of individual behavior by embedding extractive and classificatory logics within data architectures, models, and standards.
Conceptual distinction developed via literature review, political-theoretical argumentation, and genealogical analysis (qualitative).
high negative Digital colonialism, techno-sovereignty, and infrastructural... degree to which AI architectures embed extractive/classificatory logics and repr...
Contemporary biometric and algorithmic systems show continuities with colonial identification infrastructures.
Genealogical analysis and engagement with decolonial scholarship tracing historical continuities (qualitative, no quantitative sample).
high negative Digital colonialism, techno-sovereignty, and infrastructural... continuity between colonial identification infrastructures and contemporary biom...
AI systems deployed for identification, classification, and governance are the domains where sovereignty is most visibly reconfigured.
Analytic focus and genealogical tracing within the paper; literature review and conceptual examples of identification/classification systems (no quantitative sample reported).
high negative Digital colonialism, techno-sovereignty, and infrastructural... degree of sovereignty reconfiguration in identification/classification/governanc...
AI constitutes a historically continuous yet technologically novel form of colonial power, shifting sovereignty from territorial authority toward infrastructural and algorithmic control (termed "infrastructural sovereignty").
Theoretical argument and genealogical analysis drawing on political theory and decolonial scholarship; conceptual synthesis presented in the paper (no empirical sample size reported).
high negative Digital colonialism, techno-sovereignty, and infrastructural... configuration of sovereignty (territorial vs infrastructural/algorithmic)
In the absence of communicative and institutional safeguards, individually adaptive delegation aggregates into a systemic collective action problem (modeled as a prisoner's dilemma), producing a sociotechnical lock-in that degrades shared epistemic standards.
Game-theoretic analysis in the paper demonstrating aggregation effects and mapping them to a prisoner's-dilemma–style collective action problem (theoretical modeling, no empirical sample).
high negative The Human-AI Delegation Dilemma: Individual Strategies, Coll... degradation of shared epistemic standards / sociotechnical lock-in
The complementarity thesis is an over-simplification of the modalities of human-AI interaction and the possibility-space for both individual and collective action that human-AI interaction potentiates.
Theoretical argumentation and conceptual analysis presented in the paper (no empirical data reported).
high negative The Human-AI Delegation Dilemma: Individual Strategies, Coll... adequacy of the complementarity thesis to capture human-AI interaction modalitie...
Kamunun Ar-Ge harcamalarının etkin ve verimli kullanılmadığına işaret eden bulgular vardır (kamu Ar-Ge negatif ilişki gösterdiği için).
Negatif ilişkiyi gösteren rassal etkiler regresyon sonuçlarına dayanan çıkarım (G8 + Türkiye, 2010-2020).
high negative AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... etkinlik/verimlilik (yorumsal çıkarım, doğrudan ölçülmemiş)
Ekonomik büyüme ile yapay zekâ patent sayıları arasında negatif bir ilişki bulunmaktadır.
Panel regresyon (random effects) sonuçları (G8 + Türkiye, 2010-2020) raporlanmıştır; ekonomik büyüme (muhtemelen GSMH büyüme oranı) değişkeninin AI patent sayıları ile negatif ilişki gösterdiği bildirilmiştir.
high negative AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... AI patent sayıları (yapay zekâ patent sayısı)
Kamunun Ar-Ge harcamaları ile yapay zekâ patent sayıları arasında negatif bir ilişki bulunmaktadır.
Rassal etkiler panel regresyonu üzerine raporlanan sonuçlar (G8 + Türkiye, 2010-2020); kamu Ar-Ge harcamaları değişkeninin AI patent sayısı ile negatif ilişki gösterdiği bildirilmiştir.
high negative AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... AI patent sayıları (yapay zekâ patent sayısı)
Science-to-technology knowledge flow in AI has been insufficiently examined in a systematic and structural way.
Literature-gap claim in the paper motivating the study.
high negative Knowledge flows from science to AI technology: Identifying c... extent of systematic/structural study of science-to-technology knowledge flow in...
Highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation.
Synthesis of quantitative (coverage/reuse statistics) and qualitative analyses (narrative framing, taxonomy mapping) from the Benchmarking-Cultures-25 project; interpretive conclusion drawn by the authors.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... primary function of highlighted benchmarks (standardized measurement vs narrativ...
Authors of many 'general knowledge application' benchmarks claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math).
Content analysis of the benchmarks in the dataset showing topical focus (counts/observations indicating predominance of STEM/math topics) versus broader claimed measurement scope.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... topical focus of benchmark content (STEM/math prevalence) versus stated measurem...
Qualitative analysis shows many 'general knowledge application' benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI.
Qualitative content analysis of benchmark descriptions and builder narratives in the dataset; authors report themes where construct validity is downplayed and AGI progress is emphasized.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... degree of attention to construct validity vs AGI-framing in benchmark narratives
38.5% of highlighted benchmarks appear in just one release.
Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks); the paper reports the share (38.5%) of benchmarks that appear in only a single model release.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... durability/reuse of benchmarks across releases
The evaluation landscape is fragmented with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder.
Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks). The paper reports the share (63.2%) based on counts of builders per highlighted benchmark.
high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... degree of cross-model benchmark reuse (benchmarks per builder)
In strategic decision scenarios, individuals may modify their features after deployment, inducing a post-deployment distribution shift; this strategic manipulation creates a mismatch between the non-strategic prior learned during pretraining and the post-manipulation strategic prior, which leads to systematic prediction bias.
Conceptual/theoretical claim stated in the paper that strategic feature manipulation causes distribution shift and mismatch between learned prior and strategic prior; the abstract asserts this as a cause of systematic prediction bias. No empirical sample sizes given in the abstract.
high negative When Tabular Foundation Models Meet Strategic Tabular Data: ... prediction bias (systematic)
Content filtering (blocking searches for Gaza War and Tulsa race massacre).
Documented cases of content filtering cited/synthesized in the paper (specific blocked search topics reported).
high negative Operating the franchise: vendor consolidation, algorithmic m... blocking of specific search queries / restriction of information access
AI cataloguing failures (26% F1 accuracy for subject headings).
Empirical studies of AI accuracy in cataloguing synthesized by the paper (reported F1 accuracy for subject heading assignment).
high negative Operating the franchise: vendor consolidation, algorithmic m... F1 accuracy of AI subject heading assignment
Twin agents dissolve that boundary, raising a class of trust calibration challenge these frameworks were not designed to handle.
Argument and design observations from the authors' ongoing project presented in the paper; conceptual claim explaining why existing frameworks may be insufficient for twin agents.
high negative From Role to Person: Trust Calibration Challenges in Twin Ag... framework_applicability_for_trust_calibration
When a human colleague doubts a twin agent's output, they face three failure modes (a schema gap, an epistemic gap, and a model artifact) with no reliable attribution path between them.
Conceptual taxonomy derived from the authors' early design observations; presented as an identified set of failure modes in the paper (qualitative, no numeric sample reported in abstract).
Drawing on early design work in an ongoing project, we identify a trust calibration problem specific to this approach.
Based on the authors' early design work (qualitative/design research) described in the paper; no sample size or quantitative metrics reported in the abstract.
Challenges including algorithmic bias, data privacy concerns, high costs, and skill gaps persist across contexts.
Cross-study synthesis of barriers and challenges reported in the 21 included studies spanning multiple contexts.
high negative Application of Artificial Intelligence in Human Resource Man... prevalence of adoption barriers (bias, privacy, cost, skills)
SMEs face unique resource constraints yet lag in AI-HRM adoption.
Synthesis conclusion from the systematic review of 21 included studies (published 2019–2026) comparing adoption patterns and barriers for SMEs.
high negative Application of Artificial Intelligence in Human Resource Man... AI-HRM adoption (lag) and resource constraints
Greater automation can obscure rather than eliminate failure modes.
Analytical claim in paper arguing that increased automation hides failures; presented as an interpretive finding rather than a quantified experimental result in the excerpt.
high negative AI for Auto-Research: Roadmap & User Guide visibility or obscuration of failure modes under automation
End-to-end autonomous systems have not yet consistently reached major-venue acceptance standards.
Paper's statement based on review of acceptance/peer-review outcomes and standards as of April 2026; no numeric acceptance-rate data presented in the excerpt.
high negative AI for Auto-Research: Roadmap & User Guide consistency of meeting major-venue acceptance standards
Research code lags far behind pattern-matching benchmarks.
Paper's evaluative claim from its experiments/coding analysis indicating code produced for research tasks is weaker than benchmark performance on pattern-matching tasks; excerpt contains no numerical comparison.
high negative AI for Auto-Research: Roadmap & User Guide quality/performance of research code relative to pattern-matching benchmarks
Generated ideas often degrade after implementation.
Paper statement about the gap between idea generation and implemented results reported in the Creation-phase analysis; no quantified follow-up study reported in the excerpt.
high negative AI for Auto-Research: Roadmap & User Guide quality change of generated ideas after implementation
AI remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment.
Summary claim from the paper's end-to-end lifecycle analysis indicating limitations on novelty and experimental rigor; no numeric performance metrics provided in excerpt.
high negative AI for Auto-Research: Roadmap & User Guide robustness on novel ideas, research-level experiments, and scientific judgment
Frontier LLMs fail to judge novelty reliably.
Paper's claim from its Validation-phase analysis that models do not reliably assess novelty; excerpt contains no underlying experimental sample or validation metrics.
high negative AI for Auto-Research: Roadmap & User Guide reliability of novelty judgments
Frontier LLMs miss hidden errors.
Qualitative statement from paper indicating models fail to detect some latent or subtle errors in research artifacts; no numeric evaluation provided in excerpt.
high negative AI for Auto-Research: Roadmap & User Guide ability to detect hidden errors
Under scientific pressure, even frontier LLMs still fabricate results.
Reported observation in paper about model behavior under scientific-use conditions; no specific quantitative experiments or sample sizes given in the excerpt.
high negative AI for Auto-Research: Roadmap & User Guide incidence of fabricated results by LLMs
Both major deployed systems and designed mechanisms concentrate on the most observable and easiest-to-govern tier, while the forms of commercial influence most consequential for user autonomy remain poorly understood and lack frameworks for detection, measurement, or disclosure.
Critical review of deployed system design choices and governance mechanisms; authors assert that attention and tooling focus on observable product-mention-level interventions while higher-tier influences lack measurement and disclosure frameworks.
high negative Generative AI Advertising as a Problem of Trustworthy Commer... coverage of governance/mechanisms across influence tiers and the existence of fr...
These tiers instantiate across modalities and system architectures, including retrieval-augmented generation and agentic pipelines where upstream decisions can sharply constrain downstream outcomes.
Analytical claim supported by examples and discussion of system architectures (e.g., RAG, agentic pipelines) showing how interventions at different stages map to the taxonomy; no quantitative evaluation reported in excerpt.
high negative Generative AI Advertising as a Problem of Trustworthy Commer... presence of influence tiers across different system modalities and architectures
Generative AI fundamentally changes advertising: rather than placing products into discrete slots, it enables interventions on the generative process itself, which induce commercial influence through less observable channels.
Conceptual argument backed by analysis of how generative models produce outputs and how interventions can operate on latent variables of generation; illustrated via taxonomy in the paper rather than quantified empirical tests.
high negative Generative AI Advertising as a Problem of Trustworthy Commer... modes/channels of commercial influence in advertising systems
Empirical research shows that ads woven directly into large language model (LLM) outputs often go undetected by users.
Reference to prior empirical studies (unspecified in the excerpt) showing user failure to detect embedded ads in LLM outputs; presented as an empirical finding rather than new experimental data in this paper.
high negative Generative AI Advertising as a Problem of Trustworthy Commer... user detection/recognition of ads embedded in LLM outputs
Management shareholding and analyst attention amplify the debt-cost penalty faced by AI washing firms.
Heterogeneity/interaction analyses showing larger post-shock financing-cost increases for AI washing firms with higher management shareholding and greater analyst attention (descriptive of moderator effects; no sample sizes in abstract).
high negative Dissipation of Debt Financing Privilege on Corporate AI Wash... magnitude of debt financing cost penalty
Difference-in-differences estimations reveal that AI washing firms experience a 12.5 basis point relative increase in debt financing cost afterward.
Difference-in-differences estimations comparing AI washing firms to others before and after the FYP shock; effect reported as 12.5 basis points increase in debt financing cost (sample size not stated in abstract).
Standard health system digital transformation policy, which typically addresses only the threshold failure through individual incentives, is predicted to systematically produce the partial adoption trap.
Model prediction contrasting full policy architecture vs. conventional policies that focus solely on individual incentives; analytical conclusion that such limited policies leave other failure modes unaddressed and therefore lead to stable partial adoption. Theoretical model; no empirical sample.
high negative The partial adoption trap: Coordination failure, trust, and ... policy-induced equilibrium (partial adoption trap likelihood) under conventional...
The barrier-lowering benefit of failed attempts is offset when trust erosion is rapid.
Model analysis combining cost-ratchet dynamics and trust erosion parameters; results showing interaction where fast trust erosion negates barrier reductions. Theoretical simulations/derivations; no empirical sample.
high negative The partial adoption trap: Coordination failure, trust, and ... net effect on adoption barriers given interplay of cost ratchet and trust erosio...
These failure modes are most severe precisely for the technologies with the greatest systemic value: the Value-Adoption Paradox.
Analytical result from the model showing failure-mode severity as a function of systemic value; theoretical identification of a paradox where higher systemic-value technologies face stronger coordination/trust/cultural barriers. Theoretical derivation; no empirical sample.
high negative The partial adoption trap: Coordination failure, trust, and ... relationship between systemic value of technology and severity of adoption failu...
The basin of attraction of the partial adoption trap is enlarged by a cultural failure arising from negative coordination norms among doctors.
Model analysis including cultural coordination norms; theoretical demonstration that negative norms exacerbate partial adoption equilibria. Theoretical model; no empirical sample.
high negative The partial adoption trap: Coordination failure, trust, and ... size of basin of attraction for partial adoption (effect of cultural/coordinatio...
The basin of attraction of the partial adoption trap is enlarged by a trust failure arising from the organisation's inability to credibly commit to sharing productivity gains.
Model extension incorporating organisational commitment/transfer of gains; analytical results showing trust/commitment constraints increase stability of partial adoption. Theoretical model; no empirical sample.
high negative The partial adoption trap: Coordination failure, trust, and ... size of basin of attraction for partial adoption (effect of trust/commitment con...