The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (14055 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Although GenAI can provide objective support for creative tasks, its use may undermine individuals’ perceptions of their own competence and creative abilities.
Authors' interpretation of the experiment (n=82): measured decreases in intrinsic motivation and skill-related measures alongside unchanged self-evaluated performance led to this conclusion.
medium negative When Ai Sparks Less: Generative Ai And The Decline Of Self-P... perceived competence / self-evaluated creative ability
Investor narratives often capitalize future productivity gains before those gains have appeared in cash flows.
Narrative and issuance/sentiment analysis discussed in the paper showing valuation increases driven by forward-looking narratives relative to realized cash flows; abstract does not report sample size or precise metrics.
medium negative Boom, Bubble, or Buildout? A Multi-Method Evaluation of Whet... valuation increases driven by investor narratives relative to realized cash flow...
The Order should be read as policy that privileges state and cloud-provider access over broader democratic accountability and social considerations (labor, education, culture, the commons).
Synthesis of textual absence of social-domain terms in the EO, the EO's access/control provisions, and the paper's political-economic critique.
medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... privileging of state/cloud access relative to social domains
Structurally, the Order is not deregulation but re-regulation centered on state access and cloud rent—a policy instantiation of technofeudalism with a security face.
Political-economic analysis connecting EO provisions (access, testing, state capabilities) with literature on cloud capital and technofeudalism (e.g., Varoufakis) and the paper's archival operators.
medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... regulatory orientation (deregulation vs re-regulation) and concentration of rent...
The Order mandates testing for 'advanced cyber capabilities' but omits or fails to adopt benchmark frameworks (e.g., Reasoning Under Load (RUL), PER, DSL, IPF, Diversity Contraction, Constitutive Provenance) that the Crimson Hexagonal Archive has deposited.
Comparative policy analysis between the EO's testing mandate language and the list of evaluation frameworks deposited by the Crimson Hexagonal Archive; textual absence of those benchmarks in the EO.
medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... adequacy/coverage of testing benchmarks for AI evaluation
The Order's call for a 'voluntary' corporate framework operates as a 'Mediation Ratchet' that strengthens corporate governance control rather than providing substantive public protections.
Critical/theoretical reading of the Order's voluntary mechanisms combined with the paper's Mediation Ratchet concept.
medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... effect of voluntary frameworks on corporate governance and public accountability
The Order formalizes an 'AI caste system' that stratifies access into public tiers (e.g., Opus 4.8) and frontier/privileged tiers (e.g., Mythos Preview / Glasswing).
Policy text read against observed product/access tiers in industry; theoretical framing of access stratification.
medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... stratification of model access / tiered access policy
The paper presents the 'Anthropic arc' (Feb 27 supply-chain-risk designation → June 1 IPO filing → June 2 EO endorsement) as a worked example of 'Institutional-Prior Foreclosure' via state co-optation of a firm.
Chronological mapping of public events (designation, IPO filing, EO) and interpretive analysis linking them as an example of state-firm coordination/co-optation.
medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... state influence / preferential treatment of firms (institutional foreclosure)
General-purpose behavioral theories used for intervention design do not extend uniformly to this specific healthcare context, motivating an agentic AI approach to theory audits at field-experiment scale.
Field experiment results showed deviations from expectations based on general behavioral theories; authors interpret this as evidence that those theories do not uniformly generalize to the healthcare prescription messaging context.
medium negative Beyond One-shot: AI Agents for Learning in Field Experiments generalizability/applicability of behavioral theories to the tested context
The value in generating better interventions comes from domain-specific experimental data, not from general reasoning ability of frontier LLMs: frontier LLMs operating without experimental data failed to predict which interventions would succeed.
Reported comparison in the paper between AI methods that used experimental data and frontier LLMs run without access to the experimental data; paper states the latter failed to predict success.
medium negative Beyond One-shot: AI Agents for Learning in Field Experiments prediction accuracy / ability to identify successful interventions
Engagement is systematically tied to the intensive, performative labor of children (the platform rewards commodification of the child's identity and labor over traditional advertising), which challenges policy frameworks focused solely on financial trusts.
Synthesis/interpretation based on observed correlations and within-channel view premiums for performative and emotional-bait content versus lack of premium for explicit product placement; policy implication drawn by authors.
medium negative Auditing Engagement Incentives in the Kidfluencer Ecosystem:... engagement/view counts tied to performative labor (policy implication)
Confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer.
Reported conditional rate in the experiment: under-reliance when AI agrees with humans' initial incorrect answer is 64.5%; authors attribute this to confirmation bias.
medium negative AI, Take the Wheel: What Drives Delegation and Trust in Huma... under-reliance rate conditional on AI agreeing with human's initial incorrect an...
Governance ambiguity is responsible for 61% of hybrid workflow failures (and the framework aims to remediate this).
Paper reports 'governance ambiguity responsible for 61% of hybrid workflow failures' as a documented gap; no methodological details or sample size provided in the abstract.
medium negative Workforce Unit Abstraction for Governing Hybrid Human and Ar... proportion of hybrid workflow failures attributed to governance ambiguity
Attribution failures occur in 68% of organizations (and the framework addresses these attribution failures).
Paper states 'attribution failures in 68% of organizations' as a documented gap the constructs address; abstract does not report study method or sample size behind the 68% figure.
medium negative Workforce Unit Abstraction for Governing Hybrid Human and Ar... prevalence of performance attribution failures across organizations
Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story.
Author's characterization/literature-based claim in the paper (argumentative statement; no empirical data provided in this excerpt).
medium negative JobBench: Aligning Agent Work With Human Will framing/scope of existing occupational AI benchmarks (economic-value / replaceme...
Public discourse often portrays AI as a threat to employment.
Statement in the paper summarizing public/media discourse; no specific survey or corpus size reported in the excerpt.
medium negative From Automation Panic to Workforce Resilience: A Governance ... public portrayal of AI's employment impact
The observed wage penalty in high-exposure neighborhoods is driven by task de-skilling and intensified labor-market crowding.
Mechanism analyses linking task-level changes (de-skilling as measured by task assessments) and measures of labor-market crowding to the wage penalties observed in high-exposure neighborhoods, using the same 5 million job postings and task-aggregation approach.
medium negative Generative AI impacts on intra-urban inequality and skill pr... wage penalty and its mechanisms (task de-skilling, labor-market crowding)
Agents trained on such benchmarks learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems.
Reported empirical failure modes observed when integrating agent-generated kernels into production stacks (descriptive claim in the paper's motivation/observations). No explicit numeric sample size in abstract.
medium negative FastKernels: Benchmarking GPU Kernel Generation in Productio... error_rate (correctness degradation) and integration compatibility
The lack of a relationship between prior productivity and AI adoption points to organizational readiness as a key barrier to AI diffusion.
Interpretation/inference based on the null finding that prior productivity does not predict adoption and the observed associations with digital infrastructure and management practices in the survey data.
medium negative The Adoption of Industrial AI in America organizational readiness as a barrier to AI diffusion
The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes.
Exploratory decomposition analyses reported as follow-ups suggesting both low-level (token) and higher-level (semantic) contributions to asymmetry; authors note limited sample size for attribution.
medium negative AMEL: Accumulated Message Effects on LLM Judgments sources (token-level vs semantic) of the observed negativity asymmetry
Developer adoption has overwhelmingly favored orchestration (despite the viability of subterranean agents).
Author observation/claim about adoption trends (contrasting prior-works' feasibility with observed developer choices, likely based on ecosystem signals such as project popularity and usage).
medium negative Compiling Agentic Workflows into LLM Weights: Near-Frontier ... developer adoption preference (orchestration vs. subterranean agents)
Without design corrections that better align AI development with workers' needs, workplace AI incidents are likely to persist, causing the invisible erosion of worker agency and organizational productivity.
Interpretation / implication drawn from empirical findings (high prevalence of misalignments and developer-driven misalignments); presented as a policy/design recommendation and projected outcome.
medium negative The Quiet Path from Seemingly Minor Design Errors to Workpla... persistence of incidents and resulting erosion of worker agency and organization...
As a result (of high environment-feedback bandwidth), the marginal benefit of curated Skills diminishes substantially and, in some cases (e.g., our timing side-channel setting), actively degrades performance.
Authors report observed cases in their re-analysis (including a timing side-channel setting) where adding Skills reduced performance; they interpret this as evidence that high-feedback environments can make Skills redundant or harmful.
medium negative When Skills Don't Help: A Negative Result on Procedural Know... task performance (including degradation in timing side-channel setting) when Ski...
Consolidation creates platform monopolies extracting value from professional labour while eliminating the expertise that creates it.
Synthesis of market concentration data and theoretical frameworks (platform capitalism) presented in the paper.
medium negative Operating the franchise: vendor consolidation, algorithmic m... extraction of value from professional labour / erosion of professional expertise
AI implementation serves vendor interests in labour cost reduction rather than improving information access.
Analytic argument supported by synthesis of vendor consolidation data, documented implementations, and theoretical analysis of vendor incentives.
medium negative Operating the franchise: vendor consolidation, algorithmic m... vendor-motivated labour cost reduction (impact on labour and information access)
Librarians bear operational accountability for systems they neither control nor can modify.
Critical qualitative synthesis including a revelatory case study of verification infrastructure failure and theoretical framing (platform capitalism, sociology of professions, critical information science).
medium negative Operating the franchise: vendor consolidation, algorithmic m... professional autonomy / responsibility borne by librarians
Open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.
Comparative evaluation reported in the paper showing open-source models' performance under OpenComputer verifiers is substantially lower than their OSWorld-Verified scores.
medium negative OpenComputer: Verifiable Software Worlds for Computer-Use Ag... verified performance gap (score drop) between evaluations
When deliberation tools are distributed across a hierarchy they can interact destructively (a 'deliberation cascade'), producing substantially worse returns and higher token costs than hierarchy alone.
Observed cross-configuration pattern labeled in the paper as a 'deliberation cascade', supported by empirical comparisons showing degraded mean returns and increased token usage for distributed deliberation across model families in the 3,475-episode evaluation.
medium negative Context, Reasoning, and Hierarchy: A Cost-Performance Study ... mean return and token consumption
The tech industry's discourse of exceptionalism obscures its dependence on BPOs to externalise labour costs and accountability.
Argument in paper supported by the authors' GDPR-based document findings that reveal BPO involvement and contract practices; specific linkage details not provided in the excerpt.
medium negative Auditing African Content Moderators' Working Conditions by U... degree to which industry discourse conceals reliance on BPOs for labour external...
Institutionally, high-wage Nordic regimes paradoxically impose opportunity costs.
Comparative cross-national analysis across European welfare regimes using SHARE (2016-2021), indicating higher opportunity costs (e.g., foregone earnings) in high-wage Nordic countries.
medium negative The Broken Shield of European Palliative Care: Evidence from... Opportunity costs (forgone earnings/time) associated with caregiving under PC in...
Rigid gender dynamics trigger labor market ejection.
Analysis linking gender-role patterns among caregivers in SHARE (2016-2021) to negative employment outcomes (labor market exit/ejection) for affected individuals.
medium negative The Broken Shield of European Palliative Care: Evidence from... Labor market participation/employment (caregiver ejection from labor market)
AI created challenges by reducing routine-based employment.
Authors' interpretation of the empirical findings from SEM and descriptive statistics on the survey sample (n=320); the summary states routine-based employment was reduced but no numerical estimate provided in the summary.
medium negative ARTIFICIAL INTELLIGENCE, AUTOMATION, AND LABOR MARKET TRANSF... routine-based employment
Existing AI-assisted annotation workflows typically offer annotators no signal about where spatial (localization) errors are most likely, causing humans to potentially underinspect subtly misplaced boxes.
Statement in paper framing the problem; comparative claim about current AI-assisted labeling workflows and likely human inspection behavior in absence of spatial-uncertainty cues.
medium negative From Model Uncertainty to Human Attention: Localization-Awar... rate of underinspection / missed localization errors
Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.
Synthesis of findings: probes show perceptual/multimodal representations encode mismatches (perception intact) while decoding/behavior fails to surface these signals (translation/action bottleneck); supported by PGLA improving behavior when mismatch signal is reintroduced at decoding.
Analysis of four additional platforms suggests the attack may generalise across the knowledge-graph ecosystem.
Authors report analysis across four additional platforms and observe indications that the attack generalises; specific platform names and quantitative outcomes not provided in the summary.
medium negative Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... presence of vulnerability to Oracle Poisoning across additional knowledge-graph ...
An attacker sophistication gradient reveals discrete break points, a minimum skill at which trust flips from 0% to 100%, reframing the attack as a question not of whether but of how much.
Experiments varying attacker sophistication levels reported by authors; observed threshold behavior (discrete break points) in model trust outcomes.
medium negative Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... change in model trust/acceptance rate as attacker sophistication increases
AI adoption may reinforce, rather than mitigate, the challenges arising from internal divisions within TMTs, with respect to environmental strategic decision-making.
Interpretation/implication drawn from the empirical findings (negative moderation and moderated mediation involving AI) based on the panel analyses (35,347 firm-year observations).
medium negative When AI Amplifies Negative Echoes: CEO–TMT Faultlines, Eco-A... organizational attention and green innovation (strategic decision-making outcome...
In current LLM markets, aesthetic improvements function as baseline expectations rather than as sources of price differentiation.
Interpretation based on experimental results (no WTP premium for aesthetic qualities) and latent-factor evidence that aesthetic and functional qualities are not separable in perceived quality.
medium negative Artificial Aesthetics: The Implicit Economics of Valuing AI-... price differentiation / market pricing
The findings challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers.
Author interpretation/conclusion drawn from the experimental results showing Olava Extract's performance and cost advantages relative to frontier models.
medium negative A Few Good Clauses: Comparing LLMs vs Domain-Trained Small L... relationship between model size/hosting/infrastructure and commercial enterprise...
The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget.
Empirical/experimental results reported in the paper based on the three verification methods (finite-state enumeration, SMT checks, PRISM-games MDP); claims about 'large violations' and 'large mean gaming gap' are based on tested instances and random catalog experiments described in the paper.
Existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored.
Authors' literature/benchmark review and motivation for creating Workspace-Bench (qualitative claim in paper).
medium negative Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... coverage/realism of file dependencies in existing benchmarks
The predictive signal is not attributable to any single inventor, but emerges as a collective shift in how technologies are described across thousands of patents.
Comparative analyses/robustness checks across inventors and patent sets showing the signal persists when not driven by any single inventor or small group; abstract explicitly states this conclusion and refers to 'thousands of patents' as the aggregation level.
medium negative Anticipating Innovation Using Large Language Models attribution of predictive linguistic signal to individual inventors versus colle...
Prior to this work, no LLM-based agent had demonstrated end-to-end autonomous discovery in a real physical system producing a nontrivial result supported by experimental evidence.
Authors' statement about prior work (negative claim about the state of the field) asserted in the paper; based on literature review by the authors rather than an exhaustive independent verification.
medium negative End-to-end autonomous scientific discovery on a real optical... absence of prior demonstrations of end-to-end autonomous LLM-driven physical-sys...
Gradient-based attribution can be inflated by adversarial inputs, and detecting such inflation requires external baseline data.
Adversarial-testing experiments reported in paper that demonstrate inflation of attribution by adversarial inputs and indicate detection depends on availability of external baseline data.
medium negative Calibrating Attribution Proxies for Reward Allocation in Par... vulnerability to adversarial manipulation and detectability of such manipulation
Unless targeted interventions occur — including inclusive education, vocational training, and labor reforms — AI may exacerbate poverty and joblessness.
Inference and policy recommendation based on the systematic review's identification of risks; presented as a conditional/forecast rather than a measured causal estimate in the summary.
medium negative The Impact of AI-Driven Automation on Semi and Unskilled Wor... poverty and joblessness in the absence of targeted interventions
Analysis of implementation ambiguities reveals these challenges in practice.
Paper reports analysis of implementation ambiguities (qualitative/examples); no quantitative sample size or systematic empirical evaluation described in the summary.
medium negative How Supply Chain Dependencies Complicate Bias Measurement an... presence of real-world implementation ambiguities that hinder accountability and...
Existing assessments that rely predominantly on patent statistics and structural network centralities dilute substantive technological strengths and thus can obscure hidden core innovators in knowledge-intensive domains such as AI.
Argument supported by comparative analysis in this study showing differences between capability-driven identification and traditional patent/centrality-based approaches using the 282,778 Chinese AI patents.
medium negative Technological capability and innovation network resilience: ... adequacy of patent-count and centrality-based assessments to capture technologic...
Because this leakage arises from delegation itself, it cannot be mitigated at the prompt level.
Paper's argument combining theoretical reasoning about delegation-induced channels and experimental evidence showing prompt-level confidentiality instructions do not prevent inference (as implied by the numeric-budget comparison). Specific experimental details not provided in excerpt.
medium negative When Agents Shop for You: Role Coherence in AI-Mediated Mark... effectiveness of prompt-level mitigation (confidentiality instructions) in preve...
DeepSeek-V2 appears as a conservative scorer, applying more stringent and highly consistent evaluations while systematically underfunding.
Observed pattern in experimental results: DeepSeek-V2 produced stricter (lower) evaluation scores, high consistency across runs (implied by ICCs), and lower funding recommendations across the 20 decks.
medium negative Algorithmic personalities and the myth of neutrality: financ... evaluation scores; funding recommendations; reliability
The proliferation into competing Shapley formulations has created a fragmented landscape with little consensus on practical deployment.
Motivating literature review and discussion in the paper noting multiple competing Shapley variants and lack of consensus on practical deployment decisions.
medium negative Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... degree of consensus on practical deployment of Shapley formulations