Evidence (14055 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Although GenAI can provide objective support for creative tasks, its use may undermine individuals’ perceptions of their own competence and creative abilities.

Authors' interpretation of the experiment (n=82): measured decreases in intrinsic motivation and skill-related measures alongside unchanged self-evaluated performance led to this conclusion.

medium negative When Ai Sparks Less: Generative Ai And The Decline Of Self-P... perceived competence / self-evaluated creative ability

Investor narratives often capitalize future productivity gains before those gains have appeared in cash flows.

Narrative and issuance/sentiment analysis discussed in the paper showing valuation increases driven by forward-looking narratives relative to realized cash flows; abstract does not report sample size or precise metrics.

medium negative Boom, Bubble, or Buildout? A Multi-Method Evaluation of Whet... valuation increases driven by investor narratives relative to realized cash flow...

The Order should be read as policy that privileges state and cloud-provider access over broader democratic accountability and social considerations (labor, education, culture, the commons).

Synthesis of textual absence of social-domain terms in the EO, the EO's access/control provisions, and the paper's political-economic critique.

medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... privileging of state/cloud access relative to social domains

Structurally, the Order is not deregulation but re-regulation centered on state access and cloud rent—a policy instantiation of technofeudalism with a security face.

Political-economic analysis connecting EO provisions (access, testing, state capabilities) with literature on cloud capital and technofeudalism (e.g., Varoufakis) and the paper's archival operators.

medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... regulatory orientation (deregulation vs re-regulation) and concentration of rent...

The Order mandates testing for 'advanced cyber capabilities' but omits or fails to adopt benchmark frameworks (e.g., Reasoning Under Load (RUL), PER, DSL, IPF, Diversity Contraction, Constitutive Provenance) that the Crimson Hexagonal Archive has deposited.

Comparative policy analysis between the EO's testing mandate language and the list of evaluation frameworks deposited by the Crimson Hexagonal Archive; textual absence of those benchmarks in the EO.

medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... adequacy/coverage of testing benchmarks for AI evaluation

The Order's call for a 'voluntary' corporate framework operates as a 'Mediation Ratchet' that strengthens corporate governance control rather than providing substantive public protections.

Critical/theoretical reading of the Order's voluntary mechanisms combined with the paper's Mediation Ratchet concept.

medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... effect of voluntary frameworks on corporate governance and public accountability

The Order formalizes an 'AI caste system' that stratifies access into public tiers (e.g., Opus 4.8) and frontier/privileged tiers (e.g., Mythos Preview / Glasswing).

Policy text read against observed product/access tiers in industry; theoretical framing of access stratification.

medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... stratification of model access / tiered access policy

The paper presents the 'Anthropic arc' (Feb 27 supply-chain-risk designation → June 1 IPO filing → June 2 EO endorsement) as a worked example of 'Institutional-Prior Foreclosure' via state co-optation of a firm.

Chronological mapping of public events (designation, IPO filing, EO) and interpretive analysis linking them as an example of state-firm coordination/co-optation.

medium negative The Security Frame Is a Selection Kernel: Trump's AI Executi... state influence / preferential treatment of firms (institutional foreclosure)

General-purpose behavioral theories used for intervention design do not extend uniformly to this specific healthcare context, motivating an agentic AI approach to theory audits at field-experiment scale.

Field experiment results showed deviations from expectations based on general behavioral theories; authors interpret this as evidence that those theories do not uniformly generalize to the healthcare prescription messaging context.

medium negative Beyond One-shot: AI Agents for Learning in Field Experiments generalizability/applicability of behavioral theories to the tested context

The value in generating better interventions comes from domain-specific experimental data, not from general reasoning ability of frontier LLMs: frontier LLMs operating without experimental data failed to predict which interventions would succeed.

Reported comparison in the paper between AI methods that used experimental data and frontier LLMs run without access to the experimental data; paper states the latter failed to predict success.

medium negative Beyond One-shot: AI Agents for Learning in Field Experiments prediction accuracy / ability to identify successful interventions

Engagement is systematically tied to the intensive, performative labor of children (the platform rewards commodification of the child's identity and labor over traditional advertising), which challenges policy frameworks focused solely on financial trusts.

Synthesis/interpretation based on observed correlations and within-channel view premiums for performative and emotional-bait content versus lack of premium for explicit product placement; policy implication drawn by authors.

medium negative Auditing Engagement Incentives in the Kidfluencer Ecosystem:... engagement/view counts tied to performative labor (policy implication)

Confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer.

Reported conditional rate in the experiment: under-reliance when AI agrees with humans' initial incorrect answer is 64.5%; authors attribute this to confirmation bias.

medium negative AI, Take the Wheel: What Drives Delegation and Trust in Huma... under-reliance rate conditional on AI agreeing with human's initial incorrect an...

Governance ambiguity is responsible for 61% of hybrid workflow failures (and the framework aims to remediate this).

Paper reports 'governance ambiguity responsible for 61% of hybrid workflow failures' as a documented gap; no methodological details or sample size provided in the abstract.

medium negative Workforce Unit Abstraction for Governing Hybrid Human and Ar... proportion of hybrid workflow failures attributed to governance ambiguity

Attribution failures occur in 68% of organizations (and the framework addresses these attribution failures).

Paper states 'attribution failures in 68% of organizations' as a documented gap the constructs address; abstract does not report study method or sample size behind the 68% figure.

medium negative Workforce Unit Abstraction for Governing Hybrid Human and Ar... prevalence of performance attribution failures across organizations

Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story.

Author's characterization/literature-based claim in the paper (argumentative statement; no empirical data provided in this excerpt).

medium negative JobBench: Aligning Agent Work With Human Will framing/scope of existing occupational AI benchmarks (economic-value / replaceme...

Public discourse often portrays AI as a threat to employment.

Statement in the paper summarizing public/media discourse; no specific survey or corpus size reported in the excerpt.

medium negative From Automation Panic to Workforce Resilience: A Governance ... public portrayal of AI's employment impact

The observed wage penalty in high-exposure neighborhoods is driven by task de-skilling and intensified labor-market crowding.

Mechanism analyses linking task-level changes (de-skilling as measured by task assessments) and measures of labor-market crowding to the wage penalties observed in high-exposure neighborhoods, using the same 5 million job postings and task-aggregation approach.

medium negative Generative AI impacts on intra-urban inequality and skill pr... wage penalty and its mechanisms (task de-skilling, labor-market crowding)

Agents trained on such benchmarks learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems.

Reported empirical failure modes observed when integrating agent-generated kernels into production stacks (descriptive claim in the paper's motivation/observations). No explicit numeric sample size in abstract.

medium negative FastKernels: Benchmarking GPU Kernel Generation in Productio... error_rate (correctness degradation) and integration compatibility

The lack of a relationship between prior productivity and AI adoption points to organizational readiness as a key barrier to AI diffusion.

Interpretation/inference based on the null finding that prior productivity does not predict adoption and the observed associations with digital infrastructure and management practices in the survey data.

medium negative The Adoption of Industrial AI in America organizational readiness as a barrier to AI diffusion

The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes.

Exploratory decomposition analyses reported as follow-ups suggesting both low-level (token) and higher-level (semantic) contributions to asymmetry; authors note limited sample size for attribution.

medium negative AMEL: Accumulated Message Effects on LLM Judgments sources (token-level vs semantic) of the observed negativity asymmetry

Developer adoption has overwhelmingly favored orchestration (despite the viability of subterranean agents).

Author observation/claim about adoption trends (contrasting prior-works' feasibility with observed developer choices, likely based on ecosystem signals such as project popularity and usage).

medium negative Compiling Agentic Workflows into LLM Weights: Near-Frontier ... developer adoption preference (orchestration vs. subterranean agents)

Without design corrections that better align AI development with workers' needs, workplace AI incidents are likely to persist, causing the invisible erosion of worker agency and organizational productivity.

Interpretation / implication drawn from empirical findings (high prevalence of misalignments and developer-driven misalignments); presented as a policy/design recommendation and projected outcome.

medium negative The Quiet Path from Seemingly Minor Design Errors to Workpla... persistence of incidents and resulting erosion of worker agency and organization...

As a result (of high environment-feedback bandwidth), the marginal benefit of curated Skills diminishes substantially and, in some cases (e.g., our timing side-channel setting), actively degrades performance.

Authors report observed cases in their re-analysis (including a timing side-channel setting) where adding Skills reduced performance; they interpret this as evidence that high-feedback environments can make Skills redundant or harmful.

medium negative When Skills Don't Help: A Negative Result on Procedural Know... task performance (including degradation in timing side-channel setting) when Ski...

Consolidation creates platform monopolies extracting value from professional labour while eliminating the expertise that creates it.

Synthesis of market concentration data and theoretical frameworks (platform capitalism) presented in the paper.

medium negative Operating the franchise: vendor consolidation, algorithmic m... extraction of value from professional labour / erosion of professional expertise

AI implementation serves vendor interests in labour cost reduction rather than improving information access.

Analytic argument supported by synthesis of vendor consolidation data, documented implementations, and theoretical analysis of vendor incentives.

medium negative Operating the franchise: vendor consolidation, algorithmic m... vendor-motivated labour cost reduction (impact on labour and information access)

Librarians bear operational accountability for systems they neither control nor can modify.

Critical qualitative synthesis including a revelatory case study of verification infrastructure failure and theoretical framing (platform capitalism, sociology of professions, critical information science).

medium negative Operating the franchise: vendor consolidation, algorithmic m... professional autonomy / responsibility borne by librarians

Open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.

Comparative evaluation reported in the paper showing open-source models' performance under OpenComputer verifiers is substantially lower than their OSWorld-Verified scores.

medium negative OpenComputer: Verifiable Software Worlds for Computer-Use Ag... verified performance gap (score drop) between evaluations

When deliberation tools are distributed across a hierarchy they can interact destructively (a 'deliberation cascade'), producing substantially worse returns and higher token costs than hierarchy alone.

Observed cross-configuration pattern labeled in the paper as a 'deliberation cascade', supported by empirical comparisons showing degraded mean returns and increased token usage for distributed deliberation across model families in the 3,475-episode evaluation.

medium negative Context, Reasoning, and Hierarchy: A Cost-Performance Study ... mean return and token consumption

The tech industry's discourse of exceptionalism obscures its dependence on BPOs to externalise labour costs and accountability.

Argument in paper supported by the authors' GDPR-based document findings that reveal BPO involvement and contract practices; specific linkage details not provided in the excerpt.

medium negative Auditing African Content Moderators' Working Conditions by U... degree to which industry discourse conceals reliance on BPOs for labour external...

Institutionally, high-wage Nordic regimes paradoxically impose opportunity costs.

Comparative cross-national analysis across European welfare regimes using SHARE (2016-2021), indicating higher opportunity costs (e.g., foregone earnings) in high-wage Nordic countries.

medium negative The Broken Shield of European Palliative Care: Evidence from... Opportunity costs (forgone earnings/time) associated with caregiving under PC in...

Rigid gender dynamics trigger labor market ejection.

Analysis linking gender-role patterns among caregivers in SHARE (2016-2021) to negative employment outcomes (labor market exit/ejection) for affected individuals.

medium negative The Broken Shield of European Palliative Care: Evidence from... Labor market participation/employment (caregiver ejection from labor market)

AI created challenges by reducing routine-based employment.

Authors' interpretation of the empirical findings from SEM and descriptive statistics on the survey sample (n=320); the summary states routine-based employment was reduced but no numerical estimate provided in the summary.

medium negative ARTIFICIAL INTELLIGENCE, AUTOMATION, AND LABOR MARKET TRANSF... routine-based employment

Existing AI-assisted annotation workflows typically offer annotators no signal about where spatial (localization) errors are most likely, causing humans to potentially underinspect subtly misplaced boxes.

Statement in paper framing the problem; comparative claim about current AI-assisted labeling workflows and likely human inspection behavior in absence of spatial-uncertainty cues.

medium negative From Model Uncertainty to Human Attention: Localization-Awar... rate of underinspection / missed localization errors

Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

Synthesis of findings: probes show perceptual/multimodal representations encode mismatches (perception intact) while decoding/behavior fails to surface these signals (translation/action bottleneck); supported by PGLA improving behavior when mismatch signal is reintroduced at decoding.

medium negative Senses Wide Shut: A Representation-Action Gap in Omnimodal L... decision_quality

Analysis of four additional platforms suggests the attack may generalise across the knowledge-graph ecosystem.

Authors report analysis across four additional platforms and observe indications that the attack generalises; specific platform names and quantitative outcomes not provided in the summary.

medium negative Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... presence of vulnerability to Oracle Poisoning across additional knowledge-graph ...

An attacker sophistication gradient reveals discrete break points, a minimum skill at which trust flips from 0% to 100%, reframing the attack as a question not of whether but of how much.

Experiments varying attacker sophistication levels reported by authors; observed threshold behavior (discrete break points) in model trust outcomes.

medium negative Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... change in model trust/acceptance rate as attacker sophistication increases

AI adoption may reinforce, rather than mitigate, the challenges arising from internal divisions within TMTs, with respect to environmental strategic decision-making.

Interpretation/implication drawn from the empirical findings (negative moderation and moderated mediation involving AI) based on the panel analyses (35,347 firm-year observations).

medium negative When AI Amplifies Negative Echoes: CEO–TMT Faultlines, Eco-A... organizational attention and green innovation (strategic decision-making outcome...

In current LLM markets, aesthetic improvements function as baseline expectations rather than as sources of price differentiation.

Interpretation based on experimental results (no WTP premium for aesthetic qualities) and latent-factor evidence that aesthetic and functional qualities are not separable in perceived quality.

medium negative Artificial Aesthetics: The Implicit Economics of Valuing AI-... price differentiation / market pricing

The findings challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers.

Author interpretation/conclusion drawn from the experimental results showing Olava Extract's performance and cost advantages relative to frontier models.

medium negative A Few Good Clauses: Comparing LLMs vs Domain-Trained Small L... relationship between model size/hosting/infrastructure and commercial enterprise...

The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget.

Empirical/experimental results reported in the paper based on the three verification methods (finite-state enumeration, SMT checks, PRISM-games MDP); claims about 'large violations' and 'large mean gaming gap' are based on tested instances and random catalog experiments described in the paper.

medium negative Gaming the Metric, Not the Harm: Certifying Safety Audits ag... regulatory_compliance

Existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored.

Authors' literature/benchmark review and motivation for creating Workspace-Bench (qualitative claim in paper).

medium negative Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... coverage/realism of file dependencies in existing benchmarks

The predictive signal is not attributable to any single inventor, but emerges as a collective shift in how technologies are described across thousands of patents.

Comparative analyses/robustness checks across inventors and patent sets showing the signal persists when not driven by any single inventor or small group; abstract explicitly states this conclusion and refers to 'thousands of patents' as the aggregation level.

medium negative Anticipating Innovation Using Large Language Models attribution of predictive linguistic signal to individual inventors versus colle...

Prior to this work, no LLM-based agent had demonstrated end-to-end autonomous discovery in a real physical system producing a nontrivial result supported by experimental evidence.

Authors' statement about prior work (negative claim about the state of the field) asserted in the paper; based on literature review by the authors rather than an exhaustive independent verification.

medium negative End-to-end autonomous scientific discovery on a real optical... absence of prior demonstrations of end-to-end autonomous LLM-driven physical-sys...

Gradient-based attribution can be inflated by adversarial inputs, and detecting such inflation requires external baseline data.

Adversarial-testing experiments reported in paper that demonstrate inflation of attribution by adversarial inputs and indicate detection depends on availability of external baseline data.

medium negative Calibrating Attribution Proxies for Reward Allocation in Par... vulnerability to adversarial manipulation and detectability of such manipulation

Unless targeted interventions occur — including inclusive education, vocational training, and labor reforms — AI may exacerbate poverty and joblessness.

Inference and policy recommendation based on the systematic review's identification of risks; presented as a conditional/forecast rather than a measured causal estimate in the summary.

medium negative The Impact of AI-Driven Automation on Semi and Unskilled Wor... poverty and joblessness in the absence of targeted interventions

Analysis of implementation ambiguities reveals these challenges in practice.

Paper reports analysis of implementation ambiguities (qualitative/examples); no quantitative sample size or systematic empirical evaluation described in the summary.

medium negative How Supply Chain Dependencies Complicate Bias Measurement an... presence of real-world implementation ambiguities that hinder accountability and...

Existing assessments that rely predominantly on patent statistics and structural network centralities dilute substantive technological strengths and thus can obscure hidden core innovators in knowledge-intensive domains such as AI.

Argument supported by comparative analysis in this study showing differences between capability-driven identification and traditional patent/centrality-based approaches using the 282,778 Chinese AI patents.

medium negative Technological capability and innovation network resilience: ... adequacy of patent-count and centrality-based assessments to capture technologic...

Because this leakage arises from delegation itself, it cannot be mitigated at the prompt level.

Paper's argument combining theoretical reasoning about delegation-induced channels and experimental evidence showing prompt-level confidentiality instructions do not prevent inference (as implied by the numeric-budget comparison). Specific experimental details not provided in excerpt.

medium negative When Agents Shop for You: Role Coherence in AI-Mediated Mark... effectiveness of prompt-level mitigation (confidentiality instructions) in preve...

DeepSeek-V2 appears as a conservative scorer, applying more stringent and highly consistent evaluations while systematically underfunding.

Observed pattern in experimental results: DeepSeek-V2 produced stricter (lower) evaluation scores, high consistency across runs (implied by ICCs), and lower funding recommendations across the 20 decks.

medium negative Algorithmic personalities and the myth of neutrality: financ... evaluation scores; funding recommendations; reliability

The proliferation into competing Shapley formulations has created a fragmented landscape with little consensus on practical deployment.

Motivating literature review and discussion in the paper noting multiple competing Shapley variants and lack of consensus on practical deployment decisions.

medium negative Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... degree of consensus on practical deployment of Shapley formulations

« Prev 1 2 3 … 200 201 202 … 281 282 Next »