Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Although GenAI can provide objective support for creative tasks, its use may undermine individuals’ perceptions of their own competence and creative abilities.
Authors' interpretation of the experiment (n=82): measured decreases in intrinsic motivation and skill-related measures alongside unchanged self-evaluated performance led to this conclusion.
Investor narratives often capitalize future productivity gains before those gains have appeared in cash flows.
Narrative and issuance/sentiment analysis discussed in the paper showing valuation increases driven by forward-looking narratives relative to realized cash flows; abstract does not report sample size or precise metrics.
The Order should be read as policy that privileges state and cloud-provider access over broader democratic accountability and social considerations (labor, education, culture, the commons).
Synthesis of textual absence of social-domain terms in the EO, the EO's access/control provisions, and the paper's political-economic critique.
Structurally, the Order is not deregulation but re-regulation centered on state access and cloud rent—a policy instantiation of technofeudalism with a security face.
Political-economic analysis connecting EO provisions (access, testing, state capabilities) with literature on cloud capital and technofeudalism (e.g., Varoufakis) and the paper's archival operators.
The Order mandates testing for 'advanced cyber capabilities' but omits or fails to adopt benchmark frameworks (e.g., Reasoning Under Load (RUL), PER, DSL, IPF, Diversity Contraction, Constitutive Provenance) that the Crimson Hexagonal Archive has deposited.
Comparative policy analysis between the EO's testing mandate language and the list of evaluation frameworks deposited by the Crimson Hexagonal Archive; textual absence of those benchmarks in the EO.
The Order's call for a 'voluntary' corporate framework operates as a 'Mediation Ratchet' that strengthens corporate governance control rather than providing substantive public protections.
Critical/theoretical reading of the Order's voluntary mechanisms combined with the paper's Mediation Ratchet concept.
The Order formalizes an 'AI caste system' that stratifies access into public tiers (e.g., Opus 4.8) and frontier/privileged tiers (e.g., Mythos Preview / Glasswing).
Policy text read against observed product/access tiers in industry; theoretical framing of access stratification.
The paper presents the 'Anthropic arc' (Feb 27 supply-chain-risk designation → June 1 IPO filing → June 2 EO endorsement) as a worked example of 'Institutional-Prior Foreclosure' via state co-optation of a firm.
Chronological mapping of public events (designation, IPO filing, EO) and interpretive analysis linking them as an example of state-firm coordination/co-optation.
General-purpose behavioral theories used for intervention design do not extend uniformly to this specific healthcare context, motivating an agentic AI approach to theory audits at field-experiment scale.
Field experiment results showed deviations from expectations based on general behavioral theories; authors interpret this as evidence that those theories do not uniformly generalize to the healthcare prescription messaging context.
The value in generating better interventions comes from domain-specific experimental data, not from general reasoning ability of frontier LLMs: frontier LLMs operating without experimental data failed to predict which interventions would succeed.
Reported comparison in the paper between AI methods that used experimental data and frontier LLMs run without access to the experimental data; paper states the latter failed to predict success.
Engagement is systematically tied to the intensive, performative labor of children (the platform rewards commodification of the child's identity and labor over traditional advertising), which challenges policy frameworks focused solely on financial trusts.
Synthesis/interpretation based on observed correlations and within-channel view premiums for performative and emotional-bait content versus lack of premium for explicit product placement; policy implication drawn by authors.
Confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer.
Reported conditional rate in the experiment: under-reliance when AI agrees with humans' initial incorrect answer is 64.5%; authors attribute this to confirmation bias.
Governance ambiguity is responsible for 61% of hybrid workflow failures (and the framework aims to remediate this).
Paper reports 'governance ambiguity responsible for 61% of hybrid workflow failures' as a documented gap; no methodological details or sample size provided in the abstract.
Attribution failures occur in 68% of organizations (and the framework addresses these attribution failures).
Paper states 'attribution failures in 68% of organizations' as a documented gap the constructs address; abstract does not report study method or sample size behind the 68% figure.
Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story.
Author's characterization/literature-based claim in the paper (argumentative statement; no empirical data provided in this excerpt).
Public discourse often portrays AI as a threat to employment.
Statement in the paper summarizing public/media discourse; no specific survey or corpus size reported in the excerpt.
The observed wage penalty in high-exposure neighborhoods is driven by task de-skilling and intensified labor-market crowding.
Mechanism analyses linking task-level changes (de-skilling as measured by task assessments) and measures of labor-market crowding to the wage penalties observed in high-exposure neighborhoods, using the same 5 million job postings and task-aggregation approach.
Agents trained on such benchmarks learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems.
Reported empirical failure modes observed when integrating agent-generated kernels into production stacks (descriptive claim in the paper's motivation/observations). No explicit numeric sample size in abstract.
The lack of a relationship between prior productivity and AI adoption points to organizational readiness as a key barrier to AI diffusion.
Interpretation/inference based on the null finding that prior productivity does not predict adoption and the observed associations with digital infrastructure and management practices in the survey data.
The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes.
Exploratory decomposition analyses reported as follow-ups suggesting both low-level (token) and higher-level (semantic) contributions to asymmetry; authors note limited sample size for attribution.
Developer adoption has overwhelmingly favored orchestration (despite the viability of subterranean agents).
Author observation/claim about adoption trends (contrasting prior-works' feasibility with observed developer choices, likely based on ecosystem signals such as project popularity and usage).
Without design corrections that better align AI development with workers' needs, workplace AI incidents are likely to persist, causing the invisible erosion of worker agency and organizational productivity.
Interpretation / implication drawn from empirical findings (high prevalence of misalignments and developer-driven misalignments); presented as a policy/design recommendation and projected outcome.
As a result (of high environment-feedback bandwidth), the marginal benefit of curated Skills diminishes substantially and, in some cases (e.g., our timing side-channel setting), actively degrades performance.
Authors report observed cases in their re-analysis (including a timing side-channel setting) where adding Skills reduced performance; they interpret this as evidence that high-feedback environments can make Skills redundant or harmful.
Consolidation creates platform monopolies extracting value from professional labour while eliminating the expertise that creates it.
Synthesis of market concentration data and theoretical frameworks (platform capitalism) presented in the paper.
AI implementation serves vendor interests in labour cost reduction rather than improving information access.
Analytic argument supported by synthesis of vendor consolidation data, documented implementations, and theoretical analysis of vendor incentives.
Librarians bear operational accountability for systems they neither control nor can modify.
Critical qualitative synthesis including a revelatory case study of verification infrastructure failure and theoretical framing (platform capitalism, sociology of professions, critical information science).
Open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.
Comparative evaluation reported in the paper showing open-source models' performance under OpenComputer verifiers is substantially lower than their OSWorld-Verified scores.
When deliberation tools are distributed across a hierarchy they can interact destructively (a 'deliberation cascade'), producing substantially worse returns and higher token costs than hierarchy alone.
Observed cross-configuration pattern labeled in the paper as a 'deliberation cascade', supported by empirical comparisons showing degraded mean returns and increased token usage for distributed deliberation across model families in the 3,475-episode evaluation.
The tech industry's discourse of exceptionalism obscures its dependence on BPOs to externalise labour costs and accountability.
Argument in paper supported by the authors' GDPR-based document findings that reveal BPO involvement and contract practices; specific linkage details not provided in the excerpt.
Institutionally, high-wage Nordic regimes paradoxically impose opportunity costs.
Comparative cross-national analysis across European welfare regimes using SHARE (2016-2021), indicating higher opportunity costs (e.g., foregone earnings) in high-wage Nordic countries.
Rigid gender dynamics trigger labor market ejection.
Analysis linking gender-role patterns among caregivers in SHARE (2016-2021) to negative employment outcomes (labor market exit/ejection) for affected individuals.
AI created challenges by reducing routine-based employment.
Authors' interpretation of the empirical findings from SEM and descriptive statistics on the survey sample (n=320); the summary states routine-based employment was reduced but no numerical estimate provided in the summary.
Existing AI-assisted annotation workflows typically offer annotators no signal about where spatial (localization) errors are most likely, causing humans to potentially underinspect subtly misplaced boxes.
Statement in paper framing the problem; comparative claim about current AI-assisted labeling workflows and likely human inspection behavior in absence of spatial-uncertainty cues.
Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.
Synthesis of findings: probes show perceptual/multimodal representations encode mismatches (perception intact) while decoding/behavior fails to surface these signals (translation/action bottleneck); supported by PGLA improving behavior when mismatch signal is reintroduced at decoding.
Analysis of four additional platforms suggests the attack may generalise across the knowledge-graph ecosystem.
Authors report analysis across four additional platforms and observe indications that the attack generalises; specific platform names and quantitative outcomes not provided in the summary.
An attacker sophistication gradient reveals discrete break points, a minimum skill at which trust flips from 0% to 100%, reframing the attack as a question not of whether but of how much.
Experiments varying attacker sophistication levels reported by authors; observed threshold behavior (discrete break points) in model trust outcomes.
AI adoption may reinforce, rather than mitigate, the challenges arising from internal divisions within TMTs, with respect to environmental strategic decision-making.
Interpretation/implication drawn from the empirical findings (negative moderation and moderated mediation involving AI) based on the panel analyses (35,347 firm-year observations).
In current LLM markets, aesthetic improvements function as baseline expectations rather than as sources of price differentiation.
Interpretation based on experimental results (no WTP premium for aesthetic qualities) and latent-factor evidence that aesthetic and functional qualities are not separable in perceived quality.
The findings challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers.
Author interpretation/conclusion drawn from the experimental results showing Olava Extract's performance and cost advantages relative to frontier models.
The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget.
Empirical/experimental results reported in the paper based on the three verification methods (finite-state enumeration, SMT checks, PRISM-games MDP); claims about 'large violations' and 'large mean gaming gap' are based on tested instances and random catalog experiments described in the paper.
Existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored.
Authors' literature/benchmark review and motivation for creating Workspace-Bench (qualitative claim in paper).
The predictive signal is not attributable to any single inventor, but emerges as a collective shift in how technologies are described across thousands of patents.
Comparative analyses/robustness checks across inventors and patent sets showing the signal persists when not driven by any single inventor or small group; abstract explicitly states this conclusion and refers to 'thousands of patents' as the aggregation level.
Prior to this work, no LLM-based agent had demonstrated end-to-end autonomous discovery in a real physical system producing a nontrivial result supported by experimental evidence.
Authors' statement about prior work (negative claim about the state of the field) asserted in the paper; based on literature review by the authors rather than an exhaustive independent verification.
Gradient-based attribution can be inflated by adversarial inputs, and detecting such inflation requires external baseline data.
Adversarial-testing experiments reported in paper that demonstrate inflation of attribution by adversarial inputs and indicate detection depends on availability of external baseline data.
Unless targeted interventions occur — including inclusive education, vocational training, and labor reforms — AI may exacerbate poverty and joblessness.
Inference and policy recommendation based on the systematic review's identification of risks; presented as a conditional/forecast rather than a measured causal estimate in the summary.
Analysis of implementation ambiguities reveals these challenges in practice.
Paper reports analysis of implementation ambiguities (qualitative/examples); no quantitative sample size or systematic empirical evaluation described in the summary.
Existing assessments that rely predominantly on patent statistics and structural network centralities dilute substantive technological strengths and thus can obscure hidden core innovators in knowledge-intensive domains such as AI.
Argument supported by comparative analysis in this study showing differences between capability-driven identification and traditional patent/centrality-based approaches using the 282,778 Chinese AI patents.
Because this leakage arises from delegation itself, it cannot be mitigated at the prompt level.
Paper's argument combining theoretical reasoning about delegation-induced channels and experimental evidence showing prompt-level confidentiality instructions do not prevent inference (as implied by the numeric-budget comparison). Specific experimental details not provided in excerpt.
DeepSeek-V2 appears as a conservative scorer, applying more stringent and highly consistent evaluations while systematically underfunding.
Observed pattern in experimental results: DeepSeek-V2 produced stricter (lower) evaluation scores, high consistency across runs (implied by ICCs), and lower funding recommendations across the 20 decks.
The proliferation into competing Shapley formulations has created a fragmented landscape with little consensus on practical deployment.
Motivating literature review and discussion in the paper noting multiple competing Shapley variants and lack of consensus on practical deployment decisions.