Evidence (11633 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
An audit-aware OffAuditDrift strategy that exploits Stackelberg commitment defeats both (Periodic-with-floor and history-conditioned suspicion-escalation) auditor extensions.
Construction of the OffAuditDrift auditee strategy in the paper and simulation/theoretical demonstration that it can evade both proposed auditor policies by exploiting auditor commitment.
We identify a structural feature of any noise-aware static-auditor design: a cover regime in which coverage gaps and granularity gaps cannot be closed simultaneously (formalized as Observation 1).
Theoretical observation/proposition in the paper (Observation 1) derived from the formal model of continuous auditing under noise-aware static auditing rules.
Regulated systems can delay outcome reporting, drift their reports within plausible noise envelopes, exploit longitudinal sample attrition, and cherry-pick among ambiguous metric definitions.
Specification and enumeration of auditee strategies in the paper (Delay, Drift, Cherry-pick, Attrition, OffAuditDrift); conceptual examples and inclusion in simulator.
Continuous post-deployment compliance audits, mandated by emerging regulations such as the EU AI Act and Digital Services Act, create a class of strategic gaming distinct from the one-shot input/output gaming studied in prior work.
Conceptual and theoretical argument in the paper, motivated by regulatory context; formalization of continuous auditing as a multi-round interaction (T-round Stackelberg game).
The reform reduces industrial wastewater discharge, which improves agricultural production conditions (mechanism linking the reform to higher grain yield).
Mechanism analysis in the paper reporting reductions in industrial wastewater discharge following the reform (mediation channel analysis).
A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.
Empirical comparison in simulator experiments indicating that optimizing for exact action accuracy (matching individual actions) can harm higher-level trace distribution alignment; observed in the studies contrasting deterministic copying/value-based approaches with Trace-Prior RL.
Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior.
Empirical observation in simulator experiments comparing deterministic value-based RL and deterministic copying agents to other approaches; observed collapsed/shortcut pricing behaviors when uncertainty is unresolved.
This failure is a Goodhart-style failure under partial observability: Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices.
Theoretical diagnosis supported by simulator setup and observed ambiguity in agent-visible states mapping to multiple competitor prices; derived from the two-hotel simulator design where key competitor variables are hidden from Hotel A.
GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1.
Model-level observation from the ASR analysis within the experiment (paper reports GPT-4.1 had perfect TSR and HF1 but failed trajectory-level fidelity).
Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly.
Empirical evaluation reported in the paper: HMASP tested across 18 LLMs and 90,000 task instances; analysis via ASR showing checkpoint-skipping behavior for 10 models and correct enforcement for 8 models.
From an information-theoretic perspective, this transition corresponds to an emergent information bottleneck in the human-AI loop, where entropy reduction reflects loss of diversity and support under closed-loop feedback rather than beneficial compression.
Theoretical / information-theoretic analysis in the paper linking observed dynamics to entropy reduction and information bottleneck concepts.
Through a simple simulation, we demonstrate that increasing reliance on AI can induce a transition toward a low-diversity, suboptimal equilibrium.
Computational simulation reported in the paper (described as a 'simple simulation'); no sample size or experimental dataset reported in the provided text.
Tabular data does not have a foundation model that understands it natively; every approach to tabular AI today (from gradient-boosted trees to the latest tabular foundation models) requires a preprocessing pipeline before any model can consume the data.
Paper's survey/positioning statement asserting the current state of tabular AI approaches and their reliance on preprocessing pipelines (no specific empirical dataset given).
With strong exposure of low-wealth, high-MPC households and concentrated ownership, privately chosen automation can be excessive even though it raises high-skilled labor income.
Theoretical welfare/comparison analyses in the model with heterogeneous households (differing in wealth and marginal propensities to consume) and ownership concentration; shows private incentives lead to automation choices that are suboptimal from a social perspective under these parameter constellations.
Automation reduces paid human labor.
Model comparative statics in the same equilibrium framework showing substitution away from paid human labor as firms choose automation; result reported in the paper's static benchmark and general-equilibrium analysis.
DePAI entails risks including security, centralization, incentive failure, legal exposure, and the crowding-out of intrinsic motivation, requiring value-sensitive design and continuously adaptive governance.
Risk analysis and conceptual argument in the paper identifying possible failure modes and recommended design/governance responses; no empirical incidence data provided.
Experimental results show that current agents remain far from reliable workspace learning.
Authors' interpretation based on the reported agent performance (< best agent 68.7% vs human 80.7%, average 47.4%).
The average performance across evaluated agents is only 47.4%.
Reported mean performance across agents in the experiments (authors' aggregated result).
The best-performing agent reaches only 68.7% on the benchmark.
Experimental results reported by the authors (evaluation across tasks/rubrics).
These industry visions have implications for human experts, whose professional lives may be transformed and revalued by the expert-annotation industry.
Synthesis and interpretation of themes from public statements by five data-annotation firms and CEOs; authors draw implications for professionals based on observed framings and industry positioning.
Human expertise is viewed by the industry as an extractable resource whose value can be judged relative to AI expertise.
The paper's thematic analysis of public-facing statements from five annotation firms/CEOs showing language that frames human expertise as a resource to be extracted and monetized for AI training.
The industry envisions AI expertise as cheap, meaning that it can offer a better return on investment than human expertise.
Interpretive coding of statements from five data-annotation firms and their CEOs on social media and podcasts indicating that AI-based expertise is framed as lower-cost and higher-ROI relative to human experts.
These dynamics may produce an asymmetric barbell-shaped structure of value capture in advanced economies: high-volume synthetic production controlled by owners of AI infrastructure at one pole, and scarce, high-status human labor valued for verified human presence at the other.
Conceptual projection and economic argument in the paper (no empirical decomposition, distributional statistics, or sample reported in the excerpt).
AI compresses the value of standardized middle-tier labor by making good-enough synthetic substitutes scalable at low marginal cost, hollowing out the middle of the skill distribution currently categorized by knowledge work.
Conceptual/theoretical argument presented in the paper (no reported empirical sample, statistical analysis, or quantified experiment in the excerpt).
AI development may reduce firms' labor income share.
Further analysis reported in the paper linking firm-level AI development to reductions in the labor income share within firms.
AI increases the firm-level skill premium by substituting for low-skilled labor.
Mechanism analysis reported in the paper (firm-level regressions investigating labor composition / substitution effects following AI development).
The cultural and technical misalignment of the data center and electric power sectors makes coordination difficult.
Analytic claim in the paper describing differing design principles, operational philosophies, and economic incentives as sources of misalignment; presented as conceptual analysis without empirical measurement in the excerpt.
A single hyperscale training campus can draw power comparable to a mid-sized city, driven by one tightly synchronized job whose demand swings by hundreds of megawatts in seconds.
Concrete illustrative assertion in the paper about facility-level power draw and rapid demand swings; no numeric source, dataset, or case-study details provided in the excerpt.
AI training data centers break that assumption (load diversity).
Argumentative claim in the paper asserting that characteristics of AI training workloads violate the load-diversity assumption; no quantitative study included in the excerpt.
WIOA is not well-equipped to support large-scale, cross-industry labor transitions.
Low observed incidence of cross-industry occupational transitions and limited shifts into less automation-exposed occupations in the WIOA data (2017-2023) lead authors to conclude the program is poorly suited for large-scale cross-industry reallocation.
A substantial portion of WIOA participants simply return to their prior field after program participation.
Descriptive and outcome analyses on the WIOA participation records (2017-2023) showing many participants re-enter the same occupation/industry rather than transitioning to different occupations.
WIOA rarely shifts workers into less automation-exposed work.
Analysis of WIOA administrative records (2017-2023) using a newly introduced 'Retrainability Index' that decomposes outcomes into post-intervention wage recovery and shifts in routine task intensity (RTI). The paper reports low incidence of downward RTI (movement into less automation-exposed occupations) among participants.
Mechanism tests indicate innovation stagnation in mature firms with redundant AI is a pathway that limits productivity gains (i.e., AI can be associated with stagnant innovation in mature firms).
Mechanism analysis reported in the paper showing signs of reduced innovation-related gains or stagnation in mature, advanced firms using AI (interpreted as redundant AI leading to limited incremental innovation).
AI integration creates challenges such as workforce displacement that must be addressed.
Authors raise workforce displacement as a challenge/consideration in the paper's discussion; this appears as a qualitative claim rather than an empirically quantified result in the supplied text.
AI integration creates challenges such as algorithmic bias that must be addressed.
Authors identify algorithmic bias as a notable challenge in the discussion/conclusion; presented qualitatively rather than as an estimated empirical outcome in the supplied text.
Responsible AI research typically focuses on examining the use and impacts of deployed AI systems, and there is currently limited visibility into the pre-deployment decisions to pursue building such systems.
Argument and literature framing presented in the paper based on a scoping review of academic literature, civil society resources, and grey literature.
This concentration can diffuse responsibility and raise the probability of irreversible system-level loss even when local per-action error rates remain low.
Theoretical result/argument from the model linking concentrated decision-energy to increased systemic risk despite low local error rates.
Efficiency pressure, path dependence, scale feedback, and weak boundary constraints concentrate decision-energy in the most efficient node.
Derived from the paper's formal model and argumentation about system dynamics (efficiency and feedback mechanisms); theoretical rather than empirical evidence.
Declining deployment friction changes the safety problem at its root: safety is not only local output correctness or preference alignment, but the control of irreversibility under rising decision density.
Main theoretical argument of the paper; supported by conceptual framing and a formal model that introduces decision-density considerations.
Recent AI systems compress the distance between capability growth and capability deployment.
Conceptual and descriptive claim in the paper's introduction; supported by theoretical argumentation and illustrative examples rather than empirical measurement.
Creative and interpersonal roles (musicians, physicians, natural sciences managers) show the reverse (i.e., they score low on RL feasibility but high on general AI exposure).
Empirical comparison between the RL Feasibility Index and existing AI-exposure measures, with named creative/interpersonal occupations showing opposite rankings.
Existing indices measure the overlap between AI capabilities and occupational tasks rather than which tasks AI systems can learn to perform, and as a result misclassify occupations where the gap between present capability and learnability is large.
Conceptual critique and comparison of existing AI-exposure indices vs. the authors' proposed learnability-focused approach (paper text argument and empirical comparisons implied later).
A full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding.
Experimental intervention with full transparency of information between agents; authors report that even with full information exchange, dyads fail to reach optimal coordination, pointing to interactive grounding processes as the bottleneck.
The oracle baseline establishes that the coordination gap is not attributable to individual reasoning limitations.
Experimental baseline (oracle) in which individual reasoning is isolated and shown to be sufficient for identifying optimal allocations; details/sizes not given in the abstract.
Failures in referential binding occur, where agents lose track of commitments across turns.
Reported failure mode from multi-turn experiments: referential binding breakdowns leading to loss of commitments.
Agents rely on perfunctory fairness (equal resource splits) over reward-maximizing coordination.
Empirical observation from negotiation experiments where agents prefer equal splits rather than allocations that maximize joint reward, as reported in the paper.
Accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable.
Observed failure mode in multi-turn negotiation experiments: agents anchor on initial proposals and fail to revise, as reported by the authors.
Coordination degrades when shared interaction history is absent.
Experimental comparison of settings with and without shared interaction history (ablation showing worse coordination when history is removed).
While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models.
Experimental results comparing single-agent (isolated) performance and paired-agent (dyad) negotiation performance across multiple LLMs (open- and closed-source); specific sample sizes not reported in the abstract.
Current multi-agent LLM benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns.
Literature/benchmark survey claim by the authors (asserted in the paper; no numeric summary provided here).