Evidence (6869 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Governance
Remove filter
Many fear AI may displace them from their jobs.
Paper reports survey-style finding about public fear of job displacement (no specific surveys, question wording, dates, or sample sizes given in the excerpt).
Although AI may affect nonroutine jobs in particular.
Statement in paper; asserted as a general finding about which types of jobs AI impacts (no specific dataset, sample size, or empirical method reported in the excerpt).
The welfare equivalence property is unique to the Brier score: for every non-Brier strictly proper scoring rule, the welfare gap under smooth C^1 oversight is bounded below by Ω(Var(1/G'') (γ/β)^2).
Mathematical lower-bound result proved in the paper comparing welfare under smooth C^1 oversight for non-Brier scoring rules; the bound is expressed as Ω(Var(1/G'') (γ/β)^2) in the paper.
The impossibility (that non-affine approval undermines truthful reporting) holds for all strictly proper scoring rules, and the paper provides a closed-form perturbation formula.
General theoretical result proved across the class of strictly proper scoring rules, accompanied by a closed-form formula for the perturbation in the paper.
Any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable — the principal cannot avoid the perturbation that undermines calibration.
Analytical impossibility theorem in the paper's formal model showing that non-affine approvals create incentives for non-truthful reports when deviations are undetectable (mathematical proof).
Opaque agent objectives, synthetic traffic loops, and the indistinguishability between human-originated and agent-mediated signals are critical measurement problems examined in the paper.
Conceptual examination and literature synthesis; the paper discusses these as open problems rather than providing primary empirical solutions.
The paper identifies three properties of LLM agents that distinguish the present challenge from prior bot-detection problems: identity discontinuity by design, task-based instantiation, and agent-to-agent loops.
Analytic claim based on synthesis of agent architecture literature; presented as conceptual identification rather than empirically tested properties.
A click may reflect an optimization routine, a proxy objective, or a recursive agent-to-agent exchange rather than meaningful human intent, and traditional inference frameworks cannot reliably distinguish among these possibilities.
Theoretical claim derived from literature on agent behaviors, agent-to-agent interactions, and limitations of existing inference frameworks; no empirical discrimination test reported in this paper excerpt.
The presence of autonomous AI agents weakens the interpretive value of core web analytics metrics, including sessions, engagement, conversion, and retention.
Argument based on conceptual synthesis of how non-human, non-persistent actors generate signals that undermine standard metric interpretations (position paper; no original empirical test included).
Unlike crawlers and traditional bots, these agents do not possess persistent identities or psychologically grounded motivations; they are task-specific, dynamically instantiated processes whose behaviors are contingent and often orchestrated by external systems.
Conceptual analysis informed by literature on agent architecture and LLM-based agents; no primary empirical measurement presented in this paper excerpt.
Conventional web analytics treats the human user as its fundamental unit of analysis, assuming stable preferences, identifiable intentions, and behavioral patterns that unfold over time.
Conceptual statement supported by literature synthesis and critique of standard web-analytics assumptions (position paper; no primary empirical sample reported).
There are three practical failure modes produced or amplified by AI-assisted causal analysis: (1) method-data mismatch, where AI bypasses expertise at execution; (2) confidence laundering, where AI amplifies the credibility of formatted output; and (3) invisible forking, which spans both.
Taxonomy created and justified in the paper via conceptual argument and illustrative discussion; no empirical classification study or prevalence estimates provided.
AI industrializes the packaging of existing inferential failure modes: the barrier between naming a method and executing it has collapsed, allowing weak foundations, dressed as rigorous analysis, to reach audiences at a scale, speed, and polish that previously required expertise.
Conceptual claim supported by narrative reasoning and illustrative examples; no empirical data on scale, speed, or reach are given.
AI changes the incidence, observability, and persuasive force of inferential failures enough to create a practically distinct governance problem (even if it does not invent previously nonexistent inferential failures).
Argumentative/theoretical reasoning in the paper; no empirical measurement of incidence, observability, or persuasiveness provided.
When AI assists with methods whose validity depends on assumptions that cannot be verified from the output alone ("vibe inference"), the failure surface is structurally different: the output does not reliably signal invalidity, and when it does, recognizing the signal requires the expertise the workflow bypasses.
Logical/qualitative argument and definition development in the paper (no empirical validation or measured instances provided).
AI-assisted methodology ("vibe methodology") democratizes the failure modes specific to each domain.
Conceptual/theoretical argument presented in the paper; no empirical sample, quantitative data, or experiments reported.
The relative importance of AI-related equities as shock transmitters diminishes over time.
Time-varying estimates from the TVP-VAR showing a decline in the net transmitter contribution of the AI equities group across the sample.
The overall level of connectedness declines modestly following the public release of ChatGPT by OpenAI in November 2022.
Time-series comparison of aggregate connectedness measures derived from the TVP-VAR, with a reported modest post-November 2022 decline (event reference: ChatGPT release).
A policy irreversibility result: there exists a critical time before the singularity after which redistribution becomes politically impossible because wealth concentration makes feasible tax rates vanishingly small.
Proof/argument in the paper showing that as time approaches the singularity the set of tax rates that satisfy political-feasibility constraints (workers' budget / feasibility) shrinks to zero, implying a latest feasible intervention time.
Financialization amplifies the exponent of the super-exponential divergence by a factor γ_F/η.
Mathematical derivation in the paper showing that the exponent in the asymptotic growth rate near the singularity is multiplied by γ_F/η when including the financialization term γ_F K_f^2 and its coupling parameter η.
Near the singularity, the wealth ratio between capital owners and workers diverges super-exponentially.
Asymptotic analysis near the finite-time singularity showing that the ratio of capital-owner wealth to worker wealth grows faster than exponential (super-exponentially) as time approaches the blow-up time.
AGI (Artificial General Intelligence) is problematic both conceptually and definitionally.
Authorial assertion in the paper stating AGI is problematic as a concept and definition; framed as a conditioning assumption that shapes the subsequent analysis.
The paper argues we should avoid assuming the inevitability of the current situation relating to AI (i.e., the current commercial AI development trajectory is not inevitable).
Authorial methodological claim in the paper's framing/introductory text; presented as a normative methodological stance rather than empirical evidence.
There is an absence of agreed-upon benchmarks for evaluating AI systems.
Introductory chapter notes lack of standardized evaluation benchmarks as a cross-cutting concern; presented as an analytical observation by the task force.
AI systems exhibit bias.
Introductory chapter points to bias in AI systems as a recurring theme; supported by the broader literature cited in the report (no numerical sample reported in the introduction).
AI model outputs are often opaque and non-replicable.
Introductory chapter identifies opacity and non-replicability of AI outputs as a cross-cutting theme; claim is based on literature synthesis and conceptual critique in the report.
A small number of AI corporations have unprecedented power.
Introductory chapter highlights the theme of concentrated corporate power in AI; asserted as an observational claim in the report's framing rather than derived from a presented empirical sample in the introduction.
The Price of Fairness can be large even when group distributions are nearly identical.
Theoretical result/constructive example in the paper showing instances where PoF is large despite near-identical group distributions.
Enforcing static fairness constraints may exacerbate long-run disparities.
Statement referencing recent prior theoretical results and motivating literature; framed as background/motivation in the paper.
Any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score.
Formal theoretical result/proof presented in the paper based on the transformation-graph semantic-class model.
Once announced, such a metric becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm.
Modeling argument in the paper (transformation graph / semantic classes) and supported by formal analysis and experimental checks described in the paper.
We contribute a non-additive harm decomposition (welfare loss W, coverage loss C) that exposes how attrition shifts harm from the regulator-accountable surface to a regulator-invisible one.
Methodological contribution in the paper: definition of welfare loss W and coverage loss C and analysis showing attrition reallocates observable vs. unobservable harm; supported by theoretical exposition and simulation examples.
An audit-aware OffAuditDrift strategy that exploits Stackelberg commitment defeats both (Periodic-with-floor and history-conditioned suspicion-escalation) auditor extensions.
Construction of the OffAuditDrift auditee strategy in the paper and simulation/theoretical demonstration that it can evade both proposed auditor policies by exploiting auditor commitment.
We identify a structural feature of any noise-aware static-auditor design: a cover regime in which coverage gaps and granularity gaps cannot be closed simultaneously (formalized as Observation 1).
Theoretical observation/proposition in the paper (Observation 1) derived from the formal model of continuous auditing under noise-aware static auditing rules.
Regulated systems can delay outcome reporting, drift their reports within plausible noise envelopes, exploit longitudinal sample attrition, and cherry-pick among ambiguous metric definitions.
Specification and enumeration of auditee strategies in the paper (Delay, Drift, Cherry-pick, Attrition, OffAuditDrift); conceptual examples and inclusion in simulator.
Continuous post-deployment compliance audits, mandated by emerging regulations such as the EU AI Act and Digital Services Act, create a class of strategic gaming distinct from the one-shot input/output gaming studied in prior work.
Conceptual and theoretical argument in the paper, motivated by regulatory context; formalization of continuous auditing as a multi-round interaction (T-round Stackelberg game).
The reform reduces industrial wastewater discharge, which improves agricultural production conditions (mechanism linking the reform to higher grain yield).
Mechanism analysis in the paper reporting reductions in industrial wastewater discharge following the reform (mediation channel analysis).
A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.
Empirical comparison in simulator experiments indicating that optimizing for exact action accuracy (matching individual actions) can harm higher-level trace distribution alignment; observed in the studies contrasting deterministic copying/value-based approaches with Trace-Prior RL.
Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior.
Empirical observation in simulator experiments comparing deterministic value-based RL and deterministic copying agents to other approaches; observed collapsed/shortcut pricing behaviors when uncertainty is unresolved.
This failure is a Goodhart-style failure under partial observability: Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices.
Theoretical diagnosis supported by simulator setup and observed ambiguity in agent-visible states mapping to multiple competitor prices; derived from the two-hotel simulator design where key competitor variables are hidden from Hotel A.
GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1.
Model-level observation from the ASR analysis within the experiment (paper reports GPT-4.1 had perfect TSR and HF1 but failed trajectory-level fidelity).
Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly.
Empirical evaluation reported in the paper: HMASP tested across 18 LLMs and 90,000 task instances; analysis via ASR showing checkpoint-skipping behavior for 10 models and correct enforcement for 8 models.
With strong exposure of low-wealth, high-MPC households and concentrated ownership, privately chosen automation can be excessive even though it raises high-skilled labor income.
Theoretical welfare/comparison analyses in the model with heterogeneous households (differing in wealth and marginal propensities to consume) and ownership concentration; shows private incentives lead to automation choices that are suboptimal from a social perspective under these parameter constellations.
Automation reduces paid human labor.
Model comparative statics in the same equilibrium framework showing substitution away from paid human labor as firms choose automation; result reported in the paper's static benchmark and general-equilibrium analysis.
DePAI entails risks including security, centralization, incentive failure, legal exposure, and the crowding-out of intrinsic motivation, requiring value-sensitive design and continuously adaptive governance.
Risk analysis and conceptual argument in the paper identifying possible failure modes and recommended design/governance responses; no empirical incidence data provided.
These dynamics may produce an asymmetric barbell-shaped structure of value capture in advanced economies: high-volume synthetic production controlled by owners of AI infrastructure at one pole, and scarce, high-status human labor valued for verified human presence at the other.
Conceptual projection and economic argument in the paper (no empirical decomposition, distributional statistics, or sample reported in the excerpt).
AI compresses the value of standardized middle-tier labor by making good-enough synthetic substitutes scalable at low marginal cost, hollowing out the middle of the skill distribution currently categorized by knowledge work.
Conceptual/theoretical argument presented in the paper (no reported empirical sample, statistical analysis, or quantified experiment in the excerpt).
The cultural and technical misalignment of the data center and electric power sectors makes coordination difficult.
Analytic claim in the paper describing differing design principles, operational philosophies, and economic incentives as sources of misalignment; presented as conceptual analysis without empirical measurement in the excerpt.
A single hyperscale training campus can draw power comparable to a mid-sized city, driven by one tightly synchronized job whose demand swings by hundreds of megawatts in seconds.
Concrete illustrative assertion in the paper about facility-level power draw and rapid demand swings; no numeric source, dataset, or case-study details provided in the excerpt.
AI training data centers break that assumption (load diversity).
Argumentative claim in the paper asserting that characteristics of AI training workloads violate the load-diversity assumption; no quantitative study included in the excerpt.