Evidence (8570 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Adoption
Remove filter
The paper identifies three properties of LLM agents that distinguish the present challenge from prior bot-detection problems: identity discontinuity by design, task-based instantiation, and agent-to-agent loops.
Analytic claim based on synthesis of agent architecture literature; presented as conceptual identification rather than empirically tested properties.
A click may reflect an optimization routine, a proxy objective, or a recursive agent-to-agent exchange rather than meaningful human intent, and traditional inference frameworks cannot reliably distinguish among these possibilities.
Theoretical claim derived from literature on agent behaviors, agent-to-agent interactions, and limitations of existing inference frameworks; no empirical discrimination test reported in this paper excerpt.
The presence of autonomous AI agents weakens the interpretive value of core web analytics metrics, including sessions, engagement, conversion, and retention.
Argument based on conceptual synthesis of how non-human, non-persistent actors generate signals that undermine standard metric interpretations (position paper; no original empirical test included).
Unlike crawlers and traditional bots, these agents do not possess persistent identities or psychologically grounded motivations; they are task-specific, dynamically instantiated processes whose behaviors are contingent and often orchestrated by external systems.
Conceptual analysis informed by literature on agent architecture and LLM-based agents; no primary empirical measurement presented in this paper excerpt.
Conventional web analytics treats the human user as its fundamental unit of analysis, assuming stable preferences, identifiable intentions, and behavioral patterns that unfold over time.
Conceptual statement supported by literature synthesis and critique of standard web-analytics assumptions (position paper; no primary empirical sample reported).
Consequently, generated artifacts may exhibit brittle behavior and limited deployability.
Paper asserts that lack of production awareness leads to brittle artifacts and limited deployability; no quantitative measures or sample sizes provided in the abstract.
AI-assisted development tools often lack awareness of architectural constraints, infrastructure dependencies, and organizational standards required in production environments.
Asserted observation in the paper arguing limitations of general-purpose AI code generation when targeting production-ready systems; no empirical sample size or methodological details provided in the excerpt.
Nominally cheaper models can incur higher total cost due to token-intensive reasoning.
Cost and token usage analysis reported in the paper showing cheaper-per-token models may generate more tokens and thus higher total cost in practice.
Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets.
Stated as background/motivation in the paper (conceptual claim; no empirical sample size reported).
Patent text similarity analysis confirms a 'homogenization trap' (AI-associated increases in patent-text similarity).
Text-similarity analysis of patent documents reported in the paper showing increased patent similarity associated with AI use.
Industry concentration negatively moderates the AI–innovation relationship.
Moderation analysis/interacted fixed-effects models indicating that higher industry concentration weakens the AI→innovation effect.
Cascade performance is limited primarily by structural cost (they pay the cheap model before any escalation decision), rather than by a shortage of intermediate stages.
Synthesis of theoretical insights and empirical results reported in the paper (theoretical analysis of structural costs + empirical comparisons showing limited benefit from additional stages).
Optimized subsequence cascades do not deliver practically meaningful held-out gains over the pairwise envelope.
Empirical evaluation on the five benchmarks comparing optimized subsequence cascades to the pairwise envelope; reported lack of practically meaningful held-out improvement.
Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope.
Empirical comparison across the reported benchmarks and models showing that full fixed chains achieve worse cost-quality tradeoffs than the pairwise envelope (experimental results described in the paper).
Municipal 311 call centers and complaint intake systems face a structural mismatch between incoming volume and classification capacity that produces a bottleneck and differential service quality that follows income and racial lines.
Stated in the paper's introduction; cites prior work (Liu 2024 SLA) as support for the differential service-quality / demographic claim. No sample size or quantitative result reported in the excerpt.
Organizational resistance to technological change hinders AI adoption in logistics operations.
Qualitative synthesis of 31 reviewed publications identifying organizational and cultural barriers to AI uptake.
Data security concerns are a key barrier to adopting AI in global supply chains.
Synthesis of themes from 31 scholarly sources in the structured literature review highlighting data/security-related implementation issues.
High initial investment costs are a significant barrier to AI implementation in logistics.
Synthesis of literature (31 sources) reporting implementation challenges and barriers identified across studies.
AGI (Artificial General Intelligence) is problematic both conceptually and definitionally.
Authorial assertion in the paper stating AGI is problematic as a concept and definition; framed as a conditioning assumption that shapes the subsequent analysis.
The paper argues we should avoid assuming the inevitability of the current situation relating to AI (i.e., the current commercial AI development trajectory is not inevitable).
Authorial methodological claim in the paper's framing/introductory text; presented as a normative methodological stance rather than empirical evidence.
There is an absence of agreed-upon benchmarks for evaluating AI systems.
Introductory chapter notes lack of standardized evaluation benchmarks as a cross-cutting concern; presented as an analytical observation by the task force.
AI systems exhibit bias.
Introductory chapter points to bias in AI systems as a recurring theme; supported by the broader literature cited in the report (no numerical sample reported in the introduction).
AI model outputs are often opaque and non-replicable.
Introductory chapter identifies opacity and non-replicability of AI outputs as a cross-cutting theme; claim is based on literature synthesis and conceptual critique in the report.
A small number of AI corporations have unprecedented power.
Introductory chapter highlights the theme of concentrated corporate power in AI; asserted as an observational claim in the report's framing rather than derived from a presented empirical sample in the introduction.
Across short stories, marketing slogans, and alternative-uses tasks, three frontier LLMs fall below parity across crowding kernels.
Empirical experiments reported in the paper evaluating three frontier large language models on three task domains (short stories, marketing slogans, alternative-uses) and finding ρ < 1 (below parity) across crowding kernels. The abstract specifies three models but does not report the number of generated samples per model or other sample-size details.
This creates an evaluation blind spot, as AI can improve individual outputs while increasing population-level crowding.
Theoretical/ conceptual claim in the paper arguing that improvements at the individual-output level can still increase similarity (crowding) at the population level; no empirical numbers given in the abstract.
Creative AI systems are typically evaluated at the level of individual utility, yet creative outputs are consumed in populations: an idea loses value when many others produce similar ones.
Conceptual argument presented in the paper's introduction motivating a population-level perspective on creative outputs (no empirical sample size reported).
Any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score.
Formal theoretical result/proof presented in the paper based on the transformation-graph semantic-class model.
Once announced, such a metric becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm.
Modeling argument in the paper (transformation graph / semantic classes) and supported by formal analysis and experimental checks described in the paper.
A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.
Empirical comparison in simulator experiments indicating that optimizing for exact action accuracy (matching individual actions) can harm higher-level trace distribution alignment; observed in the studies contrasting deterministic copying/value-based approaches with Trace-Prior RL.
Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior.
Empirical observation in simulator experiments comparing deterministic value-based RL and deterministic copying agents to other approaches; observed collapsed/shortcut pricing behaviors when uncertainty is unresolved.
This failure is a Goodhart-style failure under partial observability: Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices.
Theoretical diagnosis supported by simulator setup and observed ambiguity in agent-visible states mapping to multiple competitor prices; derived from the two-hotel simulator design where key competitor variables are hidden from Hotel A.
GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1.
Model-level observation from the ASR analysis within the experiment (paper reports GPT-4.1 had perfect TSR and HF1 but failed trajectory-level fidelity).
Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly.
Empirical evaluation reported in the paper: HMASP tested across 18 LLMs and 90,000 task instances; analysis via ASR showing checkpoint-skipping behavior for 10 models and correct enforcement for 8 models.
DePAI entails risks including security, centralization, incentive failure, legal exposure, and the crowding-out of intrinsic motivation, requiring value-sensitive design and continuously adaptive governance.
Risk analysis and conceptual argument in the paper identifying possible failure modes and recommended design/governance responses; no empirical incidence data provided.
WIOA is not well-equipped to support large-scale, cross-industry labor transitions.
Low observed incidence of cross-industry occupational transitions and limited shifts into less automation-exposed occupations in the WIOA data (2017-2023) lead authors to conclude the program is poorly suited for large-scale cross-industry reallocation.
A substantial portion of WIOA participants simply return to their prior field after program participation.
Descriptive and outcome analyses on the WIOA participation records (2017-2023) showing many participants re-enter the same occupation/industry rather than transitioning to different occupations.
WIOA rarely shifts workers into less automation-exposed work.
Analysis of WIOA administrative records (2017-2023) using a newly introduced 'Retrainability Index' that decomposes outcomes into post-intervention wage recovery and shifts in routine task intensity (RTI). The paper reports low incidence of downward RTI (movement into less automation-exposed occupations) among participants.
Mechanism tests indicate innovation stagnation in mature firms with redundant AI is a pathway that limits productivity gains (i.e., AI can be associated with stagnant innovation in mature firms).
Mechanism analysis reported in the paper showing signs of reduced innovation-related gains or stagnation in mature, advanced firms using AI (interpreted as redundant AI leading to limited incremental innovation).
AI integration creates challenges such as workforce displacement that must be addressed.
Authors raise workforce displacement as a challenge/consideration in the paper's discussion; this appears as a qualitative claim rather than an empirically quantified result in the supplied text.
AI integration creates challenges such as algorithmic bias that must be addressed.
Authors identify algorithmic bias as a notable challenge in the discussion/conclusion; presented qualitatively rather than as an estimated empirical outcome in the supplied text.
Responsible AI research typically focuses on examining the use and impacts of deployed AI systems, and there is currently limited visibility into the pre-deployment decisions to pursue building such systems.
Argument and literature framing presented in the paper based on a scoping review of academic literature, civil society resources, and grey literature.
This concentration can diffuse responsibility and raise the probability of irreversible system-level loss even when local per-action error rates remain low.
Theoretical result/argument from the model linking concentrated decision-energy to increased systemic risk despite low local error rates.
Efficiency pressure, path dependence, scale feedback, and weak boundary constraints concentrate decision-energy in the most efficient node.
Derived from the paper's formal model and argumentation about system dynamics (efficiency and feedback mechanisms); theoretical rather than empirical evidence.
Declining deployment friction changes the safety problem at its root: safety is not only local output correctness or preference alignment, but the control of irreversibility under rising decision density.
Main theoretical argument of the paper; supported by conceptual framing and a formal model that introduces decision-density considerations.
Recent AI systems compress the distance between capability growth and capability deployment.
Conceptual and descriptive claim in the paper's introduction; supported by theoretical argumentation and illustrative examples rather than empirical measurement.
Creative and interpersonal roles (musicians, physicians, natural sciences managers) show the reverse (i.e., they score low on RL feasibility but high on general AI exposure).
Empirical comparison between the RL Feasibility Index and existing AI-exposure measures, with named creative/interpersonal occupations showing opposite rankings.
Existing indices measure the overlap between AI capabilities and occupational tasks rather than which tasks AI systems can learn to perform, and as a result misclassify occupations where the gap between present capability and learnability is large.
Conceptual critique and comparison of existing AI-exposure indices vs. the authors' proposed learnability-focused approach (paper text argument and empirical comparisons implied later).
Of these four, integration capacity is the least developed for scientific institutions and the most binding: no improvement in AI tooling can buy it.
Normative/diagnostic claim in the paper about relative scarcity and irreducibility of integration capacity; no empirical measures or sample provided in the excerpt.
Four complements then become scarce and load-bearing for AI-augmented science: verified signal, legitimacy, authentic provenance, and integration capacity (the community's tolerance for delegated cognition).
Theoretical framework proposed by the paper; list of four complements presented as an argument without empirical quantification in the excerpt.