Evidence (5157 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Increasing the complexity of the information structure has a significant and negative impact on information aggregation, suggesting AI agents may suffer from the same limitations as humans when reasoning about others.
Experimental manipulation of information-structure complexity in the controlled trading experiment; measured change in aggregation performance (log error of last price) as complexity increases.
Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns.
Turn-level coding of user behavior in the SWE-chat dataset: proportion of conversational turns containing correction/complaint/interrupt signals, computed across >63,000 user prompts and sessions.
Agent-written code introduces more security vulnerabilities than code authored by humans.
Comparative analysis of security vulnerabilities attributed to agent-authored code versus human-authored code within the SWE-chat dataset (method details not specified in excerpt).
Just 44% of all agent-produced code survives into user commits.
Empirical measurement of code provenance and survival within the SWE-chat dataset: proportion of agent-produced code that becomes part of subsequent user commits across sessions.
Despite rapidly improving capabilities, coding agents remain inefficient in natural settings.
Authors' summary claim supported by dataset-derived metrics such as agent code survival rate (44%) and user pushback (44% of turns); observational analysis of SWE-chat.
Regulated deployment imposes four load-bearing systems properties — deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale — and stateful architectures violate them by construction.
Conceptual/architectural argument presented in the paper (theoretical analysis), not an empirical measurement in the abstract.
Evaluation of four leading AI platforms shows that standard RAG-based approaches achieve an average of only 15% accuracy when information is insufficient.
Empirical evaluation described in paper: four AI platforms tested on benchmark; reported average accuracy of 15% for RAG-based approaches on cases with insufficient information.
Unemployment insurance adjudication has seen rapid integration of AI systems and the question of additional fact-finding poses the most significant bottleneck for a system that affects millions of applicants annually.
Contextual/introductory claim in paper; references to domain-scale impact and bottleneck; no specific numeric study sample provided in excerpt.
A well-known limitation of AI systems is presumptuousness: the tendency of AI systems to provide confident answers when information may be lacking.
Statement in paper framing the problem; general literature/contextual claim (no specific experiment cited in the excerpt).
Brevity, semantic isolation and rhetorical register independently predict representational outcome (i.e., which submissions are included/excluded in summaries).
Statistical/semantic analysis (presumably regression or causal inference) reported in the paper linking textual features—brevity, semantic isolation, rhetorical register—to representational outcomes.
Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI, with exclusion rates of 33%–88% in such clusters.
Cluster/semantic analysis reported in the paper showing higher exclusion rates for clusters labeled as dissent/scepticism/critique.
In topic B, 15.3% of participants are effectively excluded by the official summary.
Empirical measurement reported in the paper quantifying participants 'effectively excluded' when comparing source submissions to official summary coverage.
In topic A, 16.9% of participants are effectively excluded by the official summary.
Empirical measurement reported in the paper quantifying participants 'effectively excluded' when comparing source submissions to official summary coverage.
Both official government summaries underperform a random-participant baseline for topic B (coverage degradation of -8.0%).
Empirical comparison in the paper between official government summary and a random-participant baseline using the n=5,253 consultation responses.
Both official government summaries underperform a random-participant baseline for topic A (coverage degradation of -9.1%).
Empirical comparison in the paper between official government summary and a random-participant baseline using the n=5,253 consultation responses.
LLMs endorsed fraudulent investments at 0% across all models tested.
Preregistered experiment across seven leading LLMs producing 3,360 AI advisory conversations; reported 0% endorsement of objectively fraudulent opportunities.
Endorsement reversal occurred in fewer than 3 in 1,000 observations.
Observed incidence reported from the preregistered experiment (3,360 AI advisory conversations); statement in paper reporting incidence <3/1,000.
The policy and research challenge posed by platform-mediated automation is not merely job quantity (technological unemployment) but institutional continuity — how societies reproduce practical competence when platforms optimize for efficiency rather than formation.
Normative and conceptual claim developed through literature synthesis (institutional economics, platform governance, workforce development); presented as an analytical reframing rather than an empirically tested hypothesis.
Entry-level roles have historically functioned as apprenticeships in which workers acquire tacit knowledge and critical judgment; if platforms curtail these formative occupational layers, organizations may lack future workers capable of exercising contextual reasoning required to manage complex systems.
Institutional economics and workforce development literature cited in the paper; conceptual synthesis without original empirical measurement reported.
Platform-mediated automation risks hollowing out labor structures from both directions: eroding repetitive, junior roles from below and automating supervisory coordination functions from above.
Theoretical argument synthesizing institutional economics and platform literature; articulated as a conceptual risk rather than demonstrated with original empirical data.
Algorithmic systems are displacing routine tasks across both low-wage entry-level work and middle-management functions.
Stated in paper's argumentation; supported by a literature-based review drawing on platform governance literature and recent research on AI-enhanced automation (no original empirical sample or quantitative study reported).
A gender gap persists, concentrated in the most exposed occupations.
Stratified/descriptive and regression analyses of the 2024 EWCS showing gender differences in self-reported generative AI adoption, with the gap largest among occupations with highest exposure; sample >36,600 workers across 35 countries.
The infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it.
Authoritative claim in paper framing the research gap; presented as observational/argumentative (no empirical audit reported).
Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user.
Statement in paper's introduction/positioning; conceptual survey-style claim (no empirical study or systematic benchmark reported).
Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations.
Paper asserts that existing/standard benchmarks do not adequately isolate parsing and computation-orchestration abilities, motivating the new benchmark.
As multimodal AI achieves human-parity understanding of speech and gesture, [the keyboard's] necessity dissolves.
Theoretical claim supported by multidisciplinary review (history, neuroscience, technology, organizational studies); no quantified empirical test reported.
General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs.
Conceptual/argumentative claim stated in the paper's motivation; no empirical test reported in the abstract.
There was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition (i.e., retention decreased more after using Copilot).
Comparison of retest (one-week) performance across conditions reported in results; authors report a nonsignificant reduction and larger decrement for the AI/Copilot condition (n=22).
Current operational approaches typically involve scattered testing tools, resulting in partial coverage and errors that surface only after deployment.
Authors' characterization of industry practice and limitations (assertion in paper; no empirical sample size reported in abstract).
Network change validation remains a critical yet predominantly manual, time-consuming, and error-prone process in modern network operations.
Statement in paper framing the problem; based on authors' characterization of current operational practice (no empirical sample size reported in abstract).
Thick subjectivist theories of meaning in life and meaningful work—those theories that emphasize that meaning-conferring activities are historically formed—enable us to appreciate how some losses cannot be made up, even if there are in principle ample alternative sources of meaning to be found elsewhere.
Theoretical claim about the explanatory power of 'thick subjectivist' normative theories; argued via conceptual philosophical analysis in the paper (no empirical testing reported).
Even if there are rich non-work sources of meaning, this does not entail that there is not a significant and multi-faceted loss of meaning, one that cannot be compensated for or offset elsewhere.
Normative/philosophical argument presented in the paper (conceptual reasoning rather than empirical measurement; no sample size).
The argument that non-work goods can replace work-derived meaning fails to consider the embeddedness and thickness of meaning in human lives.
Philosophical/theoretical critique based on conceptual analysis (author's argument invoking the notions of embeddedness and thickness of meaning; no empirical study reported).
The paper identifies governance challenges such as accountability gaps, digital sovereignty risks, ethical pluralism, and strategic weaponization arising from embedding AI in diplomatic practice.
Conceptual and normative analysis section of the paper outlining risks and governance challenges; illustrated by examples and argumentation.
Thin training coverage fosters anxiety about substitution and slows diffusion of AI tools.
Reported associations from surveys of mid-level managers and technical staff, interviews, and document analysis across cases; thematic coding identified links between limited training, worker anxiety, and slower diffusion. (Sample size not reported.)
Upstream textile SMEs frequently exhibit constrained supply chain resilience owing to persistent information latency and structural dependence on downstream orders.
Background/contextual claim stated in paper (motivation for study); no specific quantitative test reported in abstract.
Platforms can exploit workers' uncertainty about the cost of labor to effectively suppress wages.
Interpretation / implication drawn from the theoretical model and the result that a platform can achieve coverage while paying only O(log(M)/M) fraction of total labor cost under assumptions about workers' cost estimates.
There exists a simple pricing strategy for the platform that covers all M tasks with wait time O(M) while paying only an O(log(M)/M) fraction of the total cost of labor.
Theoretical result from the paper's posted-price procurement model under stated assumptions on workers' estimated costs; formal analysis/proof showing existence of such a pricing strategy for general M (no empirical sample).
Because the technical threshold for this transition is already crossed at modest engineering effort, the window for protective frameworks covering disclosure, consent, compensation and deployment restriction is the present, while deployment remains optional rather than infrastructural.
Authors' normative claim based on their implementation (distillation and deployment) and interpretation that modest engineering sufficed; used to argue policy urgency for disclosure/consent/compensation frameworks.
We term this the Relic condition: when publication systems make stable reasoning architectures legible, extractable and cheaply deployable, the public record of intellectual labor becomes raw material for its own functional replacement.
Conceptual framing introduced by the authors as an interpretation of the observed results and their implications; not an empirical measurement but a named condition/argument.
Agency in software engineering is primarily constrained by organizational policies rather than individual preferences.
Authors' synthesis of qualitative results across the ACTA/Delphi and task/review phases indicating organizational policy factors were cited as primary constraints.
Existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions, and do not capture forecasting over continuous quantities.
Literature/benchmark critique asserted in the paper (argument that current benchmarks focus on simple judgmental formats and miss continuous numerical forecasting capabilities).
Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.
Empirical observations from QuantSightBench evaluation showing model calibration performance as a function of magnitude (paper statement noting sharp degradation and overconfidence at extremes).
The top performers Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), and GPT-5.4 (75.3%) all fall at least 10 percentage points short of the 90% coverage target.
Reported empirical coverage percentages from evaluation on QuantSightBench for the listed models (paper provides these percentage values).
None of the 11 evaluated frontier and open-weight models achieves the 90% coverage target.
Empirical evaluation on the newly introduced QuantSightBench benchmark across 11 frontier and open-weight models; models were assessed on empirical coverage of prediction intervals versus a 90% target (paper statement).
The study identified significant implementation challenges including algorithmic bias, digital divide concerns, data privacy risks, and low technology readiness among HR teams in Tier 2 cities.
Synthesis of qualitative case study findings from 4 organizations plus survey responses (N=150) reporting barriers and risks encountered during adoption.
Current attack policies do not saturate LinuxArena (human-crafted attacks evade monitors at substantially higher rates than model-generated attacks, indicating headroom for attackers).
Empirical observation comparing human-crafted attacks (LaStraj) and elicited model-generated attacks; authors interpret higher human evasion rates as evidence that current automated attack policies have not saturated the challenge posed by LinuxArena.
LaStraj is a dataset of human-crafted attack trajectories that evade monitors at substantially higher rates than any model-generated attacks we elicited.
Authors release LaStraj and report empirical comparisons showing human-crafted trajectories evade monitors at higher rates than the model-generated attacks they tested (exact evasion rates and sample sizes not provided in the excerpt).
Against a GPT-5-nano trusted monitor at a 1% step-wise false positive rate, Claude Opus 4.6 achieves roughly a 23% undetected sabotage success rate.
Empirical sabotage evaluation reported by the authors: monitoring a trusted monitor (GPT-5-nano) at a specified step-wise false positive rate and reporting attacking model (Claude Opus 4.6) undetected success rate. (Sample size / number of evaluated runs not provided in the excerpt.)
Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
Qualitative and quantitative analysis of errors observed across the DELEGATE-52 experiments (19 LLMs) showing sparse, high-severity, and silently introduced errors that accumulate over long workflows.