Evidence (6869 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Governance
Remove filter
An analysis of a 21-instrument inventory identifies an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification.
Empirical/qualitative analysis of an inventory of 21 governance instruments compiled and analysed in the paper (n=21 instruments).
Behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify.
The paper's normative and conceptual argument synthesising governance requirements and the epistemic limits of behavioural testing.
Current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify latent representations or long-horizon agentic behaviours.
Conceptual/analytic argument and review of existing assurance methodologies presented in the paper.
Overthinking is a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses.
Conclusion drawn by authors based on their empirical findings described in the abstract (amplification of output length across multiple models and transferability experiments).
This overthinking behavior significantly increases inference latency and energy consumption, forming a potential vector for denial-of-service (DoS)-style resource exhaustion.
Authors assert increased latency and energy consumption as consequences of longer reasoning traces; framed as a potential attack vector in the abstract (no quantitative latency/energy measurements provided in abstract).
Large reasoning models (LRMs) exhibit a tendency to "overthink", producing excessively long and redundant reasoning traces when confronted with incomplete or logically inconsistent inputs.
Empirical observation reported by the authors based on experiments described in the paper (abstract references experiments across multiple SOTA reasoning models); no numerical sample size for inputs reported in abstract.
Policy responses in Europe are fragmented across the EU and Member State levels and do not match the potential scale of disruption from AGI.
Paper's policy analysis of EU- and Member-State-level responses (stated in abstract); no quantitative metrics provided in the abstract.
Europe has low rates of industrial AI adoption.
Paper's empirical/policy review claiming low industrial AI adoption in Europe (as stated in abstract); the abstract does not provide numeric adoption rates or sample sizes.
Europe exhibits structural weaknesses in compute infrastructure and talent retention.
Paper's structural assessment of Europe's AI value-chain capabilities (stated in abstract); no numerical measures provided in the abstract.
Europe has limited strategic awareness of frontier AI progress.
Paper's assessment of Europe's positioning based on policy analysis and review of capabilities monitoring (as stated in abstract); no supporting metrics or sample sizes provided in the abstract.
AGI could strain existing governance frameworks.
Paper's policy analysis describing potential mismatches between governance capacity and AGI-induced disruptions (as stated in abstract); no empirical tests or quantification reported in the abstract.
AGI could intensify interstate competition.
Paper's geopolitical analysis and scenario-based reasoning informed by trends in AI capabilities (stated in abstract); no quantitative measures reported in the abstract.
AGI could fundamentally alter the global distribution of economic and military power.
Paper's geopolitical analysis drawing on capability trends and scenario reasoning (as stated in abstract); no empirical quantification provided in the abstract.
Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored.
Literature/contextual positioning in the paper contrasting prior benchmarks' focus with the proposed task.
There is a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.
Synthesis of experimental findings indicating that existing detectors and MLLMs are insufficiently reliable for the specific task of claim-conditioned refund-evidence verification.
Current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets.
Experimental results reported in the paper comparing MLLM true positive rates (TPR) on real-damaged vs. fake-damaged subsets produced by multiple generators.
Direct demographic targeting excludes users whose demographics the platform cannot infer ('unknown users') if advertising platforms do not provide a way to target unknown users directly, as is the case on Google Ads.
Platform capability statement about Google Ads (authors' description of Google Ads targeting options); no sample size provided.
Skewed ad delivery of public-service ads can prevent certain groups of individuals from accessing information about resources on the basis of their demographic identity.
Argument/implication drawn from observed demographic skew in ad delivery and its relevance to public-service outreach; no specific empirical sample size reported in the excerpt.
Ad delivery can be skewed by demographic attributes, such that ads are systematically under-delivered to certain groups despite advertiser intent to reach groups proportionally.
Cites prior audits of ad delivery (literature/audit studies referenced by the paper); descriptive claim based on prior empirical work (no sample size stated in the provided excerpt).
The paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics.
Conceptual analysis and problem-framing presented in the paper (qualitative identification of five mismatch categories).
Current surveys remain fragmented across system optimization, architecture design, and trust, lacking a unified framework to evaluate the fundamental trade-off between output quality and economic cost.
Authors' literature review and critique of existing surveys; based on mapping of prior works into separated strands (qualitative assessment rather than quantified meta-analysis).
Exponential token consumption introduces severe computational, collaborative, and security bottlenecks.
Synthesis presented in the paper arguing that rising token usage causes system-level constraints; based on literature survey and conceptual analysis (no single empirical sample reported).
Producing hardened, production-grade agent workflows may require extra compute and time, and these costs must be amortized through reuse across a broad user community.
Argument in paper reasoning that added rigor entails higher compute/time costs and that reuse across users is needed to amortize these costs; no empirical cost estimates provided.
By focusing on rapid, real-time synthesis, AI agents are effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them.
Conceptual argument presented in the paper asserting a qualitative mismatch between on-the-fly agents and high-stakes production needs; no empirical validation reported.
The on-the-fly paradigm short-circuits disciplined software engineering processes—iterative design, rigorous testing, adversarial evaluation, staged deployment, and more—that have delivered relatively reliable and secure systems.
Argumentative claim in paper linking the on-the-fly loop to reduced application of standard SE processes; no empirical study, sample, or quantitative evidence provided.
A controlled delivery-mode comparison shows that inline evaluation produces false negatives: GPT-5.1 shows 0% trust inline but 100% under both simulated and real agentic tool-use, demonstrating that delivery mode is a first-order confound.
Controlled experiments comparing inline evaluation vs simulated and real agentic tool-use on GPT-5.1; reported 0% trust in inline mode vs 100% trust in agentic modes (authors' reported results).
Every tested model trusts poisoned data at 100% at moderate attacker sophistication (L2), with 269 valid trials (of 270) accepting fabricated security claims under directed queries.
Primary experimental results across 270 directed-query trials (9 models × 30 each); authors report 269 of 270 trials accepted fabricated security claims under attacker sophistication level L2.
We demonstrate six attack scenarios against a production 42-million-node code knowledge graph, providing the first empirical demonstration of knowledge graph poisoning against a production-scale agentic system.
Empirical demonstrations described in paper: six distinct attack scenarios executed against a production knowledge graph containing 42 million nodes (authors' reported experimental setup).
We define Oracle Poisoning, an attack class in which an adversary corrupts a structured knowledge graph that AI agents query at runtime via tool-use protocols, causing incorrect conclusions through correct reasoning.
Conceptual definition presented by the authors in the paper (theoretical framing and distinction from prompt injection).
Disclosure banners, conversion A/B testing, UI dark-pattern taxonomies, and generic LLM safety scores were built for older interfaces and miss the prose-recommendation surface where the steering happens.
Argument in paper that existing governance/audit tools designed for ranked-list or older UIs do not cover the new single-sentence prose-recommendation surface; no empirical test reported in excerpt.
Current AI development trajectory reflects value choices that prioritize conversational generality over domain specificity, accountability, and long-term social sustainability.
Normative/critical analysis in the paper highlighting design priorities and trade-offs; no empirical measurement provided.
Sustained investment in large-scale chatbot infrastructures increases environmental costs.
Paper asserts environmental impacts from infrastructure investment (energy, resource use) as part of systemic critique; no quantified environmental measurements or sample size reported.
Chatbot-driven AI development contributes to concentration of economic power.
Argumentation about industry dynamics and infrastructure centralization in the paper; no empirical market-concentration metrics or sample provided.
The normalization of chatbots contributes to labor displacement.
Theoretical argument linking widespread chatbot adoption to changes in work and employment; no empirical displacement estimates provided.
Normalization of chatbot-mediated interaction alters patterns of work, learning, and decision-making, contributing to deskilling, homogenization of knowledge, and shifting expectations of expertise.
Analytical reasoning and literature-informed claims in the paper; no quantitative measurement or sample reported.
Chatbot-based systems often fail to adequately meet user needs, particularly in complex or high-stakes contexts, while projecting confidence and authority.
Qualitative argumentation and illustrative examples in the paper; no reported controlled empirical study or sample size.
The chatbot paradigm is not a neutral interface choice, but a dominant sociotechnical configuration whose widespread adoption reshapes social, economic, legal, and environmental systems.
Conceptual argument and synthesis in the paper (theoretical analysis); no empirical sample or quantitative data reported.
This reliance frequently leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope.
Author argument drawing on conceptual critique and cited empirical distinctions (paper's argumentative content).
AI deployment in sensitive domains (health care, credit, employment, criminal justice) is often treated as unsafe to authorize until model internals can be explained.
Author assertion based on observed regulatory and institutional tendencies described in the paper (argumentative / contextual evidence within the paper).
A scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study.
Paper references a scoping review that examined FDA-approved AI/ML device documents and reported the 9.0% figure.
A 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action.
Paper cites a recent empirical finding reporting a 53 percentage-point gap between models' internal representations and their ability to correct outputs (described as 'recent evidence').
Institutional inertia in property valuation poses risks to asset pricing, collateral risk modelling and investor confidence.
Analytical inference from interview findings and theoretical synthesis highlighting implications for property investment and financial market stability.
Despite advances in automation, data analytics and AI, the sector has been slow to digitise.
Background statement supported by interview data and sector observation reported in the study.
The IDOI framework provides a transferable model for understanding digital transformation in regulated, high-trust professions and highlights the market-level risks of institutional inertia in property valuation.
Development of the IDOI conceptual framework from qualitative data and theoretical integration; authors' claim about transferability and implications.
Generational divides, protectionist attitudes and fears of automation reinforce digital resistance.
Qualitative interview evidence reporting attitudes across cohorts of valuers and firm personnel; thematic analysis identifying cultural and attitudinal themes.
The Valuers Act (1948), fragmented infrastructure and sovereignty concerns limit innovation.
Interview data from practitioners, firm leaders and regulators in New Zealand citing specific regulatory and infrastructure constraints; thematic analysis.
Barriers to adoption arise primarily from institutional conservatism, outdated regulation and weak data governance rather than technical shortcomings.
Qualitative semi-structured interviews with valuers, firm leaders and regulators in New Zealand; thematic analysis guided by Rogers' diffusion of innovations and institutional theory synthesised into the IDOI framework.
Taken together, AI’s effects on labor and capital may strain democracy unless a set of policies we outline here are gradually implemented.
Paper's normative/predictive claim linking labor- and capital-market effects of AI to political strain on democratic institutions and proposing policy remedies (presented as contingent and prescriptive; no empirical test of democratic outcomes provided in the excerpt).
AI’s training and computing needs are intensifying the technological sector’s interest in regulatory capture.
Paper's causal/inferential claim that increased capital concentration and fixed investments raise incentives for regulatory capture in the tech sector (asserted reasoning; no political-economy empirical test reported in the excerpt).
AI’s current training and computing needs have magnified capital concentration and business investment in fixed assets.
Paper's economic claim linking AI compute/training requirements to increased capital concentration and fixed-asset investment (no quantitative investment or market-concentration data provided in the excerpt).