Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Privacy-by-design architectures, secure data interoperability, and compliance automation contribute to trust, institutional legitimacy, and long-term adoption of digital health solutions.
Synthesis of literature on privacy engineering, interoperability standards, and compliance technologies presented in the review (literature review; inferred causal linkages discussed).
The framework gives particular attention to algorithmic transparency, risk management, regulatory alignment, and lifecycle oversight of AI-enabled health systems operating under evolving privacy regulations (e.g., data protection laws and cross-border data governance standards).
Descriptive emphasis within the proposed framework, based on cited literatures in regulatory alignment and algorithmic governance (literature synthesis / conceptual emphasis).
This review develops a comprehensive conceptual framework that integrates AI governance principles, data privacy compliance mechanisms, and financially sustainable operational models within digital health ecosystems.
The paper's primary contribution is a proposed conceptual framework derived from synthesizing interdisciplinary literatures (conceptual framework produced by authors based on literature review).
The rapid expansion of digital health technologies driven by artificial intelligence has transformed healthcare delivery, clinical decision-making, and health data management.
Narrative synthesis in the review paper drawing on interdisciplinary literature in health informatics, clinical AI studies, and health data management (literature review / conceptual synthesis).
We present a simulation study analyzing the social benefits of applying ARS to agentic transactions.
Simulation study reported in the paper (study exists; abstract does not report simulation parameters, sample size, or quantitative results).
This shifts trust from an implicit expectation about model behavior to an explicit, measurable, and enforceable product guarantee.
Conceptual claim about the expected effect of adopting ARS (argument presented by authors; no empirical substantiation in the abstract).
Under ARS, users receive predefined and contractually enforceable compensation in cases of execution failure, misalignment, or unintended outcomes.
Functional guarantee described as part of ARS design (contractual/payment mechanism described; no empirical testing detailed in the abstract).
ARS integrates risk assessment, underwriting, and compensation into a single transaction framework that protects users when interacting with agents.
Design description of ARS in the paper (architectural/design claim; no empirical validation reported in the abstract).
We propose a complementary framework based on risk management: the Agentic Risk Standard (ARS), a payment settlement standard for AI-mediated transactions.
Framework proposal described in the paper (design/proposal; implementation referenced).
Security evaluation across 135 test cases demonstrates 87.5% accuracy on static code safety analysis with zero false positives.
Security evaluation reported in paper across 135 test cases with reported accuracy and false positive rate.
Security evaluation across 135 test cases demonstrates 96.7% accuracy on prompt injection detection.
Security evaluation reported in paper across 135 test cases with reported accuracy metric.
On document intelligence (DocILE), Code Factory achieves the highest line item recognition accuracy (LIR: 80.4%).
Empirical evaluation reported on DocILE dataset of 5,680 invoices; LIR metric reported at 80.4% and described as the highest among compared variants.
Compiled AI reduces token consumption by 57x at 1,000 transactions.
Empirical token-consumption comparison reported in paper (scaling example at 1,000 transactions).
Compiled AI breaks even with runtime inference at approximately 17 transactions.
Cost/efficiency comparison reported in evaluation (function-calling context); break-even point stated in paper.
On function-calling, compiled AI achieves 96% task completion with zero execution tokens.
Empirical evaluation on the BFCL function-calling tasks (reported n=400).
We introduce a system architecture for constrained LLM-based code generation, a four-stage generation-and-validation pipeline that converts probabilistic model output into production-ready code artifacts, and an evaluation framework measuring operational metrics including token amortization, determinism, reliability, security, and cost.
Paper states these three contributions as part of the authors' work (descriptive claim about methods and artifacts presented).
By constraining generation to narrow business-logic functions embedded in validated templates, compiled AI trades runtime flexibility for predictability, auditability, cost efficiency, and reduced security exposure.
Conceptual/systems claim made in paper describing design trade-offs of the compiled AI paradigm (no single empirical test cited in the excerpt).
Experimental evidence confirms that AI tools raise worker productivity.
Statement in paper referencing experimental studies (no specific study, method, or sample size reported in the excerpt).
A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects.
Paper describes an interception layer in the evaluation infrastructure that prevents actual final submissions on production sites.
Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction.
Methodological description in the paper: evaluation occurs on live (production) websites rather than offline static sandboxes; supported by reported coverage of 144 live platforms.
The tasks in ClawBench require demanding capabilities beyond existing benchmarks, such as extracting relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and completing write-heavy operations like filling many detailed forms correctly.
Paper description of task types and the capabilities they require; based on the design and composition of the 153 tasks.
ClawBench spans 144 live platforms across 15 categories.
Paper explicitly reports coverage across 144 production websites and 15 task categories (dataset description).
ClawBench is an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work.
Paper states the benchmark comprises 153 tasks (dataset description).
When used appropriately, LLMs are powerful tools that can expand the frontier of empirical economics.
Normative conclusion in the abstract based on the paper's proposed framework and discussion; presented as an overall benefit but not supported by empirical outcomes or quantified gains in the excerpt.
For estimation problems—automating the measurement of economic concepts for downstream analysis—valid downstream inference requires combining LLM outputs with a small validation sample to deliver consistent and precise estimates.
Methodological claim in the abstract advocating use of a small validation sample together with LLM outputs to achieve consistent/precise estimates; no empirical demonstration or sample-size specification provided in the excerpt.
The paper provides an econometric framework for realizing the potential of LLMs in two empirical uses: prediction problems and estimation problems.
Claim of contribution in the abstract describing a methodological framework (the excerpt reports the existence of the framework but does not detail empirical validation or sample sizes).
Researchers can now revisit old questions and tackle novel ones with rich data using LLMs.
Asserted in the paper's abstract as a consequence of LLM-enabled large-scale text analysis; no empirical demonstration or quantified case described in the excerpt.
Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost.
Stated as an assertion in the paper's abstract/summary; based on the authors' framing of LLM capabilities (no empirical sample, experiment, or quantified result provided in the excerpt).
There is an urgent need for targeted workforce planning, investment in human capital, and collaboration between industry, government, and educational institutions to manage AI-driven labour market transformations.
Policy conclusion drawn from the paper's theoretical framing (SBTC, Human Capital Theory) and the empirical patterns identified in secondary data and official reports (2020–2024).
Comparative insights from the United Kingdom show that more systematic AI adoption and structured training programs mitigate workforce displacement.
Cross-country comparison using secondary data and official reports (2020–2024) highlighting the UK's more systematic AI adoption and structured training, which the paper presents as reducing displacement risk.
AI adoption is increasing demand for new competencies.
Secondary sources and official reports (2020–2024) cited in the paper document emerging skill requirements and employer demand for new competencies.
AI adoption is driving growth in high-wage occupations.
Analysis of secondary data and official reports (2020–2024) reporting expansion of high-wage occupational categories in India.
AI adoption disproportionately benefits high-skilled workers.
The paper cites theoretical frameworks (Skill Biased Technological Change and Human Capital Theory) and analyses of secondary data and official reports from 2020–2024 showing relative gains for high-skill occupations.
All data, code, and model responses are open-sourced.
Statement in the paper asserting that data, code, and model outputs are publicly released.
78.7% of observed AI interactions are augmentation, not automation.
Empirical classification of AI interactions (from cross-referenced Anthropic Economic Index interactions/tasks) reported as a percentage in the paper.
The study cross-references the SAFI benchmark with real-world AI adoption data from the Anthropic Economic Index covering 756 occupations and 17,998 tasks.
Data linkage described in the paper: use of Anthropic Economic Index as real-world AI adoption dataset (numbers reported in text).
The benchmark covers 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy.
Reported dataset construction in the paper: 263 tasks mapped to 35 O*NET skills.
We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate).
Empirical benchmark executed by the authors: 263 text-based tasks mapped to 35 O*NET skills, 4 LLMs, 1,052 total model calls reported, and reported 0% failure rate.
The paper argues for a fundamental decoupling of semantic intent from human-readable representation.
Conceptual/design claim made by the authors as a recommended shift in representation strategy for agentic consumers; presented as argumentation rather than empirically tested in abstract.
We extend the semantic density principle to propose rehabilitation of classical anti-patterns and introduce the program skeleton concept for agentic code navigation.
Design/position claims and proposed constructs presented in the paper (program skeleton concept and re-evaluation of anti-patterns) without empirical validation reported in abstract.
Aggressive compression reduced input tokens by 17%.
Reported numeric result from the controlled experiment comparing compressed logs to other conditions; sample size not specified in abstract.
We propose a key design principle: semantic density optimization, eliminating tokens that carry zero information while preserving tokens that carry high semantic value.
Proposal/design principle presented in the paper; theoretical justification provided and (per paper) subsequently validated by experiment.
These empirical findings provide reference for global governments to optimise artificial intelligence policies for low-carbon urban development.
Paper conclusion interpreting results as policy-relevant and generalisable lessons for governments; based on observed positive association between NAIDPZ and urban GEE.
The impact of the NAIDPZ policy on urban GEE is positively moderated by government attention and public environmental attention.
Reported moderation analysis showing interaction effects between the treatment indicator and measures of government attention and public environmental attention within the DiD framework.
The composite NAIDPZ policy effect increases GEE mainly through promoting green technological innovation and optimising industrial structure.
Mechanism analysis reported in the paper (channel/mediation tests) showing that indicators of green technological innovation and industrial structure optimisation account for much of the policy effect on GEE.
The policy effect on GEE is stronger in inland cities, central-region cities, and non-resource-based cities.
Reported heterogeneity/subgroup analysis within the staggered DiD framework comparing effects across geographic regions (inland vs. others, central vs. others) and city types (non-resource-based vs. resource-based) in the 267-city sample.
The NAIDPZ policy significantly improves urban green economic efficiency (GEE).
Estimated treatment effect from staggered DiD on the 267-city panel (2007–2023) with reported statistical significance and multiple robustness checks mentioned.
ImplicitMemBench reframes evaluation from 'what agents recall' to 'what they automatically enact'.
Paper framing statement positioning the benchmark's conceptual contribution as shifting evaluation focus to implicit, automatic behavior rather than explicit recall.
Top performers were DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%).
Paper lists top model names with reported overall percentage scores from the benchmark evaluation.
The benchmark's 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring.
Paper states the suite size (300 items) and describes a unified Learning/Priming-Interfere-Test protocol and that scoring is done on first attempts.