Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

The emergence of AI agents—systems where large language models serve as the primary reasoning engine, dynamically generating and discarding code as an instrumental resource—constitutes a fundamental restructuring of the software paradigm rather than an incremental improvement.

Argument based on first-principles analysis of complexity scaling and conceptual comparison between traditional software and agentic systems (theoretical analysis presented in the paper).

high positive The End of Software Engineering: How AI Agents Are Fundament... nature of the software development paradigm (static-code-centric vs LLM-driven a...

ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

Author-stated intent and high-level goal of the benchmark.

high positive Agents' Last Exam alignment of benchmark evaluation with GDP-relevant impact (economic impact of A...

ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded.

Design and maintenance policy described by the authors.

high positive Agents' Last Exam continuous expansion of benchmark task pool

ALE was developed in collaboration with 250+ industry experts.

Author statement specifying collaborator count.

high positive Agents' Last Exam number of industry experts involved in development

This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes.

Description of benchmark introduced by the authors (design claim).

high positive Agents' Last Exam AI agent performance on long-horizon real-world tasks (verifiable outcomes / tas...

Recent AI systems have achieved strong results on a wide range of benchmarks.

Statement in paper (background/context); refers to existing benchmark results in the literature (no specific benchmarks or datasets named in this excerpt).

high positive Agents' Last Exam performance on existing AI benchmarks

The open-source implementation includes audit trails and confidence scoring, providing a replicable foundation for LLM-based actuarial variable extraction in property-casualty insurance.

Authors state the released implementation is open-source and includes audit trail and confidence scoring features; presented as part of the contribution.

high positive Leveraging LLMs for Unstructured Claims Data Analysis availability of auditability and confidence scoring in the implementation

Integration with chain ladder reserving demonstrates practical actuarial value: severity-segmented analysis reduced reserve estimation error from 6.5% to 4.0%.

Applied the extracted severity segmentation to chain ladder reserving in an integration experiment; reported reserve estimation error decreased from 6.5% to 4.0%. Sample size/portfolio details not stated in the claim.

high positive Leveraging LLMs for Unstructured Claims Data Analysis reserve estimation error

We validate 14 core variables using two independent clinical expert reviewers scoring 20 synthetic claims on a five-point Likert rubric, achieving mean scores above 4.0 and a weighted kappa of 0.53.

Validation experiment: two independent clinical expert reviewers scored 20 synthetic claims on a 5-point Likert scale for 14 core variables; reported metrics are mean Likert scores (>4.0) and weighted kappa = 0.53.

high positive Leveraging LLMs for Unstructured Claims Data Analysis quality/accuracy/agreement of extracted variables (Likert scores and inter-rater...

A modular four-script Python pipeline processes synthetic FHIR-based claims data and real claims documents, extracting 36 actuarial variables across reserving, ratemaking, and claims management categories.

Authors report implementation of a four-script Python pipeline applied to synthetic FHIR-based claims and real documents, with 36 target variables defined.

high positive Leveraging LLMs for Unstructured Claims Data Analysis number of actuarial variables extractable by the pipeline

We present a proof-of-concept framework using large language models (LLMs) to extract structured actuarial variables from unstructured claims data.

Authors implemented a prototype framework described in the paper (implementation details and pipeline described).

high positive Leveraging LLMs for Unstructured Claims Data Analysis ability to extract structured actuarial variables from unstructured text

Understanding the evolution of LLM-augmented search is critical for organizations seeking to maintain brand relevance in an AI-augmented information landscape.

Prescriptive concluding claim in paper; based on the authors' synthesis of observed trends and conceptual analysis rather than empirical validation in the provided excerpt.

high positive SEARCH ENGINE OPTIMIZATION: HOW LLM-GENERATED SUMMARIES ARE ... organizational ability to maintain brand relevance

I have developed LLMbench, a research instrument for the comparative close reading of LLM outputs that visualises token probability distributions, entropy curves, and cross-model divergence.

Description of a tool/method developed by the author (LLMbench); claim about the tool's features as stated in the abstract; no implementation details or evaluation sample sizes provided in the abstract.

high positive Prompt anxiety and the algorithmic politics of uncertainty tool capabilities (visualisation of token probabilities, entropy, cross-model di...

Public examples referenced include the reported PocketOS and Replit agentic database-deletion incidents and Moffatt v. Air Canada as an adjudicated output/reliance case.

The paper cites specific public incidents and a legal case as examples supporting its discussion.

high positive From Control Boundary to Insurance Claim: Reconstructing AI-... use of real-world examples and adjudicated case to illustrate AI reconstruction ...

The paper makes three contributions: it defines the AI-specific reconstruction problem, operationalizes that problem through CER, and specifies claim-grade evidence for AI reconstruction.

Author-stated contributions in the paper; descriptive of the paper's goals and deliverables.

high positive From Control Boundary to Insurance Claim: Reconstructing AI-... conceptual/operational contributions delivered by the paper (definition, operati...

The paper introduces CER, a use-case-level diagnostic for AI residual risk transfer: C (control boundary) asks whether the system had an enforceable operating envelope; E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts; R (insurance response) asks whether the reconstructed loss is insured (coverage available and placed, and proof needed to support claim recovery).

Framework introduction and operationalization described in the paper; presented as the paper's primary methodological contribution.

high positive From Control Boundary to Insurance Claim: Reconstructing AI-... diagnostic ability to evaluate residual risk transfer via control boundaries, ev...

The paper addresses losses in which the insured's AI system is in the causal chain, including externally triggered failures such as prompt injection, retrieval-augmented generation (RAG) poisoning, malicious tool output, credential misuse, and data poisoning.

Scope statement in the paper listing specific failure modes; descriptive rather than empirical.

high positive From Control Boundary to Insurance Claim: Reconstructing AI-... coverage of AI-caused loss modes (identification of failure types relevant to re...

The relevant question for such losses is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery.

Conceptual framing provided in the paper; presented as the diagnostic/analytic focus rather than backed by empirical data in the excerpt.

high positive From Control Boundary to Insurance Claim: Reconstructing AI-... completeness of reconstruction (allowed actions, actual actions) needed to estab...

AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts.

Argument presented in the paper as a conceptual/theoretical claim about the nature of AI-system-caused losses; no empirical sample or quantitative study reported in the excerpt.

high positive From Control Boundary to Insurance Claim: Reconstructing AI-... need for state reconstruction (vs. event-only reconstruction) to support insuran...

The future of agentic-AI insurance lies not in a single monoline product but in a layered ecosystem of complementary coverages supported by improved governance, transparency, telemetry, and regulatory clarity.

Analytic conclusion/recommendation based on the paper's risk taxonomy, actuarial framework, and parallels to cyber insurance; forward-looking synthesis rather than empirical causal evidence.

high positive Insurance of Agentic AI recommended market design for agentic-AI insurance (layered ecosystem vs single ...

A coordinated insurance architecture integrating cyber, technology errors and omissions, product liability, performance-warranty, and affirmative AI-liability coverages with explicit allocation mechanisms and dedicated AI aggregates is proposed.

Design proposal in the paper detailing a layered insurance architecture combining multiple coverages and allocation mechanisms; conceptual design not empirically tested.

high positive Insurance of Agentic AI proposed coordinated insurance architecture for agentic AI

The paper proposes an actuarial framework based on exposure assessment, scenario analysis, dependency mapping, and accumulation-risk management, drawing parallels to the evolution of cyber insurance.

Proposed actuarial approach described in the paper, invoking methods like scenario analysis and dependency mapping and analogizing to cyber insurance development; methodological proposal without empirical validation.

high positive Insurance of Agentic AI actuarial framework components for agentic-AI insurance

The paper develops a framework for understanding underwriting, pricing, reinsurance, and product-design implications for agentic-AI insurance.

Methodological contribution stated in the paper: proposed actuarial/underwriting framework (exposure assessment, scenario analysis, dependency mapping, accumulation-risk management); conceptual development rather than empirical validation.

high positive Insurance of Agentic AI framework for underwriting/pricing/reinsurance/product design

Large-scale online experiments demonstrate consistent relative improvements in device cold-start engagement.

Reported results from large-scale online experiments in Tubi production (no numerical effect sizes or sample sizes provided in excerpt).

high positive Bridging the Semantic-Collaborative Gap: An Asymmetric Graph... device cold-start engagement

Large-scale online experiments demonstrate consistent relative improvements in impression acquisition.

Reported results from large-scale online experiments in Tubi production (no numerical effect sizes or sample sizes provided in excerpt).

high positive Bridging the Semantic-Collaborative Gap: An Asymmetric Graph... impression acquisition (number/rate of impressions for content)

Large-scale online experiments demonstrate consistent relative improvements in promotion speed.

Reported results from large-scale online experiments in Tubi production (no numerical effect sizes or sample sizes provided in excerpt).

high positive Bridging the Semantic-Collaborative Gap: An Asymmetric Graph... promotion speed (how quickly new content is promoted)

Large-scale online experiments demonstrate consistent relative improvements in content cold-start engagement.

Reported results from large-scale online experiments in Tubi production (no numerical effect sizes or sample sizes provided in excerpt).

high positive Bridging the Semantic-Collaborative Gap: An Asymmetric Graph... content cold-start engagement

After training, the learned content encoder generates embeddings for both warm and newly ingested content, enabling implicit graph completion through retrieval of warm surrogate neighbors.

Functional claim based on model training and retrieval behavior described in paper (mechanistic claim; supported by described architecture and training procedure).

high positive Bridging the Semantic-Collaborative Gap: An Asymmetric Graph... ability to generate embeddings for new content and enable implicit graph complet...

The RHS content tower does not use ID-based embeddings, content-side subgraphs, neighbor aggregation, or interaction-derived representations, forcing the content encoder to map intrinsic features into a collaborative-filtering-aware embedding space.

Design choice and intended representational effect described in paper (architectural constraints and claimed representational consequence).

high positive Bridging the Semantic-Collaborative Gap: An Asymmetric Graph... content encoder representation (mapping intrinsic features into CF-aware embeddi...

The method is accessible to public entities under budget constraints because it used free AI models.

Author reports that the deployments used free AI models rather than paid services and were implemented within the budgets of the two public units.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... cost/accessibility of method (use of free AI models)

The method operates within protocols designed to comply with international and national data-protection law and with the principles of public administration.

Author statement that the method used protocols designed for legal compliance; paper reports no detected incidents and claims protocol adherence.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... legal and administrative compliance of method

The analysis is consistent with the hypothesis that the method is portable across agencies with distinct mandates.

Observed positive outcomes in two distinct public-sector units (SES/CONT and UCI/SEDET) after applying the same methodology; author frames this as consistency with portability hypothesis.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... method portability across agencies

UCI/SEDET analyzed cases totaling USD 104.3 million in financial volume during the period examined.

Aggregate monetary total of cases analyzed reported from SEI-GDF official indicators in the paper.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... monetary volume of cases analyzed (USD)

UCI/SEDET issued 288 formal recommendations to public managers during the examined period.

Count of formal recommendations reported from SEI-GDF official indicators as presented in the paper.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... number of formal recommendations issued

UCI/SEDET recorded a 92% increase in technical-report production during the period examined.

Quantitative production figures from SEI-GDF official indicators reported in the paper for UCI/SEDET.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... technical-report production (output volume)

Official indicators from SEI-GDF recorded an average processing time fall of 50% at UCI/SEDET during the period examined.

Quantitative before–after measurement from SEI-GDF official indicators for UCI/SEDET as reported in the paper.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... average processing time

Official indicators from the Electronic Information System of the Federal District Government (SEI-GDF) recorded an average processing time fall of 18.2% at SES/CONT during the period examined.

Quantitative before–after measurement from SEI-GDF official indicators for SES/CONT as reported in the paper.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... average processing time

The method was applied in two distinct units: the Sectoral Internal Control Office of the Federal District Department of Health (SES/CONT) throughout 2024, and the Internal Control Unit of the Federal District Department of Economic Development, Labor and Income (UCI/SEDET) throughout 2025.

Paper reports implementation timelines and unit names; described as auditable cases.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... application of the method (implementation occurrence)

The author developed a four-layer structured pedagogical methodology for teaching generative-AI use in the public sector.

Author description of the methodology in the paper; applied in two case units.

high positive The Main Barrier to AI Adoption in the Public Sector is Lack... existence and description of a four-layer pedagogical method

Digital learning platforms and AI-based training tools are increasingly used as central mechanisms to support continuous skill acquisition and professional growth.

Synthesis of prior studies and thematic literature discussed in the editorial (Bankins et al., 2024a; other cited works).

high positive Guest editorial: STARA (smart technology, AI, robotics and a... use of digital learning/AI training tools to support skill acquisition and profe...

Adoption of STARA increases the need to upskill and reskill workers across skill levels, with even high-skilled workers expected to integrate new digital competencies into their professional trajectories.

Literature synthesis and cited empirical/conceptual studies (e.g. Hani et al., 2025; Ibrahim and Abiddin, 2024; Singh and Chandra, 2026; Tariq, 2026).

high positive Guest editorial: STARA (smart technology, AI, robotics and a... demand for upskilling/reskilling and digital competency acquisition

The Talent pillar exerts a significant positive effect on tourism’s GDP share with a one-year lag.

Lagged specification (one-year lag) in fixed-effects panel models on 33 countries (2017–2023); reported coefficient β = 0.183, p = 0.025.

high positive Which dimensions of AI development shape tourism’s direct co... tourism’s direct GDP share

The Policy and Governance pillar is a significant positive driver of tourism’s GDP share.

Pillar decomposition with fixed-effects estimation on panel data (33 countries, 2017–2023); reported coefficient β = 0.353, p = 0.037; result robust to alternative SE and two-way fixed effects.

high positive Which dimensions of AI development shape tourism’s direct co... tourism’s direct GDP share

The AI-related R&D pillar is a significant positive driver of tourism’s GDP share.

Pillar decomposition using fixed-effects models on the same 33-country panel (2017–2023); reported coefficient β = 1.811, p = 0.005; effect robust to alternative standard errors and two-way fixed effects.

high positive Which dimensions of AI development shape tourism’s direct co... tourism’s direct GDP share

Journalists and editors exercise bounded and situational agency through local adaptation, self-training, and development of ethical guardrails that institutionalise responsible AI use.

Based on in-depth interviews with newsroom staff (journalists, editors, technical personnel) at Al-Masry Al-Youm; qualitative accounts of local practices such as self-training and the creation of internal ethical rules. Sample size not reported in the excerpt.

high positive Platformisation, Power, and AI Governance in the Newsroom: I... local adaptation, skill development, and internal governance practices

The synthesized mixed-objective program retains most of the profit-oriented baseline's funds.

Reported comparison in simulation between the synthesized program and a profit-oriented baseline showing the synthesized program keeps most of the baseline funds while reducing gaming behaviors.

high positive Healthcare Mechanisms from Policy-as-Code Search under Strat... funds retained relative to profit-oriented baseline

The synthesized mixed-objective program halves rejection.

Results from the LLM-guided evolutionary search experiment reported in the paper: the synthesized program reduces rejection by half in the simulation.

high positive Healthcare Mechanisms from Policy-as-Code Search under Strat... patient rejection rate

LLM-guided evolutionary code search synthesizes an inspectable mixed-objective program that eliminates up-coding.

Experiment using LLM-guided evolutionary search over the rule-program space within Medi-Sim; the synthesized program reportedly eliminates up-coding behavior in the simulation.

high positive Healthcare Mechanisms from Policy-as-Code Search under Strat... incidence of up-coding under the synthesized program

A single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection.

Targeted simulation experiment in Medi-Sim where an audit intervention closes the coding channel; reported effect is >2x increase in low-complexity patient selection.

high positive Healthcare Mechanisms from Policy-as-Code Search under Strat... rate or incidence of low-complexity patient selection after closing coding chann...

An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure.

Simulation experiments (an 'incentive sweep') run in Medi-Sim showing regimes with up-coding and selection of low-complexity patients when profit incentives are increased.

high positive Healthcare Mechanisms from Policy-as-Code Search under Strat... incidence of up-coding and selection of low-complexity patients under profit pre...

« Prev 1 2 3 … 95 96 97 … 277 278 Next »