Large language models can reliably extract key actuarial variables from unstructured claims text in a proof-of-concept, earning high expert scores and moderate inter-rater agreement; integrating these extractions into chain-ladder reserving cut reserve-estimation error from 6.5% to 4.0%.

Leveraging LLMs for Unstructured Claims Data Analysis

Robert D. Lieberthal, Richard Tran, Vietbao Phan, Jawand Singh, Elizabeth Sottung · June 04, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

A two-stage LLM-based pipeline can extract actuarial variables from unstructured claims text with expert-rated accuracy (mean scores >4.0, weighted kappa 0.53) and, when used to segment severity in chain-ladder reserving, reduced reserve estimation error from 6.5% to 4.0% in a proof-of-concept.

Actuaries rely primarily on structured numerical data for reserving and ratemaking, while valuable predictive information in unstructured text including medical records, adjuster notes, and call transcripts remains largely unused. Manual processing of these documents is time-consuming, inconsistent across reviewers, and unscalable. We present a proof-of-concept framework using large language models (LLMs) to extract structured actuarial variables from unstructured claims data. We implement a two-stage processing architecture separating document-level extraction (Stage 1) from claim-level synthesis (Stage 2). A modular four-script Python pipeline processes synthetic FHIR-based claims data and real claims documents, extracting 36 actuarial variables across reserving, ratemaking, and claims management categories. We validate 14 core variables using two independent clinical expert reviewers scoring 20 synthetic claims on a five-point Likert rubric, achieving mean scores above 4.0 and a weighted kappa of 0.53. Integration with chain ladder reserving demonstrates practical actuarial value: severity-segmented analysis reduced reserve estimation error from 6.5% to 4.0%. The open-source implementation includes audit trails and confidence scoring, providing a replicable foundation for LLM-based actuarial variable extraction in property-casualty insurance.

Summary

Main Finding

LLMs can be used to convert unstructured claims documents (medical notes, adjuster notes, transcripts) into standardized actuarial variables at scale. A prototype two-stage LLM pipeline applied to synthetic claims data extracted a 36-variable taxonomy (14 core variables validated) with high practical accuracy (expert-scored ≥4/5) and produced demonstrable actuarial value — integrating LLM-derived classifications into severity‑segmented chain‑ladder reserving reduced ultimate estimate error from 6.5% to 4.0% (a 2.5 percentage‑point improvement).

Key Points

Goal: bridge rich narrative claims information and quantitative actuarial workflows by automatically extracting structured predictors from unstructured text using LLMs.
Architecture: two-stage processing
- Stage 1: document‑level extraction (per document)
- Stage 2: claim‑level synthesis (aggregate per claim, maintain claim IDs and timestamps)
Prototype pipeline: four Python scripts, JSON output schemas for integration with actuarial systems.
Data: synthetic claim ecosystems generated from Synthea FHIR bundles plus LLM‑generated nonmedical documents (adjuster notes, call transcripts, settlement docs) to create consistent, testable claims with known ground truth.
Variable taxonomy: 36 actuarial variables defined (categorical, numeric, Boolean). Validation focused on 14 core variables including claim/injury severity, medical complexity, treatment type, expected development, ultimate cost, litigation risk, settlement likelihood, management complexity.
Validation:
- Two independent clinician reviewers used a taxonomy + Likert rubric (1–5) to score extractions.
- Extraction accuracy across variable categories averaged at least 4/5.
- Confidence scores were well calibrated; interrater reliability was moderate.
Hallucination mitigation and quality controls:
- Controlled vocabularies / categorical constraints (with “unknown” option)
- Mandatory rationale fields citing document evidence
- Multilevel confidence scoring (document/variable/overall)
- Actuarial consistency checks (mathematical relationships)
Actuarial integration: LLM‑extracted classifications used for severity‑segmented chain‑ladder analysis and produced measurable reserve estimation improvement.
Artefacts released: demonstration code, JSON schemas, and synthetic test datasets to enable reproducibility and extension.
Limitations noted: reliance on synthetic data for development, remaining risks of coherent-but-incorrect LLM outputs, need for production-grade calibration, OCR and provider documentation variability when moving to real data, and regulatory/governance considerations.

Data & Methods

Synthetic data generation:
- Synthea configured to produce FHIR R4 patient bundles representing realistic clinical trajectories for injury claims (e.g., worker’s compensation scenarios).
- Script 1 (FHIR Bundle Processor): transforms structured FHIR elements into narrative clinical notes using an LLM to emulate real medical text.
- Script 2 (Document Augmentation Engine): generates nonmedical documents (adjuster notes, phone transcripts, claimant statements, settlement notes) using an ~800+ character master profile created by an LLM to ensure cross-document factual consistency (same claim numbers, dates, clinical facts).
Pipeline & schema:
- Modular four-script Python implementation producing standardized JSON outputs keyed by claim ID and document timestamps to permit time-aware augmentation of actuarial datasets.
Extraction approach:
- Two-stage LLM workflow: per-document extraction then claim-level aggregation/synthesis.
- 36-variable taxonomy defined; prototype focuses on 14 high-value variables.
- Outputs include categorical values, numeric fields, and mandatory textual rationale for auditability.
Validation:
- Ground truth is known due to synthetic generation.
- Two independent clinician experts reviewed extracted variables and rated correctness using a Likert 1–5 rubric linked to taxonomy.
- Performance metrics: average accuracy ≥4/5; confidence calibration and moderate interrater reliability reported.
Actuarial testing:
- Severity-segmented reserve triangles built using LLM classifications feeding into chain‑ladder reserving.
- Comparison to baseline showed reduction of ultimate estimate error from 6.5% to 4.0%.

Implications for AI Economics

Improved reserving and capital accuracy:
- More granular, text‑driven segmentation can reduce reserve estimation error and therefore lower capital misallocation and pricing uncertainty.
- Example: a 2.5 percentage‑point reduction in ultimate estimate error suggests measurable balance‑sheet and solvency implications for insurers using LLM‑derived features.
Better risk pricing and segmentation:
- Extracted indicators (medical complexity, litigation risk, causation type, provider quality, etc.) enable finer risk classification, potentially improving rate adequacy and experience rating.
Operational efficiency and scalability:
- Automation reduces reliance on manual reviewers, enabling broader coverage of documents, faster triage (identifying high‑complexity or litigation‑prone claims earlier), and potential cost savings in claims handling.
Product and portfolio management effects:
- Earlier detection of claim trajectories supports targeted interventions, settlement strategies, and reinsurance/retention decisions informed by richer text-derived signals.
Labor and task-shift considerations:
- Potential reduction in routine manual review jobs but increased demand for model validators, domain experts for auditing, and governance roles.
Model risk, governance, and regulatory constraints:
- Deployments must address LLM hallucination risk, explainability, audit trails, calibration, fairness, and compliance (e.g., NAIC guidance). Mandatory rationale fields and confidence scores help but are not sufficient — ongoing empirical validation and monitoring are required.
Research and market effects:
- Open synthetic datasets and reproducible code lower barriers to adoption and benchmarking, accelerating innovation but also raising competitive pressure to adopt text-enabled actuarial analytics.
Practical next steps before production adoption:
- Phased validation on real claims with human-in-the-loop auditing.
- Calibrate confidence scores against empirical error rates; implement anomaly detection and automated escalation.
- Integrate outputs into actuarial software and governance frameworks to meet regulatory and professional standards.

Summary takeaway: This study demonstrates a practical, reproducible approach showing that LLMs can reliably extract high‑value actuarial predictors from unstructured claims texts and deliver tangible improvements in reserving accuracy — but production adoption requires robust calibration, governance, and staged validation on real-world claims.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Provides empirical validation showing promising performance (mean expert scores >4.0, weighted kappa 0.53) and a practical downstream improvement in reserve estimates (6.5% → 4.0%); however evidence is based on a small validation sample (20 synthetic claims), limited details on real-world dataset scale, and no randomized or quasi-experimental comparison, so external validity and robustness are uncertain. Methods Rigormedium — The study uses a clear two-stage modular pipeline, independent expert reviewers, Likert scoring, inter-rater agreement metrics, audit trails, and a downstream actuarial application (chain ladder); but the validation sample is very small, largely synthetic, reviewer roles and potential biases are under-specified, statistical evaluation is limited, and details on model training/ prompts, hyperparameters, and held-out test sets are sparse. SamplePipeline applied to synthetic FHIR-based claims data and real claims documents; 36 actuarial variables targeted across reserving, ratemaking, and claims management; formal validation performed on 14 core variables using two independent clinical expert reviewers scoring 20 synthetic claims on a 5-point Likert rubric; reserve-impact demonstration (severity-segmented chain-ladder) reported but dataset size for that analysis is not fully specified. Themesproductivity human_ai_collab adoption org_design GeneralizabilityValidation uses a small (n=20) synthetic-claims sample — may not reflect real-world claim heterogeneity, Limited description of real claims data scale and provenance restricts transferability to other insurers or jurisdictions, Performance may vary by language, document format, and claim complexity not covered in synthetic set, Results depend on specific LLM, prompts, and pipeline choices that may change over time, Clinical expert reviewers may not reflect actuarial reviewer population or operational conditions

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Actuaries rely primarily on structured numerical data for reserving and ratemaking, while valuable predictive information in unstructured text including medical records, adjuster notes, and call transcripts remains largely unused. Adoption Rate	negative	high	use of unstructured text in actuarial processes	0.09
Manual processing of these documents is time-consuming, inconsistent across reviewers, and unscalable. Organizational Efficiency	negative	high	effort, consistency, and scalability of manual document processing	0.09
We present a proof-of-concept framework using large language models (LLMs) to extract structured actuarial variables from unstructured claims data. Organizational Efficiency	positive	high	ability to extract structured actuarial variables from unstructured text	0.18
We implement a two-stage processing architecture separating document-level extraction (Stage 1) from claim-level synthesis (Stage 2). Other	null_result	high	system architecture (document-level vs claim-level processing)	0.3
A modular four-script Python pipeline processes synthetic FHIR-based claims data and real claims documents, extracting 36 actuarial variables across reserving, ratemaking, and claims management categories. Other	positive	high	number of actuarial variables extractable by the pipeline	36 variables extracted 0.3
We validate 14 core variables using two independent clinical expert reviewers scoring 20 synthetic claims on a five-point Likert rubric, achieving mean scores above 4.0 and a weighted kappa of 0.53. Output Quality	positive	high	quality/accuracy/agreement of extracted variables (Likert scores and inter-rater agreement)	n=20 mean scores above 4.0; weighted kappa = 0.53 0.18
Integration with chain ladder reserving demonstrates practical actuarial value: severity-segmented analysis reduced reserve estimation error from 6.5% to 4.0%. Firm Productivity	positive	high	reserve estimation error	reduced from 6.5% to 4.0% 0.18
The open-source implementation includes audit trails and confidence scoring, providing a replicable foundation for LLM-based actuarial variable extraction in property-casualty insurance. Adoption Rate	positive	high	availability of auditability and confidence scoring in the implementation	0.18