Adaptive, GPT-powered life-insurance questionnaires cut the number of questions and win user preference, but conventional forms still slightly outperform them on risk-assessment accuracy in two field experiments; further development could close the accuracy gap and streamline underwriting.
Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences. Moreover, insurers must blindly trust users' responses, increasing the chances of fraud. The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires. Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions. A life insurance system integrated into an industry partner mobile app was tested in two experiments. While traditional questionnaires yielded slightly higher accuracy in risk assessment, adaptive versions powered by GPT models required fewer questions and were preferred by users for their more fluid and engaging experience. ARQuest shows great potential to improve user satisfaction and streamline insurance processes. With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry.
Summary
Main Finding
ARQuest, an adaptive underwriting framework that combines LLMs (via RAG), alternative data (EHRs, social media images, geographic indicators), and dynamic questioning, can substantially reduce the number of questions applicants must answer and improve user experience, but — in the present implementation — yields somewhat lower risk-score accuracy than a conventional fixed questionnaire. Users strongly prefer the adaptive flow; GPT-4.1 outperforms GPT-3.5 in speed and predictive quality, while remaining gaps (notably family-history coverage) explain most accuracy shortfalls.
Key Points
- ARQuest architecture:
- Four modules: User profiling (ingest basic + external data), Response forecasting (LLM predicts answers and confidence), Dynamic questioning (iterative factor selection), Risk assessment (monotonic additive scoring and mismatch detection).
- Uses Retrieval-Augmented Generation (RAG) to feed contextualized external insights to the LLM and limit hallucinations.
- Data sources and feature engineering:
- Geographic health indicators (Atlas of Healthy Municipalities), labeled by k-means clusters (e.g., "very high").
- Synthetic EHRs from Synthea and Instagram images captioned via BLIP to create semantically rich inputs.
- Synthetic population of 85 users plus a small real-user pilot (n=10).
- Experimental comparison:
- Two baselines: traditional static questionnaire (30 Qs across 3 domains) vs. dynamic ARQuest flows using GPT-3.5 Turbo and GPT-4.1.
- Evaluation metrics: number of questions asked, task time, MAE and Pearson correlation against a synthetic “true” risk score, and user experience feedback.
- Results summary:
- Dynamic flows required roughly half the number of questions (GPT-3.5 even fewer than GPT-4.1).
- Traditional questionnaires had lower MAE and higher correlation with the true risk scores (traditional outperformed dynamic by ~10–30% on risk accuracy).
- GPT-4.1 performed better than GPT-3.5: higher accuracy, faster prediction and factor selection.
- User preferences strongly favored dynamic flow (70% preferred it); participants found it more engaging and personalized.
- Limitations and risks identified:
- Lack of family-history questions in dynamic flows accounted for a major portion of accuracy loss.
- Small real-world sample, reliance on synthetic users, privacy and compliance concerns (GDPR/AIA), potential bias in model-driven decisions, and LLM hallucinations remain important constraints.
- Scoring used a deterministic monotonic additive model for interpretability (not a production probabilistic black box).
Data & Methods
- Implementation:
- Mobile app integration with optional user sharing of EHR, fitness, and Instagram data; Azure-hosted GPT models for protected processing.
- Pre-computed ground-truth answers for each synthetic user to simulate questionnaire filling and compute “true” risk.
- Data:
- Synthetic users (n=85) constructed from Synthea EHR profiles, Portuguese municipal health indicators, occupation-based step estimates, and image captions sampled from a Kaggle Instagram dataset.
- Real-user pilot (n=10) recruited to assess UX and perceptions.
- Modelling & pipelines:
- BLIP used for image captioning (tuned hyperparameters).
- K-means labeling for municipality indicator buckets.
- RAG pipeline to retrieve and present external insights to LLM prompts (LLM asked to output JSON with predicted answers, confidences, and explanations).
- Two LLMs tested: GPT-3.5 Turbo and GPT-4.1.
- Risk scoring: monotonic additive model with extra penalties for risky combinations.
- Evaluation:
- Metrics: number of questions asked, completion time, MAE and Pearson correlation vs. synthetic true risk, user feedback survey on clarity, engagement, and preference.
- Comparative analysis across traditional vs. dynamic (GPT-3.5, GPT-4.1).
Implications for AI Economics
- Productivity and cost structure:
- Reduced question counts and faster flows can lower underwriting time and operational costs per application (labor and processing savings).
- Fewer in-person or manual follow-ups could reduce acquisition and servicing costs, increasing insurer throughput.
- Conversion, demand, and consumer surplus:
- Better UX and lower friction should raise conversion rates and consumer willingness to apply—potentially expanding the insured pool and consumer surplus.
- However, willingness to share alternative data (EHR, social media) is heterogeneous; firms that can credibly protect privacy may capture a premium (first-mover advantage).
- Pricing, risk selection, and profitability:
- More personalized data and adaptive questioning can enable finer-grained risk stratification (better price discrimination), boosting actuarial efficiency if accuracy improves.
- Current accuracy shortfalls (vs. traditional forms) imply short-term pricing risk; missing features (e.g., family history) can cause under- or over-pricing.
- If refined, ARQuest may reduce asymmetric information problems (adverse selection) and improve loss ratios; conversely, model errors could generate new selection distortions.
- Fraud and moral hazard:
- Pre-filled predictions and mismatch detection provide a mechanism to flag potential misreporting, reducing fraud costs. But richer data may also create incentives to game the inputs or to withhold/share selectively.
- Market structure and competition:
- Firms adopting superior adaptive-underwriting tech may gain competitive advantage (faster applications, lower cost-to-serve), pressuring incumbents to invest in AI capabilities or partner with data providers.
- A new data ecosystem may emerge where insurers pay for curated alternative data feeds (geographic, wearables, social media analytics), shifting marginal costs.
- Regulatory and compliance costs:
- Use of external behavioral and health data invites regulatory scrutiny (privacy, fairness, explainability). Compliance and auditability will impose additional costs and potentially limit permissible features, altering the economic return on adopting ARQuest-like systems.
- Transparency requirements (explainability, contestability) may favor hybrid models with interpretable scoring components (as in this paper).
- Distributional and welfare considerations:
- Potential efficiency gains could reduce premiums for well-profiled low-risk individuals, while imperfect models risk systematic bias against minorities and vulnerable groups, creating welfare trade-offs and possible public-policy intervention.
- Research and investment priorities:
- Economically valuable next steps include: (a) improving family-history capture and model calibration, (b) large-scale field trials measuring conversion, lapse, claim frequency, and profitability impacts, (c) formal cost–benefit and privacy–value analyses, and (d) mechanisms to ensure fairness, auditability, and consumer consent.
- Reinsurance and capital:
- More granular risk signals could change reinsurance pricing and capital allocation; reinsurers and regulators will need to understand model risk and systemic correlations introduced by shared data sources.
Suggested follow-ups for economists and insurers - Run A/B field trials measuring conversion, claim rates, and loss ratios under ARQuest vs. traditional processes. - Quantify willingness-to-share alternative data and price elasticity of demand for faster, personalized underwriting. - Conduct distributional audits for bias and unintended exclusionary effects; estimate compliance cost implications. - Model strategic behavior by applicants (selective disclosure, gaming) and design audit/penalty structures to mitigate moral hazard.
If you want, I can produce a one-page slide-ready summary highlighting the economic takeaways, or draft suggested empirical tests (A/B design and metrics) for a pilot deployment.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences. Decision Quality | negative | high | ability of standardized questionnaires to capture individual differences |
0.24
|
| Insurers must blindly trust users' responses, increasing the chances of fraud. Organizational Efficiency | negative | high | fraud risk from self-reported responses |
0.24
|
| The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires. Organizational Efficiency | positive | high | personalization and adaptiveness of questionnaires |
0.08
|
| Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions. Organizational Efficiency | positive | high | ability to extract user insights and guide follow-up questions |
0.8
|
| A life insurance system integrated into an industry partner mobile app was tested in two experiments. Research Productivity | neutral | high | experimental evaluation of system in partner app |
0.8
|
| Traditional questionnaires yielded slightly higher accuracy in risk assessment. Decision Quality | negative | high | risk assessment accuracy |
0.48
|
| Adaptive versions powered by GPT models required fewer questions. Task Completion Time | positive | high | number of questions required (survey length / task completion effort) |
0.48
|
| Adaptive versions were preferred by users for their more fluid and engaging experience. Consumer Welfare | positive | high | user preference / perceived fluidity and engagement |
0.48
|
| ARQuest shows great potential to improve user satisfaction and streamline insurance processes. Consumer Welfare | positive | high | user satisfaction and process streamlining |
0.08
|
| With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry. Decision Quality | positive | high | risk assessment accuracy and industry innovation |
0.08
|