The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Adaptive, GPT-powered life-insurance questionnaires cut the number of questions and win user preference, but conventional forms still slightly outperform them on risk-assessment accuracy in two field experiments; further development could close the accuracy gap and streamline underwriting.

AI in Insurance: Adaptive Questionnaires for Improved Risk Profiling
Diogo Silva, João Teixeira, Bruno Lima · April 02, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
In two in-app experiments, LLM-powered adaptive questionnaires required fewer questions and were preferred by users but were slightly less accurate at risk assessment than traditional standardized questionnaires.

Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences. Moreover, insurers must blindly trust users' responses, increasing the chances of fraud. The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires. Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions. A life insurance system integrated into an industry partner mobile app was tested in two experiments. While traditional questionnaires yielded slightly higher accuracy in risk assessment, adaptive versions powered by GPT models required fewer questions and were preferred by users for their more fluid and engaging experience. ARQuest shows great potential to improve user satisfaction and streamline insurance processes. With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry.

Summary

Main Finding

ARQuest, an adaptive underwriting framework that combines LLMs (via RAG), alternative data (EHRs, social media images, geographic indicators), and dynamic questioning, can substantially reduce the number of questions applicants must answer and improve user experience, but — in the present implementation — yields somewhat lower risk-score accuracy than a conventional fixed questionnaire. Users strongly prefer the adaptive flow; GPT-4.1 outperforms GPT-3.5 in speed and predictive quality, while remaining gaps (notably family-history coverage) explain most accuracy shortfalls.

Key Points

  • ARQuest architecture:
    • Four modules: User profiling (ingest basic + external data), Response forecasting (LLM predicts answers and confidence), Dynamic questioning (iterative factor selection), Risk assessment (monotonic additive scoring and mismatch detection).
    • Uses Retrieval-Augmented Generation (RAG) to feed contextualized external insights to the LLM and limit hallucinations.
  • Data sources and feature engineering:
    • Geographic health indicators (Atlas of Healthy Municipalities), labeled by k-means clusters (e.g., "very high").
    • Synthetic EHRs from Synthea and Instagram images captioned via BLIP to create semantically rich inputs.
    • Synthetic population of 85 users plus a small real-user pilot (n=10).
  • Experimental comparison:
    • Two baselines: traditional static questionnaire (30 Qs across 3 domains) vs. dynamic ARQuest flows using GPT-3.5 Turbo and GPT-4.1.
    • Evaluation metrics: number of questions asked, task time, MAE and Pearson correlation against a synthetic “true” risk score, and user experience feedback.
  • Results summary:
    • Dynamic flows required roughly half the number of questions (GPT-3.5 even fewer than GPT-4.1).
    • Traditional questionnaires had lower MAE and higher correlation with the true risk scores (traditional outperformed dynamic by ~10–30% on risk accuracy).
    • GPT-4.1 performed better than GPT-3.5: higher accuracy, faster prediction and factor selection.
    • User preferences strongly favored dynamic flow (70% preferred it); participants found it more engaging and personalized.
  • Limitations and risks identified:
    • Lack of family-history questions in dynamic flows accounted for a major portion of accuracy loss.
    • Small real-world sample, reliance on synthetic users, privacy and compliance concerns (GDPR/AIA), potential bias in model-driven decisions, and LLM hallucinations remain important constraints.
    • Scoring used a deterministic monotonic additive model for interpretability (not a production probabilistic black box).

Data & Methods

  • Implementation:
    • Mobile app integration with optional user sharing of EHR, fitness, and Instagram data; Azure-hosted GPT models for protected processing.
    • Pre-computed ground-truth answers for each synthetic user to simulate questionnaire filling and compute “true” risk.
  • Data:
    • Synthetic users (n=85) constructed from Synthea EHR profiles, Portuguese municipal health indicators, occupation-based step estimates, and image captions sampled from a Kaggle Instagram dataset.
    • Real-user pilot (n=10) recruited to assess UX and perceptions.
  • Modelling & pipelines:
    • BLIP used for image captioning (tuned hyperparameters).
    • K-means labeling for municipality indicator buckets.
    • RAG pipeline to retrieve and present external insights to LLM prompts (LLM asked to output JSON with predicted answers, confidences, and explanations).
    • Two LLMs tested: GPT-3.5 Turbo and GPT-4.1.
    • Risk scoring: monotonic additive model with extra penalties for risky combinations.
  • Evaluation:
    • Metrics: number of questions asked, completion time, MAE and Pearson correlation vs. synthetic true risk, user feedback survey on clarity, engagement, and preference.
    • Comparative analysis across traditional vs. dynamic (GPT-3.5, GPT-4.1).

Implications for AI Economics

  • Productivity and cost structure:
    • Reduced question counts and faster flows can lower underwriting time and operational costs per application (labor and processing savings).
    • Fewer in-person or manual follow-ups could reduce acquisition and servicing costs, increasing insurer throughput.
  • Conversion, demand, and consumer surplus:
    • Better UX and lower friction should raise conversion rates and consumer willingness to apply—potentially expanding the insured pool and consumer surplus.
    • However, willingness to share alternative data (EHR, social media) is heterogeneous; firms that can credibly protect privacy may capture a premium (first-mover advantage).
  • Pricing, risk selection, and profitability:
    • More personalized data and adaptive questioning can enable finer-grained risk stratification (better price discrimination), boosting actuarial efficiency if accuracy improves.
    • Current accuracy shortfalls (vs. traditional forms) imply short-term pricing risk; missing features (e.g., family history) can cause under- or over-pricing.
    • If refined, ARQuest may reduce asymmetric information problems (adverse selection) and improve loss ratios; conversely, model errors could generate new selection distortions.
  • Fraud and moral hazard:
    • Pre-filled predictions and mismatch detection provide a mechanism to flag potential misreporting, reducing fraud costs. But richer data may also create incentives to game the inputs or to withhold/share selectively.
  • Market structure and competition:
    • Firms adopting superior adaptive-underwriting tech may gain competitive advantage (faster applications, lower cost-to-serve), pressuring incumbents to invest in AI capabilities or partner with data providers.
    • A new data ecosystem may emerge where insurers pay for curated alternative data feeds (geographic, wearables, social media analytics), shifting marginal costs.
  • Regulatory and compliance costs:
    • Use of external behavioral and health data invites regulatory scrutiny (privacy, fairness, explainability). Compliance and auditability will impose additional costs and potentially limit permissible features, altering the economic return on adopting ARQuest-like systems.
    • Transparency requirements (explainability, contestability) may favor hybrid models with interpretable scoring components (as in this paper).
  • Distributional and welfare considerations:
    • Potential efficiency gains could reduce premiums for well-profiled low-risk individuals, while imperfect models risk systematic bias against minorities and vulnerable groups, creating welfare trade-offs and possible public-policy intervention.
  • Research and investment priorities:
    • Economically valuable next steps include: (a) improving family-history capture and model calibration, (b) large-scale field trials measuring conversion, lapse, claim frequency, and profitability impacts, (c) formal cost–benefit and privacy–value analyses, and (d) mechanisms to ensure fairness, auditability, and consumer consent.
  • Reinsurance and capital:
    • More granular risk signals could change reinsurance pricing and capital allocation; reinsurers and regulators will need to understand model risk and systemic correlations introduced by shared data sources.

Suggested follow-ups for economists and insurers - Run A/B field trials measuring conversion, claim rates, and loss ratios under ARQuest vs. traditional processes. - Quantify willingness-to-share alternative data and price elasticity of demand for faster, personalized underwriting. - Conduct distributional audits for bias and unintended exclusionary effects; estimate compliance cost implications. - Model strategic behavior by applicants (selective disclosure, gaming) and design audit/penalty structures to mitigate moral hazard.

If you want, I can produce a one-page slide-ready summary highlighting the economic takeaways, or draft suggested empirical tests (A/B design and metrics) for a pilot deployment.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The work reports real-world experiments inside a commercial mobile app, which provides stronger external validity than lab studies; however, key details are missing or unclear (sample size, randomization or balance checks, how ground-truth risk was measured, and robustness checks), and the scope is limited to a single partner and product (life insurance), reducing confidence in broad causal generalization. Methods Rigormedium — The study uses a practical deployment and compares arms empirically, and it leverages multiple data sources (social-media images, geodata, RAG). But the description lacks important methodological details: whether assignment was randomized and concealed, sample sizes and statistical power, how accuracy of risk assessment was validated against objective outcomes, pre-registration or corrections for multiple testing, model training/overfitting controls, and privacy/ethics safeguards—so methodological transparency and robustness testing appear limited. SampleUsers of an industry partner's mobile app applying for life insurance; two in-app experiments compared traditional standardized questionnaires to adaptive LLM-driven questionnaires that incorporate alternative data (social-media image analysis, geographic data categorization) and Retrieval-Augmented Generation; exact sample sizes, demographic breakdown, geographic coverage and experiment dates are not reported. Themesorg_design human_ai_collab adoption productivity IdentificationBetween-subject comparison (A/B-style experiments) within an industry partner's mobile app comparing traditional standardized questionnaires to adaptive LLM-powered questionnaires; causal claims rest on experimental assignment to each questionnaire arm in two in-app experiments. GeneralizabilitySingle-industry-partner sample — may not generalize to other insurers, markets, or countries, Life insurance only — findings may not transfer to other insurance lines or non-insurance products, Mobile app users — likely skews toward smartphone-savvy and self-selecting applicants, Unclear demographic/market coverage — regulatory and privacy norms differ across jurisdictions, Results depend on specific LLM (GPT) and data sources used; model upgrades or different data could change outcomes, Risk-assessment ground truth unclear — accuracy comparisons may not reflect long-term claim outcomes

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences. Decision Quality negative high ability of standardized questionnaires to capture individual differences
0.24
Insurers must blindly trust users' responses, increasing the chances of fraud. Organizational Efficiency negative high fraud risk from self-reported responses
0.24
The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires. Organizational Efficiency positive high personalization and adaptiveness of questionnaires
0.08
Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions. Organizational Efficiency positive high ability to extract user insights and guide follow-up questions
0.8
A life insurance system integrated into an industry partner mobile app was tested in two experiments. Research Productivity neutral high experimental evaluation of system in partner app
0.8
Traditional questionnaires yielded slightly higher accuracy in risk assessment. Decision Quality negative high risk assessment accuracy
0.48
Adaptive versions powered by GPT models required fewer questions. Task Completion Time positive high number of questions required (survey length / task completion effort)
0.48
Adaptive versions were preferred by users for their more fluid and engaging experience. Consumer Welfare positive high user preference / perceived fluidity and engagement
0.48
ARQuest shows great potential to improve user satisfaction and streamline insurance processes. Consumer Welfare positive high user satisfaction and process streamlining
0.08
With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry. Decision Quality positive high risk assessment accuracy and industry innovation
0.08

Notes