The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

A Human–AI team can deliver near-expert online diagnoses while cutting required doctor involvement to roughly one-tenth; hierarchical reinforcement learning allocates human attention sparingly to preserve accuracy and lower labor per consultation.

Hierarchical Reinforcement Learning Based Human-AI Online Diagnosis
Jiaqi Liu, Xuehan Zhao, Xin Zhang, Zhiwen Yu, Bin Guo · Fetched March 18, 2026 · IEEE Transactions on Mobile Computing
semantic_scholar descriptive medium evidence 7/10 relevance DOI Source
A hierarchical Human–AI Diagnostic Team (HADT) using turn-level assignment and masked hierarchical RL attains near-expert diagnostic accuracy (up to 89.4%) while reducing required doctor involvement to about 10.9% of interactions on benchmark datasets and a clinical-interface validation.

Online medical consultation is one of important mobile services worldwide, in which patients can make consultation more conveniently through phones anytime and anywhere. However, expert-level online consultations are expensive due to the shortage of medical professionals, while AI models are unreliable because they have unpredictable risks. Therefore, we introduce human-machine collaboration to medical online consultation and focus on symptom inquiry, as the basis for disease diagnosis. There are two key issues: 1) how to design an intelligent assignment strategy that can determine which doctors or models participate in each turn? 2) how to design an effective execution strategy that can improve the machine’s inquiry ability among considerable symptoms? To address the above issues, we propose the Human-AI Diagnostic Team (HADT) framework based on Hierarchical Reinforcement Learning, which aims to achieve high accuracy with low laborforce. Specifically, HADT has two layers. The upper one is responsible for assignment, in which we propose a module called master that enables intelligent human-machine assignments through the masked reinforcement learning with reward shaping. The lower one is responsible for execution, consisting of a doctor and a proposed module called machine. This module can effectively ask about symptoms through the masked hierarchical reinforcement learning with bottom-up training. Experiments on the public datasets show that HADT can achieve up to 89.4<inline-formula><tex-math notation="LaTeX">$\%$</tex-math><alternatives><mml:math><mml:mo>%</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq1-3629745.gif"/></alternatives></inline-formula> accuracy with only 10.9<inline-formula><tex-math notation="LaTeX">$\%$</tex-math><alternatives><mml:math><mml:mo>%</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq2-3629745.gif"/></alternatives></inline-formula> human effort, as confirmed by real clinical doctors using the designed interface with mobile devices.

Summary

Main Finding

The Human-AI Diagnostic Team (HADT) framework — a two-layer hierarchical reinforcement learning system combining an assignment “master” and an execution “machine” plus human doctors — can deliver near-expert-level online symptom inquiry and diagnosis while using very little human labor. On public medical-consultation datasets and in clinical-interface validation with real doctors, HADT reached up to 89.4% diagnostic accuracy while requiring only 10.9% human effort.

Key Points

  • Problem: expert online medical consultation is costly due to scarce medical professionals; pure AI is unreliable. Symptom inquiry is the critical front-line task for accurate diagnosis.
  • Two central design questions addressed:
  • Which actor (human doctor or AI module) should act at each turn? (assignment)
  • How should the machine execute symptom inquiry effectively from a large symptom space? (execution)
  • Architecture:
    • Upper layer (“master”): learns turn-by-turn human–machine assignment using masked reinforcement learning with reward shaping to balance accuracy and human cost.
    • Lower layer: execution team composed of a doctor and a “machine” module. The machine uses masked hierarchical reinforcement learning with bottom-up training to ask informative symptom questions.
  • Training innovations: masked RL to constrain/guide action spaces and reward shaping to trade off diagnostic accuracy vs human labor; bottom-up training for the machine execution module to improve question selection over many possible symptoms.
  • Empirical validation: experiments on public datasets show strong accuracy/human-effort tradeoffs (89.4% accuracy at 10.9% human effort). The system and interface were also tested with real clinical doctors on mobile devices to confirm practical viability.

Data & Methods

  • Data: experiments were run on publicly available online medical consultation datasets (paper reports public dataset(s) for symptom-inquiry/diagnosis tasks).
  • Methodology:
    • Hierarchical reinforcement learning with two layers: master (assignment) and worker/execution (doctor + machine).
    • Masked RL techniques restrict or mask actions to relevant subsets (reducing exploration over huge symptom/action spaces).
    • Reward shaping applied at the assignment layer to incorporate penalties for human involvement and incentives for diagnostic accuracy.
    • Bottom-up training in the execution module to progressively learn effective symptom-question policies.
  • Evaluation:
    • Primary metrics: diagnostic accuracy and human effort (proportion of turns or time requiring human doctors).
    • Comparative evaluation against (implied) baselines of fully human, fully automated, and simpler assignment strategies (paper reports superior trade-offs).
    • Clinical validation: physicians used the designed mobile interface to confirm usability and real-world performance.

Implications for AI Economics

  • Labor-cost reduction: HADT demonstrates a concrete way to substitute expensive human diagnostic labor with AI assistance while preserving high accuracy — lowering marginal cost per consultation.
  • Allocation efficiency: intelligent turn-level assignment can reduce costly human attention to only the high-value moments, improving overall system productivity.
  • Pricing and market expansion: lower per-consultation costs could enable broader access to medical advice, new pricing models (tiered or dynamic pricing based on human involvement), and expanded demand in underserved regions.
  • Incentive/design considerations: reward-shaping and assignment rules embody implicit incentive structures; similar mechanisms could be used to design payment and staffing contracts that balance quality and cost.
  • Workforce effects: partial substitution may reduce routine diagnostic workload, shifting clinicians toward oversight, complex cases, and supervision — raising questions about retraining, job design, and labor market transitions.
  • Reliability and regulation: economic gains depend on robustness and trust. Regulators and payers will require clinical validation, safety guarantees, and clear liability frameworks for human-AI shared decision-making.
  • Research opportunities: formal cost–benefit analyses, mechanism design for optimal assignment pricing, heterogeneity across specialties, generalizability to other high-skill services, and long-run effects on supply of medical professionals.

If you want, I can (a) extract and format the experimental result table and baseline comparisons (if you provide the paper or dataset names), or (b) produce a short policy brief estimating potential cost savings per consultation under plausible assumptions. Which would you prefer?

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides strong system-level evidence that a hierarchical Human–AI team can achieve high diagnostic accuracy with low human involvement on public consultation datasets and in an interface test with real doctors; however, it does not present randomized or field-level causal evidence about economic outcomes (e.g., realized cost savings, adoption, or labor-market impacts), clinical trial–scale validation, or long-run deployment results, and dataset/validation sample details are limited. Methods Rigormedium — The technical approach is sophisticated (two-layer hierarchical RL, masked RL, reward shaping, bottom-up training) with comparative baselines and a clinical-interface validation, but the summary lacks detail on dataset composition and size, baseline implementations, statistical significance or robustness checks, and the clinical validation appears limited in scope (no randomized controlled or large-scale field trial reported). SampleTraining and evaluation on publicly available online medical-consultation datasets for symptom-inquiry/diagnosis dialogues (dataset names and sample sizes not specified in the summary); additional usability/validation tests with practicing physicians using the mobile clinical interface (details on number of doctors, case mix, and sampling not provided). Themesproductivity human_ai_collab labor_markets adoption GeneralizabilityPublic consultation datasets may not represent real-world patient heterogeneity (demographics, comorbidities, language, severity)., Diagnostic domains and specialties tested may be narrow; results may not generalize across all medical fields or complex/rare conditions., Clinical validation sample size and selection bias (convenience sample of doctors) may limit external validity., Interaction modality (mobile app) may differ from telephone or in-person workflows affecting performance and human effort., Ground-truth labels in datasets may be noisy or simulated, affecting measured accuracy., Regulatory, payment, and liability environments differ across jurisdictions, affecting deployability and economic impact., Long-run behavioral changes (clinician supervision, gaming, trust dynamics) and system adaptation are not evaluated., Computational/resource requirements and integration costs may limit deployment in low-resource settings.

Claims (13)

ClaimDirectionConfidenceOutcomeDetails
HADT reached up to 89.4% diagnostic accuracy while requiring only 10.9% human effort. Output Quality positive medium diagnostic accuracy; human effort (proportion of turns/time requiring human doctors)
89.4% diagnostic accuracy; 10.9% human effort required
0.11
The Human-AI Diagnostic Team (HADT) framework can deliver near-expert-level online symptom inquiry and diagnosis while using very little human labor. Output Quality positive medium quality of symptom inquiry / diagnostic performance (compared to expert-level)
near-expert-level performance with low human labor
0.11
A two-layer hierarchical reinforcement learning system—an assignment 'master' and an execution 'machine' (plus human doctors)—effectively balances accuracy and human cost. Task Allocation positive medium trade-off between diagnostic accuracy and human effort
hierarchical RL balances accuracy and human cost
0.11
The upper layer ('master') learns turn-by-turn human–machine assignment using masked reinforcement learning with reward shaping to balance accuracy and human cost. Task Allocation positive high assignment policy performance; human effort allocation; diagnostic accuracy under assignment policy
assignment-layer policy learned to trade off accuracy and human cost
0.18
The execution machine uses masked hierarchical reinforcement learning with bottom-up training to ask informative symptom questions from a large symptom space. Output Quality positive medium quality/informativeness of symptom questions; downstream diagnostic accuracy
execution module asks informative symptom questions improving downstream accuracy
0.11
Masked reinforcement learning techniques constrain or mask action spaces, reducing exploration over huge symptom/action spaces. Other positive high action-space reduction / sample efficiency / learning stability (as applied to symptom-action space)
masked RL constrains action space, improving sample efficiency/stability
0.18
Reward shaping at the assignment layer enables an explicit trade-off between diagnostic accuracy and human labor by incorporating penalties for human involvement. Task Allocation positive high diagnostic accuracy vs human effort (as controlled by reward shaping)
reward shaping encodes penalties for human involvement to trade off accuracy/human effort
0.18
On public datasets HADT achieves superior accuracy/human-effort trade-offs compared to baselines (fully human, fully automated, and simpler assignment strategies). Output Quality positive medium diagnostic accuracy and human effort relative to baseline methods
superior accuracy/human-effort trade-offs vs baselines on public datasets
0.11
Clinical-interface validation with real physicians on mobile devices confirmed the practical viability and usability of the HADT system and interface. Adoption Rate positive medium practical viability / usability in clinical-interface testing (physician interaction)
clinical-interface validation confirmed practical viability/usability (physician tests)
0.11
HADT demonstrates a concrete way to substitute expensive human diagnostic labor with AI assistance while preserving high accuracy, implying reductions in marginal cost per consultation. Firm Productivity positive low implied marginal cost per consultation (not directly measured)
implied reductions in marginal cost per consultation (partial substitution of human labor)
0.05
Intelligent turn-level assignment can reduce costly human attention to only high-value moments, improving overall system productivity. Organizational Efficiency positive low distribution of human attention / system productivity (conceptual, not directly measured)
intelligent turn-level assignment reduces human attention to high-value moments improving productivity (conceptual)
0.05
Partial substitution of routine diagnostic work by HADT may shift clinicians toward oversight, complex cases, and supervision, raising workforce and retraining considerations. Skill Acquisition mixed speculative clinician workload composition / need for retraining (speculative)
partial substitution shifts clinicians toward oversight/complex cases, implying retraining needs
0.02
Regulators and payers will require clinical validation, safety guarantees, and clear liability frameworks for human–AI shared decision-making before widescale deployment. Regulatory Compliance null_result speculative regulatory requirements / safety validation (anticipated, not measured)
anticipation that regulators/payers will require validation, safety guarantees, liability frameworks
0.02

Notes