A Human–AI team can deliver near-expert online diagnoses while cutting required doctor involvement to roughly one-tenth; hierarchical reinforcement learning allocates human attention sparingly to preserve accuracy and lower labor per consultation.

Hierarchical Reinforcement Learning Based Human-AI Online Diagnosis

Jiaqi Liu, Xuehan Zhao, Xin Zhang, Zhiwen Yu, Bin Guo · Fetched March 18, 2026 · IEEE Transactions on Mobile Computing

semantic_scholar descriptive medium evidence 7/10 relevance Summary only summary available; pdf_status=not_found DOI Source

A hierarchical Human–AI Diagnostic Team (HADT) using turn-level assignment and masked hierarchical RL attains near-expert diagnostic accuracy (up to 89.4%) while reducing required doctor involvement to about 10.9% of interactions on benchmark datasets and a clinical-interface validation.

Online medical consultation is one of important mobile services worldwide, in which patients can make consultation more conveniently through phones anytime and anywhere. However, expert-level online consultations are expensive due to the shortage of medical professionals, while AI models are unreliable because they have unpredictable risks. Therefore, we introduce human-machine collaboration to medical online consultation and focus on symptom inquiry, as the basis for disease diagnosis. There are two key issues: 1) how to design an intelligent assignment strategy that can determine which doctors or models participate in each turn? 2) how to design an effective execution strategy that can improve the machine’s inquiry ability among considerable symptoms? To address the above issues, we propose the Human-AI Diagnostic Team (HADT) framework based on Hierarchical Reinforcement Learning, which aims to achieve high accuracy with low laborforce. Specifically, HADT has two layers. The upper one is responsible for assignment, in which we propose a module called master that enables intelligent human-machine assignments through the masked reinforcement learning with reward shaping. The lower one is responsible for execution, consisting of a doctor and a proposed module called machine. This module can effectively ask about symptoms through the masked hierarchical reinforcement learning with bottom-up training. Experiments on the public datasets show that HADT can achieve up to 89.4<inline-formula><tex-math notation="LaTeX">$\%$</tex-math><alternatives><mml:math><mml:mo>%</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq1-3629745.gif"/></alternatives></inline-formula> accuracy with only 10.9<inline-formula><tex-math notation="LaTeX">$\%$</tex-math><alternatives><mml:math><mml:mo>%</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq2-3629745.gif"/></alternatives></inline-formula> human effort, as confirmed by real clinical doctors using the designed interface with mobile devices.

Summary

Main Finding

The Human-AI Diagnostic Team (HADT) framework — a two-layer hierarchical reinforcement learning system combining an assignment “master” and an execution “machine” plus human doctors — can deliver near-expert-level online symptom inquiry and diagnosis while using very little human labor. On public medical-consultation datasets and in clinical-interface validation with real doctors, HADT reached up to 89.4% diagnostic accuracy while requiring only 10.9% human effort.

Key Points

Problem: expert online medical consultation is costly due to scarce medical professionals; pure AI is unreliable. Symptom inquiry is the critical front-line task for accurate diagnosis.
Two central design questions addressed:
Which actor (human doctor or AI module) should act at each turn? (assignment)
How should the machine execute symptom inquiry effectively from a large symptom space? (execution)
Architecture:
- Upper layer (“master”): learns turn-by-turn human–machine assignment using masked reinforcement learning with reward shaping to balance accuracy and human cost.
- Lower layer: execution team composed of a doctor and a “machine” module. The machine uses masked hierarchical reinforcement learning with bottom-up training to ask informative symptom questions.
Training innovations: masked RL to constrain/guide action spaces and reward shaping to trade off diagnostic accuracy vs human labor; bottom-up training for the machine execution module to improve question selection over many possible symptoms.
Empirical validation: experiments on public datasets show strong accuracy/human-effort tradeoffs (89.4% accuracy at 10.9% human effort). The system and interface were also tested with real clinical doctors on mobile devices to confirm practical viability.

Data & Methods

Data: experiments were run on publicly available online medical consultation datasets (paper reports public dataset(s) for symptom-inquiry/diagnosis tasks).
Methodology:
- Hierarchical reinforcement learning with two layers: master (assignment) and worker/execution (doctor + machine).
- Masked RL techniques restrict or mask actions to relevant subsets (reducing exploration over huge symptom/action spaces).
- Reward shaping applied at the assignment layer to incorporate penalties for human involvement and incentives for diagnostic accuracy.
- Bottom-up training in the execution module to progressively learn effective symptom-question policies.
Evaluation:
- Primary metrics: diagnostic accuracy and human effort (proportion of turns or time requiring human doctors).
- Comparative evaluation against (implied) baselines of fully human, fully automated, and simpler assignment strategies (paper reports superior trade-offs).
- Clinical validation: physicians used the designed mobile interface to confirm usability and real-world performance.

Implications for AI Economics

Labor-cost reduction: HADT demonstrates a concrete way to substitute expensive human diagnostic labor with AI assistance while preserving high accuracy — lowering marginal cost per consultation.
Allocation efficiency: intelligent turn-level assignment can reduce costly human attention to only the high-value moments, improving overall system productivity.
Pricing and market expansion: lower per-consultation costs could enable broader access to medical advice, new pricing models (tiered or dynamic pricing based on human involvement), and expanded demand in underserved regions.
Incentive/design considerations: reward-shaping and assignment rules embody implicit incentive structures; similar mechanisms could be used to design payment and staffing contracts that balance quality and cost.
Workforce effects: partial substitution may reduce routine diagnostic workload, shifting clinicians toward oversight, complex cases, and supervision — raising questions about retraining, job design, and labor market transitions.
Reliability and regulation: economic gains depend on robustness and trust. Regulators and payers will require clinical validation, safety guarantees, and clear liability frameworks for human-AI shared decision-making.
Research opportunities: formal cost–benefit analyses, mechanism design for optimal assignment pricing, heterogeneity across specialties, generalizability to other high-skill services, and long-run effects on supply of medical professionals.

If you want, I can (a) extract and format the experimental result table and baseline comparisons (if you provide the paper or dataset names), or (b) produce a short policy brief estimating potential cost savings per consultation under plausible assumptions. Which would you prefer?

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides strong system-level evidence that a hierarchical Human–AI team can achieve high diagnostic accuracy with low human involvement on public consultation datasets and in an interface test with real doctors; however, it does not present randomized or field-level causal evidence about economic outcomes (e.g., realized cost savings, adoption, or labor-market impacts), clinical trial–scale validation, or long-run deployment results, and dataset/validation sample details are limited. Methods Rigormedium — The technical approach is sophisticated (two-layer hierarchical RL, masked RL, reward shaping, bottom-up training) with comparative baselines and a clinical-interface validation, but the summary lacks detail on dataset composition and size, baseline implementations, statistical significance or robustness checks, and the clinical validation appears limited in scope (no randomized controlled or large-scale field trial reported). SampleTraining and evaluation on publicly available online medical-consultation datasets for symptom-inquiry/diagnosis dialogues (dataset names and sample sizes not specified in the summary); additional usability/validation tests with practicing physicians using the mobile clinical interface (details on number of doctors, case mix, and sampling not provided). Themesproductivity human_ai_collab labor_markets adoption GeneralizabilityPublic consultation datasets may not represent real-world patient heterogeneity (demographics, comorbidities, language, severity)., Diagnostic domains and specialties tested may be narrow; results may not generalize across all medical fields or complex/rare conditions., Clinical validation sample size and selection bias (convenience sample of doctors) may limit external validity., Interaction modality (mobile app) may differ from telephone or in-person workflows affecting performance and human effort., Ground-truth labels in datasets may be noisy or simulated, affecting measured accuracy., Regulatory, payment, and liability environments differ across jurisdictions, affecting deployability and economic impact., Long-run behavioral changes (clinician supervision, gaming, trust dynamics) and system adaptation are not evaluated., Computational/resource requirements and integration costs may limit deployment in low-resource settings.

Claims (13)

Claim	Direction	Outcome	Confidence & Evidence	Details
HADT reached up to 89.4% diagnostic accuracy while requiring only 10.9% human effort. Output Quality	positive	diagnostic accuracy; human effort (proportion of turns/time requiring human doctors)	Reading fidelity medium Study strength medium	89.4% diagnostic accuracy; 10.9% human effort required 0.11
The Human-AI Diagnostic Team (HADT) framework can deliver near-expert-level online symptom inquiry and diagnosis while using very little human labor. Output Quality	positive	quality of symptom inquiry / diagnostic performance (compared to expert-level)	Reading fidelity medium Study strength medium	near-expert-level performance with low human labor 0.11
A two-layer hierarchical reinforcement learning system—an assignment 'master' and an execution 'machine' (plus human doctors)—effectively balances accuracy and human cost. Task Allocation	positive	trade-off between diagnostic accuracy and human effort	Reading fidelity medium Study strength medium	hierarchical RL balances accuracy and human cost 0.11
The upper layer ('master') learns turn-by-turn human–machine assignment using masked reinforcement learning with reward shaping to balance accuracy and human cost. Task Allocation	positive	assignment policy performance; human effort allocation; diagnostic accuracy under assignment policy	Reading fidelity high Study strength medium	assignment-layer policy learned to trade off accuracy and human cost 0.18
The execution machine uses masked hierarchical reinforcement learning with bottom-up training to ask informative symptom questions from a large symptom space. Output Quality	positive	quality/informativeness of symptom questions; downstream diagnostic accuracy	Reading fidelity medium Study strength medium	execution module asks informative symptom questions improving downstream accuracy 0.11
Masked reinforcement learning techniques constrain or mask action spaces, reducing exploration over huge symptom/action spaces. Other	positive	action-space reduction / sample efficiency / learning stability (as applied to symptom-action space)	Reading fidelity high Study strength medium	masked RL constrains action space, improving sample efficiency/stability 0.18
Reward shaping at the assignment layer enables an explicit trade-off between diagnostic accuracy and human labor by incorporating penalties for human involvement. Task Allocation	positive	diagnostic accuracy vs human effort (as controlled by reward shaping)	Reading fidelity high Study strength medium	reward shaping encodes penalties for human involvement to trade off accuracy/human effort 0.18
On public datasets HADT achieves superior accuracy/human-effort trade-offs compared to baselines (fully human, fully automated, and simpler assignment strategies). Output Quality	positive	diagnostic accuracy and human effort relative to baseline methods	Reading fidelity medium Study strength medium	superior accuracy/human-effort trade-offs vs baselines on public datasets 0.11
Clinical-interface validation with real physicians on mobile devices confirmed the practical viability and usability of the HADT system and interface. Adoption Rate	positive	practical viability / usability in clinical-interface testing (physician interaction)	Reading fidelity medium Study strength medium	clinical-interface validation confirmed practical viability/usability (physician tests) 0.11
HADT demonstrates a concrete way to substitute expensive human diagnostic labor with AI assistance while preserving high accuracy, implying reductions in marginal cost per consultation. Firm Productivity	positive	implied marginal cost per consultation (not directly measured)	Reading fidelity low Study strength medium	implied reductions in marginal cost per consultation (partial substitution of human labor) 0.05
Intelligent turn-level assignment can reduce costly human attention to only high-value moments, improving overall system productivity. Organizational Efficiency	positive	distribution of human attention / system productivity (conceptual, not directly measured)	Reading fidelity low Study strength medium	intelligent turn-level assignment reduces human attention to high-value moments improving productivity (conceptual) 0.05
Partial substitution of routine diagnostic work by HADT may shift clinicians toward oversight, complex cases, and supervision, raising workforce and retraining considerations. Skill Acquisition	mixed	clinician workload composition / need for retraining (speculative)	Reading fidelity speculative Study strength medium	partial substitution shifts clinicians toward oversight/complex cases, implying retraining needs 0.02
Regulators and payers will require clinical validation, safety guarantees, and clear liability frameworks for human–AI shared decision-making before widescale deployment. Regulatory Compliance	null_result	regulatory requirements / safety validation (anticipated, not measured)	Reading fidelity speculative Study strength medium	anticipation that regulators/payers will require validation, safety guarantees, liability frameworks 0.02