← Papers

Simulated hospitals game payment incentives: profit pressure produces up-coding and patient selection, and auditing one channel simply shifts the problem elsewhere; using LLM-guided program synthesis to design interpretable rules can suppress gaming and preserve most of the baseline revenues.

Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

Zihan Wang, Xiang Xu, Hongyuan Zha, Wenhao Li · May 29, 2026 · ArXiv.org

openalex theoretical low evidence 7/10 relevance Full text usable extracted full text Source PDF

Structured author observations

Linked only from stored provider relations; the raw author line above is never matched by name.

OpenAlex

Latest observation: July 23, 2026

Wang, Zihan provider ID
Xu, Xiang provider ID
Zha, Hongyuan provider ID
Li, Wenhao provider ID

Semantic Scholar

Latest observation: July 23, 2026

Zihan Wang provider ID
Xiangjun Xu provider ID
Hongyuan Zha provider ID
Wenhao Li provider ID

In a multi-agent hospital simulator, incentive changes and audits induce strategic behaviors—up-coding and selection of low-complexity patients—and an LLM-guided synthesis of inspectable rule programs can largely eliminate up-coding and cut rejections while retaining most profit.

Citation observations

Cumulative provider counts captured on specific dates; providers are never combined.

0 cumulative citations

OpenAlex · Observed July 22, 2026

View corpus context

0 cumulative citations

Semantic Scholar · Observed July 22, 2026

View corpus context

Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.

Summary

Main Finding

Medi-Sim—a closed-loop Identify–Produce–Settle simulator paired with a typed “policy-as-code” administrative DSL—reveals that common healthcare mechanism failures (up-coding, cream-skimming, Goodhart drift) are not independent pathologies but adjacent regimes of a single incentive phase space, and that administrative fixes (audits, bonuses, KPI steering) typically induce pressure migration across provider-response channels. Constraining administrator policies to inspectable programs enables LLM-guided program synthesis (AlphaEvolve + LLM code edits) to produce an auditable mixed-objective rule that removes up-coding, substantially reduces rejection, and preserves most of profit-oriented funds—while diagnostics expose whether distortions reappear in other channels.

Key Points

Closed-loop modeling matters: Medi-Sim evaluates administrator rules by the equilibrium they induce (leader commits a rule; providers follow bounded-rational best-response across five channels: coding, selection, delay, effort, triage). This reveals strategic-response shifts invisible to benchmarks that fix provider behavior.
Five explicit provider channels map to classical health-econ margins and are instrumented for per-channel diagnostics, enabling identification of pressure migration when one channel is constrained.
Incentive phase diagram (L1 sweep over profit α and quality β):
- Weak incentives → access rationing.
- High profit pressure → up-coding and low-complexity selection (example metrics: up-coding ≈ 0.226; high-CMI rejection gap ≈ 0.182).
- High quality pressure → more effort but solvency stress.
- Intermediate/balanced region hides worst Goodhart-style effects: measured KPI becomes anti-correlated with true health (corr ≈ −0.659), and high-CMI patients experience much higher delay (0.290 vs 0.010).
Administrative lever effects (L2):
- Audits reduce coding but shift pressure to selection (closing coding can more than double low-complexity selection).
- Expanding bonus pools or KPI-steering can worsen proxy misalignment or increase waiting through flex-capacity steering.
Policy-as-code search (L3):
- Administrators’ rules are typed, assignment-only programs over approved levers (α, β, capacities, KPI weights, audit intensity fixed host-side, etc.) to meet auditability/regulability constraints.
- AlphaEvolve evolutionary search with LLM-guided code edits (semantic program mutations) finds inspectable rules under a mixed objective (funds + reputation) that eliminate up-coding, halve rejection, and retain most funds from the profit-oriented baseline.
- Ablations: both a diverse warm-start library and LLM-guided refinement are jointly necessary; black-box neural controllers are excluded by design because of auditability requirements.
Contributions claimed:
Framing mechanism design as auditable program synthesis for LLMs.
Medi-Sim: an Identify–Produce–Settle simulator with five provider-response channels and channel-level diagnostics.
Demonstration that pressure migration is a structural benchmark phenomenon addressable (partly) by LLM-guided program search within an auditable DSL.

Data & Methods

Simulator (Medi-Sim):
- Finite-horizon stochastic Stackelberg game over T periods (experiments: T = 200; reported means over 30 random seeds per cell).
- Patient arrivals: DRG-style Poisson batches; patients have true group g*, normalized CMI ∈ [0,1], urgency, tolerance.
- Identify–Produce–Settle loop: coding (reported group), triage/selection/delay/effort/resource allocation, settlement (reimbursement, KPI, audits, bonuses).
- Coding wedge: separation between true clinical complexity and coded complexity determining reimbursement.
- Measurement wedge: KPI is a weighted aggregate of true health, waiting, rejection, and cost; misalignment produces Goodhart-style gaming.
Provider response model ΠP:
- Bounded-rationality, interpretable parametric response class (not LLM agents). Each care team’s action decomposes into five channels (coding, selection, delay, effort, triage).
- Closed-form behavioral rules: response intensity driven by local gradients of a utility U = α(Rev − Cost) + β Health + θ Bonus − fatigue penalty.
- This class preserves per-channel identifiability and diagnostic interpretability.
Administrator policy class ΠA:
- Typed, assignment-only DSL exposing only approved levers (α, β, capacities, bonus pool & κ, KPI weights, optional KPI-steering switch). Audit schedule, penalties, and settlement logic are fixed host-side.
- Programs are auditable line-by-line (no black-box neural controllers).
Search & optimization:
- AlphaEvolve-style evolutionary search (OpenEvolve) with LLM-guided code-edit proposals for semantic mutations in the policy program.
- Fitness = empirical estimator of Stackelberg objective (social objective ∈ {welfare, profit, mixed}) minus safety penalty (V aggregates diagnostics like unsafe waiting, up-coding, rejection, insolvency) and a seed-variance regularizer.
- Candidate programs are syntax/type-checked and evaluated on short and full stochastic rollouts.
Experiments:
- L1: 11×11 grid over α, β ∈ [0,1], 30 seeds per cell; maps incentive phase diagram.
- L2: lever sweeps (audit intensity, bonus pool, KPI steering) to trace pressure migration.
- L3: code search starting from a diverse warm-start library, ablations test necessity of warm-start and LLM edits.
Representative quantitative signals reported in the paper:
- Up-coding rate ≈ 0.226 under profit pressure.
- High-CMI rejection gap ≈ 0.182.
- High-CMI delay share ≈ 0.290 vs low-CMI 0.010 in the balanced interior.
- Measured KPI vs true health correlation ≈ −0.659 in the problematic interior.

Implications for AI Economics

Necessity of strategic-response modeling: Evaluating mechanisms (pricing, KPIs, audits) without modeling providers’ strategic channels systematically underestimates or misidentifies harms—benchmarks should model endogenous behavioral adaptation, not treat providers as static.
Multichannel diagnostic tooling: Single-metric evaluation (e.g., up-coding only) is insufficient. AI-economics frameworks should instrument per-channel diagnostics to detect pressure migration when one distortion channel is constrained.
Policy-as-code + LLMs as a practical design pattern: In regulated domains where auditability and fixed interfaces are required, LLMs can act as semantic code-editors over typed administrative DSLs to find interpretable rules that balance objectives while revealing side-channel effects. This is distinct from using LLMs as black-box controllers and aligns with regulatory needs (line-by-line inspection).
Goodhart and pressure migration are structural: Interventions that suppress one gaming channel (audits on coding) can reallocate incentives to others (selection, delay). Mechanism design must consider the full space of provider actions and include safety-penalized objectives or multi-channel constraints.
Methodological lesson for applied AI-economics work: combine tractable, interpretable agent-response families (for identifiability and diagnostics) with stochastic closed-loop simulation and constrained program search to generate auditable policies; use LLMs to guide semantic edits but preserve interpretability.
Cautions and next steps:
- Medi-Sim uses parameterized bounded-rational providers (not human-in-the-loop or LLM providers); external validity requires empirical calibration to real provider behavior.
- Audit, penalty, and settlement rules were host-fixed; real-world implementation would require careful policy choices and legal/regulatory vetting.
- While LLM-guided synthesis reduced certain distortions in-simulator, continual monitoring is needed to ensure distortions do not reappear as providers adapt or the environment changes.

If you want, I can extract specific algorithmic details (DSL primitives, coding-score formula, fitness terms) or produce a one-page cheat-sheet summarizing the five diagnostic metrics and how each administrative lever (audit, bonus, KPI weights, capacities) typically moves them.

Assessment

Paper Typetheoretical Evidence Strengthlow — Findings rely entirely on a purpose-built simulator and LLM-driven agents rather than real-world data or randomized interventions; internal counterfactuals are informative but external validity depends on strong modeling assumptions about provider behavior, incentives, and LLM capabilities. Methods Rigormedium — The study uses a structured, multi-channel simulator, systematic incentive sweeps, and automated search over an interpretable rule-program space, which gives clear internal comparisons; however, rigor is limited by absence of empirical validation, potential sensitivity to simulator specification and LLM prompts, and unclear robustness checks reported here. SampleNo human or observational dataset; experiments conducted in Medi-Sim, a multi-agent simulated hospital environment with five strategic provider channels (coding, selection, delay, effort, triage); rule programs (typed, inspectable) are executed to represent hospital mechanisms; treatments include incentive parameter sweeps and audit lever manipulations; LLM-guided evolutionary code search explores alternative rule programs. Themesgovernance org_design IdentificationControlled counterfactual experiments inside a multi-agent simulator (Medi-Sim): the authors vary incentive parameters and audit levers and compare equilibrium outcomes across runs; they also perform LLM-guided evolutionary search over rule-program space to evaluate alternative mechanism designs. GeneralizabilitySimulation-only results may not map to real hospitals or health systems, Provider behavior modeled via LLMs (or stylized agents) may differ from actual clinicians, administrators, and payers, Simplified channelization (five channels) omits many institutional and regulatory details, Results likely sensitive to simulator parametrization and reward specifications, Findings may not generalize across different insurance systems, countries, or clinical contexts

Claims (8)

Claim	Direction	Outcome	Confidence & Evidence	Details
Existing healthcare AI benchmarks hold this [strategic provider] response fixed and so cannot evaluate mechanisms by the equilibrium they produce. Governance And Regulation	negative	ability of benchmarks to evaluate mechanisms by equilibrium response	Reading fidelity high Study strength speculative	not reported 0.02
We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). Other	positive	existence and design of Medi-Sim simulator (methodological/tool contribution)	Reading fidelity high Study strength medium	not reported 0.12
An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure. Task Allocation	positive	incidence of up-coding and selection of low-complexity patients under profit pressure	Reading fidelity high Study strength medium	not reported 0.12
An incentive sweep reveals Goodhart-style drift where measured performance becomes anti-correlated with true outcomes. Decision Quality	negative	correlation between measured performance metric and true patient outcomes	Reading fidelity high Study strength medium	measured performance becomes anti-correlated with true outcomes 0.12
A single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. Task Allocation	positive	rate or incidence of low-complexity patient selection after closing coding channel	Reading fidelity high Study strength medium	more than doubles low-complexity selection 0.12
LLM-guided evolutionary code search synthesizes an inspectable mixed-objective program that eliminates up-coding. Task Allocation	positive	incidence of up-coding under the synthesized program	Reading fidelity high Study strength medium	eliminates up-coding 0.12
The synthesized mixed-objective program halves rejection. Task Allocation	positive	patient rejection rate	Reading fidelity high Study strength medium	halves rejection 0.12
The synthesized mixed-objective program retains most of the profit-oriented baseline's funds. Firm Revenue	positive	funds retained relative to profit-oriented baseline	Reading fidelity high Study strength medium	retains most of the profit-oriented baseline's funds 0.12