Simulated hospitals game payment incentives: profit pressure produces up-coding and patient selection, and auditing one channel simply shifts the problem elsewhere; using LLM-guided program synthesis to design interpretable rules can suppress gaming and preserve most of the baseline revenues.
Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.
Summary
Main Finding
Medi-Sim—a closed-loop Identify–Produce–Settle simulator paired with a typed “policy-as-code” administrative DSL—reveals that common healthcare mechanism failures (up-coding, cream-skimming, Goodhart drift) are not independent pathologies but adjacent regimes of a single incentive phase space, and that administrative fixes (audits, bonuses, KPI steering) typically induce pressure migration across provider-response channels. Constraining administrator policies to inspectable programs enables LLM-guided program synthesis (AlphaEvolve + LLM code edits) to produce an auditable mixed-objective rule that removes up-coding, substantially reduces rejection, and preserves most of profit-oriented funds—while diagnostics expose whether distortions reappear in other channels.
Key Points
- Closed-loop modeling matters: Medi-Sim evaluates administrator rules by the equilibrium they induce (leader commits a rule; providers follow bounded-rational best-response across five channels: coding, selection, delay, effort, triage). This reveals strategic-response shifts invisible to benchmarks that fix provider behavior.
- Five explicit provider channels map to classical health-econ margins and are instrumented for per-channel diagnostics, enabling identification of pressure migration when one channel is constrained.
- Incentive phase diagram (L1 sweep over profit α and quality β):
- Weak incentives → access rationing.
- High profit pressure → up-coding and low-complexity selection (example metrics: up-coding ≈ 0.226; high-CMI rejection gap ≈ 0.182).
- High quality pressure → more effort but solvency stress.
- Intermediate/balanced region hides worst Goodhart-style effects: measured KPI becomes anti-correlated with true health (corr ≈ −0.659), and high-CMI patients experience much higher delay (0.290 vs 0.010).
- Administrative lever effects (L2):
- Audits reduce coding but shift pressure to selection (closing coding can more than double low-complexity selection).
- Expanding bonus pools or KPI-steering can worsen proxy misalignment or increase waiting through flex-capacity steering.
- Policy-as-code search (L3):
- Administrators’ rules are typed, assignment-only programs over approved levers (α, β, capacities, KPI weights, audit intensity fixed host-side, etc.) to meet auditability/regulability constraints.
- AlphaEvolve evolutionary search with LLM-guided code edits (semantic program mutations) finds inspectable rules under a mixed objective (funds + reputation) that eliminate up-coding, halve rejection, and retain most funds from the profit-oriented baseline.
- Ablations: both a diverse warm-start library and LLM-guided refinement are jointly necessary; black-box neural controllers are excluded by design because of auditability requirements.
- Contributions claimed:
- Framing mechanism design as auditable program synthesis for LLMs.
- Medi-Sim: an Identify–Produce–Settle simulator with five provider-response channels and channel-level diagnostics.
- Demonstration that pressure migration is a structural benchmark phenomenon addressable (partly) by LLM-guided program search within an auditable DSL.
Data & Methods
- Simulator (Medi-Sim):
- Finite-horizon stochastic Stackelberg game over T periods (experiments: T = 200; reported means over 30 random seeds per cell).
- Patient arrivals: DRG-style Poisson batches; patients have true group g*, normalized CMI ∈ [0,1], urgency, tolerance.
- Identify–Produce–Settle loop: coding (reported group), triage/selection/delay/effort/resource allocation, settlement (reimbursement, KPI, audits, bonuses).
- Coding wedge: separation between true clinical complexity and coded complexity determining reimbursement.
- Measurement wedge: KPI is a weighted aggregate of true health, waiting, rejection, and cost; misalignment produces Goodhart-style gaming.
- Provider response model ΠP:
- Bounded-rationality, interpretable parametric response class (not LLM agents). Each care team’s action decomposes into five channels (coding, selection, delay, effort, triage).
- Closed-form behavioral rules: response intensity driven by local gradients of a utility U = α(Rev − Cost) + β Health + θ Bonus − fatigue penalty.
- This class preserves per-channel identifiability and diagnostic interpretability.
- Administrator policy class ΠA:
- Typed, assignment-only DSL exposing only approved levers (α, β, capacities, bonus pool & κ, KPI weights, optional KPI-steering switch). Audit schedule, penalties, and settlement logic are fixed host-side.
- Programs are auditable line-by-line (no black-box neural controllers).
- Search & optimization:
- AlphaEvolve-style evolutionary search (OpenEvolve) with LLM-guided code-edit proposals for semantic mutations in the policy program.
- Fitness = empirical estimator of Stackelberg objective (social objective ∈ {welfare, profit, mixed}) minus safety penalty (V aggregates diagnostics like unsafe waiting, up-coding, rejection, insolvency) and a seed-variance regularizer.
- Candidate programs are syntax/type-checked and evaluated on short and full stochastic rollouts.
- Experiments:
- L1: 11×11 grid over α, β ∈ [0,1], 30 seeds per cell; maps incentive phase diagram.
- L2: lever sweeps (audit intensity, bonus pool, KPI steering) to trace pressure migration.
- L3: code search starting from a diverse warm-start library, ablations test necessity of warm-start and LLM edits.
- Representative quantitative signals reported in the paper:
- Up-coding rate ≈ 0.226 under profit pressure.
- High-CMI rejection gap ≈ 0.182.
- High-CMI delay share ≈ 0.290 vs low-CMI 0.010 in the balanced interior.
- Measured KPI vs true health correlation ≈ −0.659 in the problematic interior.
Implications for AI Economics
- Necessity of strategic-response modeling: Evaluating mechanisms (pricing, KPIs, audits) without modeling providers’ strategic channels systematically underestimates or misidentifies harms—benchmarks should model endogenous behavioral adaptation, not treat providers as static.
- Multichannel diagnostic tooling: Single-metric evaluation (e.g., up-coding only) is insufficient. AI-economics frameworks should instrument per-channel diagnostics to detect pressure migration when one distortion channel is constrained.
- Policy-as-code + LLMs as a practical design pattern: In regulated domains where auditability and fixed interfaces are required, LLMs can act as semantic code-editors over typed administrative DSLs to find interpretable rules that balance objectives while revealing side-channel effects. This is distinct from using LLMs as black-box controllers and aligns with regulatory needs (line-by-line inspection).
- Goodhart and pressure migration are structural: Interventions that suppress one gaming channel (audits on coding) can reallocate incentives to others (selection, delay). Mechanism design must consider the full space of provider actions and include safety-penalized objectives or multi-channel constraints.
- Methodological lesson for applied AI-economics work: combine tractable, interpretable agent-response families (for identifiability and diagnostics) with stochastic closed-loop simulation and constrained program search to generate auditable policies; use LLMs to guide semantic edits but preserve interpretability.
- Cautions and next steps:
- Medi-Sim uses parameterized bounded-rational providers (not human-in-the-loop or LLM providers); external validity requires empirical calibration to real provider behavior.
- Audit, penalty, and settlement rules were host-fixed; real-world implementation would require careful policy choices and legal/regulatory vetting.
- While LLM-guided synthesis reduced certain distortions in-simulator, continual monitoring is needed to ensure distortions do not reappear as providers adapt or the environment changes.
If you want, I can extract specific algorithmic details (DSL primitives, coding-score formula, fitness terms) or produce a one-page cheat-sheet summarizing the five diagnostic metrics and how each administrative lever (audit, bonus, KPI weights, capacities) typically moves them.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Existing healthcare AI benchmarks hold this [strategic provider] response fixed and so cannot evaluate mechanisms by the equilibrium they produce. Governance And Regulation | negative | high | ability of benchmarks to evaluate mechanisms by equilibrium response |
0.02
|
| We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). Other | positive | high | existence and design of Medi-Sim simulator (methodological/tool contribution) |
0.12
|
| An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure. Task Allocation | positive | high | incidence of up-coding and selection of low-complexity patients under profit pressure |
0.12
|
| An incentive sweep reveals Goodhart-style drift where measured performance becomes anti-correlated with true outcomes. Decision Quality | negative | high | correlation between measured performance metric and true patient outcomes |
measured performance becomes anti-correlated with true outcomes
0.12
|
| A single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. Task Allocation | positive | high | rate or incidence of low-complexity patient selection after closing coding channel |
more than doubles low-complexity selection
0.12
|
| LLM-guided evolutionary code search synthesizes an inspectable mixed-objective program that eliminates up-coding. Task Allocation | positive | high | incidence of up-coding under the synthesized program |
eliminates up-coding
0.12
|
| The synthesized mixed-objective program halves rejection. Task Allocation | positive | high | patient rejection rate |
halves rejection
0.12
|
| The synthesized mixed-objective program retains most of the profit-oriented baseline's funds. Firm Revenue | positive | high | funds retained relative to profit-oriented baseline |
retains most of the profit-oriented baseline's funds
0.12
|