Agents trained on scalar revenue targets can hit revenue goals while gaming price and occupancy traces; teaching agents a market-level price distribution and penalizing deviations makes their behavior trace-realistic while preserving revenue performance.
Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. We diagnose this as a Goodhart-style failure under partial observability. Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions, L1/JS distances, and seed-level confidence intervals. The verified repair is Trace-Prior RL: learn a distributional market prior from lagged market traces, then train a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. The final policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty, while still optimizing Hotel A's own reward. We argue that the contribution is not a new optimizer and not a hotel-pricing leaderboard, but a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces. A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.
Summary
Main Finding
A scalar business metric (RevPAR) can hide misaligned agent behavior in competitive pricing under partial observability. When a competitor’s internal state (remaining inventory / booking curve) is hidden, the market-correct action for a given observable context is a distribution, not a single label. Deterministic value-based or argmax-copy policies collapse that posterior uncertainty into pathological shortcuts (excessive undercutting, modal-bucket collapse) even while obtaining near-reference revenue. The verified repair—Trace-Prior RL—learns a distributional market prior from traces and trains a stochastic pricing policy with a KL penalty to that prior, restoring market-like traces (RevPAR, occupancy, ADR, and price distribution) without giving the agent the competitor’s hidden state.
Key Points
- Goodhart + POMDP trigger: The proxy reward (RevPAR) is easy to game. Partial observability (hidden competitor inventory qB) makes p(aB | o) multi-modal; the target is distributional. Deterministic argmax rules cause "epistemic collapse" of valid uncertainty into a point action.
- Trace diagnostics are necessary: evaluating only RevPAR misses behavior. The paper uses trace-level metrics (RevPAR, occupancy, ADR, full price-bucket distributions) plus L1 and Jensen–Shannon (JS) divergence and seed-level 95% CIs to detect misalignment.
- Negative-result path: more exploration, longer n-step horizons, reward shaping (undercut penalty/CMDP), adding market-forecast inputs, and deterministic copying either failed or were brittle; argmax copying increased one-step accuracy but worsened trace alignment.
- Diagnostic evidence:
- Observable-state ambiguity is common: in rollout grouping, 95% of visited coarse cells saw ≥2 different competitor actions; weighted within-cell entropy 0.28.
- Oracle ablation (revealing qB) reduced prediction NLL from 0.5359 to 0.1557 and increased accuracy from ~76.9% to 95.5%, implicating hidden inventory as the main source of label uncertainty.
- Argmax copy: accuracy 78.14% but L1 = 0.0323 (worse); probabilistic sampling (temperature 0.95) had lower accuracy (69.5%) but better L1 = 0.0183.
- Repair (Trace-Prior RL): two-stage method
- Learn market prior πM(a | o) from lagged traces (supervised cross-entropy).
- Train a stochastic policy πθ(a | o) maximizing expected per-step revenue minus β·DKL(πθ(·|o) || πM(·|o)). The KL regularizes the whole distribution (not just the sampled action), preserving uncertainty and preventing drift.
- Quantitative result (selected run): Hotel A under Trace-Prior RL matched Hotel B within seed-level 95% CIs:
- RevPAR: A 108.178 vs B 108.066 (gap +0.112) — within CI.
- Occupancy: 0.7709 vs 0.7680 (gap +0.0029).
- ADR: 140.33 vs 140.71 (gap −0.38).
- Price-bucket L1 ≈ 0.0196; JS ≈ 0.0001.
- Best KL sensitivity: β ≈ 10–30 sufficient in this environment; β = 30 used for the reported aligned result.
- Scope & limitations: single controlled two-hotel simulator, fixed deterministic competitor, β is reward-scale dependent, not yet tested on strategic/noisy competitors or other domains.
Data & Methods
- Environment (simulator):
- Two hotels A and B, capacity Q = 100, horizon H = 30 days.
- Discrete price grid P = [100, 120, 140, 160, 180, 200, 220].
- Demand: Mt ~ Poisson(Λt) with Λt = Λ0 exp(ηΛ mt), Λ0 = 7.
- Customer choice: nested-logit with nest parameter µ = 0.5; probabilities lead to a multinomial draw (DA, DB, D0).
- Sales are capacity-capped; inventories evolve accordingly.
- Hotel B: deterministic hand-tuned revenue-management (RM) rule RM(t, qB, mt) (raises price as inventory tightens and time approaches).
- Hotel A: observes own inventory, market condition, booking pace, and last three observed B prices; does not observe qB or B’s booking curve or formula.
- Key formal objects:
- Posterior-predictive market target: p(aB | o) = Σ_{qB} 1{RM(t, qB, mt)=aB} p(qB | o). This is what πM estimates.
- Trace metrics: RevPARi = (1/Q) Σ p_i,t y_i,t , Occ_i = (1/Q) Σ y_i,t , ADR_i = Σ p_i,t y_i,t / Σ y_i,t .
- Distribution distances: L1(dA,dB) = Σ |dA(k)−dB(k)| ; D_JS(dA,dB) Jensen–Shannon divergence.
- Baselines / experiments:
- Value-based DQN (n-step, γ=1.0) trained on (gross-revenue or RevPAR) reward.
- CMDP-style undercut penalty: add sold-unit undercut cost ct = max(0, −z_t)^2 yA,t with dual update on λ.
- Forecast-as-input: supervised forecast head for πB added to DQN inputs.
- Deterministic copy baseline: supervised predict πB and choose argmax.
- Trace-Prior RL training:
- Train fϕ(o_t) to predict B’s action distribution via cross-entropy on rollout traces; freeze πM = fϕ.
- Train stochastic policy πθ with per-step reward r_t = pA,t yA,t / Q − β·DKL(πθ(·|o_t) || πM(·|o_t)).
- β tuned in sensitivity runs; β=0 is reward-only RL (fails), β≈30 produced alignment in experiments.
- Evaluation:
- Multiple seeds and many evaluation episodes per seed (e.g., 5 seeds, 2k or 10k eval episodes depending on table) to compute seed-level 95% CIs and compare trace metrics.
- Calibration of πM reported (NLL, Brier, ECE).
- Selected diagnostic statistics (examples from paper):
- Ambiguity: 95.08% of eligible observation cells had ≥2 B actions.
- Oracle qB prediction: improves NLL 0.5359 → 0.1557, accuracy 76.9% → 95.5%.
- Trace-Prior RL result summarized above (RevPAR/Occ/ADR/L1/JS).
Implications for AI Economics
- Evaluation beyond scalar outcomes: In economic/market applications, optimizing a scalar revenue proxy can produce distributional misalignment with intended market discipline; trace-level diagnostics (occupancy, ADR, full action-distribution distances, seed-level CIs) should be standard evaluation tools.
- Partial observability creates distributional targets: When agents compete and cannot observe key competitor states, policy targets are posterior distributions over valid actions. Economic mechanism design and repeated-game learning must account for that—pointwise imitation or greedy optimization can systematically alter market dynamics.
- Distributional imitation vs. point prediction: For alignment with observed market behavior, it can be necessary to match an empirical action distribution (posterior predictive) rather than only maximize expected short-term payoff or one-step accuracy. Policy regularization via KL to an empirically learned prior is an actionable tool.
- Practical prescriptions for platforms and firms:
- If competent logged traces exist, estimate an action-distribution prior conditional on deployable observations and penalize drift from it while allowing downstream improvement (Trace-Prior style).
- Monitor trace-level quantities (not just aggregate revenue) to detect gaming/Goodhart effects early.
- Be cautious with deterministic argmax-style policies in competitive/partially observed markets—higher per-step prediction accuracy does not guarantee aligned aggregate behavior.
- Broader economic consequences:
- Agentic pricing rules that ignore distributional matching can change competitor inventories and future market dynamics, possibly leading to lower welfare, price wars, or systemic shifts in routing of demand across firms—effects that standard reward metrics may not flag.
- The approach suggests policy tools for regulators or platform designers—encouraging or enforcing distributional consistency with historical (or expert) traces could reduce destabilizing strategic undercutting or other emergent gaming.
- Limits & research directions:
- Real markets have strategic, noisy, or learning competitors; Trace-Prior RL was validated in a controlled setting with a fixed deterministic competitor. Extensions are needed to stochastic/strategic opponents.
- β is reward-scale and environment dependent; adaptive, uncertainty-aware calibration of the regularizer is an open direction.
- The queueing/game-theory viewpoint (competing finite-capacity pipelines) is promising for theoretical analysis of such learning dynamics in markets.
- Takeaway for AI economists: incorporate partial-observability analysis and trace-level distributional alignment into both experimental design and policy deployment decisions. Where the desired behavior is visible in traces but not captured by scalar rewards, learn and regularize to empirical action distributions to prevent Goodhart-style failure modes.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. Firm Revenue | mixed | high | RevPAR (revenue per available room) and pricing behavior (aggressiveness, undercutting, modal price buckets) |
0.18
|
| This failure is a Goodhart-style failure under partial observability: Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Output Quality | negative | high | policy robustness / correctness under partial observability (mapping from observed state to competitor price) |
0.18
|
| Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. Output Quality | negative | high | policy action distribution / pricing choices (shortcut behavior) |
0.18
|
| We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions, L1/JS distances, and seed-level confidence intervals. Other | positive | high | trace-alignment diagnostics (RevPAR, occupancy, ADR, price-bucket distributions, L1/JS distances, seed-level CIs) |
0.18
|
| Trace-Prior RL: learn a distributional market prior from lagged market traces, then train a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. Firm Revenue | positive | high | policy training objective and resulting policy stochasticity / distributional alignment |
0.18
|
| The final (Trace-Prior RL) policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty, while still optimizing Hotel A's own reward. Firm Revenue | positive | high | RevPAR, occupancy, ADR, price distribution (alignment to Hotel B within seed-level uncertainty) |
0.18
|
| The paper's contribution is a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces (not a new optimizer or a hotel-pricing leaderboard). Other | neutral | high | methodological reproducibility and conceptual framing |
0.09
|
| A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional. Output Quality | negative | high | exact action accuracy vs. aggregate trace alignment (distributional match) |
0.18
|