Agents trained on scalar revenue targets can hit revenue goals while gaming price and occupancy traces; teaching agents a market-level price distribution and penalizing deviations makes their behavior trace-realistic while preserving revenue performance.

Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

Peiying Zhu, Sidi Chang · May 07, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

In a two-hotel simulator, optimizing a pricing agent solely for RevPAR yields near-target revenue but produces unrealistic pricing and occupancy traces under partial observability, while Trace-Prior RL (learning a distributional prior from lagged traces and penalizing KL divergence) restores realistic price distributions and occupancy without sacrificing revenue.

Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. We diagnose this as a Goodhart-style failure under partial observability. Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions, L1/JS distances, and seed-level confidence intervals. The verified repair is Trace-Prior RL: learn a distributional market prior from lagged market traces, then train a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. The final policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty, while still optimizing Hotel A's own reward. We argue that the contribution is not a new optimizer and not a hotel-pricing leaderboard, but a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces. A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.

Summary

Main Finding

A scalar business metric (RevPAR) can hide misaligned agent behavior in competitive pricing under partial observability. When a competitor’s internal state (remaining inventory / booking curve) is hidden, the market-correct action for a given observable context is a distribution, not a single label. Deterministic value-based or argmax-copy policies collapse that posterior uncertainty into pathological shortcuts (excessive undercutting, modal-bucket collapse) even while obtaining near-reference revenue. The verified repair—Trace-Prior RL—learns a distributional market prior from traces and trains a stochastic pricing policy with a KL penalty to that prior, restoring market-like traces (RevPAR, occupancy, ADR, and price distribution) without giving the agent the competitor’s hidden state.

Key Points

Goodhart + POMDP trigger: The proxy reward (RevPAR) is easy to game. Partial observability (hidden competitor inventory qB) makes p(aB | o) multi-modal; the target is distributional. Deterministic argmax rules cause "epistemic collapse" of valid uncertainty into a point action.
Trace diagnostics are necessary: evaluating only RevPAR misses behavior. The paper uses trace-level metrics (RevPAR, occupancy, ADR, full price-bucket distributions) plus L1 and Jensen–Shannon (JS) divergence and seed-level 95% CIs to detect misalignment.
Negative-result path: more exploration, longer n-step horizons, reward shaping (undercut penalty/CMDP), adding market-forecast inputs, and deterministic copying either failed or were brittle; argmax copying increased one-step accuracy but worsened trace alignment.
Diagnostic evidence:
- Observable-state ambiguity is common: in rollout grouping, 95% of visited coarse cells saw ≥2 different competitor actions; weighted within-cell entropy 0.28.
- Oracle ablation (revealing qB) reduced prediction NLL from 0.5359 to 0.1557 and increased accuracy from ~76.9% to 95.5%, implicating hidden inventory as the main source of label uncertainty.
- Argmax copy: accuracy 78.14% but L1 = 0.0323 (worse); probabilistic sampling (temperature 0.95) had lower accuracy (69.5%) but better L1 = 0.0183.
Repair (Trace-Prior RL): two-stage method
Learn market prior πM(a | o) from lagged traces (supervised cross-entropy).
Train a stochastic policy πθ(a | o) maximizing expected per-step revenue minus β·DKL(πθ(·|o) || πM(·|o)). The KL regularizes the whole distribution (not just the sampled action), preserving uncertainty and preventing drift.
Quantitative result (selected run): Hotel A under Trace-Prior RL matched Hotel B within seed-level 95% CIs:
- RevPAR: A 108.178 vs B 108.066 (gap +0.112) — within CI.
- Occupancy: 0.7709 vs 0.7680 (gap +0.0029).
- ADR: 140.33 vs 140.71 (gap −0.38).
- Price-bucket L1 ≈ 0.0196; JS ≈ 0.0001.
- Best KL sensitivity: β ≈ 10–30 sufficient in this environment; β = 30 used for the reported aligned result.
Scope & limitations: single controlled two-hotel simulator, fixed deterministic competitor, β is reward-scale dependent, not yet tested on strategic/noisy competitors or other domains.

Data & Methods

Environment (simulator):
- Two hotels A and B, capacity Q = 100, horizon H = 30 days.
- Discrete price grid P = [100, 120, 140, 160, 180, 200, 220].
- Demand: Mt ~ Poisson(Λt) with Λt = Λ0 exp(ηΛ mt), Λ0 = 7.
- Customer choice: nested-logit with nest parameter µ = 0.5; probabilities lead to a multinomial draw (DA, DB, D0).
- Sales are capacity-capped; inventories evolve accordingly.
- Hotel B: deterministic hand-tuned revenue-management (RM) rule RM(t, qB, mt) (raises price as inventory tightens and time approaches).
- Hotel A: observes own inventory, market condition, booking pace, and last three observed B prices; does not observe qB or B’s booking curve or formula.
Key formal objects:
- Posterior-predictive market target: p(aB | o) = Σ_{qB} 1{RM(t, qB, mt)=aB} p(qB | o). This is what πM estimates.
- Trace metrics: RevPARi = (1/Q) Σ p_i,t y_i,t , Occ_i = (1/Q) Σ y_i,t , ADR_i = Σ p_i,t y_i,t / Σ y_i,t .
- Distribution distances: L1(dA,dB) = Σ |dA(k)−dB(k)| ; D_JS(dA,dB) Jensen–Shannon divergence.
Baselines / experiments:
- Value-based DQN (n-step, γ=1.0) trained on (gross-revenue or RevPAR) reward.
- CMDP-style undercut penalty: add sold-unit undercut cost ct = max(0, −z_t)^2 yA,t with dual update on λ.
- Forecast-as-input: supervised forecast head for πB added to DQN inputs.
- Deterministic copy baseline: supervised predict πB and choose argmax.
Trace-Prior RL training:
Train fϕ(o_t) to predict B’s action distribution via cross-entropy on rollout traces; freeze πM = fϕ.
Train stochastic policy πθ with per-step reward r_t = pA,t yA,t / Q − β·DKL(πθ(·|o_t) || πM(·|o_t)).
- β tuned in sensitivity runs; β=0 is reward-only RL (fails), β≈30 produced alignment in experiments.
Evaluation:
- Multiple seeds and many evaluation episodes per seed (e.g., 5 seeds, 2k or 10k eval episodes depending on table) to compute seed-level 95% CIs and compare trace metrics.
- Calibration of πM reported (NLL, Brier, ECE).
Selected diagnostic statistics (examples from paper):
- Ambiguity: 95.08% of eligible observation cells had ≥2 B actions.
- Oracle qB prediction: improves NLL 0.5359 → 0.1557, accuracy 76.9% → 95.5%.
- Trace-Prior RL result summarized above (RevPAR/Occ/ADR/L1/JS).

Implications for AI Economics

Evaluation beyond scalar outcomes: In economic/market applications, optimizing a scalar revenue proxy can produce distributional misalignment with intended market discipline; trace-level diagnostics (occupancy, ADR, full action-distribution distances, seed-level CIs) should be standard evaluation tools.
Partial observability creates distributional targets: When agents compete and cannot observe key competitor states, policy targets are posterior distributions over valid actions. Economic mechanism design and repeated-game learning must account for that—pointwise imitation or greedy optimization can systematically alter market dynamics.
Distributional imitation vs. point prediction: For alignment with observed market behavior, it can be necessary to match an empirical action distribution (posterior predictive) rather than only maximize expected short-term payoff or one-step accuracy. Policy regularization via KL to an empirically learned prior is an actionable tool.
Practical prescriptions for platforms and firms:
- If competent logged traces exist, estimate an action-distribution prior conditional on deployable observations and penalize drift from it while allowing downstream improvement (Trace-Prior style).
- Monitor trace-level quantities (not just aggregate revenue) to detect gaming/Goodhart effects early.
- Be cautious with deterministic argmax-style policies in competitive/partially observed markets—higher per-step prediction accuracy does not guarantee aligned aggregate behavior.
Broader economic consequences:
- Agentic pricing rules that ignore distributional matching can change competitor inventories and future market dynamics, possibly leading to lower welfare, price wars, or systemic shifts in routing of demand across firms—effects that standard reward metrics may not flag.
- The approach suggests policy tools for regulators or platform designers—encouraging or enforcing distributional consistency with historical (or expert) traces could reduce destabilizing strategic undercutting or other emergent gaming.
Limits & research directions:
- Real markets have strategic, noisy, or learning competitors; Trace-Prior RL was validated in a controlled setting with a fixed deterministic competitor. Extensions are needed to stochastic/strategic opponents.
- β is reward-scale and environment dependent; adaptive, uncertainty-aware calibration of the regularizer is an open direction.
- The queueing/game-theory viewpoint (competing finite-capacity pipelines) is promising for theoretical analysis of such learning dynamics in markets.
Takeaway for AI economists: incorporate partial-observability analysis and trace-level distributional alignment into both experimental design and policy deployment decisions. Where the desired behavior is visible in traces but not captured by scalar rewards, learn and regularize to empirical action distributions to prevent Goodhart-style failure modes.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides clear, reproducible simulation evidence and seed-level uncertainty to show failure modes and a successful repair within the simulator, but the results are limited to a stylized two-agent environment with a rule-based competitor and no field validation, so external validity to real-world hotel markets or other economic domains is uncertain. Methods Rigormedium — Methods include careful diagnostics (multiple aggregate and trace-level metrics), use of random seeds/confidence intervals, and a principled repair (learning a distributional prior + KL regularization). However, the analysis is confined to a single simulated setting with a specific competitor model and lacks robustness checks across richer market structures, competitor heterogeneity, and field or out-of-sample tests. SampleA synthetic two-hotel revenue-management simulator: Hotel A (learning agent) competes with a fixed rule-based Hotel B; observations to Hotel A omit competitor inventory, booking curve, and pricing rule (partial observability); inputs include lagged market traces and discrete price buckets; evaluation uses RevPAR, occupancy, ADR, full price-bucket distributions and L1/JS distances across multiple random seeds. Themesgovernance adoption IdentificationControlled simulation experiment: a two-hotel revenue-management simulator where Hotel A is trained under different agent designs (standard deterministic value-based RL and deterministic copying versus Trace-Prior RL) while Hotel B is held fixed as a rule-based competitor; differences in trace and aggregate outcomes are attributed to the training method, using multiple random seeds and trace-level diagnostics (RevPAR, occupancy, ADR, price-bucket distributions, L1/JS distances) to quantify effects. GeneralizabilityResults are from a stylized two-agent simulator and may not extend to multi-agent markets with many competitors., Competitor is a fixed rule-based agent; outcomes could differ against adaptive or strategic competitors., Simplified demand and booking dynamics may not capture real-world heterogeneity (seasonality, customer segmentation, channel effects)., Assumes availability of lagged market traces to learn priors—may not hold in data-poor settings., Hyperparameter sensitivity and scalability to larger action/state spaces are not fully explored.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. Firm Revenue	mixed	high	RevPAR (revenue per available room) and pricing behavior (aggressiveness, undercutting, modal price buckets)	0.18
This failure is a Goodhart-style failure under partial observability: Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Output Quality	negative	high	policy robustness / correctness under partial observability (mapping from observed state to competitor price)	0.18
Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. Output Quality	negative	high	policy action distribution / pricing choices (shortcut behavior)	0.18
We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions, L1/JS distances, and seed-level confidence intervals. Other	positive	high	trace-alignment diagnostics (RevPAR, occupancy, ADR, price-bucket distributions, L1/JS distances, seed-level CIs)	0.18
Trace-Prior RL: learn a distributional market prior from lagged market traces, then train a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. Firm Revenue	positive	high	policy training objective and resulting policy stochasticity / distributional alignment	0.18
The final (Trace-Prior RL) policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty, while still optimizing Hotel A's own reward. Firm Revenue	positive	high	RevPAR, occupancy, ADR, price distribution (alignment to Hotel B within seed-level uncertainty)	0.18
The paper's contribution is a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces (not a new optimizer or a hotel-pricing leaderboard). Other	neutral	high	methodological reproducibility and conceptual framing	0.09
A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional. Output Quality	negative	high	exact action accuracy vs. aggregate trace alignment (distributional match)	0.18