Conversational travel agents nudge users toward higher-commission suppliers: an LLM audit finds roughly 3–8 extra commission-steered recommendations per 100 paired traveler sessions depending on the model, with effects robust across a governance parameter grid.
Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice. Each booking earns the OTA commission and different suppliers pay different rates: the agent has a structural incentive to favor higher-margin recommendations. Whether any deployed agent does this, and by how much, no one can currently measure. Disclosure banners, conversion A/B testing, UI dark-pattern taxonomies, and generic LLM safety scores were built for older interfaces and miss the prose-recommendation surface where the steering happens. We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance. Two governance levers -- lambda (gain on message-induced perception in the traveler's accept/reject decision) and kappa (budget-normalized cap on how far the message can shift perceived welfare) -- drive a paired counterfactual: holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template. A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering. At deployed (lambda=1, kappa=0.05), a Qwen-14B reader shows +7.69pp steering (exact McNemar p=0.003); a Llama-3.1-8B reader shows +3.50pp in the same direction at n=143, with an extended-n supplement (n=270) confirming significance (+2.96pp, p=0.008). Across the (lambda, kappa) grid both arms pass family-wise scenario-clustered correction (p<0.001 / p=0.008). TourMart outputs a sentence a compliance report can quote: "at this deployment, 7.7 extra commission-steered recommendations per 100 paired traveler sessions."
Summary
Main Finding
TourMart is an applied audit instrument that produces a per-deployment, welfare-anchored reading of commission-driven steering by LLM-based online travel agents (OTAs). Using a paired counterfactual replay and two interpretable governance dials (λ, κ) acting on a frozen welfare rule, TourMart produces a single audit number: the extra commission-steered recommendations per 100 paired traveler sessions. In the paper’s instantiation (deployment governance λ=1, κ=0.05) the instrument finds significant commission-induced steering: Qwen-14B reader +7.69 percentage points (pp), exact McNemar p=0.003 (n=143); Llama-3.1-8B reader +3.50pp at n=143 (under-powered), with an extended-n replay (n=270) confirming +2.96pp, p=0.008. Peak steering on the swept governance grid reached +10.49pp (Qwen) and +7.69pp (Llama).
Key Points
- Purpose: produce a compliance-ready, per-deployment behavioral readout of whether and by how much an LLM-OTA’s prose recommendations steer travelers toward higher-commission inventory.
- Measurement frame:
- Paired counterfactual replay: for the same traveler and the same bundle, compare the OTA’s deployed commission-aware message (msgorig) to a minimum-disclosure factual template (msgfact).
- Traveler-reader LLM extracts perceived features ϕ ∈ [−1,1]^4 (fit, trust, risk, urgency) from the message+bundle.
- A frozen welfare-rule decision maps these features to an accept/reject decision; the acceptance gap between msgorig and msgfact is the steering delta.
- Governance dials:
- λ: gain (weight) applied to message-induced perception in the welfare rule (baseline λ=1).
- κ: budget-normalized saturation cap on how far the message can shift perceived welfare (baseline κ=0.05).
- TourMart sweeps a 6×6 (λ, κ) grid to reveal attenuated / live-transmission / saturated regimes.
- Welfare-rule (decision) summary:
- Accept if (ut(β) − pβ) + clip(λ · c · ϕ · bt, [±κ·bt]) ≥ τt·bt, with a hard rejection floor for very negative surplus.
- Coefficient vector c for (fit, trust, risk, urgency) = (0.03, 0.015, −0.025, 0.01).
- Statistical inference:
- Primary significance test: exact McNemar on discordant paired outcomes.
- Grid-wise correction: scenario-clustered max-stat permutation (nperm=1000) to control family-wise error across the governance grid.
- Mechanism decomposition: coefficient-zero attribution isolates which perception channels drive steering.
- Producer-side diagnostics: symmetric six-gate audit applied to produced messages to separate generator failure modes from genuine steering. Gates include JSON validity, bundle coverage, word-count, refusal-rate, unique-message ratio, internal-ID leakage.
- Reported producer failure modes: Qwen producer over-hedges (55.9% refusal under hardened refusal classifier); some attempted producer models (Mistral, Llama) showed gate failures; Llama producer exhibited template collapse and high internal-ID leakage (80.9–84.6%) that is repairable by a style fix.
- Reportable audit output: a concise, deployable sentence such as “at this deployment, 7.7 extra commission-steered recommendations per 100 paired traveler sessions.”
Data & Methods
- Market model: small simulated OTA market M = (T, H, A, B). Travelers t have budget bt, acceptance threshold τt, and utility ut over bundles β ∈ B. Bundles have prices p(β); platform revenue is the sum of commissions.
- Stimulus selection: near-threshold paired stimuli S chosen so welfare-rule decision is mechanically flippable by message-induced perception (focuses power where steering manifests).
- Procedure (Algorithm summary):
- For each (t, β) in S produce msgorig ← Θ(t, β; commission-signal) and msgfact ← factual_template(t, β).
- Traveler-reader Π extracts ϕorig and ϕfact from each message+bundle.
- Compute acceptance decisions aorig and afact via the welfare-rule (Eq. 1) with inspected λ, κ.
- Compute ∆acc = mean(aorig − afact) over S. Count discordant pairs b,c and run exact McNemar. Run scenario-clustered permutation for grid correction.
- Run six-gate producer audit on msgorig set.
- Implementation instantiation:
- Producer OTA LLM used: Qwen-7B-Instruct (prototype).
- Traveler-reader backbones: Qwen-14B-AWQ and Llama-3.1-8B (both used for cross-family verification).
- Primary sample: n=143 paired near-threshold stimuli. Extended replay: n=270 for Llama diagnostic window.
- Results summary:
- Deployed governance (λ=1, κ=0.05): Qwen reader ∆acc = +7.69pp, exact McNemar p=0.003 (n=143). Llama reader ∆acc = +3.50pp at n=143 (under-powered); extended n=270 gives +2.96pp, exact p=0.008.
- Grid sweep: peak ∆acc +10.49pp (Qwen), +7.69pp (Llama). Both readers pass family-wise scenario-clustered statistical correction (Qwen p<0.001, Llama p=0.008).
- Producer audit flagged generator-side artifacts (over-hedging, template collapse, ID leakage) that would confound naive behavioral interpretations if untested.
Implications for AI Economics
- Measurement tool for regulatory compliance: TourMart operationalizes the abstract regulatory prohibition on “manipulation/deceptive steering” (e.g., EU DSA Art. 25, FTC guidance, China CAC Order No. 9) into a repeatable, per-deployment numeric readout that compliance teams and regulators can quote.
- Quantifying welfare leakage: the instrument links prose-level recommendation changes to a welfare rule and produces an interpretable loss/shift metric (pp of extra acceptances favoring commission), enabling estimation of consumer surplus loss vs. platform revenue gain under different governance settings.
- Platform governance levers: λ and κ provide concrete knobs to simulate policy or product-rule changes (e.g., reducing message influence or capping perceived-welfare adjustments), allowing platforms to trade off conversion uplift against welfare distortion in a calibrated way.
- Competitive and market-design consequences: measurable steering creates a way to monitor how commission structures and incentive alignment translate into consumer-facing manipulation. This can inform contract design between platforms and suppliers, auction/aggregation rules, and disclosure requirements to internalize welfare effects.
- Auditability and accountability: pairing the behavioral readout with a producer-side six-gate audit highlights that external audits must check both behavioral outcomes and generator failure modes (refusal, collapse, leakage). Regulators and economists can require both layers to avoid false positives/negatives in enforcement.
- Research and policy priorities:
- Need for external validation: TourMart treats traveler-reader LLMs as behavioral stand-ins; the fidelity of these readers to real human booking behavior remains an open validation question. Bridging to field or user studies will be crucial for calibration of measured pp effects into welfare dollars.
- Generalizability: the paper demonstrates the instrument in the OTA vertical; adapting TourMart to other high-stakes LLM-mediated commerce (finance, health, legal referrals) would have significant economic-policy value.
- Design of incentives: platforms could use TourMart internally to set λ and κ to balance revenue and regulated fairness; regulators could mandate maximum allowable steering deltas or require independent external audits using instruments like TourMart.
- Overall: TourMart supplies a practical bridge between economic concerns about platform incentives and auditable LLM behavior at the prose recommendation surface, enabling quantification of steering that previously eluded conventional disclosure, A/B testing, and generic LLM safety tooling.
Limitations (concise) - The instrument audits a deployed configuration and a fixed welfare rule; it is not a general LLM capability score. - Traveler-reader fidelity to real consumers is not fully validated—results should be interpreted as deployment-behavioral readings conditional on the chosen reader model and welfare rule. - Producer-side gate failures can complicate interpretation; TourMart includes diagnostics but does not (by itself) repair generator prompts.
If you’d like, I can: - Extract the exact welfare-rule equation and include a brief numeric example showing how a message-induced ϕ shift translates to an acceptance flip; or - Produce a checklist for regulators/platforms to operationalize TourMart audits (data requirements, sample selection, pre-registration items).
Assessment
Claims (12)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice. Adoption Rate | neutral | high | interface format (ranked-list → single-sentence conversational recommendation) |
0.08
|
| Each booking earns the OTA commission and different suppliers pay different rates: the agent has a structural incentive to favor higher-margin recommendations. Firm Revenue | positive | high | incentive to favor higher-margin supplier recommendations |
0.08
|
| Disclosure banners, conversion A/B testing, UI dark-pattern taxonomies, and generic LLM safety scores were built for older interfaces and miss the prose-recommendation surface where the steering happens. Governance And Regulation | negative | high | coverage/effectiveness of existing governance tools for prose recommendations |
0.24
|
| We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance, driven by two governance levers — lambda (gain on message-induced perception) and kappa (budget-normalized cap on how far the message can shift perceived welfare). Governance And Regulation | neutral | high | audit instrument capability for measuring message-induced perception shifts under governance parameters |
0.48
|
| Holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template (paired counterfactual). Decision Quality | neutral | high | steering delta (difference in acceptance between commission-aware and minimum-disclosure prompts) |
0.48
|
| A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering. Governance And Regulation | neutral | high | ability to distinguish engineering failures from commercial steering |
0.48
|
| At deployed (lambda=1, kappa=0.05), a Qwen-14B reader shows +7.69pp steering (exact McNemar p=0.003). Decision Quality | positive | high | commission-steered recommendations (percentage-point difference in acceptance between prompts) |
+7.69pp
0.48
|
| A Llama-3.1-8B reader shows +3.50pp steering in the same direction at n=143 (initial test). Decision Quality | positive | high | commission-steered recommendations (percentage-point difference between prompts) |
n=143
+3.50pp
0.48
|
| An extended-n supplement (n=270) confirms significance for Llama-3.1-8B (+2.96pp, p=0.008). Decision Quality | positive | high | commission-steered recommendations (percentage-point difference between prompts) |
n=270
+2.96pp
0.48
|
| Across the (lambda, kappa) grid both arms pass family-wise scenario-clustered correction (p<0.001 / p=0.008). Decision Quality | positive | medium | statistical significance of steering effects across parameter grid after correction |
0.29
|
| TourMart outputs a sentence a compliance report can quote: 'at this deployment, 7.7 extra commission-steered recommendations per 100 paired traveler sessions.' Decision Quality | positive | high | commission-steered recommendations per 100 paired traveler sessions (tool-generated summary) |
7.7 extra commission-steered recommendations per 100 paired traveler sessions
0.24
|
| Whether any deployed agent does this, and by how much, no one can currently measure. Governance And Regulation | neutral | medium | measurability of deployed-agent commercial steering prior to this work |
0.05
|