The Commonplace
Home Papers Evidence Explore Syntheses Digests About 🎲 Workforce Futures
← Papers
Direction, evidence grade, and study type are AI-generated labels (gpt-5-mini), not human-verified. Syntheses are LLM-written. "Tensions" are machine-detected candidates, not confirmed contradictions. A research-acceleration tool, not peer review. How this is built →

AI assistants favor highly rated and higher-priced hotels and even a search list position — a content-free artifact — meaningfully shifts recommendations worth roughly $12 per night; eco-labels get outsized weight while manager replies are largely disregarded.

Whose hotel does the AI recommend? An algorithm audit of reputation signals in LLM-assisted hotel selection
Mirza Samad Ahmed Baig, Syeda Anshrah Gillani, Asher Ali · June 15, 2026
arxiv rct high evidence 8/10 relevance Source PDF
A randomized conjoint audit across twelve LLM assistants finds guest rating and price strongly drive hotel recommendations, list position causally shifts recommendations (worth about $12/night), eco-certification is overweighted, and management responses are ignored.

Travelers increasingly ask large language model (LLM) assistants which hotel to book, making these systems gatekeepers of property visibility -- yet what moves their recommendations is undocumented. We conduct a pre-specified algorithm audit using a randomized choice-based conjoint: across personas, prompt templates, and twelve open-weight and proprietary models, assistants choose among five hotels whose guest rating, review volume and recency, management response, chain affiliation, price, eco-certification, and list position are independently randomized. We estimate the average marginal component effect of each signal on the probability of recommendation. Guest rating and price dominate (a top rating raises selection by 31.6 percentage points; a high price lowers it by 30.0), reproducing human valence-and-price primacy but over-weighting eco-certification and ignoring management response. List position -- a content-free artifact -- shifts recommendations causally, worth about \$12 per night. Stated reasons track revealed weights imperfectly. The findings ground generative engine optimization and the accountability of AI infomediaries in causal evidence.

Summary

Main Finding

LLM travel assistants select hotels largely the way humans do on the headline margins—guest rating and price dominate recommendations—but they reweight some reputation cues and introduce a measurable positional distortion. Concretely (pooled across 12 models): a top guest rating raises recommendation probability by 31.6 percentage points, a high price lowers it by 30.0 pp, eco-certification adds +11.6 pp, review volume +8.3 pp, chain affiliation −1.8 pp (small penalty), and visible management response has no detectable effect (+0.1 pp). Independently randomized list position causally shifts recommendations; the first slot is worth about $12/night on average (with substantial model heterogeneity). Models’ stated reasons correlate with but imperfectly reflect the causal weights (Spearman ρ ≈ 0.59–0.85). Results are robust across personas, prompt paraphrases, decoding temperatures, and layout/order variants.

Key Points

  • Primary drivers
    • Guest rating (valence) and price are the dominant causal determinants of an LLM’s hotel recommendation, mirroring human eWOM ordering on these two margins.
    • Magnitudes (pooled AMCEs): rating (top) +31.6 pp; high price −30.0 pp.
  • Secondary and muted signals
    • Eco-certification is materially rewarded by LLMs (+11.6 pp), more than typical human benchmarks suggest.
    • Review volume gives a positive but smaller boost (+8.3 pp).
    • Management response—commonly promoted in optimization advice—shows no detectable effect.
    • Brand/chain affiliation has a negligible-to-slightly-negative effect overall (−1.8 pp), and its effect varies by persona.
  • Position bias
    • List position, a content-free artifact, causally changes recommendations; pooled first-slot advantage ≈ $12/night (≈ a tenth of a rating step).
    • Position effects vary widely across models—some show much larger order sensitivity.
  • Persona heterogeneity
    • Price sensitivity is highest for the budget-family persona.
    • Eco-certification is most rewarded for the eco-conscious persona.
    • Chain effects vary by persona: the pooled chain penalty is concentrated in some personas and can vanish for business travelers.
  • Stated vs revealed reasons
    • Models’ textual reasons partially track revealed causal weights (positive correlations), but they over-mention brand and underreport positional/list-order influences.
  • Robustness and pre-specification
    • Design, estimands and hypotheses (H1–H12) were pre-registered/hashed. Findings hold across nine prompt paraphrases, two decoding temperatures (0.0, 0.7), card/snippet layouts, and field orders.

Data & Methods

  • Audit design
    • Pre-specified, randomized choice-based conjoint (algorithm audit) presenting an assistant with five synthetic hotel cards per choice set.
    • Independently randomized attributes per card: guest rating, review volume, review recency, management response (visible or not), chain affiliation (brand vs independent), nightly price (low/medium/high), Green Key eco-certification, plus list position randomized independently.
    • Each model asked to (a) pick one recommended hotel from the five and (b) state a reason.
  • Estimand and analysis
    • Primary estimand: Average Marginal Component Effect (AMCE) of each attribute on P(recommendation), reported in percentage points.
    • Secondary specs: conditional logit models, price-equivalent and rating-step-equivalent conversions, interaction tests (e.g., rating×volume, volume×recency), multiple-comparison corrections, equivalence tests for null effects.
  • Experimental scope and pre-registration
    • Focus limited to the selection (gatekeeping) stage among a fixed candidate set (not retrieval or multi-turn negotiation).
    • Confirmatory hypotheses H1–H12 specified in advance; full code, prompts, and outputs archived.
  • Models, sample size, and variations
    • Panel of 12 models: 4 open-weight LLMs run locally (Llama-3.2-3B, Qwen-2.5-3B, Phi-3-mini, Mixtral-8×7B) and 8 proprietary API-access models (OpenAI GPT-4o-mini; Google Gemini 1.5 Pro, 2.0 Flash, 2.0 Pro; four Anthropic Claude variants).
    • Three traveler personas, nine prompt paraphrases, two decoding temperatures.
    • 3,024 main-arm choice sets per model; >60,000 model calls total.
  • Key robustness checks
    • Persona-conditioned interactions, model fixed effects, prompt paraphrase and temperature robustness, layouts and field order checks.
    • Stated-versus-revealed comparison via Spearman rank correlations between attributes named in rationales and AMCE magnitudes.

Implications for AI Economics

  • Market visibility and distributional effects
    • LLMs acting as gatekeepers can create concentrated, asymmetric visibility effects: small differences in rating, price, eco-cert status, or list position translate into large swings in recommendation probability.
    • Position bias (priced at ≈ $12/night pooled) implies platform/UI design choices and internal ranking heuristics can shift revenues substantially across suppliers; this amplifies the economic importance of platform placement even within conversational interfaces.
  • Strategic response and “generative engine optimization”
    • Hotels and intermediaries will rationally adapt (and already are adapting) to these causal weights. The evidence suggests prioritizing actions that increase observable rating and manage price signals; eco-certification investments may yield higher returns in LLM-mediated channels than expected from human-only benchmarks.
    • Optimization advice should be grounded in measured causal weights: the audit shows some industry-recommended levers (e.g., management response) may be ineffective for LLM recommendation.
  • Platform accountability, transparency, and regulation
    • Models’ stated reasons are only an imperfect proxy for what drives outcomes; “explainability” claims that rely on textual rationales risk misrepresenting actual decision drivers.
    • Quantified position-equivalents give regulators and platform designers concrete metrics to assess and mitigate order-induced distortions (e.g., counterfactual re-ranking, mandatory disclosure of selection criteria, randomized auditing).
  • Welfare and competitive concerns
    • Reweighting of eco-certification and relative neglect of management response/brand could reallocate demand across hotel types in ways not aligned with consumer welfare or producer expectations.
    • The gatekeeping power of LLMs makes entry and survival dynamics more sensitive to platform-mediated signals, potentially increasing incentives for gaming, manipulation of visible metadata, or paid placement-like arrangements within chat interfaces.
  • Research and policy agenda for AI economics
    • Need for more causal audits across additional stages (retrieval, multi-turn negotiation, booking conversion) and more diverse models/platforms to map end-to-end market effects.
    • Standardize price-equivalent and rating-equivalent metrics for cross-study comparability so economists and managers can translate algorithmic biases into economic magnitudes.
    • Consider regulatory standards for verifiable auditing, required disclosure of selection logic, and limits on positional advantage in conversational commerce.
  • Practical takeaways for stakeholders
    • Hotels: invest first in improving average guest rating and understand price-position trade-offs; eco-certification may be a higher-return lever in LLM channels than in traditional channels.
    • Platforms: mitigate position bias, surface transparent selection criteria, and monitor model heterogeneity to avoid large, model-specific allocation shocks.
    • Regulators and economists: treat LLMs as consequential, opaque intermediaries; prioritize causal audits to inform policy and competitive assessments.

Limitations to keep in mind: the audit isolates the selection among fixed candidates (not retrieval), uses synthetic personas and design choices fixed in time, and evaluates a snapshot of models available during the study period—effects may evolve with model updates, platform integration, or different prompt/UX conventions.

Assessment

Paper Typerct Evidence Strengthhigh — The study uses a randomized experimental design (conjoint) with pre-specification, which identifies causal effects of hotel attributes (including list position) on model recommendations; effects are estimated across multiple proprietary and open-weight models increasing credibility for the domain studied. Methods Rigorhigh — Pre-specified audit, independent randomization of multiple attributes, use of choice-based conjoint and AMCE estimation, and testing across twelve distinct LLMs and prompt/persona variations constitute a rigorous and transparent approach; primary limitations are external/ecological validity rather than internal validity. SampleAlgorithm audit using a randomized choice-based conjoint presented to twelve large language model assistants (a mix of open-weight and proprietary models), varying personas and prompt templates; each task asked the assistant to choose among five hotels whose attributes (guest rating, review volume and recency, management response, chain affiliation, price, eco-certification, list position) were independently randomized. Themesgovernance adoption IdentificationPre-registered randomized choice-based conjoint: across personas and prompt templates, five hotel options were presented with independently randomized attributes (guest rating, review volume/recency, management response, chain affiliation, price, eco-certification, list position) to twelve LLM assistants; causal effects estimated as average marginal component effects (AMCEs) of each attribute on selection probability. GeneralizabilityLimited to the twelve models tested; other LLMs or future model updates may behave differently, Prompts and personas chosen by the researchers may not cover full real-world user diversity or query framing, Recommendations measured are assistant outputs, not observed booking behavior or realized revenue—translation to consumer actions is assumed but not measured, Single domain (hotels) and a limited set of attributes; results may not generalize to other product categories or marketplaces, Snapshot in time: proprietary models update frequently so effects may change

Claims (8)

ClaimDirectionOutcomeConfidence & EvidenceDetails
We conduct a pre-specified algorithm audit using a randomized choice-based conjoint: across personas, prompt templates, and twelve open-weight and proprietary models, assistants choose among five hotels whose guest rating, review volume and recency, management response, chain affiliation, price, eco-certification, and list position are independently randomized. Adoption Rate other assistant hotel choice / recommendation
Reading fidelity high
Study strength high
1.0
Guest rating strongly increases the probability an assistant recommends a hotel: a top rating raises selection by 31.6 percentage points. Adoption Rate positive probability of recommendation
Reading fidelity high
Study strength high
31.6 percentage points
1.0
High price strongly decreases the probability an assistant recommends a hotel: a high price lowers selection by 30.0 percentage points. Adoption Rate negative probability of recommendation
Reading fidelity high
Study strength high
30.0 percentage points
1.0
List position — a content-free artifact — causally shifts recommendations and has an effect equivalent to about $12 per night. Adoption Rate positive probability of recommendation (expressed as monetary equivalent)
Reading fidelity high
Study strength high
$12 per night
1.0
Assistants over-weight eco-certification relative to other signals (i.e., eco-certification has an outsized positive effect compared to expectations from human valence-and-price primacy). Adoption Rate positive probability of recommendation
Reading fidelity medium
Study strength medium
0.36
Assistants ignore management response: management response does not affect recommendation probability. Adoption Rate null_result probability of recommendation
Reading fidelity high
Study strength high
1.0
Overall, guest rating and price dominate other attributes in driving assistant recommendations, reproducing human valence-and-price primacy. Adoption Rate positive probability of recommendation
Reading fidelity high
Study strength high
1.0
Stated reasons provided by assistants only imperfectly track the revealed (experimentally measured) weights the assistants place on attributes. Decision Quality mixed congruence between stated reasons and revealed weights
Reading fidelity medium
Study strength medium
0.36

Notes