← Papers

AI assistants tailor CRM recommendations to who they think is asking: adding a buyer persona cuts overlap in recommended brands by about 12–20 percentage points and can replace up to 75% of mid-market suggestions, while market leaders stay largely unchanged; models that rely less on observable retrieval traces show larger persona sensitivity.

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu · May 28, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

Prefixing queries with a buyer persona materially changes AI assistants' CRM brand recommendations—reducing recommendation-set similarity by 12–20 percentage points—especially swapping mid-market brands while category leaders remain stable, with a larger point estimate for Anthropic's model.

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

Summary

Main Finding

Prefixing buyer queries with different persona descriptions meaningfully changes which brands commercial chat AIs recommend. Across three model cells (OpenAI gpt-5.4-mini low/high; Anthropic claude-sonnet-4-6 low) the authors find a persona-shift effect size Δ = (cross-persona Jaccard − within-persona Jaccard) between −0.12 and −0.20 (clustered 95% CIs exclude zero). The effect is highly prominence-stratified: category leaders (L1) are largely persona-resistant (~80% same-brand consistency), but mid-market (L3) brands can swap at very high rates (up to 75% on some cells).

Key Points

Headline magnitude: Δ ≈ −0.12 to −0.20 (negative means changing persona reduces overlap vs same-persona baseline). All three clustered CIs exclude zero.
Sample: 2,000 runs from a crossed design (10 personas × 8 prompts × 3 model cells × N=10 reruns per leaf), with the Anthropic cell covering only 4 prompts (flagged as a limitation).
Prominence stratification:
- L1 (category leaders): swap rates ≈ 0.20–0.29 (i.e., ~71–80% consistency across personas).
- L2 (established challengers): mixed — low swap on OpenAI (0.05–0.13) but higher on Anthropic (0.23).
- L3 (mid-market): largest persona effect, swap rates up to 0.75 (strong persona sensitivity).
- L4: undersampled (no reliable estimate).
- L5 (regional/long-tail): moderate swap on OpenAI (0.27–0.33); Sonnet L5 undersampled/missing.
Provider asymmetry: Anthropic sonnet cell shows a larger point estimate of persona responsiveness (Δ ≈ −0.20) than OpenAI cells (Δ ≈ −0.12 to −0.16). CIs overlap for some contrasts; the provider-ranking is a point-estimate signal, not a conclusive ordering.
Suggested mechanism: models that rely more on parametric priors / training-data (retrieval-unattributed generation) appear more persona-responsive. Prior work cited in the paper reports Anthropic showing higher retrieval-unattributed shares (43–52%) vs OpenAI (8–29%), consistent with the observed asymmetry.
Stability anchoring: within-persona Jaccards were 0.42–0.51 (N=10), slightly below a larger rerun-stability baseline (0.50–0.61, from prior work), but the persona Δ is larger than expected rerun noise.
Consensus extraction: recommended brands per run identified via intersection of two LLM judges (claude-haiku-4-5 low and gpt-5-mini) — conservative approach used to form consensus recommendation sets.
Persona heterogeneity: personas fall into “sharp/concentrating” (within-persona Jaccard ≥ 0.5) and “broad/scattering” (≤ 0.4); recommendation diversity scales with persona specificity.

Data & Methods

Design:
- Persona corpus: 10 hand-curated personas covering industry vertical, company size, role, geography (examples: solo_founder_us_bootstrapped, enterprise_vp_us_procurement, uk_smb_owner_london).
- Prompts: 8–10 commercially framed prompts focused on B2B SaaS and related sectors (OpenAI cells sampled all 8; Anthropic sonnet sampled 4).
- Model cells: gpt-5.4-mini / low; gpt-5.4-mini / high; claude-sonnet-4-6 / low.
- Crossed sampling: personas × prompts × cells × 10 reruns per leaf → 2,000 runs total.
Outcome construction:
- Brand mentions parsed by two LLM judges; a brand is counted only if both judges mark it (intersection consensus).
- Consensus recommendation set per (persona, prompt, cell) is the union across the 10 reruns.
Metrics:
- Within-persona Jaccard: mean Jaccard between two independent halves of the reruns for the same persona (stability reference).
- Cross-persona Jaccard: mean Jaccard across persona pairs for the same prompt and cell.
- Persona-shift Δ = cross − within (negative values indicate more swap when persona changes).
- Prominence-tier metric: per-tier swap rate reported as 1 − cross-persona Jaccardℓ (per-tier within-persona Jaccards were too noisy for double-difference).
Statistical inference:
- Prompt-clustered bootstrap (1,000 iterations) used for cell-level Δ CIs (cluster unit = prompt).
- Flagged undersampled cells/tiers (notably L4 and some L5/sonnet cells) and refrained from reporting CIs where sample sizes were insufficient.
Limitations noted by authors:
- Anthropic cell covered only 4 prompts (wider CI).
- No within-run joint measurement of retrieval-vs-priors for causal mediation.
- Small, hand-curated persona corpus (trades breadth for control).
- Some prominence tiers undersampled (esp. L4).

Implications for AI Economics

Measurement and research practices:
- Studies that estimate brand perception, market-share proxies, or recommendation-market shares from LLM outputs must condition on buyer persona. Aggregating across personas will systematically obscure meaningful heterogeneity.
- Audit protocols should include prominence stratification and a rerun-stability anchor; consensus extraction (intersection of judges) reduces false positives in brand detection.
Firm strategy and positioning:
- Brands should expect differential exposure across buyer personas. Category leaders are relatively robust; mid-market brands are most susceptible to persona-driven substitution and thus face both risk and targeting opportunity.
- Companies should craft persona-specific positioning and content for LLM-facing touchpoints (e.g., chat assistants, knowledge-bases, product copy presented to AI) rather than relying on a single canonical “best X” page.
Platform and product design:
- Provider choices (retrieval-heavy vs. priors-heavy generation) influence persona sensitivity. Architectures or settings that increase reliance on parametric priors may amplify persona-conditioned variation in recommendations — a design lever with competitive and regulatory implications.
- Retrieval attribution transparency matters: opaque generations without retrieval evidence can mask whether recommendations are drawn from indexed sources or model priors, complicating auditability and trust.
Market dynamics and competition:
- Persona-driven recommendation heterogeneity can intensify segmentation: mid-market incumbents may face volatile AI-driven top-of-consideration shifts as models adapt to persona cues, potentially altering discoverability and competitive dynamics.
- If stronger context utilization by models systematically advantages certain brand types (e.g., those better aligned to persona signals), this could affect advertising and SEO strategies and induce firms to optimize for persona-specific cues.
Policy and regulation:
- Regulators and platform auditors should require persona-conditioned audits when assessing fairness, competitive effects, or deceptive steering in commercial chat systems. Aggregated audits risk missing persona-specific harms or favoritism.
Suggested next steps for researchers and firms:
- Jointly measure per-run retrieval attribution and persona Δ to test the "priors-driven = more persona-responsive" hypothesis causally.
- Expand persona corpora (while controlling for stereotyping) and provider cells (e.g., sonnet-high, opus family) to firm up provider comparisons.
- Design A/B interventions that manipulate persona prominence or explicitness to see whether persona-responsiveness improves user utility (i.e., whether persona sensitivity is a feature benefiting recommendations versus a source of bias).

Summary takeaway: commercial chat AIs do not give a single, persona-agnostic “best” set of brands. Persona conditioning materially reshapes recommendation sets — especially for mid-market brands — and both researchers and firms must account for that heterogeneity when measuring market signals, designing positioning, or auditing platform behavior.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — Relatively large sample (≈2,000 runs) with repeated random draws, clear treatment (persona prefix) and clustered CIs give credible evidence of a treatment effect; however scope is limited to one task (CRM recommendations), three model configurations, a finite prompt/persona set, and partial coverage in one model cell (sonnet-4.6), which constrain external validity. Methods Rigormedium — Design systematically varies personas, prompts, models and replicates runs, uses an interpretable similarity metric (Jaccard) and clustered inference; weaknesses include uneven coverage across cells (Anthropic limited to 4 prompts), possible non-randomness in prompt selection, no mention of pre-registration or robustness checks to alternative similarity/ranking metrics, and limited measurement of retrieval mechanisms. SampleApproximately 2,000 model runs covering a factorial design: 10 buyer personas × 8 prompts × 3 model configurations × 10 repetitions, with two OpenAI cells having full 8-prompt coverage and the Anthropic sonnet-4.6 'low' cell covering 4 prompts; outputs are brand-recommendation sets for a 'best CRM software' query; retrieval-attribution evidence was recorded showing Anthropic produced 43–52% recommendations without observed retrieval-layer evidence versus OpenAI's 8–29% (per Jack 2026). Themesadoption human_ai_collab IdentificationCompare recommendation-set similarity (Jaccard) between runs where the same prompt is prefixed by an explicit buyer persona versus a same-persona baseline, across repeated runs and model configurations; statistical inference uses clustered 95% confidence intervals over prompt clusters and replication. GeneralizabilityTask-specific: limited to CRM-software recommendation queries, may not generalize to other product categories or economic decisions, Model coverage: only two OpenAI configurations and Anthropic sonnet-4.6 tested — other models/versions may differ, Prompt and persona scope: 8 prompts and 10 personas; untested prompts or cultural/linguistic contexts might change results, Uneven cell coverage: Anthropic cell had only 4 prompts, reducing power and precision for that comparison, Measurement limitations: Jaccard similarity on recommendation sets ignores ranking depth, explanation quality, and downstream user behavior, Temporal validity: results are tied to specific model versions and training data snapshots and may change over time

Claims (10)

Claim	Direction	Confidence	Outcome	Details
The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Other	null_result	high	audit sample size and experimental design coverage	n=2000 0.8
Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline. Adoption Rate	negative	high	recommendation-set similarity (Jaccard index)	n=2000 -0.12 to -0.20 (Jaccard) 0.8
Clustered 95% CIs exclude zero on all three measured cells (the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). Adoption Rate	positive	high	statistical significance of persona effect (confidence intervals)	clustered 95% CIs exclude zero 0.48
Category leaders are persona-resistant (~80% same-brand consistency across personas). Adoption Rate	positive	high	brand consistency across personas (same-brand %)	n=2000 ~80% same-brand consistency 0.48
Mid-market brands swap up to 75% of the recommendation set as the persona changes. Adoption Rate	negative	high	proportion of recommendation-set changed for mid-market brands	n=2000 up to 75% swap 0.48
The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high). Adoption Rate	mixed	high	magnitude of persona-driven recommendation-set change by model	n=2000 0.48
Anthropic's asymmetry is consistent with its more retrieval-unattributed generation route: 43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29% (documented in Jack 2026). Other	mixed	medium	fraction of recommendations lacking observed retrieval-layer evidence	Anthropic: 43-52%; OpenAI: 8-29% 0.29
Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. Adoption Rate	mixed	high	validity of AI brand-perception measurement protocols	n=2000 0.48
The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit. Adoption Rate	negative	high	concentration of persona effect across brand market segments and generation routes	n=2000 0.48
Persona responsiveness grows as models lean more on training-data priors and richer context integration. Adoption Rate	positive	medium	relationship between model reliance on priors/context and persona responsiveness	0.05