AI assistants tailor CRM recommendations to who they think is asking: adding a buyer persona cuts overlap in recommended brands by about 12–20 percentage points and can replace up to 75% of mid-market suggestions, while market leaders stay largely unchanged; models that rely less on observable retrieval traces show larger persona sensitivity.
The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.
Summary
Main Finding
Prefixing buyer queries with different persona descriptions meaningfully changes which brands commercial chat AIs recommend. Across three model cells (OpenAI gpt-5.4-mini low/high; Anthropic claude-sonnet-4-6 low) the authors find a persona-shift effect size Δ = (cross-persona Jaccard − within-persona Jaccard) between −0.12 and −0.20 (clustered 95% CIs exclude zero). The effect is highly prominence-stratified: category leaders (L1) are largely persona-resistant (~80% same-brand consistency), but mid-market (L3) brands can swap at very high rates (up to 75% on some cells).
Key Points
- Headline magnitude: Δ ≈ −0.12 to −0.20 (negative means changing persona reduces overlap vs same-persona baseline). All three clustered CIs exclude zero.
- Sample: 2,000 runs from a crossed design (10 personas × 8 prompts × 3 model cells × N=10 reruns per leaf), with the Anthropic cell covering only 4 prompts (flagged as a limitation).
- Prominence stratification:
- L1 (category leaders): swap rates ≈ 0.20–0.29 (i.e., ~71–80% consistency across personas).
- L2 (established challengers): mixed — low swap on OpenAI (0.05–0.13) but higher on Anthropic (0.23).
- L3 (mid-market): largest persona effect, swap rates up to 0.75 (strong persona sensitivity).
- L4: undersampled (no reliable estimate).
- L5 (regional/long-tail): moderate swap on OpenAI (0.27–0.33); Sonnet L5 undersampled/missing.
- Provider asymmetry: Anthropic sonnet cell shows a larger point estimate of persona responsiveness (Δ ≈ −0.20) than OpenAI cells (Δ ≈ −0.12 to −0.16). CIs overlap for some contrasts; the provider-ranking is a point-estimate signal, not a conclusive ordering.
- Suggested mechanism: models that rely more on parametric priors / training-data (retrieval-unattributed generation) appear more persona-responsive. Prior work cited in the paper reports Anthropic showing higher retrieval-unattributed shares (43–52%) vs OpenAI (8–29%), consistent with the observed asymmetry.
- Stability anchoring: within-persona Jaccards were 0.42–0.51 (N=10), slightly below a larger rerun-stability baseline (0.50–0.61, from prior work), but the persona Δ is larger than expected rerun noise.
- Consensus extraction: recommended brands per run identified via intersection of two LLM judges (claude-haiku-4-5 low and gpt-5-mini) — conservative approach used to form consensus recommendation sets.
- Persona heterogeneity: personas fall into “sharp/concentrating” (within-persona Jaccard ≥ 0.5) and “broad/scattering” (≤ 0.4); recommendation diversity scales with persona specificity.
Data & Methods
- Design:
- Persona corpus: 10 hand-curated personas covering industry vertical, company size, role, geography (examples: solo_founder_us_bootstrapped, enterprise_vp_us_procurement, uk_smb_owner_london).
- Prompts: 8–10 commercially framed prompts focused on B2B SaaS and related sectors (OpenAI cells sampled all 8; Anthropic sonnet sampled 4).
- Model cells: gpt-5.4-mini / low; gpt-5.4-mini / high; claude-sonnet-4-6 / low.
- Crossed sampling: personas × prompts × cells × 10 reruns per leaf → 2,000 runs total.
- Outcome construction:
- Brand mentions parsed by two LLM judges; a brand is counted only if both judges mark it (intersection consensus).
- Consensus recommendation set per (persona, prompt, cell) is the union across the 10 reruns.
- Metrics:
- Within-persona Jaccard: mean Jaccard between two independent halves of the reruns for the same persona (stability reference).
- Cross-persona Jaccard: mean Jaccard across persona pairs for the same prompt and cell.
- Persona-shift Δ = cross − within (negative values indicate more swap when persona changes).
- Prominence-tier metric: per-tier swap rate reported as 1 − cross-persona Jaccardℓ (per-tier within-persona Jaccards were too noisy for double-difference).
- Statistical inference:
- Prompt-clustered bootstrap (1,000 iterations) used for cell-level Δ CIs (cluster unit = prompt).
- Flagged undersampled cells/tiers (notably L4 and some L5/sonnet cells) and refrained from reporting CIs where sample sizes were insufficient.
- Limitations noted by authors:
- Anthropic cell covered only 4 prompts (wider CI).
- No within-run joint measurement of retrieval-vs-priors for causal mediation.
- Small, hand-curated persona corpus (trades breadth for control).
- Some prominence tiers undersampled (esp. L4).
Implications for AI Economics
- Measurement and research practices:
- Studies that estimate brand perception, market-share proxies, or recommendation-market shares from LLM outputs must condition on buyer persona. Aggregating across personas will systematically obscure meaningful heterogeneity.
- Audit protocols should include prominence stratification and a rerun-stability anchor; consensus extraction (intersection of judges) reduces false positives in brand detection.
- Firm strategy and positioning:
- Brands should expect differential exposure across buyer personas. Category leaders are relatively robust; mid-market brands are most susceptible to persona-driven substitution and thus face both risk and targeting opportunity.
- Companies should craft persona-specific positioning and content for LLM-facing touchpoints (e.g., chat assistants, knowledge-bases, product copy presented to AI) rather than relying on a single canonical “best X” page.
- Platform and product design:
- Provider choices (retrieval-heavy vs. priors-heavy generation) influence persona sensitivity. Architectures or settings that increase reliance on parametric priors may amplify persona-conditioned variation in recommendations — a design lever with competitive and regulatory implications.
- Retrieval attribution transparency matters: opaque generations without retrieval evidence can mask whether recommendations are drawn from indexed sources or model priors, complicating auditability and trust.
- Market dynamics and competition:
- Persona-driven recommendation heterogeneity can intensify segmentation: mid-market incumbents may face volatile AI-driven top-of-consideration shifts as models adapt to persona cues, potentially altering discoverability and competitive dynamics.
- If stronger context utilization by models systematically advantages certain brand types (e.g., those better aligned to persona signals), this could affect advertising and SEO strategies and induce firms to optimize for persona-specific cues.
- Policy and regulation:
- Regulators and platform auditors should require persona-conditioned audits when assessing fairness, competitive effects, or deceptive steering in commercial chat systems. Aggregated audits risk missing persona-specific harms or favoritism.
- Suggested next steps for researchers and firms:
- Jointly measure per-run retrieval attribution and persona Δ to test the "priors-driven = more persona-responsive" hypothesis causally.
- Expand persona corpora (while controlling for stereotyping) and provider cells (e.g., sonnet-high, opus family) to firm up provider comparisons.
- Design A/B interventions that manipulate persona prominence or explicitness to see whether persona-responsiveness improves user utility (i.e., whether persona sensitivity is a feature benefiting recommendations versus a source of bias).
Summary takeaway: commercial chat AIs do not give a single, persona-agnostic “best” set of brands. Persona conditioning materially reshapes recommendation sets — especially for mid-market brands — and both researchers and firms must account for that heterogeneity when measuring market signals, designing positioning, or auditing platform behavior.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Other | null_result | high | audit sample size and experimental design coverage |
n=2000
0.8
|
| Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline. Adoption Rate | negative | high | recommendation-set similarity (Jaccard index) |
n=2000
-0.12 to -0.20 (Jaccard)
0.8
|
| Clustered 95% CIs exclude zero on all three measured cells (the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). Adoption Rate | positive | high | statistical significance of persona effect (confidence intervals) |
clustered 95% CIs exclude zero
0.48
|
| Category leaders are persona-resistant (~80% same-brand consistency across personas). Adoption Rate | positive | high | brand consistency across personas (same-brand %) |
n=2000
~80% same-brand consistency
0.48
|
| Mid-market brands swap up to 75% of the recommendation set as the persona changes. Adoption Rate | negative | high | proportion of recommendation-set changed for mid-market brands |
n=2000
up to 75% swap
0.48
|
| The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high). Adoption Rate | mixed | high | magnitude of persona-driven recommendation-set change by model |
n=2000
0.48
|
| Anthropic's asymmetry is consistent with its more retrieval-unattributed generation route: 43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29% (documented in Jack 2026). Other | mixed | medium | fraction of recommendations lacking observed retrieval-layer evidence |
Anthropic: 43-52%; OpenAI: 8-29%
0.29
|
| Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. Adoption Rate | mixed | high | validity of AI brand-perception measurement protocols |
n=2000
0.48
|
| The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit. Adoption Rate | negative | high | concentration of persona effect across brand market segments and generation routes |
n=2000
0.48
|
| Persona responsiveness grows as models lean more on training-data priors and richer context integration. Adoption Rate | positive | medium | relationship between model reliance on priors/context and persona responsiveness |
0.05
|