Leading AI chat tools get broadly similar satisfaction scores despite huge differences in resources and benchmarks; most users run multiple services and switch freely, implying specialization sustains competition rather than winner-take-all consolidation.

Beyond Benchmarks: How Users Evaluate AI Chat Assistants

Moiz Sadiq Awan, Muhammad Haris Noor, Muhammad Salman Munaf · March 26, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

A cross-platform survey of 388 active AI chat users finds similar satisfaction across top models, widespread multi-platform use with negligible switching costs, and platform-specific adoption reasons, while hallucination and content filtering are the main frustrations.

Automated benchmarks dominate the evaluation of large language models, yet no systematic study has compared user satisfaction, adoption motivations, and frustrations across competing platforms using a consistent instrument. We address this gap with a cross-platform survey of 388 active AI chat users, comparing satisfaction, adoption drivers, use case performance, and qualitative frustrations across seven major platforms: ChatGPT, Claude, Gemini, DeepSeek, Grok, Mistral, and Llama. Three broad findings emerge. First, the top three platforms (Claude, ChatGPT, and DeepSeek) receive statistically indistinguishable satisfaction ratings despite vast differences in funding, team size, and benchmark performance. Second, users treat these tools as interchangeable utilities rather than sticky ecosystems: over 80% use two or more platforms, and switching costs are negligible. Third, each platform attracts users for different reasons: ChatGPT for its interface, Claude for answer quality, DeepSeek through word-of-mouth, and Grok for its content policy, suggesting that specialization, not generalist dominance, sustains competition. Hallucination and content filtering remain the most common frustrations across all platforms. These findings offer an early empirical baseline for a market that benchmarks alone cannot characterize, and point toward competitive plurality rather than winner-take-all consolidation among engaged users.

Summary

Main Finding

Active, engaged AI chat users treat assistants as largely interchangeable utilities: multi‑platform use is pervasive, switching costs are low, and user satisfaction is driven as much by interface, policy, and product features as by raw benchmark performance. Despite large resource differences, the top three platforms (Claude, ChatGPT, DeepSeek) report statistically indistinguishable satisfaction, and competition is sustained by platform specialization rather than winner‑take‑all consolidation.

Key Points

Sample and reach
- Survey collected late 2025; total responses = 388; primary analytic (second‑wave) sample N = 237; complete demographic subsample n = 171; qualitative responses n = 329.
- Respondents skew toward technology professionals and students; 79.5% report daily use; respondents from 37 countries (largest shares: North America 38.2%, South Asia 36.4%).
Multi‑homing and switching
- Mean number of platforms used = 2.83 (median = 3); 82.4% use two or more platforms; 52.9% use three or more.
- ChatGPT is the most commonly used (66.7% of respondents) and acts as the default starting point for exploration; most switchers to other primary tools report coming from ChatGPT (e.g., 78.6% of those who switched to Claude came from ChatGPT).
- Low switching‑in rate for ChatGPT (8.6%) vs higher switching‑in for newer entrants (e.g., Mistral 42.1%), consistent with first‑mover anchoring.
Satisfaction and heterogeneity
- The top three platforms (Claude, ChatGPT, DeepSeek) show statistically indistinguishable overall satisfaction despite large differences in funding, team size, and benchmark performance.
- Platform choice clusters around different strengths: ChatGPT (interface/usability), Claude (answer quality), DeepSeek (word‑of‑mouth growth), Grok (content policy/permissiveness).
- First‑mover anchoring: respondents who adopted ChatGPT as their first AI chat tool rated satisfaction 1.34 points higher than those who arrived from a competitor.
Persistent frustrations
- Thematic analysis of open responses identifies hallucinations (factual errors) and content moderation/filtering (policy limits) as the two most common user frustrations, representing a tradeoff between accuracy and permissiveness.
Other findings
- Adoption drivers and use‑case fit vary by platform (nine adoption drivers and six use‑case ratings were collected).
- Multi‑platform use sometimes signals ongoing search for better fit (e.g., negative correlation between number of platforms used and satisfaction for Claude users).

Data & Methods

Survey design
- Instrument delivered via Qualtrics with four sections: demographics/usage; model selection (checklist of 7 platforms); per‑model evaluation blocks (within‑subject design); open‑ended questions.
- Per‑model block items included: overall satisfaction (5‑point Likert), nine adoption‑driver importance ratings, six use‑case performance ratings, subscription plan, reaction to hypothetical 25% price increase, whether it was first AI tool, tenure, and switching history.
Sampling and data collection
- Convenience sampling through technology‑focused online communities (primarily Reddit) and professional networks.
- Two waves: initial five‑model checklist (n≈151) with free‑text Other; mid‑survey addition of DeepSeek and Mistral for second wave (final analytic N = 237). The instrument change is reported and robustness checks were performed.
Cleaning and quality checks
- Excluded responses below 3 minutes, contradictory model selections, and straight‑lining.
- Robustness checks: compared full vs partial completers and early vs late respondents; ChatGPT satisfaction stable across these splits.
Analytical approach
- Non‑parametric tests used for ordinal Likert data: Kruskal‑Wallis H tests for group differences; Mann‑Whitney U for pairwise comparisons with Bonferroni correction (adjusted α = 0.0024 for 21 pairwise tests).
- Effect sizes: ε2 for Kruskal‑Wallis, Cohen’s d for pairwise, Cramér’s V for chi‑square.
- Internal consistency: Cronbach’s α reported for use‑case (≈0.79–0.85) and adoption‑driver scales (≈0.75–0.80).
- Open replies analyzed with inductive, keyword‑assisted thematic coding.
Limitations noted by authors
- Convenience, tech‑oriented sample (not representative of general population).
- Mid‑survey instrument change (DeepSeek, Mistral added) — analyses involving those platforms should be interpreted cautiously.
- Small per‑platform subsamples for some competitors reduce power for certain comparisons.

Implications for AI Economics

Market structure and competition
- Low switching costs and prevalent multi‑homing argue against rapid winner‑take‑all consolidation among engaged users; competition is likely to remain pluralistic, with suppliers competing on differentiated features and domain fit.
- First‑mover anchoring (higher satisfaction among users who adopted ChatGPT first) suggests incumbency and default status provide durable advantages even when multi‑homing is common — incumbency effects coexist with low churn.
Valuation and investment signals
- Resource and benchmark advantages do not map directly to higher user satisfaction among power users. Investors and firms should account for product features (UI/UX, policy stance, integrations, niche performance) in addition to raw model capability.
- Niche or specialized entrants can sustain viable positions by targeting specific adoption drivers (e.g., policy permissiveness, answer style, vertical integrations).
Product strategy and prioritization
- Product teams should prioritize features that drive everyday satisfaction (interface, reliability, content policy alignment, domain suitability) rather than optimizing only for automated benchmarks.
- Addressing hallucinations and designing transparent, calibrated content‑filtering tradeoffs are high‑value engineering targets because they are primary user pain points.
Policy and market monitoring
- Regulators and antitrust analysts should incorporate multi‑homing rates, user satisfaction parity, and feature differentiation into assessments of market power — traffic or compute spend alone may overstate dominance.
- Consumer welfare analyses should consider heterogeneity in user priorities (quality vs permissiveness vs cost) and the role of default/anchoring effects.
Research and evaluation
- Automated benchmarks remain necessary but insufficient; platform‑level, user‑centric evaluation protocols (covering interface, policy, multi‑turn reliability, and ecosystem features) are required to capture real‑world competitive dynamics.
- Future empirical work should use more representative sampling and larger per‑platform samples to quantify heterogeneity across broader user populations and task domains.

Summary: Among active AI chat users, competition is sustained by specialization and product attributes rather than absolute benchmark superiority. Multi‑homing and low switching costs produce a pluralistic market, but incumbency effects and user experience factors (interface, policy, hallucination mitigation) strongly shape adoption and satisfaction.

Assessment

Paper Typedescriptive Evidence Strengthlow — Cross-sectional self-reported survey (n=388) provides descriptive comparisons but no causal identification; subject to selection, reporting, and survivorship biases and likely small per-platform subgroup sizes, so findings are suggestive rather than definitive about broader user populations. Methods Rigormedium — Study uses a consistent instrument across seven platforms and reports statistical tests (e.g., comparing satisfaction), which supports internal consistency; however, there is no probabilistic sampling frame, limited information on respondent recruitment or demographic controls, potential confounders are not addressed, and subgroup sample sizes are likely small. SampleCross-sectional online survey of 388 self-selected active AI chat users reporting use of one or more of seven platforms (ChatGPT, Claude, Gemini, DeepSeek, Grok, Mistral, Llama); recruitment and respondent demographics are not reported and per-platform sample sizes are not disclosed. Themesadoption innovation GeneralizabilitySelf-selected sample of active users creates selection bias (not representative of general population or casual users), Likely skewed toward tech-savvy or English-speaking respondents, Small overall and per-platform sample sizes limit precision and subgroup inference, Findings reflect a point-in-time market and may not hold as models, interfaces, and policies evolve, Does not capture enterprise, API, or developer usage patterns

Claims (11)

Claim	Direction	Confidence	Outcome	Details
We conducted a cross-platform survey of 388 active AI chat users comparing satisfaction, adoption drivers, use case performance, and qualitative frustrations across seven major platforms: ChatGPT, Claude, Gemini, DeepSeek, Grok, Mistral, and Llama. Adoption Rate	positive	high	survey sample and platform coverage	n=388 0.3
The top three platforms (Claude, ChatGPT, and DeepSeek) receive statistically indistinguishable satisfaction ratings despite vast differences in funding, team size, and benchmark performance. Consumer Welfare	null_result	high	user satisfaction ratings	n=388 0.18
Over 80% of users use two or more platforms (i.e., multi-platform usage is common). Adoption Rate	positive	high	number/proportion of users using multiple platforms	n=388 over 80% 0.18
Switching costs between platforms are negligible (users treat these tools as interchangeable utilities rather than sticky ecosystems). Adoption Rate	positive	medium	perceived switching costs / platform stickiness	n=388 0.11
ChatGPT attracts users primarily for its interface. Adoption Rate	positive	high	reported adoption reason for ChatGPT (interface)	n=388 0.18
Claude attracts users primarily for answer quality. Adoption Rate	positive	high	reported adoption reason for Claude (answer quality)	n=388 0.18
DeepSeek attracts users primarily through word-of-mouth. Adoption Rate	positive	high	reported adoption reason for DeepSeek (word-of-mouth)	n=388 0.18
Grok attracts users primarily for its content policy. Adoption Rate	positive	high	reported adoption reason for Grok (content policy)	n=388 0.18
Hallucination and content filtering are the most common frustrations reported across all platforms. Consumer Welfare	negative	high	reported frustrations (hallucination and content filtering)	n=388 0.18
These findings provide an early empirical baseline and point toward competitive plurality rather than winner-take-all consolidation among engaged users. Market Structure	positive	medium	market structure (likelihood of plurality vs winner-take-all)	n=388 0.02
Automated benchmarks dominate the evaluation of large language models, yet no systematic study has compared user satisfaction, adoption motivations, and frustrations across competing platforms using a consistent instrument. Other	neutral	medium	state of the evaluation literature (dominance of automated benchmarks and lack of cross-platform user surveys)	0.05