The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Fluent users drive AI to do harder, more ambitious work by iterating and critiquing outputs — they fail more often but these failures are visible and recoverable, yielding higher success on complex tasks, while novices more frequently encounter undetected, silent failures.

A paradox of AI fluency
Christopher Potts, Moritz Sudhof · April 28, 2026 · ArXiv.org
openalex correlational medium evidence 8/10 relevance Source PDF
Users fluent with AI engage more actively and take on more complex tasks, producing more visible failures but higher recoveries and greater success on hard tasks, while novices more often produce invisible failures that appear successful but miss the goal.

How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat-4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures than novices -- but their failures tend to be visible (a direct consequence of their engagement), they are more likely to lead to partial recovery, and they occur alongside greater success on complex tasks. Novices, by contrast, more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Taken together, these results reframe what success with AI depends on. Individuals should adopt a stance of active engagement rather than passive acceptance. AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall. Our code and data are available at https://github.com/bigspinai/bigspin-fluency-outcomes

Summary

Main Finding

Fluency with AI strongly shapes outcomes. High-fluency users tackle substantially more complex tasks and interact in an augmentative, iterative way, which produces both more visible failures and greater recovery/success on hard tasks. Low-fluency users take simpler tasks and are more delegative/passive, producing fewer visible failures but more invisible failures that leave conversations appearing successful while missing the mark. This creates a “paradox of AI fluency”: experts fail more often (in raw counts) but fail in ways that are detectable and recoverable, whereas novices suffer quieter, harder-to-detect failures.

Key Points

  • Dataset & scope

    • Sample: 1,000 English transcripts per month from May 2023–July 2025 from WildChat-4.8M → 27K sampled, 26,958 annotated; main analyses use a “Standard” subset (excludes two exogenous subsets) with 20,969 cases.
    • Models represented in the underlying service range from GPT-3.5-turbo to GPT-4.1-mini.
    • Code/data: https://github.com/bigspinai/bigspin-fluency-outcomes
  • Quantitative highlights

    • Task complexity gap: mean complexity (1–5 scale) is 3.1 for highest-fluency users vs 1.5 for lowest — a 1.6-point gap.
    • Interactional mode: 93% of high-fluency transcripts are augmentative vs <1% for the lowest-fluency group.
    • Failure rates: 64% of high-fluency transcripts contain ≥1 failure indicator vs 24% for lowest-fluency.
    • Failure visibility: 59% of high-fluency failures are visible vs 12% visible for the lowest-fluency group.
    • Fluent users are more likely to achieve partial recoveries and succeed on complex tasks.
  • Behavioral patterns

    • High-fluency signatures: iterative refinement, context provision/injection, critical output evaluation, decomposition, goal/format specification.
    • Low-fluency signatures: passive acceptance, prompt flailing, vague delegation, context gaps not addressed.
  • Data quirks

    • “Midjourney” (≈4,335 transcripts) and “Blockman” (≈200K in full dataset; 1,654 in sample) are large, specific usage subsets that distort fluency and complexity distributions. Report focuses on the Standard set with those removed, but authors discuss their relevance.

Data & Methods

  • Data source: WildChat-4.8M non-toxic subset (≈3.2M conversations, April 2023–Aug 2025). Sampled 1k/month → annotated sample ~27K.
  • Annotation pipeline
    • Fluency annotations: protocol inspired by Anthropic (2026). Each transcript received: one-sentence summary, interaction style (augmentative | delegative | other), lists of fluency (17 categories) and anti-fluency (7 categories) behaviors with strength/evidence, and an overall fluency assessment (high|moderate|low|minimal) plus rationale. Annotated by Sonnet 4.5 (claude-sonnet-4-5-20250929).
    • Task complexity: 1–5 confidence/complexity scale and sublabels (cognitive complexity, domain expertise, novelty, etc.), annotated by Sonnet 4.5.
    • Failure-mode annotation: two-stage LLM pipeline following Potts & Sudhof (2026). Stage 1: three LLMs (Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.4) annotate quality signals. Stage 2: Claude Opus 4.6 infers basic failure mode (visible | invisible | mixed | none) and, where applicable, assigns invisible-failure archetypes (e.g., confidence trap, silent mismatch, drift, death spiral, contradiction unravel, walkaway, partial recovery, mystery failure).
  • Subset strategy: analyses mostly on “Standard” subset to avoid distortions from Midjourney/Blockman events; authors report where those subsets materially change results.

Implications for AI Economics

  • Measurement of AI impact must account for user fluency

    • Productivity or welfare gains estimated from AI tools depend on who is using them and how. Studies ignoring user fluency risk over- or under-estimating benefits (novices may appear successful while suffering hidden errors; experts extract more value from complex tasks).
    • Evaluation metrics should distinguish visible vs invisible failures; aggregate satisfaction or completion rates can mask invisible mismatches that matter economically (misinformation, bad decisions).
  • Distributional effects and inequality

    • Returns to “AI fluency” are likely concentrated: fluent users capture more complex-task gains and can recover from errors, while novices capture less value and may be more exposed to undetected harms or productivity losses. This suggests potential widening of inequality in realized gains from AI unless fluency is more broadly distributed.
  • Product and platform design trade-offs

    • Frictionless, one-click experiences that minimize visible failure may increase short-term user satisfaction but enable invisible failures and passive delegation. Designing for active engagement (e.g., scaffolding iterative refinement, prompting affordances, verifiability tools) may reduce immediate satisfaction but increase long-run successful use and societal benefit.
    • Platforms should instrument invisible-failure signals and provide interfaces to surface uncertainty or gaps; platform-level metrics should incorporate these signals to better align incentives.
  • Labor market and training

    • Employers and educators should prioritize training in augmentative behaviors (iterative refinement, goal clarification, critical evaluation) because these skills materially affect the ability to use AI as a productivity tool, not just as automation.
    • Hiring and compensation models that assume uniform AI gains across workers will be mistaken; firms may obtain larger returns from investing in workers’ AI fluency.
  • Policy and regulation

    • Regulators and auditors should be aware of invisible failure archetypes: compliance and safety checks should not rely solely on surface indicators of success.
    • Public policy aimed at equitable AI adoption should fund accessibility and fluency programs; otherwise, market dynamics may favor already-skilled workers.

Overall, the report reframes “AI success” as a function of both model capability and user behavior. For economic analysis, that means shifting from model-centered impact estimates to interaction-aware assessments that model heterogeneity in user fluency, task complexity, and observable vs invisible failure modes.

Assessment

Paper Typecorrelational Evidence Strengthmedium — Large, richly annotated observational dataset (27K transcripts) provides strong descriptive evidence of associations between user fluency and interaction patterns/outcomes, but there is no experimental or quasi-experimental identification; selection on unobservables (user motivation, prior domain knowledge, task selection) and potential annotation subjectivity limit causal claims. Methods Rigormedium — The study uses a large sample and detailed annotations, which supports internal validity for descriptive patterns; however, the absence of randomized variation, limited information (in the abstract) about coder reliability, control variables, and robustness checks reduces methodological rigor relative to causal inference standards. Sample27,000 conversational transcripts sampled from the WildChat-4.8M corpus, each richly annotated for user 'fluency' level, task complexity, interactional mode (iterative/collaborative vs. passive), failure type (visible vs. invisible), recoveries, and outcome; user demographics and how users were sampled or assigned to fluency categories are not specified in the abstract; code and data are publicly available on GitHub. Themeshuman_ai_collab skills_training GeneralizabilitySingle platform/model (WildChat-4.8M) — findings may not generalize to other LLMs or chat interfaces, Unknown user population and selection — likely self-selected/active users, limiting representativeness, Language and cultural bias if transcripts are predominately English or from specific regions, Task-domain dependence — results may differ for domain-specific vs. general-purpose tasks, Annotation subjectivity and potential coder variation may affect replicability, Time- and version-specific: model behavior and user practices evolve rapidly

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
Fluent users take on more complex tasks than novices. Task Allocation positive high task complexity
n=27000
0.3
Fluent users adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. Organizational Efficiency mixed high interactional mode / engagement style
n=27000
0.3
Fluent users experience more failures than novices. Error Rate negative high failure rate (errors / failed turns)
n=27000
0.3
Fluent users' failures tend to be visible (a direct consequence of their engagement). Error Rate positive high visibility of failures (visible vs. invisible failures)
n=27000
0.3
Fluent users' failures are more likely to lead to partial recovery. Error Rate positive high partial recovery rate after failures
n=27000
0.3
Fluent users' failures occur alongside greater success on complex tasks. Output Quality mixed high success on complex tasks
n=27000
0.3
Novices more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Error Rate negative high invisible failure rate (apparent success but incorrect outcome)
n=27000
0.3
Individuals should adopt a stance of active engagement rather than passive acceptance. Skill Acquisition positive high recommended user behavior (active engagement)
0.05
AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall. Adoption Rate positive high product design recommendation (encouraging deep engagement)
0.05

Notes