The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Generative search engines give inconsistent and heavy‑tailed citation visibility: identical queries return varying domains and citation counts follow a power law, so single‑run snapshots overstate precision; repeated sampling and confidence intervals reveal much apparent concentration is sampling noise.

Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement
Ronald Sielinski · March 09, 2026
arxiv descriptive medium evidence 8/10 relevance Source PDF
Generative search engines produce stochastic, heavy-tailed citation patterns such that single-run estimates of domain visibility are misleading and should be reported with uncertainty derived from repeated sampling and bootstrap methods.

AI-powered answer engines are inherently non-deterministic: identical queries submitted at different times can produce different responses and cite different sources. Despite this stochastic behavior, current approaches to measuring domain visibility in generative search typically rely on single-run point estimates of citation share and prevalence, implicitly treating them as fixed values. This paper argues that citation visibility metrics should be treated as sample estimators of an underlying response distribution rather than fixed values. We conduct an empirical study of citation variability across three generative search platforms--Perplexity Search, OpenAI SearchGPT, and Google Gemini--using repeated sampling across three consumer product topics. Two sampling regimes are employed: daily collections over nine days and high-frequency sampling at ten-minute intervals. We show that citation distributions follow a power-law form and exhibit substantial variability across repeated samples. Bootstrap confidence intervals reveal that many apparent differences between domains fall within the noise floor of the measurement process. Distribution-wide rank stability analysis further demonstrates that citation rankings are unstable across samples, not only among top-ranked domains but throughout the frequently cited domain set. These findings demonstrate that single-run visibility metrics provide a misleadingly precise picture of domain performance in generative search. We argue that citation visibility must be reported with uncertainty estimates and provide practical guidance for sample sizes required to achieve interpretable confidence intervals.

Summary

Main Finding

Citation visibility measured from generative search engines is stochastic and heavy-tailed: identical queries produce varying responses and cited domains across runs, and citation counts follow a power-law distribution with substantial sample-to-sample variability. Therefore single-run point estimates of citation share or prevalence are misleading; visibility metrics should be treated as estimators with uncertainty and reported with confidence intervals. Many apparent inter-domain differences vanish once measurement uncertainty is accounted for.

Key Points

  • Generative search platforms are non-deterministic: the same query at different times can yield different answers and different cited domains.
  • The authors analyzed three platforms (Perplexity Search, OpenAI SearchGPT, Google Gemini) on three consumer-product topics.
  • Two sampling regimes were used: daily collections across nine days and high-frequency sampling at 10-minute intervals.
  • Citation counts across repeated samples follow a power-law (heavy-tailed) distribution — a few domains are cited often, many are cited rarely.
  • Bootstrap-based confidence intervals show wide uncertainty: many domain-level differences that look meaningful in single-run snapshots fall within the measurement noise.
  • Rank stability analysis across the whole citation distribution shows instability not only at the tail but across frequently cited domains; rankings shift substantially across samples.
  • Practical conclusion: single-run visibility metrics give a falsely precise view of domain performance; uncertainty must be quantified and reported.

Data & Methods

  • Platforms: Perplexity Search, OpenAI SearchGPT, Google Gemini.
  • Topics: three consumer-product topics (paper focuses on consumer-product verticals; topics not specified here).
  • Sampling designs:
    • Multi-day sampling: one collection per day over nine days to capture day-to-day variation and medium-term drift.
    • High-frequency sampling: repeated queries at ten-minute intervals to capture short-term stochasticity.
  • Analyses:
    • Empirical distributional analysis showed citation counts follow a power-law form.
    • Bootstrap resampling was used to compute confidence intervals for citation shares and prevalence metrics.
    • Distribution-wide rank stability methods (e.g., comparing ranks across samples) quantified how often domain orderings change over repeated samples.
  • Main methodological point: treat observed citation shares as sampled statistics from an underlying response distribution and quantify uncertainty via repeated sampling and resampling methods.

Implications for AI Economics

  • Measurement & inference: Studies that estimate domain visibility, market share, or referral importance from single runs risk incorrect conclusions. Researchers should treat visibility metrics as stochastic estimators and report confidence intervals computed from repeated samples or bootstrap methods.
  • Platform competition & market power: Quantifying platform-mediated visibility (which drives traffic and economic value) requires accounting for stochasticity. Policy assessments (e.g., antitrust, platform gatekeeping) that rely on visibility snapshots may misestimate the extent of concentration or favoritism unless uncertainty is incorporated.
  • Business decisions & analytics: Publishers and advertisers using generative search visibility metrics for strategy or ROI estimation should collect repeated samples and report uncertainty; otherwise decisions may respond to sampling noise.
  • Methodological guidance for practitioners and researchers:
    • Use repeated sampling (both high-frequency to capture short-term noise and multi-day sampling to capture temporal drift).
    • Compute bootstrap confidence intervals for citation shares and prevalence.
    • Run rank-stability analyses to understand how robust top-N or share-based conclusions are.
    • Determine required sample size empirically: run a pilot to estimate variance, then choose the number of samples needed to achieve the desired confidence-interval width. The paper provides practical guidance on sample sizes conditioned on observed variability.
  • Broader implication: The heavy-tailed (power-law) nature of citation distributions implies long tails and high variance, meaning that achieving tight uncertainty bounds can require substantially more sampling than intuition from thin-tailed settings would suggest.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper uses repeated sampling across three major generative search platforms, two complementary sampling regimes (daily and high-frequency), and bootstrap/resampling and rank-stability analyses to document stochasticity and heavy tails in citation outputs; these methods convincingly show measurement variability within the studied settings. However, the scope is limited (three consumer-product topics, three platforms, a finite time window) and the study cannot observe or control platform internals (personalization, geolocation, model updates), limiting how strongly the findings generalize beyond the sample. Methods Rigormedium — The design demonstrates sound empirical practice for measurement studies: systematic repeated sampling, high-frequency and multi-day regimes, distributional fitting (power-law), bootstrap confidence intervals, and rank-stability analyses. Rigor is reduced by a narrow topical focus, modest platform coverage, likely unreported controls for personalization/geography/agent-state, and the observational nature of platform outputs which prevents experiment-like controls. SampleRepeated query-response samples from three generative search platforms (Perplexity Search, OpenAI SearchGPT, Google Gemini) on three consumer-product topics; sampling included one collection per day over nine days and high-frequency collections every 10 minutes, recording cited domains and citation counts per response. Themesgovernance adoption GeneralizabilityOnly three platforms were studied; results may differ on other generative search systems or older/newer model versions., Only three consumer-product topics were used; findings may not hold for other verticals (news, academic, technical queries)., Finite time window and limited number of days; platform updates or seasonality could change variability patterns., Potential unobserved personalization/geolocation/language effects — study likely does not exhaustively control for user state., Differences between API and web UI behavior, or query phrasing variations, may affect citation patterns and were not fully explored.

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
Generative search platforms are non-deterministic: the same query at different times can yield different answers and different cited domains. Other negative high response variability (changes in generated answers) and cited domains per query
0.18
Citation counts across repeated samples follow a power-law (heavy-tailed) distribution: a few domains are cited often while many domains are cited rarely. Other mixed high distribution of citation counts per domain (frequency of domain citations)
0.18
Single-run point estimates of citation share or prevalence are misleading; visibility metrics should be treated as estimators with uncertainty and reported with confidence intervals. Other negative high bias/precision of single-run estimates of domain citation share and prevalence
0.18
Bootstrap-based confidence intervals show wide uncertainty: many domain-level differences that look meaningful in single-run snapshots fall within measurement noise. Other negative high width of bootstrap confidence intervals for domain citation shares / prevalence and statistical separability of domain differences
0.18
Rank stability analysis across the whole citation distribution shows instability not only at the tail but across frequently cited domains; rankings shift substantially across samples. Other negative high rank stability of domains by citation frequency across repeated samples
0.18
Many apparent inter-domain differences vanish once measurement uncertainty is accounted for. Other null_result high statistical significance of inter-domain differences in citation share / prevalence after accounting for sampling variability
0.18
The heavy-tailed nature of citation distributions implies long tails and high variance, meaning achieving tight uncertainty bounds can require substantially more sampling than would be expected under thin-tailed assumptions. Other negative medium required sample size (number of repeated queries) to achieve target confidence-interval width for citation-share estimates
0.11
Practical measurement guidance: researchers and practitioners should use repeated sampling (high-frequency and multi-day), compute bootstrap confidence intervals for citation shares and prevalence, run rank-stability analyses, and determine required sample size empirically via pilots. Other positive high robustness and reliability of visibility metrics (as improved by recommended measurement practices)
0.18
Platform-mediated visibility measures used in policy assessments, business analytics, and research (e.g., estimating market share, referral importance, or favoritism) are at risk of misestimation if measurement stochasticity is not incorporated. Market Structure negative medium accuracy of downstream inferences (market share, referral importance, favoritism) based on single-run visibility snapshots
0.11

Notes