The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Generative search engines give inconsistent and heavy‑tailed citation visibility: identical queries return varying domains and citation counts follow a power law, so single‑run snapshots overstate precision; repeated sampling and confidence intervals reveal much apparent concentration is sampling noise.

Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement
Ronald Sielinski · March 09, 2026
arxiv descriptive medium evidence 8/10 relevance Source PDF
Generative search engines produce stochastic, heavy-tailed citation patterns such that single-run estimates of domain visibility are misleading and should be reported with uncertainty derived from repeated sampling and bootstrap methods.

AI-powered answer engines are inherently non-deterministic: identical queries submitted at different times can produce different responses and cite different sources. Despite this stochastic behavior, current approaches to measuring domain visibility in generative search typically rely on single-run point estimates of citation share and prevalence, implicitly treating them as fixed values. This paper argues that citation visibility metrics should be treated as sample estimators of an underlying response distribution rather than fixed values. We conduct an empirical study of citation variability across three generative search platforms--Perplexity Search, OpenAI SearchGPT, and Google Gemini--using repeated sampling across three consumer product topics. Two sampling regimes are employed: daily collections over nine days and high-frequency sampling at ten-minute intervals. We show that citation distributions follow a power-law form and exhibit substantial variability across repeated samples. Bootstrap confidence intervals reveal that many apparent differences between domains fall within the noise floor of the measurement process. Distribution-wide rank stability analysis further demonstrates that citation rankings are unstable across samples, not only among top-ranked domains but throughout the frequently cited domain set. These findings demonstrate that single-run visibility metrics provide a misleadingly precise picture of domain performance in generative search. We argue that citation visibility must be reported with uncertainty estimates and provide practical guidance for sample sizes required to achieve interpretable confidence intervals.

Summary

Main Finding

Citation-based visibility metrics for generative search (citation count, share, prevalence) are noisy estimates of an underlying stochastic response distribution. Single-run point estimates commonly used in practice are misleadingly precise: repeated sampling and bootstrap confidence intervals reveal substantial variability (often comparable to observed differences between domains), and citation rankings are unstable across samples. Measurement must therefore report uncertainty and use repeated sampling to support interpretable inference.

Key Points

  • Generative search engines (Perplexity Search, OpenAI SearchGPT, Google Gemini) are non-deterministic: identical queries can yield different responses and different cited sources.
  • Visibility metrics are sample statistics (estimators), not fixed platform facts. Two distinct variability sources matter:
    • System-level stochasticity: the engine’s inherent randomness (and retrieval variability).
    • Measurement uncertainty: sampling error from finite query batches.
  • Empirical results:
    • Three topics studied: bird feeders, running gear, multivitamins for adults; 200 queries per topic generated via an LLM across ten query types.
    • Two sampling regimes: daily collections over nine days (≈200 responses/sample) and high-frequency 10-minute interval sampling (running gear; 25 samples).
    • Citation volumes differ widely across platforms (median citations/response: Gemini ≈36–40, Perplexity ≈20–22, SearchGPT ≈5–7).
    • Citation counts/share distributions follow a power-law: a heavy head and long tail; variability structure differs between head and tail.
    • Citation count and citation share are nearly perfectly correlated within platform (Spearman ρ ≈ 0.994–0.997).
    • Bootstrap 95% confidence intervals for citation share are commonly wide; for SearchGPT with N≈200, CI widths on share of ~5–7 percentage points are typical, making many apparent inter-domain differences statistically indistinguishable.
    • Distribution-wide rank stability (weighted Spearman) shows rank instability not only at the top but across the frequently-cited domain set.
    • High-frequency sampling and content-checksum controls indicate most observed variability is due to engine behavior rather than changing source content.
  • Additional measurement concerns:
    • Within-sample non-stationarity (e.g., query-ordering effects) can violate exchangeability assumptions and complicate CI convergence and interpretation.
    • Generating queries with an LLM introduces potential bias in the query distribution.

Data & Methods

  • Platforms: Perplexity Search, OpenAI SearchGPT, Google Gemini.
  • Topics: three consumer-product topics representing different market structures (bird feeders, running gear, multivitamins for adults).
  • Query set: 200 queries per topic generated by prompting an LLM to produce queries across ten pre-defined categories; repetition of common queries allowed.
  • Sampling regimes:
    • Daily: submit full 200-query set once per day for nine consecutive days (≈9 samples/topic/platform).
    • High-frequency: submit queries at 10-minute intervals (running gear), yielding 25 samples per platform to isolate system-level stochasticity.
  • Metrics:
    • Citation count c(d, S), citation share ŝS(d) = c(d,S) / C(S), and citation prevalence p(d, S) = fraction of responses citing d.
  • Analyses:
    • Repeated-sample empirical distributions of citation metrics.
    • Bootstrap (resampling) to produce 95% confidence intervals for citation share/prevalence.
    • Rank-stability analysis using weighted Spearman rank correlation across samples.
    • Content-checksum validation to detect whether cited pages changed between samples.
    • Diagnostics for within-sample non-stationarity and CI convergence with sample size.
  • Key numeric observations:
    • Typical sample: N ≈ 200 queries → common CI widths on citation share of ~5–7 percentage points (SearchGPT example).
    • Platform citation density differences: Gemini >> Perplexity >> SearchGPT in citations per response.

Implications for AI Economics

  • Visibility and market inference:
    • Metrics of AI-driven visibility (citation share/prevalence) used for estimating market exposure, competitive positioning, or market concentration are uncertain; treating point estimates as ground truth can produce misleading economic conclusions.
    • Comparisons or rankings of domains based on single-run samples can falsely guide resource allocation (e.g., SEO/content spend) because many observed differences lie within measurement noise.
  • Evaluation of interventions and ROI:
    • Reported improvements from content or optimization interventions must exceed the measurement noise floor (commonly several percentage points for citation share with N≈200) and should be validated via repeated sampling and statistical testing (bootstrap CIs, pre/post sampling designs).
    • A/B tests or intervention studies on generative search visibility should be designed with sufficient sample size and replication to detect economically meaningful effects given the platform’s stochasticity.
  • Modeling attention and traffic:
    • Predictive models of traffic or attention that condition on citation presence/position must incorporate uncertainty in citation assignments; stochastic citation selection implies probabilistic exposure estimates rather than deterministic attribution.
  • Policy and competition analysis:
    • Analyses of platform influence, gatekeeping, or concentration that use citation-based evidence should include uncertainty quantification; confidence intervals may change conclusions about dominance or market power.
  • Practical measurement guidance:
    • Always report uncertainty (bootstrap CIs) with visibility metrics and avoid over-interpreting single-run point estimates.
    • Use repeated sampling and, when feasible, high-frequency sampling to separate system-level stochasticity from content-change effects.
    • Increase sample sizes when aiming to detect differences smaller than the observed noise floor (many observed effects <5–7 percentage points are indistinguishable at N≈200).
    • Check for within-sample non-stationarity and consider randomized query ordering or stratified sampling to improve exchangeability.
  • Research & operational recommendations:
    • Benchmarks and frameworks for AI visibility/attribution should mandate variance reporting (not just point estimates).
    • Economic models and platform audits should treat citation outcomes as probabilistic draws from a response distribution and propagate measurement uncertainty through downstream inference.

Limitations (as noted or implied) - Preprint status—results not yet peer-reviewed. - Limited topics (three) and potential bias from LLM-generated query sets. - Platforms and generation behavior evolve over time; conclusions are contingent on the systems and sampling windows studied.

Overall, this paper reframes AI visibility as a statistical estimation task and provides empirical evidence and practical recommendations showing that uncertainty quantification (repeated sampling + bootstrap CIs) is essential for valid economic inference and decision-making in generative search contexts.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper uses repeated sampling across three major generative search platforms, two complementary sampling regimes (daily and high-frequency), and bootstrap/resampling and rank-stability analyses to document stochasticity and heavy tails in citation outputs; these methods convincingly show measurement variability within the studied settings. However, the scope is limited (three consumer-product topics, three platforms, a finite time window) and the study cannot observe or control platform internals (personalization, geolocation, model updates), limiting how strongly the findings generalize beyond the sample. Methods Rigormedium — The design demonstrates sound empirical practice for measurement studies: systematic repeated sampling, high-frequency and multi-day regimes, distributional fitting (power-law), bootstrap confidence intervals, and rank-stability analyses. Rigor is reduced by a narrow topical focus, modest platform coverage, likely unreported controls for personalization/geography/agent-state, and the observational nature of platform outputs which prevents experiment-like controls. SampleRepeated query-response samples from three generative search platforms (Perplexity Search, OpenAI SearchGPT, Google Gemini) on three consumer-product topics; sampling included one collection per day over nine days and high-frequency collections every 10 minutes, recording cited domains and citation counts per response. Themesgovernance adoption GeneralizabilityOnly three platforms were studied; results may differ on other generative search systems or older/newer model versions., Only three consumer-product topics were used; findings may not hold for other verticals (news, academic, technical queries)., Finite time window and limited number of days; platform updates or seasonality could change variability patterns., Potential unobserved personalization/geolocation/language effects — study likely does not exhaustively control for user state., Differences between API and web UI behavior, or query phrasing variations, may affect citation patterns and were not fully explored.

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
Generative search platforms are non-deterministic: the same query at different times can yield different answers and different cited domains. Other negative high response variability (changes in generated answers) and cited domains per query
0.18
Citation counts across repeated samples follow a power-law (heavy-tailed) distribution: a few domains are cited often while many domains are cited rarely. Other mixed high distribution of citation counts per domain (frequency of domain citations)
0.18
Single-run point estimates of citation share or prevalence are misleading; visibility metrics should be treated as estimators with uncertainty and reported with confidence intervals. Other negative high bias/precision of single-run estimates of domain citation share and prevalence
0.18
Bootstrap-based confidence intervals show wide uncertainty: many domain-level differences that look meaningful in single-run snapshots fall within measurement noise. Other negative high width of bootstrap confidence intervals for domain citation shares / prevalence and statistical separability of domain differences
0.18
Rank stability analysis across the whole citation distribution shows instability not only at the tail but across frequently cited domains; rankings shift substantially across samples. Other negative high rank stability of domains by citation frequency across repeated samples
0.18
Many apparent inter-domain differences vanish once measurement uncertainty is accounted for. Other null_result high statistical significance of inter-domain differences in citation share / prevalence after accounting for sampling variability
0.18
The heavy-tailed nature of citation distributions implies long tails and high variance, meaning achieving tight uncertainty bounds can require substantially more sampling than would be expected under thin-tailed assumptions. Other negative medium required sample size (number of repeated queries) to achieve target confidence-interval width for citation-share estimates
0.11
Practical measurement guidance: researchers and practitioners should use repeated sampling (high-frequency and multi-day), compute bootstrap confidence intervals for citation shares and prevalence, run rank-stability analyses, and determine required sample size empirically via pilots. Other positive high robustness and reliability of visibility metrics (as improved by recommended measurement practices)
0.18
Platform-mediated visibility measures used in policy assessments, business analytics, and research (e.g., estimating market share, referral importance, or favoritism) are at risk of misestimation if measurement stochasticity is not incorporated. Market Structure negative medium accuracy of downstream inferences (market share, referral importance, favoritism) based on single-run visibility snapshots
0.11

Notes