Large language models sharpen individual outputs but shrink the pool of distinct ideas: across stories, slogans and alternative-use prompts three frontier LLMs generate systematically less diverse idea sets than comparable human samples, implying higher redundancy costs — though targeted generation protocols can reduce that crowding.

Ex Ante Evaluation of AI-Induced Idea Diversity Collapse

Nafis Saami Azad, Raiyan Abdul Baten · May 07, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

Using a congestible-resource framework, the paper shows that three frontier LLMs produce less population-level creative diversity than matched human baselines (ρ<1), quantifying excess crowding with an identifiable coefficient Δ and demonstrating that protocol design can mitigate diversity collapse.

Creative AI systems are typically evaluated at the level of individual utility, yet creative outputs are consumed in populations: an idea loses value when many others produce similar ones. This creates an evaluation blind spot, as AI can improve individual outputs while increasing population-level crowding. We introduce a human-relative framework for benchmarking AI-induced human diversity collapse without requiring human-AI interaction data, providing an ex ante protocol to estimate crowding risk from model-only generations and matched unaided human baselines. By modeling ideas as congestible resources, we show that source-level crowding is identifiable from within-distribution comparisons, yielding an excess-crowding coefficient $Δ$ and a human-relative diversity ratio $ρ$. We show that $ρ\ge1$ is the no-excess-crowding parity condition and connect $Δ$ to an adoption game with exposure-dependent redundancy costs. Across short stories, marketing slogans, and alternative-uses tasks, three frontier LLMs fall below parity across crowding kernels. Estimates stabilize with feasible model-only sample sizes. Importantly, generation-protocol variants show that crowding can be reduced through targeted design, making diversity collapse an actionable, development-time evaluation target for population-aware creative AI.

Summary

Main Finding

The paper introduces an ex ante, human-relative framework to quantify AI-induced human diversity collapse from model-only and matched unaided human samples. It defines interpretable source-level metrics—an excess-crowding coefficient (∆) and a human-relative diversity ratio (ρ)—that are identifiable from within-distribution comparisons. Theoretical results link these source metrics to a population adoption (congestion) game: if ρ < 1 a model imposes an adoption-dependent externality (higher redundancy cost) on creators; if ρ ≥ 1 the model introduces no excess crowding. Empirically, across three creative tasks (short stories, alternative-uses, marketing slogans) and three frontier LLMs (GPT-5.4, Claude Sonnet 4.5, Gemini 2.5 Flash), neutral model-conditions fall below parity (ρ < 1), indicating positive excess crowding vs. matched human baselines. The paper also shows estimates stabilize with feasible model-only sample sizes and that prompt/protocol interventions (temperature, persona mixtures) can reduce crowding.

Key Points

Conceptual framing
- Ideas-as-congestible-resources: inspiration sources act like shared resources whose repeated use reduces downstream distinctiveness.
- Human-relative benchmark: contextualize model crowding against task-matched unaided human crowding to avoid conflating task-constrained convergence with model effects.
Source-level metrics (identifiable from samples)
- κH,k = E_{h,h'~Hk}[Kk(h,h')] (human pairwise crowding)
- κA,m,k = E_{a,a'~Am,k}[Kk(a,a')] (model pairwise crowding)
- ∆m,k = max{0, κA,m,k − κH,k} (excess-crowding coefficient)
- ρm,k = (1 − κA,m,k) / (1 − κH,k) (human-relative diversity ratio)
- Parity condition: ∆ = 0 ⇔ ρ ≥ 1 (no excess crowding)
Links to economic/adoption theory
- Redundancy cost (exposure-dependent): Cm,k(X−i) = γk (1 − exp{−X−i ∆m,k}), where γk is value of distinctiveness and X−i is number of other adopters.
- Critical-benefit adoption threshold: a creator adopts iff private AI benefit Bi,m,k exceeds the redundancy cost; thus lower ρ increases the benefit required for rational adoption.
- Mass-adoption limit: if ρ < 1, excess crowding can reach full distinctiveness penalty γk as adoption grows; if ρ ≥ 1 no excess penalty arises at any exposure.
Empirical findings
- Neutral prompting (T = 1), 50 model-only draws per task-condition: all nine model×task combinations had ρ < 1 under the primary semantic kernel; bootstrap CIs for ρ were below 1.
- Example: GPT-5.4 in the slogan condition showed a large deficit (bρ ≈ 0.179).
- Rarefaction diagnostics: pairwise crowding estimates stabilize with feasible model-only sample sizes, supporting practical development-time evaluation.
- Task-specific kernels (plot-synopsis, concept-bucket, lexical-template) corroborate results across representational levels.
- Generation-protocol variants (temperature sweeps, persona-mixture prompting) can move model-conditions toward parity; crowding is not immutable to prompting/design.

Data & Methods

Tasks and human baselines
- Short stories: 3 compact-fiction prompts from WritingPrompts; 87 human authors (one story each).
- Alternative Uses Task (AUT): socialmuse dataset; 109 human contributors generating 3,047 unaided ideas across five objects (primary uses excluded).
- Smartphone slogans: IRB-approved study with 95 contributors producing 659 slogans (650 unique).
- Each prompt/object/slogan context treated as a task condition k; estimates are computed within-condition and then equally aggregated across conditions.
Models & generation protocols
- Models: GPT-5.4, Claude Sonnet 4.5, Gemini 2.5 Flash.
- Main protocol: neutral prompting, temperature T = 1.0, 50 independent model-only generations per condition.
- Deployment variants: temperature sweeps and a persona-mixture protocol (25-persona grid based on Big Five binary dimensions).
Crowding kernels
- Primary kernel (semantic): Ksem(x,y) = (1 + cos(embed(x), embed(y))) / 2, mapping cosine similarity to [0,1].
- Task-specific kernels: plot-synopsis similarity for stories, concept-bucket co-membership for AUT, lexical-template overlap for slogans.
- Same kernel applied to both human and model samples for comparability.
Estimation procedure
- Matched-sample bootstrap: for each condition, draw bm,k = min(nH_k, nA_m,k) human units and model generations with replacement; compute mean off-diagonal pairwise K values to estimate κH,k and κA,m,k.
- Compute ∆ and ρ per condition; aggregate equally across conditions in a task family; use percentile bootstrap intervals for uncertainty.
- Participant-aware sampling when humans contributed multiple responses (sample participant, then one response) to avoid domination by prolific contributors.
- Rarefaction curves used to assess finite-sample stability.
Theoretical results and assumptions
- Independent-exposure and mean-field approximations used to derive adoption-cost expressions; crowding kernel bounded in [0,1].
- Decision-theoretic interpretation requires estimates of γk, adoption probability p, and population size N for translating ∆ into expected costs.

Implications for AI Economics

Externalities and adoption dynamics
- Shared use of generative models can produce negative externalities (excess crowding) that reduce private returns to distinctiveness and alter aggregate welfare.
- The framework cleanly separates a model-intrinsic crowding parameter (∆ or ρ) from population context (N, adoption prevalence p) and value of distinctiveness (γ), enabling modular welfare and adoption analyses.
- In markets where distinctiveness is valuable (high γ) and adoption rates are high, models with ρ < 1 raise the private benefit threshold for adoption and can produce large aggregate redundancy costs.
Policy, platform, and firm strategy applications
- Ex ante auditing: developers and platforms can estimate ∆ and ρ from model-only samples before deployment to audit crowding risk and compare model-conditions.
- Product design & mitigation: generation-protocol choices (e.g., higher temperature, persona mixtures, diversity-promoting decoding) are actionable levers to reduce crowding and move toward parity.
- Pricing and market design: knowledge of crowding externalities could inform subscription pricing, feature segmentation, or differentiation (e.g., offering diversified-generation modes as a premium privacy/uniqueness feature).
- Regulation and standards: ρ and ∆ provide candidate metrics for assessing population-level cultural/creative impacts and could inform guidelines around the deployment of ideation tools in domains where distinctiveness matters (journalism, marketing, patent ideation).
Research and measurement implications
- Practicality: the method requires only model-only and matched human-only samples and stabilizes with modest sample sizes, making it feasible for routine use in model development cycles.
- Decision support: combining source-level ρ estimates with market parameters (p, N, γ) yields quantitative predictions of expected redundancy costs and critical benefit thresholds for adoption—useful for forecasting adoption and welfare impacts.
Limitations and caveats relevant to economic interpretation
- Kernel dependence: results depend on the choice of crowding kernel; different representational levels may yield different quantitative conclusions—careful kernel selection is essential for domain-relevant policy.
- Behavioral assumptions: the adoption game uses independent-exposure and mean-field approximations; real-world strategic behavior, network structure, and feedback loops (e.g., personalization, model updates) can complicate dynamics.
- Mapping to realized human outputs: source-level excess crowding is necessary but not sufficient for realized human-AI diversity collapse—users may selectively use or transform model outputs; empirical human–AI interaction data remains necessary to validate realized effects in deployment contexts.
- Value of distinctiveness (γ) and beliefs about others’ adoption (p) are context-dependent and may be hard to estimate; welfare conclusions require domain-specific calibration.

Overall, the paper provides a usable, theoretically grounded ex ante tool for measuring model-driven crowding risk and links that measurement to adoption incentives and aggregate externalities—offering developers, platforms, and policymakers a concrete pathway to audit and mitigate population-level harms from creative AI.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper provides a clear theoretical identification argument and applies it empirically across multiple creative tasks and three frontier LLMs, with robustness checks on sample size and generation protocols; however it does not observe real-world adoption or downstream economic outcomes and rests on structural assumptions (crowding kernels, representativeness of human baselines) that limit causal claims about economic impact. Methods Rigorhigh — The authors derive formal identifiability results, define transparent summary statistics (Δ and ρ), test across multiple tasks and models, report stability with feasible sampling, and explore protocol/prompt variants to probe mechanisms; remaining concerns are explicit assumptions about kernels and external validity rather than weaknesses in implementation or inference. SampleModel-only generations from three frontier large language models across three creative tasks (short stories, marketing slogans, alternative-uses), with matched unaided human baseline samples; multiple generation-protocol variants and crowding-kernel specifications used to test robustness; sample sizes reported sufficient for estimator stability. Themesinnovation adoption IdentificationModels ideas as congestible resources and compares model-only generation distributions to matched unaided human baselines; within-distribution contrasts identify an excess-crowding coefficient (Δ) and a human-relative diversity ratio (ρ) under structural assumptions about crowding kernels, giving an ex ante, model-only estimand for population-level crowding without requiring observed human-AI interaction. GeneralizabilityTasks limited to three creative domains (short stories, slogans, alternative-uses) and may not generalize to technical or domain-specific idea markets, Only three frontier LLMs evaluated — results may differ for smaller or future models, Model-only generation comparison does not capture real-world adoption dynamics, market incentives, or human-AI interactive workflows, Identifiability relies on assumed crowding kernels and on matched human baselines being representative of population creativity, Cultural, language, and domain-specific variation not extensively explored

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Creative AI systems are typically evaluated at the level of individual utility, yet creative outputs are consumed in populations: an idea loses value when many others produce similar ones. Creativity	negative	high	loss of value due to similarity (population-level creative value)	0.08
This creates an evaluation blind spot, as AI can improve individual outputs while increasing population-level crowding. Creativity	negative	high	population-level crowding (diversity collapse)	0.08
We introduce a human-relative framework for benchmarking AI-induced human diversity collapse without requiring human-AI interaction data, providing an ex ante protocol to estimate crowding risk from model-only generations and matched unaided human baselines. Creativity	positive	high	ability to benchmark AI-induced diversity collapse (method performance)	0.48
By modeling ideas as congestible resources, we show that source-level crowding is identifiable from within-distribution comparisons, yielding an excess-crowding coefficient Δ and a human-relative diversity ratio ρ. Creativity	positive	high	identifiability of source-level crowding; definition of Δ and ρ	0.48
We show that ρ ≥ 1 is the no-excess-crowding parity condition and connect Δ to an adoption game with exposure-dependent redundancy costs. Creativity	neutral	high	parity condition for no-excess-crowding (ρ ≥ 1) and economic/game-theoretic relation of Δ	0.48
Across short stories, marketing slogans, and alternative-uses tasks, three frontier LLMs fall below parity across crowding kernels. Creativity	negative	high	human-relative diversity ratio (ρ) indicating excess crowding	n=3 0.48
Estimates stabilize with feasible model-only sample sizes. Creativity	positive	medium	stability/convergence of crowding estimates as model-only sample size increases	0.14
Generation-protocol variants show that crowding can be reduced through targeted design, making diversity collapse an actionable, development-time evaluation target for population-aware creative AI. Creativity	positive	medium	change in crowding (Δ or ρ) under generation-protocol variants	0.29