Incentives shape how humans co-create with generative AI

Generative AI is quickly becoming an integral part of people's everyday workflows. Early evidence has shown that while generative AI can increase individual-level productivity, it does so at the cost of collective diversity, potentially narrowing the set of ideas and perspectives produced. Our research stands in contrast to this concern: through a pre-registered randomized control trial, we show that incentives mediate AI's homogenizing force in a creative writing task where participants can use AI interactively. Participants rewarded for originality relative to peers produce collectively more diverse writing than those rewarded for quality alone. This divergence is driven not by abandoning AI, but by how participants use it: those incentivized for originality incorporate fewer AI suggestions verbatim, relying on the model more selectively for brainstorming, proofreading, and targeted edits. Our results reveal that the effects of generative AI depend not only on the technology itself, but also the behavioral strategies and incentive structures surrounding its use.

Summary

Main Finding

In a pre-registered randomized trial (n = 200), generative AI (GPT5-Mini) tends to homogenize creative outputs when it produces full drafts, but human co-creation substantially offsets that homogenizing force. Crucially, incentive structures matter: rewarding originality (vs. technical quality) increases the diversity of final stories when people have access to AI. This increase in diversity arises not from abandoning the tool but from different strategies of using it—participants incentivized for originality incorporate fewer AI suggestions verbatim and use the model more selectively (brainstorming, proofreading, targeted edits). However, even with originality incentives, AI-assisted outputs remain less diverse than purely human-written stories.

Key Points

Experimental design: 2 × 2 between-subjects study (AI vs. SELF) × (Originality (O) vs. Technical quality (T)), 50 participants per cell, pre-registered.
Task: write a 250–350 word short story in a 25-minute session (5-minute brainstorming, 20-minute writing); rich interaction logs (editor snapshots every 5s and full AI transcripts).
AI usage: participants in the AI condition were required to use the tool at least once. 91% requested at least one full draft; mean conversation turns = 3.57.
Time use: AI users wrote slightly faster on average (18.6 vs. 20.6 minutes), but differences are modest.
Measured diversity via a pre-specified suite of metrics (multiple embedding models’ cosine similarities, style embeddings, byte-level embeddings, and n-gram/token-based metrics including compression ratio, BLEU, ROUGE-L, and n-gram diversity score). Participant-level outcome = leave-one-out average similarity within each cell.
Main empirical patterns:
- AI first drafts are highly similar to each other (homogenizing).
- Human editing and iterative prompting substantially increase diversity relative to raw AI drafts; final stories move closer to the human-only distribution.
- Incentivizing originality increases diversity among AI-assisted writers compared to those incentivized for technical quality.
- The mechanism: originality incentives reduce direct adoption/anchoring on AI drafts (fewer verbatim incorporations); users rely more selectively on the model.
- Conditional on adopting AI suggestions, participants incentivized for originality spent more time prompting but did not achieve additional diversity gains.
- Participants with higher prior AI experience tended to rely more on AI suggestions and produced more homogeneous stories.
Statistical approach: pre-registered hypotheses tested with leave-one-out similarity measures and Welch’s t-tests across multiple similarity/diversity metrics. Results are consistent across metrics and robust to multiple-hypothesis correction (reported as a correlated set of views on latent diversity).

Data & Methods

Sample: 200 Prolific participants (50 per experimental cell). Attention controls: full-screen-only interface, inactivity checks, minimum time thresholds, post-hoc filtering for low-effort sessions.
Treatments:
- AI vs. SELF: AI condition had an integrated LLM assistant (GPT5-Mini); SELF wrote without the tool.
- Incentives: Originality (bonus for top 25% in originality; $2–7) vs. Technical Quality (rubric-based grade bonuses; no cap; A=$2.5, B=$1).
Interaction logging: editor snapshots every 5 seconds, full AI conversation transcripts, and manual flagging of AI-produced valid drafts.
Classification of AI uses (from transcripts): editing requests (38%), drafting (30%), creative consulting (14%), mechanical assistance (3%), others (15%).
Diversity measurement: ensemble of similarity/diversity metrics (embedding cosine similarities from several models, style embeddings, byte-level embeddings, compression ratio, BLEU, ROUGE-L, n-gram diversity). Outcome is average within-cell similarity (leave-one-out). Hypotheses:
- H1: AI reduces diversity vs. SELF.
- H2: Originality incentive increases diversity vs. Technical-quality incentive.
Inferential tests: two-sample difference-in-means (Welch’s t-test) for each metric; interpret pattern across correlated metrics rather than a single p-value. Pre-registration and analysis plan published on OSF.

Implications for AI Economics

Incentives shape aggregate outcomes from human-AI co-creation. Market or organizational incentive structures (bonuses, grading, promotion criteria) materially influence whether AI increases productivity at the cost of reduced collective diversity.
Product differentiation and cultural variety depend not only on model capabilities but on how agents are rewarded and how they strategically use AI. Firms/platforms that reward distinctiveness (e.g., editorial or commissioning incentives emphasizing novelty) can partially mitigate AI-induced homogenization.
Platform/UI design matters as a policy lever: defaults, templates, or recommendation systems that encourage selective use, iterative editing, or explicit originality signals can change adoption behavior and thus the distribution of outputs.
Labor-market implications: more experienced AI users may increase homogeneity if their standard practice is heavy reliance on model outputs. This could compress observable stylistic differentiation across workers, affecting reputational signaling, wage dispersion, and returns to creative skill.
Trade-offs: rewarding originality increases diversity but may involve higher effort, slower productivity, or lower measured “technical quality” under some rubrics. Policymakers and firms should weigh the social value of diversity (innovation externalities, cultural variety) against short-term productivity gains from standardized AI assistance.
Regulation & platform policy: interventions that change incentives (e.g., disclosure rules, copyright and attribution regimes, reward structures for human-authored originality) could influence the macro-level cultural and economic impacts of generative AI.
Research gaps for policy and practice: need field experiments with stronger/real-world incentives, longer-term studies of learning and habituation, and exploration of how product markets respond to shifts in diversity (consumer surplus, innovation rates, competition).

Limitations noted by authors: controlled Prolific sample and lab-like payoffs (bonuses) are not perfect proxies for real-world stakes; short-term task frame (one session) does not reveal long-run dynamics; choice of AI model and similarity metrics influences measured effects. The study is pre-registered and transparent about exploratory analyses.

Assessment

Paper Typerct Evidence Strengthhigh — Random assignment and pre-registration provide strong causal identification of incentive effects on behavior and collective output; the study also observes mechanistic intermediate variables (how AI suggestions were used), strengthening the causal story. Main limitations are scope (single task and setting) and unspecified sample representativeness. Methods Rigorhigh — Use of an RCT with pre-registration, direct behavioral measurement of AI use, and analysis of textual outputs indicates rigorous design and implementation; however, external validity and potential measurement choices for 'diversity' remain concerns. SampleA pre-registered RCT sample of human participants completing an interactive creative-writing task with access to a generative-AI assistant; participants were randomly assigned to incentive conditions (originality vs quality); textual outputs and in-task AI interactions (suggestions accepted verbatim, edits, brainstorming use) were recorded. (Paper text does not specify sample size, recruitment source, demographics, or the specific model/interface used.) Themeshuman_ai_collab org_design IdentificationPre-registered randomized controlled trial: participants were randomly assigned to different incentive conditions (reward for originality relative to peers vs reward for quality), and causal effects are identified by comparing outcomes across these randomized arms; mediation is explored using observed in-task AI interactions (e.g., rate of verbatim incorporation, type of edits). GeneralizabilitySingle-task setting (creative writing) may not generalize to other tasks or industries, Short-term experimental incentives differ from real-world compensation and organizational incentives, Participant pool not specified (likely online or student sample), so demographic representativeness is unclear, Results may depend on the particular AI model, interface, or prompt design used, Metrics of 'collective diversity' in written text may not map directly to firm-level productivity or innovation outcomes, Cultural and language context (if limited) could affect generalizability to other populations

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Early evidence has shown that generative AI can increase individual-level productivity. Organizational Efficiency	positive	high	individual-level productivity	0.6
Early evidence suggests generative AI increases productivity but does so at the cost of collective diversity, potentially narrowing the set of ideas and perspectives produced. Creativity	negative	high	collective diversity of produced ideas/perspectives	0.6
Through a pre-registered randomized control trial, we show that incentives mediate AI's homogenizing force in a creative writing task where participants can use AI interactively. Creativity	mixed	high	extent to which incentives alter AI's homogenizing effect (mediating effect)	0.6
Participants rewarded for originality relative to peers produce collectively more diverse writing than those rewarded for quality alone. Creativity	positive	high	collective diversity of writing	0.6
The divergence in collective outputs is not driven by participants abandoning AI, but by how participants use it. Task Allocation	null_result	high	continued use of AI (vs. abandonment)	0.6
Participants incentivized for originality incorporate fewer AI suggestions verbatim. Task Allocation	negative	high	rate of verbatim incorporation of AI suggestions	0.6
Those incentivized for originality rely on the model more selectively for brainstorming, proofreading, and targeted edits. Task Allocation	positive	high	types of tasks for which AI is used (brainstorming, proofreading, targeted edits)	0.6
The effects of generative AI depend not only on the technology itself, but also the behavioral strategies and incentive structures surrounding its use. Organizational Efficiency	mixed	high	impact of incentives and strategies on AI outcomes	0.6

Tweak incentives, not the model: paying writers for originality—rather than quality—prevents generative AI from making outputs more alike, because originality payoffs prompt users to lean on the model for brainstorming and targeted edits instead of copying suggestions verbatim.