← Papers

Feedback interdependencies in recommendation systems can be made both stable and fair: under a multi-agent bandit model the induced cooperative game has a non-empty core and is convex for identical creators so the Shapley value lies in the core; for heterogeneous creators a simple regret-based payout rule guarantees core membership and satisfies most Shapley axioms.

Creator Incentives in Recommender Systems: A Cooperative Game-Theoretic Approach for Stable and Fair Collaboration in Multi-Agent Bandits

Ramakrishnan Krishnamurthy, Arpit Agarwal, Lakshminarayanan Subramanian, Maximilian Nickel · April 09, 2026

arxiv theoretical n/a evidence 7/10 relevance Source PDF

Modeling creators' interaction through a multi-agent stochastic linear bandit yields a transferable-utility cooperative game with a non-empty core (convex for homogeneous agents under mild algorithmic conditions), and a simple regret-based payout rule provides core membership and approximates Shapley fairness in empirical illustrations.

User interactions in online recommendation platforms create interdependencies among content creators: feedback on one creator's content influences the system's learning and, in turn, the exposure of other creators' contents. To analyze incentives in such settings, we model collaboration as a multi-agent stochastic linear bandit problem with a transferable utility (TU) cooperative game formulation, where a coalition's value equals the negative sum of its members' cumulative regrets. We show that, for identical (homogenous) agents with fixed action sets, the induced TU game is convex under mild algorithmic conditions, implying a non-empty core that contains the Shapley value and ensures both stability and fairness. For heterogeneous agents, the game still admits a non-empty core, though convexity and Shapley value core-membership are no longer guaranteed. To address this, we propose a simple regret-based payout rule that satisfies three out of the four Shapley axioms and also lies in the core. Experiments on MovieLens-100k dataset illustrate when the empirical payout aligns with -- and diverges from -- the Shapley fairness across different settings and algorithms.

Summary

Main Finding

Modeling creators as agents in a multi-agent linear bandit, and defining a transferable-utility (TU) coalition game with coalition value v(C) = −(sum of members’ cumulative regrets), the paper shows:
- For homogeneous agents with a shared, fixed action set and a mild (natural) assumption on single-agent regret dynamics, the induced TU game is convex (Theorem 1). Hence the core is non-empty and contains the Shapley value → full collaboration (grand coalition) is stable and the Shapley allocation is fair and stable (Corollary 1).
- For heterogeneous agents (different action sets), the game still admits a non-empty core under reasonable algorithmic assumptions, but convexity and Shapley-in-core need not hold. To address fairness/stability trade-offs, the authors propose a simple regret-based payout rule that (under their assumptions) satisfies three of the four Shapley axioms and lies in the core (Theorem 3).
- Empirical simulations on MovieLens-100k illustrate regimes where empirical payouts align with Shapley fairness and regimes where they diverge, depending on heterogeneity and algorithm choice.

Key Points

Model
- Agents play a common linear bandit instance over T rounds; at each t, agent a chooses xa,t from Xa,t and observes ya,t = ⟨θ*, xa,t⟩ + noise.
- Agents can form disjoint coalitions; within a coalition members fully share their action–reward histories and run a coalition-level multi-agent bandit algorithm.
- Coalition value v(C) is defined as the negative sum of expected pseudo-regrets of coalition members (higher v = lower total regret).
Fixed-action (homogeneous) setting
- Algorithm Mul: a meta-algorithm that runs a single-agent bandit algorithm Sin on a pooled buffer of rewards collected when coalition agents play actions suggested by Sin. All agents in a coalition simultaneously play the same action when Sin requests a sample for that action.
- Assumption 1 (on Sin): expected cumulative regret R(t) has (1) strict concavity in time (discrete second derivative negative) and (2) a lower bound preventing faster-than-logarithmic convergence (discrete second derivative bounded below by −c t^{-2+ε}). These are mild and consistent with many bandit guarantees/empirical curves.
- Lemma 1: coalition regret under Mul for coalition of size m behaves like single-agent regret run for mT rounds up to an additive O(mK) term (K = |action set|). This reduction establishes that coalition marginal returns are increasing in coalition size.
- Theorem 1: under Assumption 1 and large T, the collaboration game is convex (supermodular value function).
- Corollary: convexity ⇒ non-empty core and Shapley value ∈ core ⇒ stability + fairness.
Heterogeneous-action setting
- Assumption 2: algorithms use anonymized pools of other agents’ (action, reward) samples (no identity info), and an agent’s action choice depends on its own past, the anonymized multiset of others’ samples, and its present action set.
- Theorem 2 (informal): under reasonable algorithmic behavior, the TU game has a non-empty core, so stable grand coalition allocations exist, but the game need not be convex and the Shapley value may be outside the core.
- Remedy: a simple regret-based payout rule is proposed and proven (Theorem 3) to satisfy three Shapley axioms (efficiency, symmetry, null-player — see note below) while belonging to the core; the one Shapley axiom typically violated is additivity (common for practical allocation rules).
Empirical validation
- Simulations derived from MovieLens-100k demonstrate practical behavior: when agents are more homogeneous the empirical payout aligns well with Shapley; with heterogeneity and certain algorithms the regret-based payout and empirical allocations diverge from Shapley, illustrating the relevance of the proposed payout rule.

Data & Methods

Formalism
- Multi-agent stochastic linear bandit with parameter θ* ∈ R^d. Agents have action sets Xa,t ⊂ R^d; rewards are linear plus sub-Gaussian noise.
- Coalition formation pre-plays; within-coalition full sharing of action–reward pairs (subject to anonymity in heterogeneous analysis).
- Value function vAlg,I,T(C) := −∑a∈C R^C_a(Alg, I, T), where R^C_a is expected pseudo-regret of agent a under coalition C using algorithm Alg on instance I over horizon T.
Algorithms analyzed
- Mul (multi-agent meta-algorithm) for fixed-action instances: uses single-agent Sin as black-box with a shared reward buffer; Sin’s sampling drives which action the coalition plays next and the buffer is filled by all coalition agents’ plays of that action.
- General multi-agent algorithms for heterogeneous action sets are characterized only via behavioral assumptions (Assumption 2), not a specific algorithm.
Theoretical tools
- Cooperative game theory: convexity (supermodularity), core non-emptiness, Shapley value properties.
- Regret analysis: relate coalition regret to single-agent regret scaled by coalition size (mT), control additive discrepancies, and use monotonicity/concavity properties of R(t) to show supermodularity.
Empirical setup
- Construct bandit problem instances from MovieLens-100k (details in paper): agents and action sets derived from item/user features; run multiple algorithms and compute coalition regrets and payout allocations; compare the proposed payout to the Shapley allocation and to empirical outcomes (plots/metrics reported in the paper).

Implications for AI Economics

Incentives and revenue-sharing on recommender platforms
- Formal link between collaborative learning (shared user feedback) and creators’ payoffs: value of collaboration is captured by joint regret reductions, so revenue-sharing based on marginal regret reductions is principled.
- For homogeneous creator populations (similar content/audiences), platforms can design payout schemes that are both stable (core) and fair (Shapley)—encouraging full cooperation and honest data sharing.
- For heterogeneous creators, naive Shapley-based splits may not be stable; the regret-based payout proposed gives a practically implementable, core-stable alternative that preserves most Shapley fairness properties.
Platform design and policy
- Platforms can operationalize “learning-contribution” payments by estimating each creator’s marginal contribution to collective learning (via regret reductions or approximations) and using core-compatible payout rules to avoid destabilizing incentives.
- Because exact Shapley computation is exponential, platforms will need scalable approximations; the paper’s regret-based rule is a simpler alternative with provable core-membership under model assumptions.
Strategic behavior and mechanism design
- The model abstracts away creator content choice (treats content as given). In practice creators may change content strategically if payouts depend on measured contributions — suggesting further mechanism design to make allocations incentive-compatible and robust to manipulation.
Limitations and open directions relevant to AI economics
- Theoretical assumptions: linear-bandit structure, Anonymized-data-consumption assumption, and the specific regret-shape assumption are simplifications; real-world recommender dynamics (nonlinear models, non-stationary users, constrained visibility) could alter coalition values.
- The model treats the platform as neutral aggregator; real platforms control recommendation policies and revenue flows. Implementing TU-style payouts requires platform buy-in or regulation.
- Future work: incorporate strategic content choice by creators, consider non-transferable utility or partial observability, extend to nonlinear/high-capacity models, and design computationally efficient, manipulation-resistant payout mechanisms.

Note on Shapley axioms: the paper shows their regret-based payout satisfies three of the four classic Shapley axioms (efficiency, symmetry, null-player). The usual axiom that is harder to preserve in pragmatic schemes is additivity (linearity across games), which the proposed rule does not generally satisfy.

Assessment

Paper Typetheoretical Evidence Strengthn/a — The paper provides formal theoretical results (game-theoretic proofs) about properties of the induced cooperative game and supplements these with illustrative experiments on MovieLens-100k; it does not present causal empirical identification of AI's economic impacts. Methods Rigorhigh — Rigorous analytical derivations and proofs establish core and convexity results under clearly stated model and algorithmic assumptions; experiments and simulations illustrate behaviors, though they are limited in scope and primarily demonstrative. SampleTheoretical analysis of a multi-agent stochastic linear bandit model with transferable utility; empirical illustrations use simulations and experiments on the MovieLens-100k dataset (≈100k ratings, standard recommender dataset) to compare regret-based payouts and Shapley allocations across algorithms and settings. Themesgovernance org_design GeneralizabilityRelies on stochastic linear bandit assumptions (linearity, stationarity, bounded noise) that may not hold in real-world recommender systems., Analysis assumes fixed action sets and specified algorithmic conditions; real platforms have non‑stationary item pools, evolving content, and complex feedback loops., Experimental validation is limited to MovieLens-100k and simulations, which may not represent scale, heterogeneity, or strategic behavior in commercial platforms., Transferable-utility cooperative game formulation abstracts away many institutional constraints (budget, legal, platform policy) that affect real payout mechanisms.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
User interactions in online recommendation platforms create interdependencies among content creators: feedback on one creator's content influences the system's learning and, in turn, the exposure of other creators' contents. Other	mixed	high	interdependencies in content exposure induced by user feedback	0.12
Collaboration among content creators can be modeled as a multi-agent stochastic linear bandit problem with a transferable utility (TU) cooperative game formulation, where a coalition's value equals the negative sum of its members' cumulative regrets. Other	null_result	high	coalition value defined as negative sum of members' cumulative regrets	0.12
For identical (homogenous) agents with fixed action sets, the induced TU game is convex under mild algorithmic conditions. Organizational Efficiency	positive	high	convexity of the induced transferable-utility cooperative game	0.2
Convexity (in the homogeneous-agent case) implies a non-empty core that contains the Shapley value and ensures both stability and fairness of payout allocations. Organizational Efficiency	positive	high	non-emptiness of the core; membership of the Shapley value in the core; stability/fairness of allocations	0.2
For heterogeneous agents the cooperative game still admits a non-empty core, though convexity and Shapley value core-membership are no longer guaranteed. Organizational Efficiency	mixed	high	core non-emptiness; lack of guaranteed convexity and Shapley membership	0.2
A simple regret-based payout rule is proposed that satisfies three out of the four Shapley axioms and also lies in the core. Organizational Efficiency	positive	high	axiomatic compliance (3/4 Shapley axioms) and core-membership of the payout rule	0.12
Experiments on the MovieLens-100k dataset illustrate when the empirical payout aligns with — and diverges from — Shapley fairness across different settings and algorithms. Organizational Efficiency	mixed	high	alignment/divergence between empirical payouts and Shapley-value fairness	n=100000 0.12