A Nash-bargaining guided multi-agent RL system raises simulated peer-to-peer EV energy-trading welfare by roughly 62% and trading volume by 63% versus a double-auction baseline, while producing markedly fairer allocations; results are promising but confined to synthetic simulations without real-world validation.

Incentive-Aligned Vehicle-to-Vehicle Energy Trading via Nash-Integrated Multi-Agent Reinforcement Learning

Yujin Lin, Yue Yang, Hao Wang · May 21, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Nash-MADDPG, which embeds Nash Bargaining into multi-agent deep RL, substantially increases simulated V2V trading social welfare (≈61.6%), trading volume (≈62.9%), and fairness (Jain index +40.1%) relative to a Double Auction baseline over a 30-day continuous simulation.

Vehicle-to-vehicle (V2V) energy trading enables decentralized peer-to-peer energy exchange among electric vehicles (EVs), reducing grid dependency while monetizing surplus capacity. However, coordinating self-interested EV agents with diverse charging needs and uncertain arrival-departure schedules remains challenging. Existing approaches either require centralized optimization with computational limitations or lack fairness guarantees. This paper integrates Nash Bargaining Solution into Multi-Agent Deep Deterministic Policy Gradient, namely Nash-MADDPG, for incentive-aligned V2V energy trading. Nash bargaining determines efficient bilateral pricing, while Nash-guided price proximity rewards align agent learning toward bargaining-optimal strategies. Evaluation over 30-day continuous operation demonstrates an improvement of 61.6% in social welfare and 62.9% improvement in trading volume over Double Auction, while achieving superior fairness, such as 40.1% improvement in Jain's index. Testing across 6-100 agents over a 30-day horizon with continuous vehicle turnover confirms scalability across population size and empirically stable pricing near the Nash Bargaining benchmark.

Summary

Main Finding

Integrating the Nash Bargaining Solution (NBS) as both a market-clearing rule and a training-time reward signal within a multi-agent RL framework (Nash-MADDPG) yields substantially higher social welfare, trading volume, fairness, match rates, and training stability for vehicle-to-vehicle (V2V) energy trading under continuous agent turnover than auction or pure-learning baselines. Empirically: ≈61.6% higher social welfare, ≈62.9% higher traded volume, and ≈40.1% improvement in Jain’s fairness index versus a double-auction baseline across population sizes (6–100 agents) in a 30-day simulated horizon.

Key Points

Problem: decentralized V2V energy trading with self-interested EVs, private valuations, and continuous arrival/departure — challenging for fairness, efficiency, and scalability.
Core idea: bi-level design
- Upper level: Nash Bargaining market clearing
  - Bilateral bargaining price = midpoint (p* = 0.5·(buyer bid + seller ask)).
  - Quantities allocated by maximizing log-transformed Nash social welfare (concave formulation solved by SLSQP) subject to capacity and individual-rationality constraints.
- Lower level: Multi-agent RL (MADDPG, CTDE) with Nash-guided rewards
  - Reward = executed trade utility + counterfactual credit-assignment (COMA-like approximation) + a training-only quadratic price-proximity penalty that biases submitted prices toward Nash prices.
Guarantees and trade-offs
- NBS provides Pareto-efficiency and individual rationality when bids/asks reflect true valuations; incentive compatibility is impossible to achieve simultaneously (Myerson–Satterthwaite), so the method trades strict IC for efficiency + fairness guidance.
- Price-proximity shaping steers learned policies toward value-consistent bidding, reducing the gap to the NBS guarantees empirically.
Architecture and learning
- Shared actor and critic networks (role encoded) for generalization across population sizes; CTDE (centralized critic during training, decentralized actors at execution).
- Training uses Ornstein–Uhlenbeck exploration, target networks, soft updates; actors output price and quantity (bounded by role-based masks).
Empirical outcomes
- Metrics improved: social welfare, traded volume, Gini (lower), Jain’s index (higher), match rate (0.92 vs 0.78 double auction).
- Stable training convergence and lower variability across population sizes (CV for SW ≈17.6% vs ~45.6% for double auction).
- Clearing prices empirically clustered near Nash benchmark (mean ≈0.208 AUD/kWh, std 0.031), about 25% below grid price.

Data & Methods

Environment
- Simulated parking-facility V2V market, 30-min timesteps, Tmax covering 1-day (16 steps) and 30-day (480 steps).
- EV model: Tesla Model 3 parameters (75 kWh), initial SoC and target SoC drawn from uniform distributions; arrival process Poisson (λ calibrated to target N); parking durations U(4,12) timesteps.
- Roles: buyer if below target, seller if above target + buffer, neutral otherwise. Urgency and valuations modeled (urgency increases as departure approaches; buyer willingness-to-pay and seller reservation price include battery, grid arbitrage, degradation terms).
Algorithm: Nash-MADDPG
- MADDPG with centralized critics (Qi(s, a1..aN)), decentralized actors µi(si).
- Reward components:
- r_base = realized utility from Nash-cleared trades.
- δ (credit assignment) = r_collective(a) − r_collective(a−i, default_action), COMA-like O(N) approximation; r_collective is weighted sum of social welfare, fairness (Jain), and match rate.
- r_price (training only) = −κ · |p_i − p_Nash|^2 for matched agents; zero if unmatched.
- Penalty for negative utility to enforce individual rationality.
- Optimization details: log-Nash-welfare optimization for quantities (convex after log transform), SLSQP solver per timestep for clearing; training for 2000 episodes, γ=0.95, replay buffer 1e5, batch size 256, learning rates actor=1e-4, critic=1e-3.
Baselines
- Learning Only: MADDPG without Nash clearing or fairness-shaped rewards.
- Greedy Average: midpoint pricing but greedy quantity allocation (no Nash welfare maximization).
- Double Auction: sorted bids/asks, marginal midpoint clearing price.
Evaluation
- Testing across agent counts N ∈ {6,10,15,20,30,50,75,100} without retraining.
- Primary metrics: Social Welfare (AUD), traded volume (kWh), Gini index, Jain’s fairness index, match rate (fraction of active participants matched).
- Aggregation: medians and IQR across 3 seeds × 8 population sizes; time-series and long-horizon (30-day) analyses for turnover robustness.

Implications for AI Economics

Demonstrates a productive hybrid of normative mechanism design and learning:
- Embedding a game-theoretic solution (NBS) into MARL (both as clearing rule and reward shaping) meaningfully improves collective outcomes and stabilizes learning, illustrating that mechanism-aware RL can help overcome inefficient emergent equilibria in decentralized markets.
Practical template for decentralized P2P markets:
- The two-level approach — explicit, axiomatic clearing combined with incentive-aligned learning — is applicable to other bilateral exchange domains (spectrum sharing, compute/resource markets, P2P energy between households).
Trade-offs and economic constraints:
- The method accepts the Myerson–Satterthwaite impossibility: it cannot achieve full incentive compatibility while retaining budget-balance and ex-post efficiency. Instead, it reduces strategic misreporting empirically via price-proximity shaping — a pragmatic compromise in dynamic, private-valuation settings.
Policy and deployment considerations:
- Improved fairness and match rates imply higher voluntary participation—important for adoption of P2P energy platforms—yet deployment must address transaction costs, regulatory constraints, settlement trust, and potential strategic manipulation not fully ruled out by the learning penalty.
Research directions for AI economics
- Analyze incentive-compatibility gaps formally: bound the residual gains from strategic misreporting under learned policies.
- Replace coordinator-centric clearing (SLSQP per timestep) with decentralized/approximate solvers to reduce computational/centralization costs while retaining welfare guarantees.
- Extend to heterogeneous bargaining power, multi-attribute trades (time-flexible energy, V2G), and integration with wholesale/grid prices and network constraints.
- Study robustness to adversarial or colluding agents, and to model mismatch when simulated valuation models diverge from real user behavior.
Broader message: mechanism features can and should be encoded into ML training objectives when designing decentralized markets — combining axiomatic economic solutions with adaptive learning produces more efficient, fair, and stable market outcomes than either approach alone.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Results are based on simulation experiments (30-day continuous operation, 6–100 agents) that show large relative gains versus a Double Auction baseline and report multiple metrics (social welfare, trading volume, fairness). However, evidence is limited to synthetic environments, a single main baseline, and no real-world deployment or out-of-sample validation, so external validity and causal claims about real markets remain untested. Methods Rigormedium — The paper integrates Nash Bargaining into MADDPG, proposes a Nash-guided reward shaping, and evaluates performance over a reasonably long simulated horizon and across different population sizes; it reports several quantitative metrics and fairness measures. Missing or unclear elements reduce rigor: limited description of environment parameterization, reliance on a single main baseline (Double Auction) rather than multiple competitive mechanisms, no statistical uncertainty reporting or sensitivity analyses across key assumptions (arrival/departure stochasticity, pricing frictions, communication failures), and no real-world or field validation. SampleSimulation of V2V energy trading with heterogeneous EV agents having diverse charging needs and stochastic arrival-departure schedules; evaluated over a continuous 30-day horizon with vehicle turnover, across populations from 6 to 100 agents; main comparisons against a Double Auction baseline; reported outcomes include social welfare, trading volume, and Jain's fairness index. Themesinnovation governance adoption GeneralizabilitySimulation-only evaluation — results may not transfer to physical EVs, real grid constraints, or market frictions, Synthetic agent behavior and demand profiles may not reflect real driver preferences or strategic manipulation, Only one primary baseline (Double Auction) — robustness to alternative market designs untested, Scalability beyond 100 agents and performance under network/communication failures or adversarial agents unknown, Regulatory, transaction-cost, and infrastructure constraints (metering, settlement latency) not modeled

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Nash-MADDPG improves social welfare by 61.6% over Double Auction in evaluation over 30-day continuous operation. Consumer Welfare	positive	high	social welfare	n=30 61.6% improvement 0.18
Nash-MADDPG yields a 62.9% improvement in trading volume over Double Auction. Market Structure	positive	high	trading volume	n=30 62.9% improvement 0.18
Nash-MADDPG achieves superior fairness, showing a 40.1% improvement in Jain's index. Inequality	positive	high	fairness (Jain's index)	n=30 40.1% improvement 0.18
The paper integrates Nash Bargaining Solution into Multi-Agent Deep Deterministic Policy Gradient, creating Nash-MADDPG, where Nash bargaining determines efficient bilateral pricing. Market Structure	positive	high	bilateral pricing efficiency (algorithmic pricing)	0.03
Nash-guided price proximity rewards align agent learning toward bargaining-optimal strategies. Decision Quality	positive	high	alignment of learned strategies to bargaining-optimal strategies	0.18
Testing across 6–100 agents over a 30-day horizon confirms scalability across population size. Other	positive	high	scalability across population size (algorithm performance across agent counts)	n=30 0.18
Empirically stable pricing near the Nash Bargaining benchmark is observed in testing. Market Structure	positive	high	pricing proximity/stability relative to Nash Bargaining benchmark	n=30 0.18