A Nash-bargaining guided multi-agent RL system raises simulated peer-to-peer EV energy-trading welfare by roughly 62% and trading volume by 63% versus a double-auction baseline, while producing markedly fairer allocations; results are promising but confined to synthetic simulations without real-world validation.
Vehicle-to-vehicle (V2V) energy trading enables decentralized peer-to-peer energy exchange among electric vehicles (EVs), reducing grid dependency while monetizing surplus capacity. However, coordinating self-interested EV agents with diverse charging needs and uncertain arrival-departure schedules remains challenging. Existing approaches either require centralized optimization with computational limitations or lack fairness guarantees. This paper integrates Nash Bargaining Solution into Multi-Agent Deep Deterministic Policy Gradient, namely Nash-MADDPG, for incentive-aligned V2V energy trading. Nash bargaining determines efficient bilateral pricing, while Nash-guided price proximity rewards align agent learning toward bargaining-optimal strategies. Evaluation over 30-day continuous operation demonstrates an improvement of 61.6% in social welfare and 62.9% improvement in trading volume over Double Auction, while achieving superior fairness, such as 40.1% improvement in Jain's index. Testing across 6-100 agents over a 30-day horizon with continuous vehicle turnover confirms scalability across population size and empirically stable pricing near the Nash Bargaining benchmark.
Summary
Main Finding
Integrating the Nash Bargaining Solution (NBS) as both a market-clearing rule and a training-time reward signal within a multi-agent RL framework (Nash-MADDPG) yields substantially higher social welfare, trading volume, fairness, match rates, and training stability for vehicle-to-vehicle (V2V) energy trading under continuous agent turnover than auction or pure-learning baselines. Empirically: ≈61.6% higher social welfare, ≈62.9% higher traded volume, and ≈40.1% improvement in Jain’s fairness index versus a double-auction baseline across population sizes (6–100 agents) in a 30-day simulated horizon.
Key Points
- Problem: decentralized V2V energy trading with self-interested EVs, private valuations, and continuous arrival/departure — challenging for fairness, efficiency, and scalability.
- Core idea: bi-level design
- Upper level: Nash Bargaining market clearing
- Bilateral bargaining price = midpoint (p* = 0.5·(buyer bid + seller ask)).
- Quantities allocated by maximizing log-transformed Nash social welfare (concave formulation solved by SLSQP) subject to capacity and individual-rationality constraints.
- Lower level: Multi-agent RL (MADDPG, CTDE) with Nash-guided rewards
- Reward = executed trade utility + counterfactual credit-assignment (COMA-like approximation) + a training-only quadratic price-proximity penalty that biases submitted prices toward Nash prices.
- Upper level: Nash Bargaining market clearing
- Guarantees and trade-offs
- NBS provides Pareto-efficiency and individual rationality when bids/asks reflect true valuations; incentive compatibility is impossible to achieve simultaneously (Myerson–Satterthwaite), so the method trades strict IC for efficiency + fairness guidance.
- Price-proximity shaping steers learned policies toward value-consistent bidding, reducing the gap to the NBS guarantees empirically.
- Architecture and learning
- Shared actor and critic networks (role encoded) for generalization across population sizes; CTDE (centralized critic during training, decentralized actors at execution).
- Training uses Ornstein–Uhlenbeck exploration, target networks, soft updates; actors output price and quantity (bounded by role-based masks).
- Empirical outcomes
- Metrics improved: social welfare, traded volume, Gini (lower), Jain’s index (higher), match rate (0.92 vs 0.78 double auction).
- Stable training convergence and lower variability across population sizes (CV for SW ≈17.6% vs ~45.6% for double auction).
- Clearing prices empirically clustered near Nash benchmark (mean ≈0.208 AUD/kWh, std 0.031), about 25% below grid price.
Data & Methods
- Environment
- Simulated parking-facility V2V market, 30-min timesteps, Tmax covering 1-day (16 steps) and 30-day (480 steps).
- EV model: Tesla Model 3 parameters (75 kWh), initial SoC and target SoC drawn from uniform distributions; arrival process Poisson (λ calibrated to target N); parking durations U(4,12) timesteps.
- Roles: buyer if below target, seller if above target + buffer, neutral otherwise. Urgency and valuations modeled (urgency increases as departure approaches; buyer willingness-to-pay and seller reservation price include battery, grid arbitrage, degradation terms).
- Algorithm: Nash-MADDPG
- MADDPG with centralized critics (Qi(s, a1..aN)), decentralized actors µi(si).
- Reward components:
- r_base = realized utility from Nash-cleared trades.
- δ (credit assignment) = r_collective(a) − r_collective(a−i, default_action), COMA-like O(N) approximation; r_collective is weighted sum of social welfare, fairness (Jain), and match rate.
- r_price (training only) = −κ · |p_i − p_Nash|^2 for matched agents; zero if unmatched.
- Penalty for negative utility to enforce individual rationality.
- Optimization details: log-Nash-welfare optimization for quantities (convex after log transform), SLSQP solver per timestep for clearing; training for 2000 episodes, γ=0.95, replay buffer 1e5, batch size 256, learning rates actor=1e-4, critic=1e-3.
- Baselines
- Learning Only: MADDPG without Nash clearing or fairness-shaped rewards.
- Greedy Average: midpoint pricing but greedy quantity allocation (no Nash welfare maximization).
- Double Auction: sorted bids/asks, marginal midpoint clearing price.
- Evaluation
- Testing across agent counts N ∈ {6,10,15,20,30,50,75,100} without retraining.
- Primary metrics: Social Welfare (AUD), traded volume (kWh), Gini index, Jain’s fairness index, match rate (fraction of active participants matched).
- Aggregation: medians and IQR across 3 seeds × 8 population sizes; time-series and long-horizon (30-day) analyses for turnover robustness.
Implications for AI Economics
- Demonstrates a productive hybrid of normative mechanism design and learning:
- Embedding a game-theoretic solution (NBS) into MARL (both as clearing rule and reward shaping) meaningfully improves collective outcomes and stabilizes learning, illustrating that mechanism-aware RL can help overcome inefficient emergent equilibria in decentralized markets.
- Practical template for decentralized P2P markets:
- The two-level approach — explicit, axiomatic clearing combined with incentive-aligned learning — is applicable to other bilateral exchange domains (spectrum sharing, compute/resource markets, P2P energy between households).
- Trade-offs and economic constraints:
- The method accepts the Myerson–Satterthwaite impossibility: it cannot achieve full incentive compatibility while retaining budget-balance and ex-post efficiency. Instead, it reduces strategic misreporting empirically via price-proximity shaping — a pragmatic compromise in dynamic, private-valuation settings.
- Policy and deployment considerations:
- Improved fairness and match rates imply higher voluntary participation—important for adoption of P2P energy platforms—yet deployment must address transaction costs, regulatory constraints, settlement trust, and potential strategic manipulation not fully ruled out by the learning penalty.
- Research directions for AI economics
- Analyze incentive-compatibility gaps formally: bound the residual gains from strategic misreporting under learned policies.
- Replace coordinator-centric clearing (SLSQP per timestep) with decentralized/approximate solvers to reduce computational/centralization costs while retaining welfare guarantees.
- Extend to heterogeneous bargaining power, multi-attribute trades (time-flexible energy, V2G), and integration with wholesale/grid prices and network constraints.
- Study robustness to adversarial or colluding agents, and to model mismatch when simulated valuation models diverge from real user behavior.
- Broader message: mechanism features can and should be encoded into ML training objectives when designing decentralized markets — combining axiomatic economic solutions with adaptive learning produces more efficient, fair, and stable market outcomes than either approach alone.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Nash-MADDPG improves social welfare by 61.6% over Double Auction in evaluation over 30-day continuous operation. Consumer Welfare | positive | high | social welfare |
n=30
61.6% improvement
0.18
|
| Nash-MADDPG yields a 62.9% improvement in trading volume over Double Auction. Market Structure | positive | high | trading volume |
n=30
62.9% improvement
0.18
|
| Nash-MADDPG achieves superior fairness, showing a 40.1% improvement in Jain's index. Inequality | positive | high | fairness (Jain's index) |
n=30
40.1% improvement
0.18
|
| The paper integrates Nash Bargaining Solution into Multi-Agent Deep Deterministic Policy Gradient, creating Nash-MADDPG, where Nash bargaining determines efficient bilateral pricing. Market Structure | positive | high | bilateral pricing efficiency (algorithmic pricing) |
0.03
|
| Nash-guided price proximity rewards align agent learning toward bargaining-optimal strategies. Decision Quality | positive | high | alignment of learned strategies to bargaining-optimal strategies |
0.18
|
| Testing across 6–100 agents over a 30-day horizon confirms scalability across population size. Other | positive | high | scalability across population size (algorithm performance across agent counts) |
n=30
0.18
|
| Empirically stable pricing near the Nash Bargaining benchmark is observed in testing. Market Structure | positive | high | pricing proximity/stability relative to Nash Bargaining benchmark |
n=30
0.18
|