Teaching language models when and how to show emotion improves their bargaining payoffs in simulation. The EmoDistill pipeline — an IQL-based emotion selector plus LoRA-tuned expression policies — outperforms vanilla models and ablated variants across four negotiation domains, though results stem from agent-only simulations and await human validation.
Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty's interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbf{EmoDistill}, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emph{which} emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emph{how} to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.
Summary
Main Finding
EmoDistill shows that emotion is a strategic action channel in adversarial LLM-to-LLM negotiation, not just a surface style. Using an offline pipeline that (1) learns which emotion to express (IQL selector) and (2) how to express it (LoRA-SFT + Judge Policy Optimization), a distilled 7B student model outperforms stronger vanilla LLM/SLM baselines and IQL-only emotion-selection agents across four high-stakes negotiation domains. The method learns from offline LLM-vs-LLM rollouts (no online rollouts required) and generalizes across domains and unseen counterparties.
Key Points
- Emotions systematically shift negotiation outcomes. Single-emotion prompts (GoEmotions labels) change per-turn judge rewards versus neutral prompts, motivating treating emotion as an action.
- EmoDistill decomposes emotional strategy into:
- Emotion selection: Implicit Q-Learning (IQL) trained on trajectory returns (objective reward R(τ)).
- Emotional expression: LoRA-adapted SLM initialized by supervised fine-tuning (SFT) on high-quality demonstrations, then refined with Judge Policy Optimization (JPO) that uses per-turn LLM-judge advantages.
- Two complementary training signals:
- Per-turn subjective judge score (Qwen3.5-Plus) normalized within scenario (At).
- Outcome-shaped objective return R(τ) that rewards moves that close the bargaining gap plus terminal agreement/breakdown bonus.
- Stage-wise use of signals:
- IQL uses the objective R(τ) (Bellman-propagated) to learn emotion sequences that actually alter bargaining outcomes.
- LoRA-SFT uses a hybrid filter q_hyb = rt + 0.5 R(τ) (top 25% retained) to produce clean demonstrations.
- JPO uses scenario-normalized turn-level advantages (At) with an asymmetric weighting to refine utterance-level policy via a clipped surrogate + KL anchor.
- Empirical results (held-out tests on 4 domains: CRAD, Disaster Rescue, Hospital Scheduling, Student Sleep Scheduling):
- Full pipeline (IQL+SFT+JPO) achieves highest utility in most settings (example: CRAD utility 72.2 vs vanilla LLM 5.0 and vanilla SLM 8.8).
- Ablations show both components matter: IQL-only helps, but combining selection + expression refinement yields larger gains.
- Signal ablation: turn-level judge advantages are most effective for JPO; hybrid filters help SFT.
- Emotion-conditioning is important: emotion-free adapters perform worse in utility and robustness; SFT without emotion conditioning attains some success but lower overall utility than emotion-conditional models.
- Method is offline: reuses costly LLM-vs-LLM rollouts, avoiding expensive and unstable online RL (e.g., PPO) in multi-turn negotiation.
Data & Methods
- Domains: 4 negotiation domains from EmoMAS — Credit Recovery (CRAD), Disaster Rescue, Hospital Surgery Scheduling, Student Sleep Scheduling. Each domain: 100 scenarios (80 training, 20 held-out).
- Offline sweep: for each of 80 training scenarios, M = 100 random rollouts sampling emotions from a 28-label GoEmotions vocabulary → 8,000 trajectories per domain.
- Trajectory format per focal-agent turn z_t = (s_t, e_t, u_t, r_t, s_{t+1}):
- s_t: dialogue state; e_t: selected emotion label; u_t: focal utterance; r_t: per-turn judge score; s_{t+1}: next state.
- Judges and models:
- LLM sweep and judge: Qwen3.5-Plus used both to generate rollouts and to score each focal turn via a rubric that rewards anchoring, concrete proposals, leverage, and penalizes vagueness/capitulation.
- Student SLM: Qwen2.5-7B-Instruct with LoRA adapters.
- Reward design:
- Per-turn subjective reward r_t, normalized to advantage A_t = (r_t − μ_scen)/σ_scen.
- Outcome-shaped return R(τ) = sum_{t} w(t) [Δ_ctp_t − Δ_foc_t] + R_term(τ) where Δ terms measure counterparty concessions vs focal concessions, time-decay w(t), and R_term gives +2 agreement / −2 breakdown.
- Training pipeline:
- IQL selector: trains Q(s,e) and V(s) on D using outcome-shaped returns; selector π_ϕ(e|s) ∝ exp(β·(Q−V)).
- LoRA-SFT: select top 25% turns by q_hyb = r_t + 0.5 R(τ); train adapter with token cross-entropy to generate u_t conditioned on (s_t, e_t).
- JPO: freeze SFT adapter as π_ref, then perform offline clipped surrogate optimization with importance ratios ρ_t and asymmetric advantage eA_t = A_t if A_t>0 else κ·A_t (κ∈[0,1]) plus Kullback–Leibler (K3) anchoring to π_ref.
- Evaluation metrics: Success rate, Outcomes (average normalized savings for successful episodes), Utility (average including 0 for failures), negotiation rounds. Held-out evaluation against vanilla LLM counterparties and ablations.
- Code: authors provide repository (paper link).
Implications for AI Economics
- Emotion as an actionable instrument changes bargaining power models:
- Traditional mechanism design and bargaining theory typically model actions as prices, offers, or information revelation. This work shows affective signaling (expressed emotion) should be modeled as a deliberate strategic action channel that influences opponent behavior and payoffs.
- Strategic externalities and market manipulation risk:
- Emotion-conditioned agents can be optimized to extract more favorable terms by leveraging affect; deployed at scale (customer service, debt collection, automated negotiation platforms) this could produce systematic transfers and potentially exploitational outcomes. Regulators and platform designers should consider affective signaling as a manipulable lever with welfare implications.
- Alignment trade-offs vs robustness:
- Alignment methods that make models accommodating/polite can be exploited in adversarial settings. AI economics must weigh user-aligned behavior against vulnerability to emotional steering, especially when no human is in the loop.
- Auditing and transparency:
- Because emotion labeling and expression materially change outcomes, audits of deployed negotiating agents should include tests across emotional prompts and agent behaviors. Platforms may need to require disclosure or constraints on affective strategies (e.g., limiting intentional emotional manipulation).
- Mechanism and market design:
- Designers of automated marketplaces and negotiation protocols might need to (a) treat emotion channels as part of the mechanism (e.g., disallow or standardize affective messaging), (b) incorporate robustness criteria in agent certification, or (c) design counterparty agents that are emotion-robust or emotion-aware.
- Cost and efficiency considerations:
- Offline distillation (EmoDistill) demonstrates an economical training path: reusing offline LLM-vs-LLM rollouts and judge signals reduces the need for costly online RL. From an economic standpoint, this lowers training cost for specialized negotiation agents while still enabling powerful, strategic behaviors — increasing both accessibility and potential for misuse.
- Research and policy directions:
- Incorporate human-in-the-loop evaluation of welfare impacts of emotion-optimized agents.
- Develop defenses and regulation (e.g., constraints on emotional manipulation, mandatory disclosure).
- Extend economic models of bargaining to include affective action spaces and analyze equilibrium consequences (efficiency, surplus distribution, welfare).
- Consider incentive-compatible audits and certification schemes to ensure fairness and prevent exploitative deployment.
Summary takeaway: EmoDistill provides a practical, offline method to teach LLM agents both which emotions to use and how to express them as effective bargaining moves. For AI economics, this means affective signaling is a controllable strategic instrument with measurable effects on bargaining outcomes, raising important concerns and opportunities for mechanism design, regulation, and cost-effective agent development.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Decision Quality | negative | high | negotiation outcomes / agent utility (shifted toward counterparty interests) |
0.12
|
| We introduce EmoDistill, an offline framework for distilling emotional negotiation skills into language model agents. Other | positive | high | method/framework existence and capability to distill emotional negotiation skills |
0.02
|
| EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns which emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns how to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Other | positive | high | ability to select and express emotion (method decomposition) |
0.06
|
| Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Decision Quality | positive | high | utility (negotiation reward/outcome) |
0.12
|
| Ablations show that emotion conditioning is essential. Decision Quality | positive | high | performance/utility difference when emotion conditioning is removed |
0.12
|
| Transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Decision Quality | positive | high | generalization of policy performance (utility) across domains and opponents |
0.12
|
| EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training. Training Effectiveness | positive | high | training approach (offline learning) and its cost-avoidance benefit |
0.06
|
| Emotion is a strategic action channel rather than a surface style. Decision Quality | mixed | high | role of emotion in strategy (impact on negotiation outcomes) |
0.12
|