Teaching language models when and how to show emotion improves their bargaining payoffs in simulation. The EmoDistill pipeline — an IQL-based emotion selector plus LoRA-tuned expression policies — outperforms vanilla models and ablated variants across four negotiation domains, though results stem from agent-only simulations and await human validation.

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

Yunbo Long, Haolang Zhao, Lukas Beckenbauer, Liming Xu, Alexandra Brintrup · May 26, 2026

arxiv other medium evidence 7/10 relevance Source PDF

EmoDistill trains LLM agents to select and express emotions strategically, and doing so raises negotiated utility across four simulated high-stakes bargaining domains relative to untuned baselines and ablated variants.

Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty's interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbf{EmoDistill}, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emph{which} emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emph{how} to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.

Summary

Main Finding

EmoDistill shows that emotion is a strategic action channel in adversarial LLM-to-LLM negotiation, not just a surface style. Using an offline pipeline that (1) learns which emotion to express (IQL selector) and (2) how to express it (LoRA-SFT + Judge Policy Optimization), a distilled 7B student model outperforms stronger vanilla LLM/SLM baselines and IQL-only emotion-selection agents across four high-stakes negotiation domains. The method learns from offline LLM-vs-LLM rollouts (no online rollouts required) and generalizes across domains and unseen counterparties.

Key Points

Emotions systematically shift negotiation outcomes. Single-emotion prompts (GoEmotions labels) change per-turn judge rewards versus neutral prompts, motivating treating emotion as an action.
EmoDistill decomposes emotional strategy into:
- Emotion selection: Implicit Q-Learning (IQL) trained on trajectory returns (objective reward R(τ)).
- Emotional expression: LoRA-adapted SLM initialized by supervised fine-tuning (SFT) on high-quality demonstrations, then refined with Judge Policy Optimization (JPO) that uses per-turn LLM-judge advantages.
Two complementary training signals:
- Per-turn subjective judge score (Qwen3.5-Plus) normalized within scenario (At).
- Outcome-shaped objective return R(τ) that rewards moves that close the bargaining gap plus terminal agreement/breakdown bonus.
Stage-wise use of signals:
- IQL uses the objective R(τ) (Bellman-propagated) to learn emotion sequences that actually alter bargaining outcomes.
- LoRA-SFT uses a hybrid filter q_hyb = rt + 0.5 R(τ) (top 25% retained) to produce clean demonstrations.
- JPO uses scenario-normalized turn-level advantages (At) with an asymmetric weighting to refine utterance-level policy via a clipped surrogate + KL anchor.
Empirical results (held-out tests on 4 domains: CRAD, Disaster Rescue, Hospital Scheduling, Student Sleep Scheduling):
- Full pipeline (IQL+SFT+JPO) achieves highest utility in most settings (example: CRAD utility 72.2 vs vanilla LLM 5.0 and vanilla SLM 8.8).
- Ablations show both components matter: IQL-only helps, but combining selection + expression refinement yields larger gains.
- Signal ablation: turn-level judge advantages are most effective for JPO; hybrid filters help SFT.
- Emotion-conditioning is important: emotion-free adapters perform worse in utility and robustness; SFT without emotion conditioning attains some success but lower overall utility than emotion-conditional models.
Method is offline: reuses costly LLM-vs-LLM rollouts, avoiding expensive and unstable online RL (e.g., PPO) in multi-turn negotiation.

Data & Methods

Domains: 4 negotiation domains from EmoMAS — Credit Recovery (CRAD), Disaster Rescue, Hospital Surgery Scheduling, Student Sleep Scheduling. Each domain: 100 scenarios (80 training, 20 held-out).
Offline sweep: for each of 80 training scenarios, M = 100 random rollouts sampling emotions from a 28-label GoEmotions vocabulary → 8,000 trajectories per domain.
Trajectory format per focal-agent turn z_t = (s_t, e_t, u_t, r_t, s_{t+1}):
- s_t: dialogue state; e_t: selected emotion label; u_t: focal utterance; r_t: per-turn judge score; s_{t+1}: next state.
Judges and models:
- LLM sweep and judge: Qwen3.5-Plus used both to generate rollouts and to score each focal turn via a rubric that rewards anchoring, concrete proposals, leverage, and penalizes vagueness/capitulation.
- Student SLM: Qwen2.5-7B-Instruct with LoRA adapters.
Reward design:
- Per-turn subjective reward r_t, normalized to advantage A_t = (r_t − μ_scen)/σ_scen.
- Outcome-shaped return R(τ) = sum_{t} w(t) [Δ_ctp_t − Δ_foc_t] + R_term(τ) where Δ terms measure counterparty concessions vs focal concessions, time-decay w(t), and R_term gives +2 agreement / −2 breakdown.
Training pipeline:
- IQL selector: trains Q(s,e) and V(s) on D using outcome-shaped returns; selector π_ϕ(e|s) ∝ exp(β·(Q−V)).
- LoRA-SFT: select top 25% turns by q_hyb = r_t + 0.5 R(τ); train adapter with token cross-entropy to generate u_t conditioned on (s_t, e_t).
- JPO: freeze SFT adapter as π_ref, then perform offline clipped surrogate optimization with importance ratios ρ_t and asymmetric advantage eA_t = A_t if A_t>0 else κ·A_t (κ∈[0,1]) plus Kullback–Leibler (K3) anchoring to π_ref.
Evaluation metrics: Success rate, Outcomes (average normalized savings for successful episodes), Utility (average including 0 for failures), negotiation rounds. Held-out evaluation against vanilla LLM counterparties and ablations.
Code: authors provide repository (paper link).

Implications for AI Economics

Emotion as an actionable instrument changes bargaining power models:
- Traditional mechanism design and bargaining theory typically model actions as prices, offers, or information revelation. This work shows affective signaling (expressed emotion) should be modeled as a deliberate strategic action channel that influences opponent behavior and payoffs.
Strategic externalities and market manipulation risk:
- Emotion-conditioned agents can be optimized to extract more favorable terms by leveraging affect; deployed at scale (customer service, debt collection, automated negotiation platforms) this could produce systematic transfers and potentially exploitational outcomes. Regulators and platform designers should consider affective signaling as a manipulable lever with welfare implications.
Alignment trade-offs vs robustness:
- Alignment methods that make models accommodating/polite can be exploited in adversarial settings. AI economics must weigh user-aligned behavior against vulnerability to emotional steering, especially when no human is in the loop.
Auditing and transparency:
- Because emotion labeling and expression materially change outcomes, audits of deployed negotiating agents should include tests across emotional prompts and agent behaviors. Platforms may need to require disclosure or constraints on affective strategies (e.g., limiting intentional emotional manipulation).
Mechanism and market design:
- Designers of automated marketplaces and negotiation protocols might need to (a) treat emotion channels as part of the mechanism (e.g., disallow or standardize affective messaging), (b) incorporate robustness criteria in agent certification, or (c) design counterparty agents that are emotion-robust or emotion-aware.
Cost and efficiency considerations:
- Offline distillation (EmoDistill) demonstrates an economical training path: reusing offline LLM-vs-LLM rollouts and judge signals reduces the need for costly online RL. From an economic standpoint, this lowers training cost for specialized negotiation agents while still enabling powerful, strategic behaviors — increasing both accessibility and potential for misuse.
Research and policy directions:
- Incorporate human-in-the-loop evaluation of welfare impacts of emotion-optimized agents.
- Develop defenses and regulation (e.g., constraints on emotional manipulation, mandatory disclosure).
- Extend economic models of bargaining to include affective action spaces and analyze equilibrium consequences (efficiency, surplus distribution, welfare).
- Consider incentive-compatible audits and certification schemes to ensure fairness and prevent exploitative deployment.

Summary takeaway: EmoDistill provides a practical, offline method to teach LLM agents both which emotions to use and how to express them as effective bargaining moves. For AI economics, this means affective signaling is a controllable strategic instrument with measurable effects on bargaining outcomes, raising important concerns and opportunities for mechanism design, regulation, and cost-effective agent development.

Assessment

Paper Typeother Evidence Strengthmedium — The paper provides convincing causal evidence that emotional prompting and distilled emotion policies change outcomes in the simulated negotiation environment (good internal validity, randomized/controlled manipulations and ablations). However, evidence is limited to offline, agent-to-agent interactions with synthetic or model-generated counterparts, so external validity to human negotiations, real-world markets, and diverse cultural contexts is unclear. Methods Rigormedium — The methodology uses modern, appropriate ML techniques (IQL for selection, LoRA-based SFT and Judge Policy Optimization for expression), includes ablations, cross-domain transfer and tournament evaluations, and compares to sensible baselines; nevertheless, it relies on simulated data without human-subject validation, and the paper likely leaves open sensitivity to hyperparameters, dataset construction choices (emotion labeling), and possible evaluation biases from simulated counterparts. SampleAn offline dataset of agent-to-agent negotiation dialogues across four emotion-sensitive, high-stakes simulated negotiation domains; emotional labels/inductions derived from the GoEmotions taxonomy and used for affective prompting; models include small/medium supervised language models (SLMs) and larger LLM baselines, with an IQL-trained emotion selector and LoRA-fine-tuned policy trained via supervised fine-tuning and Judge Policy Optimization; evaluations measure negotiated utility, ablations (no emotion conditioning, IQL-only), transfer across domains and counterparties, and trained-vs-trained tournaments; no human-in-the-loop or field data reported. Themeshuman_ai_collab org_design IdentificationControlled within-simulation experiments: the authors manipulate emotion via GoEmotions-based prompts and an IQL-trained emotion selector, comparing outcomes across conditions (emotionally prompted/selected vs neutral/vanilla policies), performing ablations (remove emotion conditioning, remove IQL selection) and transfer/tournament tests to attribute differences in negotiated utility to the emotion channel; identification is internal to agent-to-agent simulations rather than coming from exogenous real-world variation. GeneralizabilityResults are from agent-to-agent simulations and may not generalize to human negotiators or mixed human-AI interactions., GoEmotions taxonomy and prompt-based emotion induction may not capture the range or cultural nuance of real emotional expression., Limited to four specific negotiation domains; performance in other economic contexts or higher-stakes real-world bargaining is uncertain., Model-scale and architecture dependence: findings may not transfer across different LLM scales or non-LoRA fine-tuning regimes., Offline learning ignores online adaptation, strategic behavior, and long-run dynamics that occur with repeated human opponents.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Decision Quality	negative	high	negotiation outcomes / agent utility (shifted toward counterparty interests)	0.12
We introduce EmoDistill, an offline framework for distilling emotional negotiation skills into language model agents. Other	positive	high	method/framework existence and capability to distill emotional negotiation skills	0.02
EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns which emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns how to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Other	positive	high	ability to select and express emotion (method decomposition)	0.06
Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Decision Quality	positive	high	utility (negotiation reward/outcome)	0.12
Ablations show that emotion conditioning is essential. Decision Quality	positive	high	performance/utility difference when emotion conditioning is removed	0.12
Transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Decision Quality	positive	high	generalization of policy performance (utility) across domains and opponents	0.12
EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training. Training Effectiveness	positive	high	training approach (offline learning) and its cost-avoidance benefit	0.06
Emotion is a strategic action channel rather than a surface style. Decision Quality	mixed	high	role of emotion in strategy (impact on negotiation outcomes)	0.12