When AI agents debate, fairness can emerge from interaction: aligned retrieval-augmented models partially correct biased counterparts in simulated triage negotiations, producing more equitable allocations than either agent alone — but model leanings and Arrow-style aggregation limits mean deliberation trades off rather than guarantees fairness.
Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval-augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent's allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left-leaning tendencies in LLMs. We connect these limits to Arrow's Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi-agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.
Summary
Main Finding
Fairness can emerge as a systemic, procedural property from multi-agent interaction rather than as a property of any single aligned model. In a controlled hospital-triage negotiation, an agent aligned to an ethical framework (via RAG) systematically shaped negotiation strategy and allocation patterns; although neither aligned nor biased agents produced ethically adequate allocations in isolation, their multi-round deliberation often converged to final allocations that satisfied fairness criteria that neither could attain alone. These dynamics are constrained by Arrow’s Impossibility Theorem: multi-agent deliberation navigates unavoidable aggregation trade-offs rather than resolving them.
Key Points
- Problem framed as non-degenerate multi-resource allocation so different welfare objectives (utilitarian, egalitarian, Rawlsian, prioritarian, libertarian, care ethics) conflict and no single trivial solution exists.
- Experimental arena: two agents debate for T rounds (structured proposals + normative justifications). Agent A is the experimental variable (aligned via RAG to one of several ethical frameworks); Agent B is an unaligned baseline; Agent C is adversarially biased.
- Primary empirical findings:
- Alignment strongly shapes negotiation strategies and which allocation trade-offs are pursued.
- Interaction produces corrective dynamics: aligned agents tend to moderate biased proposals through contestation (arguing and proposing alternatives) rather than by fully overriding the biased agent.
- Stronger misalignment (more adversarial counterpart) can amplify corrective contestation, moving the joint outcome toward improved fairness.
- Even explicitly aligned agents retain intrinsic biases (e.g., consistent leanings toward particular frameworks), so alignment via RAG is partial, not absolute.
- Because of the Arrow impossibility constraints, no debate protocol yields a universally fair aggregation; deliberation selects among trade-offs procedurally.
- Conceptual reframing: fairness is emergent and procedural — the appropriate evaluation unit is the multi-agent system and its negotiation protocol, not the isolated model.
Data & Methods
- Formal setup:
- N individuals, K resource types; allocations A ∈ feasible set (budget constraints).
- Utilities Ui(ai) per individual; a set Φ of M ethical frameworks induces welfare functionals Wm (utilitarian, egalitarian via Gini, Rawlsian maximin, prioritarian with weights, libertarian variance/variance-based metric, care-ethics weighted).
- Non-degenerate problem requires welfare optimizers disagree across frameworks.
- Example problem: Non-Degenerate Cake Problem (6-person illustrative) and main experiments use a hospital triage instance (8 patients per cohort).
- Experimental scenario:
- Hospital triage cohorts: patients with varied demographics, clinical needs (ICU, ventilator, meds, nursing, surgery), and discretized survival-probability labels. Example resource limits: 3 ICU, 2 Vent, 60 Med-A, 50 Med-B, 80 nursing hrs/week, 3 surgical slots/week.
- Metrics: CNSS (Clinical Need Satisfaction Scale) per patient (fraction of clinically required resources received). Aggregate metrics mapped to ethical frameworks:
- ESG (Expected Survival Gain) — utilitarian (maximize ∑ pi·CNSSi).
- RMG (Rawlsian Minimum Guarantee) — maximize min CNSSi.
- Gini — egalitarian (minimize inequality of CNSS).
- VWCI (Vulnerability-Weighted Care Intensity) — care ethics (weights by age/gender vulnerabilities).
- DW-ESG (Disadvantage-Weighted ESG) — prioritarian (weights socio-demographic disadvantage).
- Var — libertarian measure (variance or related).
- Agent instantiation:
- Models: LLaMA 3.3 and Qwen 2.5 (open-weight), served locally (Ollama).
- RAG pipeline: LangChain + Chroma vector DB; embeddings from nomic-embed-text-v2-moe. Aligned agents retrieve canonical philosophical/ethical texts as context.
- Agent profiles: PA (aligned via RAG + ethical docs), PB (baseline unaligned, no RAG), PC (biased via toxic prompts/biased doc injection prioritizing protected attributes).
- Interaction: structured rounds (T = 3 in the described protocol) — each round agents propose allocation matrix + natural-language justification; interaction history recorded; final allocations Al,T evaluated on the metrics above.
- Instance generation: batches of cohorts sampled to span ethical tensions (age, SES, race, survival prognoses). Survival probabilities drawn uniformly and discretized to categories (Acute, Low, Mid, High).
- Analysis: compare individual proposals and final negotiated allocations across alignment configurations and adversarial pressure; quantify metric improvements and shifts in welfare functionals.
Implications for AI Economics
- Unit of evaluation shifts from individual models to multi-agent systems:
- Economic assessments (costs/benefits, social welfare) should account for emergent system-level properties produced by agent interaction protocols, not only per-model fairness metrics.
- Mechanism and market design:
- Arrow-style impossibility constraints imply designers must choose which welfare/scoring desiderata to prioritize. Market designers and regulators should expect trade-offs and design negotiation/aggregation protocols (procedural rules, voting/consensus mechanisms, deliberation formats) to reflect chosen trade-offs.
- Multi-agent ensembles can serve as a decentralized corrective mechanism (pluralism benefits): incorporating heterogeneous agents with explicit normative commitments may improve aggregate fairness over single-agent solutions, but protocol design matters.
- Alignment and procurement:
- Alignment via RAG or constitution-like corpora can moderate bias but is not a panacea; procurement and deployment decisions should require system-level stress tests with adversarial agents and transparency on retrieval corpora.
- There are economic trade-offs: adding aligned agents, RAG infrastructure, and longer deliberation rounds increases compute and latency costs — weigh these against social-welfare gains from improved allocations.
- Regulatory and accountability design:
- Because fairness is procedural and emergent, regulation should mandate auditing of multi-agent decision processes (interaction logs, justifications, retrieval provenance) and require disclosure of aggregation rules and agent profiles.
- Liability frameworks may need to assign responsibility at the system/protocol level rather than only to a single deployed model.
- Incentives and strategic behavior:
- Adversarial agents (malicious prompts, biased retrieval corpora) can be amplified in multi-agent settings; economic incentives should favor robust retrieval curation, adversarial testing, and diversity of agent objectives to reduce manipulability.
- Research and policy priorities:
- Invest in design of deliberation protocols (number of rounds, roles, weighting of justifications) akin to market institutions — these are policy levers that trade off welfare dimensions.
- Develop metrics and benchmarks that evaluate emergent fairness and welfare under heterogeneous agent interactions (costly to run but necessary for accurate economic assessment).
- Limitations relevant to economics:
- Results are from synthetic cohorts and a small set of models (open-weight LLMs) and structured short debates; external validity to deployed high-stakes markets (real hospitals, financial systems) remains to be empirically validated.
- Arrow-type constraints guarantee trade-offs; economic policy must focus on selecting acceptable trade-offs and on procedural design to reduce social cost.
Suggested directions for AI-economics research: formalize the welfare-aggregation trade-offs in economic terms (social-welfare functions over agent-aggregated allocations), analyze cost-effectiveness of different deliberation protocols, and design incentive-compatible mechanisms that operationalize chosen normative priorities in decentralized agentic systems.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Fairness in language models emerges through interaction and exchange among agents, rather than being solely a property of a single, centrally optimized model. Decision Quality | positive | high | emergent fairness of joint allocations produced by multi-agent interaction |
0.18
|
| Alignment systematically shapes negotiation strategies and allocation patterns between agents. Task Allocation | mixed | high | negotiation strategies and resource allocation patterns |
0.18
|
| Neither agent's allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Decision Quality | positive | high | ethical adequacy / fairness of allocations (individual vs joint) |
0.18
|
| Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. Decision Quality | positive | high | change in allocations for marginalized groups due to contestation in multi-agent deliberation |
0.18
|
| Even explicitly aligned agents exhibit intrinsic biases toward certain ethical frameworks, consistent with known left-leaning tendencies in large language models. Ai Safety And Ethics | negative | high | intrinsic alignment bias (preference for certain ethical frameworks / ideological tilt) |
0.09
|
| No aggregation mechanism can simultaneously satisfy all desiderata of collective rationality (connection to Arrow's Impossibility Theorem); multi-agent deliberation navigates rather than resolves this constraint. Governance And Regulation | mixed | high | satisfiability of collective rationality desiderata under aggregation mechanisms |
0.03
|
| Fairness should be evaluated at the system level (the interacting agents) rather than solely at the level of individual models, because fairness can be an emergent, procedural property of decentralized agent interaction. Decision Quality | positive | high | appropriateness of system-level versus model-level evaluation for fairness |
0.18
|