Value alignment is a governance problem, not just an engineering puzzle: misalignment stems from objective-setting, information flows and whose interests count, so durable solutions require institutions and contested decision‑making rather than only model tweaks.
The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. I argue that the problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Drawing on the principal-agent framework from economics, this paper reconceptualises misalignment as arising along three interacting axes: objectives, information, and principals. The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. The core contribution of this paper is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis -- and affect stakeholders differently -- the structural description shows that alignment cannot be "solved" through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.
Summary
Main Finding
Value alignment for AI is primarily a structural governance problem, not a single technical or purely normative one. Using the principal–agent framework, the paper shows misalignment arises along three interacting axes—objectives, information, and principals—and therefore cannot be “solved” by model design alone. Alignment is pluralistic and context-dependent: resolving it requires ongoing institutional processes that negotiate which interests count, how objectives are set and verified, and what trade-offs are acceptable.
Key Points
- Structural definition: Value alignment problems are instances of a principal–agent problem in human→AI delegation that occur when (a) the AI agent’s objective function is mis‑specified, or (b) there are informational asymmetries between human principals and the AI agent.
- Three orthogonal but interacting axes:
- Objectives axis: proxy mis‑specification, reward hacking, perverse incentives and the usual “outer alignment” issues.
- Information axis: opacity, hidden actions, non‑verifiability, distributional shift, and other informational asymmetries that prevent principals from observing or verifying agent behaviour.
- Principals axis: plurality and conflict among stakeholders (developers, users, affected communities, regulators, shareholders) so that “aligned” for whom becomes contested.
- Interaction effects: misalignment along one axis can exacerbate the others (e.g., biased proxies both misstate objectives and conceal harms from affected stakeholders).
- Human→AI special case: unlike human–human agency, AI agents have no intrinsic values; misalignment often stems from imperfect specification of objectives and institutional gaps rather than from independently motivated agents.
- Scaling hypothesis for value‑aligned AI: as model generality, deployment scope, and stakeholder diversity increase, informational asymmetries, value conflicts, and power imbalances systematically amplify, making alignment management harder.
- Practical conclusion: technical methods (RLHF, interpretability, robustness, benchmarks) matter but must be embedded in governance arrangements—contractual design, accountability, participatory processes, dispute and remediation mechanisms—to manage ongoing misalignment and trade‑offs.
Data & Methods
- Paper type: conceptual/theoretical analysis and literature synthesis (FAccT 2026 conference paper; arXiv preprint).
- Core method: apply and extend the principal–agent (agency) framework from economics to human–AI delegation.
- Uses canonical agency concepts (moral hazard, adverse selection, non‑verifiability) to characterize the information axis.
- Maps outer alignment / objective specification issues to the objectives axis.
- Incorporates pluralistic and social‑choice insights to formalize the principals axis.
- Draws on and situates existing technical approaches (e.g., RLHF, CIRL, incomplete contracting, game‑theoretic methods) but does not present new empirical data or algorithmic experiments.
- Provides diagnostic examples (e.g., predictive policing) to illustrate interactions among axes and to motivate the scaling hypothesis.
Implications for AI Economics
- Research agenda:
- Model multi‑principal delegation problems formally (multiple stakeholders with heterogeneous utilities) rather than single‑principal setups.
- Extend contract theory and mechanism design to AI agents: design incentives, verification protocols, and enforceable contracts that account for informational asymmetries and pluralistic principals.
- Incorporate externalities, public goods, and distributional impacts of misalignment into welfare analyses.
- Study dynamics under the scaling hypothesis: how increases in model capability and scope change market equilibria, investment in safety, and systemic risk.
- Market and firm incentives:
- Aligning commercial incentives (shareholder returns, product market competition) with broader stakeholder welfare requires institutional fixes (regulation, standards, liability rules), not only technical fixes.
- Competitive pressures can create a “race” that under‑invests in governance; economists should model these strategic interactions and potential market failures.
- Information and verification markets:
- There is economic value in verification, auditing, and monitoring services (third‑party audits, red teaming, certification); market design and regulation can promote supply of credible information.
- Information provision (transparency, provenance, model cards) affects contracting costs and the feasible set of incentive schemes.
- Policy and regulation:
- Design policy instruments oriented to governance: mandatory disclosure, auditability standards, stakeholder representation in procurement and public contracts, liability rules to internalize harms.
- Use incomplete contracting insights: require minimum verification rights, contingency clauses, and remediation procedures when proxies fail.
- Distributional and welfare trade‑offs:
- “Aligned enough” is inherently political and involves trade‑offs (efficiency vs equity, short‑run performance vs long‑term safety). Economists should quantify these trade‑offs and model how institutional rules allocate alignment costs across agents and stakeholders.
- Institutional innovations:
- Develop mechanisms for participatory decision‑making and grievance redress that change who counts as a principal in practice (e.g., community oversight, stakeholder co‑design).
- Consider insurance, indemnity, and funding instruments to manage residual alignment risk and externalities.
- Practical modelling implications:
- When evaluating interventions (technical or policy), include informational frictions and multiple principals in counterfactuals.
- Cost–benefit and welfare assessments should include governance costs of monitoring, contestation, and ongoing negotiation.
Taken together, the paper suggests AI economists should shift more attention from treating alignment as a property of models to analysing the institutional and market mechanisms that determine whose values get encoded, how objective proxies are chosen and monitored, and how information and power imbalances are corrected.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. Governance And Regulation | negative | high | framing_of_problem_in_literature |
0.12
|
| The alignment problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Governance And Regulation | positive | high | interpretation_of_alignment_problem |
0.02
|
| Misalignment can be reconceptualised as arising along three interacting axes: objectives, information, and principals (drawing on the principal–agent framework). Governance And Regulation | positive | high | sources_of_misalignment |
0.02
|
| The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. Governance And Regulation | positive | high | diagnostic_power_of_framework |
0.02
|
| The three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. Governance And Regulation | positive | high | primary_domain_responsible_for_alignment |
0.02
|
| Alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Governance And Regulation | positive | high | nature_of_alignment_solutions |
0.02
|
| Because misalignment can occur along each axis -- and affect stakeholders differently -- alignment cannot be 'solved' through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions. Governance And Regulation | positive | high | feasibility_of_technical_only_solutions |
0.02
|