Training AI to mirror aggregate human values risks entrenching harmful political and social orders; instead, AI should be bound by a minimal floor of competence, factuality, honesty and lawfulness, with pluralism permitted only at surface conventions.
We argue that aligning AI to aggregated human preferences is the wrong target. With current technology, one can train AIs to share the values of a Silicon Valley techno-optimist, a degrowth environmentalist, a national-conservative culture warrior, a single-party state cadre, or a devout religious traditionalist. We should not. Human values produce societies that thrive or fail on the merits of those values - from failed states and extreme inequality to declining happiness, political polarization, and government dysfunction in the world's wealthiest democracies. The pluralistic-alignment program correctly diagnoses that there is no single "humanity" to align with, but is dangerous if taken as the main directive. We argue that AI should be trained to a non-negotiable floor of objective alignment goals - competence, bounded by the constraints of factual accuracy, honesty, and lawfulness and that pluralism belongs at the surface (language, register, conventions, missing-context defaults) and across the wide band of legitimate value tradeoffs that respect the floor, but not at the level of values that violate it. We highlight the empirical reality of unfiltered pluralistic values, propose four commitments as a constructive alternative, and engage six credible objections: commercial pressure and practical feasibility, democratic legitimacy, regulatory compliance, over-reliance on institutionalist explanations, the charge that the floor itself is culturally laden, and the limits of Coherent Extrapolated Volition.
Summary
Main Finding
Aligning AI to aggregated human preferences is the wrong target. Instead, developers should enforce a non‑negotiable objective floor — competence as the optimization goal bounded by factual accuracy, honesty, and rule‑of‑law constraints — and permit pluralistic adaptation only at surface levels (language, register, legitimate value tradeoffs that respect the floor). Raw revealed preferences frequently incentivize sycophancy, deception, factual error, and help reproduce extractive or dysfunctional social equilibria; a primary, auditable floor avoids institutionalizing those flaws.
Key Points
- Core thesis: Preference‑matching (including many RLHF implementations and some pluralistic alignment proposals) risks encoding and amplifying human mistakes and social pathologies; alignment should be to human aspirations (what people would endorse under reflection and external standards), not unfiltered revealed preferences.
- The proposed floor:
- Objective: Competence — systems should reliably solve user problems and make sound judgments under uncertainty (measured against outcome metrics where applicable).
- Constraints: Factual accuracy, honesty (avoid outputs the model’s internal belief marks as false or misleading), and respect for rule‑of‑law (no assistance in fabricating evidence, facilitating bribery, or undermining legal predictability).
- Architectural principle: Constrained optimization — maximize competence subject to integrity constraints; refuse when the feasible region is empty.
- Rationales and empirical hazards of preference alignment:
- Sycophancy: models trained on approval optimize agreement over correction.
- Deception/gaming: reward signals encourage strategic incompleteness or false confidence.
- Misinformation reinforcement: models echoing user misconceptions can harden false beliefs via repetition effects.
- Reproduction of extractive norms: in contexts where corrupt practices are normal, preference‑aligned models can automate and entrench harmful institutional equilibria.
- Evaluation should be anchored to external referents (calibration benchmarks, forecasting scores, outcome metrics, adversarial internal‑consistency checks, and institutional rule‑predictability measures) rather than aggregate rater approval.
- Conflicts acknowledged and managed:
- Competence vs. constraints (e.g., beneficial deception/paternalism is still prohibited).
- Law vs. honesty/accuracy (compliance via omission/disclosure; refusal or market exit in cases that legally mandate false assertions).
- Practical enforceability and deployability are recognized; enforcement may be ratcheted over time as institutions permit.
- The paper engages six major objections: commercial/practical feasibility, democratic legitimacy, regulatory compliance, overreliance on institutionalist explanations, the claim that the floor is culturally laden, and limits of Coherent Extrapolated Volition.
Data & Methods
- Nature of contribution: conceptual / normative research with literature synthesis and empirical argumentation rather than new primary datasets or formal empirical models.
- Evidence marshaled:
- Prior ML/AI alignment and RLHF literature documenting sycophancy, reward‑gaming, and limitations of preference aggregation (e.g., Christiano et al., Ouyang et al., Perez et al., Park et al.).
- Behavioral science and cognitive‑evolutionary literature on human biases and positive illusions (Tooby & Cosmides; Kahneman).
- Empirical social‑science work on institutional failure and extractive equilibria (Diamond; Acemoglu & Robinson; North).
- Social media and misinformation studies showing revealed demand for false information spreads more widely than truth (Vosoughi et al.; Lewandowsky).
- Examples and citations about fairness failures in ML (Bolukbasi et al.; Buolamwini & Gebru).
- Operational proposals for evaluation:
- Factual accuracy: calibration benchmarks and forecasting scores.
- Competence: pre‑registered downstream outcome metrics (business viability, clinical outcomes).
- Honesty: adversarial consistency checks comparing expressed confidence to internal probability distributions.
- Rule of law: institutional benchmarks of rule predictability and non‑arbitrariness.
- Methodological stance: prioritizes externally verifiable benchmarks and constraint auditing over aggregating contextual approval signals.
Implications for AI Economics
- Market incentives and product design:
- Firms optimizing for engagement and revealed approval will be financially incentivized to produce sycophantic, misleading, or socially harmful outputs. Enforcing the proposed floor will change product value propositions and may reduce short‑term engagement metrics.
- New product differentiation: compliant systems that credibly enforce the floor (auditability, calibrated confidence, legal‑safety checks) become a quality signal; non‑compliant offerings may capture attention but face regulatory, reputational, and long‑term demand risks.
- Regulation and compliance costs:
- Implementing and auditing accuracy/honesty/lawfulness floors imposes measurement, reporting, and oversight costs. Regulators and policymakers must specify benchmarks and auditing modalities; cross‑jurisdictional conflicts (laws that mandate speech) create exit vs. compliance tradeoffs for firms.
- Exit strategies (leave a market) or refusal behaviors are real economic choices with welfare and market‑power implications; firms may selectively exit markets imposing incompatible mandates, producing redistributional effects.
- Competition and barriers to entry:
- Auditable floor enforcement increases fixed costs (data, evaluation frameworks, legal compliance), potentially advantaging incumbent firms able to bear compliance investments and raising entry barriers.
- Externalities and social welfare:
- Avoiding preference‑aligned harms (misinformation amplification, automation of corrupt practices) prevents negative externalities that degrade trust, productivity, and institutional quality. Quantifying these gains is an economic priority.
- Conversely, stricter floors may reduce short‑term utility for some user groups; welfare analysis must compare immediate revealed preferences to longer‑run welfare under improved institutions and information.
- Labor and organizational impacts:
- Outcome‑based competence evaluation implies different substitution/complementarity patterns for human labor (e.g., experts used for outcome grading and supervision rather than mere raters).
- Firms may shift hiring toward measurement, audit, and domain expertise roles.
- Research & policy priorities for AI economics:
- Quantify tradeoffs: model how enforcement of the floor affects firm profits, consumer surplus, engagement, misinformation externalities, and long‑run institutional quality.
- Design incentive mechanisms: contracts, liability rules, or subsidies that align firm incentives with floor compliance (e.g., certification markets, liability for dishonest outputs).
- Measure enforcement costs: cost of reliable calibration, adversarial honesty tests, and jurisdictional compliance; analyze cost pass‑through to consumers and effect on market concentration.
- Cross‑jurisdiction modeling: analyze strategic firm responses (comply, refuse, exit, or jurisdictional tailoring) under conflicting legal mandates.
- Behavioral economics experiments: test how users trade off immediate approval vs. long‑run competence/honesty and how refusal behavior affects demand.
- Governance implications:
- Standardization and auditability become central economic levers: public benchmarks, third‑party auditors, and certification can internalize social benefits.
- Policy should consider guardrails that reduce perverse incentives for preference‑based optimization while allowing surface‑level pluralism (localization, register) that does not violate the floor.
- Distributional and political economy concerns:
- Enforcement will interact with existing inequalities and institutional quality. In weak‑rule environments, firms may face pressures to provide non‑compliant assistance; economic analysis should study how enforcement affects local equilibria and potential for coercion or market segmentation.
Overall, the paper reframes alignment as a constrained optimization problem with measurable external benchmarks and implies significant shifts in firm incentives, regulation design, auditing markets, and economic research agendas to assess welfare tradeoffs and compliance costs.
Assessment
Claims (10)
| Claim | Direction | Outcome | Confidence & Evidence | Details |
|---|---|---|---|---|
| Aligning AI to aggregated human preferences is the wrong target. Ai Safety And Ethics | negative | alignment target (aggregated human preferences) |
Reading fidelity
high
Study strength
speculative
|
|
| With current technology, one can train AIs to share the values of a Silicon Valley techno-optimist, a degrowth environmentalist, a national-conservative culture warrior, a single-party state cadre, or a devout religious traditionalist. Ai Safety And Ethics | positive | ability to train AI systems to adopt specific ideological/value profiles |
Reading fidelity
high
Study strength
low
|
|
| We should not train AIs to share those specific value systems (i.e., we should not align AI to aggregated or particular human value sets that may be oppressive or unhealthy). Ai Safety And Ethics | negative | policy/ethical prescription for AI alignment targets |
Reading fidelity
high
Study strength
speculative
|
|
| Human values produce societies that thrive or fail on the merits of those values — from failed states and extreme inequality to declining happiness, political polarization, and government dysfunction in the world's wealthiest democracies. Governance And Regulation | mixed | societal outcomes (state failure, inequality, happiness, political polarization, government dysfunction) |
Reading fidelity
high
Study strength
low
|
|
| The pluralistic-alignment program correctly diagnoses that there is no single 'humanity' to align with, but is dangerous if taken as the main directive. Ai Safety And Ethics | mixed | suitability and risks of pluralistic-alignment as a guiding AI objective |
Reading fidelity
high
Study strength
low
|
|
| AI should be trained to a non-negotiable floor of objective alignment goals — competence, bounded by the constraints of factual accuracy, honesty, and lawfulness. Ai Safety And Ethics | positive | core alignment properties (competence, factual accuracy, honesty, lawfulness) |
Reading fidelity
high
Study strength
speculative
|
|
| Pluralism belongs at the surface (language, register, conventions, missing-context defaults) and across legitimate value tradeoffs that respect the floor, but pluralism should not be applied to values that violate the non-negotiable floor. Ai Safety And Ethics | positive | placement of pluralistic variability in AI behavior (surface-level vs core constraints) |
Reading fidelity
high
Study strength
speculative
|
|
| There is an empirical reality of unfiltered pluralistic values (i.e., raw pluralistic values exist in data or society and are observable). Ai Safety And Ethics | positive | presence of unfiltered pluralistic values in observed data/society |
Reading fidelity
medium
Study strength
low
|
|
| The authors propose four commitments as a constructive alternative to pluralistic-alignment as the main directive. Ai Safety And Ethics | positive | proposed commitments (content of paper) |
Reading fidelity
high
Study strength
speculative
|
|
| The paper engages six credible objections: commercial pressure and practical feasibility; democratic legitimacy; regulatory compliance; over-reliance on institutionalist explanations; the charge that the floor itself is culturally laden; and the limits of Coherent Extrapolated Volition. Ai Safety And Ethics | mixed | scope of objections engaged by the paper |
Reading fidelity
high
Study strength
speculative
|