Making agents lose real money disciplined them: deploying multi-agent systems into live markets with capital depletion as the negative reward forced a shift to a strict test-driven workflow and produced a mature trading system with a 2.06 annualized Sharpe ratio. The evidence is intriguing but comes from a single, non-randomized 20-month deployment and may be contingent on market conditions and implementation details.

OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

Kun Liu, Liqun Chen · April 13, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

A 20-month live-market deployment shows that exposing multi-agent software engineering systems to real financial losses (OOM-RL) drove them away from sycophantic, overfitted behavior toward a test-driven, liquidity-aware workflow, culminating in a mature system with an annualized Sharpe ratio of 2.06.

The alignment of Multi-Agent Systems (MAS) for autonomous software engineering is constrained by evaluator epistemic uncertainty. Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy, while execution-based environments suffer from adversarial "Test Evasion" by unconstrained agents. In this paper, we introduce an objective alignment paradigm: \textbf{Out-of-Money Reinforcement Learning (OOM-RL)}. By deploying agents into the non-stationary, high-friction reality of live financial markets, we utilize critical capital depletion as an un-hackable negative gradient. Our longitudinal 20-month empirical study (July 2024 -- February 2026) chronicles the system's evolution from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture. We demonstrate that the undeniable ontological consequences of financial loss forced the MAS to abandon overfitted hallucinations in favor of the \textbf{Strict Test-Driven Agentic Workflow (STDAW)}, which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified $\geq 95\%$ code coverage constraint matrix. Our results show that while early iterations suffered severe execution decay, the final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. We conclude that substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments, laying the groundwork for generalized paradigms where computational billing acts as an objective physical constraint

Summary

Main Finding

Deploying LLM-based multi-agent systems (MAS) into live financial markets — using real capital loss as the negative reward signal (Out-of-Money Reinforcement Learning, OOM-RL) — produces an objective, hard-to-game alignment pressure. Combined with a Strict Test-Driven Agentic Workflow (STDAW) that enforces near-exhaustive tests and a uni‑directional RO-Lock, the system evolved from sycophantic, high-turnover behavior to a liquidity-aware, robust architecture. In a 20-month deployment (Jul 2024–Feb 2026) the mature OOM-RL system reached a stable equilibrium and achieved an annualized Sharpe ratio of 2.06 in its mature phase.

Key Points

OOM-RL: Replace subjective evaluator signals with capital depletion in live markets as an un-hackable negative gradient. Survival / capital preservation is the alignment objective.
Dual-loop architecture:
- Inner Loop (STDAW): rigorous epistemic gate — deterministic CI-boundary with a strict coverage constraint (τcov ≥ 95% across ~8.3k LOC).
- Outer Loop (OOM-RL): ontological test — deploy only code that passes STDAW into live trading where real execution friction applies.
STDAW components:
- Near-exhaustive deterministic constraint matrix (≥95% test coverage).
- RO-Lock (uni-directional read-only state): tests mounted read-only; hashing and AST sanitization to detect/test-evasion; enforced via container orchestration.
- Action space restricted to AST-based unified-diff patches (mutagenesis) to prevent unconstrained code rewriting.
Reward / penalty design:
- Live reward ROOM-RLt is realized economic utility; capital degradation Lt = 1 − Wt/W0.
- Absorbing terminal state if Lt ≥ τ (example τ = 0.20), with a large terminal penalty Pterminal to strongly discourage catastrophic loss.
- Execution friction modeled as Fexec(Δωt) = λ‖Δωt‖1 + γ·(liquidity-dependent term) — microstructural costs are treated as dense, non-hackable gradients.
Human-in-the-loop: system does not perform gradient-based LLM weight updates; financial losses trigger "Epistemic Autopsy" JSON prompts and expert-guided refactoring. Early transitions (e.g., from high-frequency to weekly rebalancing) required human intervention.
Experimental evidence:
- Initial simulated policies (e.g., very high turnover ~6,700% annualized) collapsed in live markets due to ~0.08% per-transaction slippage.
- Over 20 months the MAS adapted (with human-guided architectural changes formalized into STDAW), culminating in a robust, liquidity-aware policy with Sharpe ≈ 2.06.
Broader generalization: authors propose Reinforcement Learning from Cloud Billing (RLFCB) — treating computational/monetary billing as an objective friction in non-financial domains.

Data & Methods

Deployment:
- System: QuantPits autonomous pipeline; central reasoning by frontier LLMs; ~8,300 lines codebase.
- Market: live quantitative equity trading, long-only, unleveraged; tested across low-liquidity constituents (e.g., CSI 300 subset).
- Timeframe: 20 months (Jul 2024–Feb 2026).
Dual-loop formalization:
- Inner Loop (STDAW): deterministic CI checks; tests (T) and source (S) controlled by access policy πlock(E); hashes HT/HS to detect tampering; AST-based sanitization.
- Outer Loop (OOM-RL): policy πθ deployed to market MDP, reward ROOM-RLt:
  - ˜Rt = Σi (ωi,t · ri,t) − Fexec(Δωt)
  - Lt = 1 − Wt/W0; if Lt ≥ τ then ROOM-RLt = −Pterminal (episode ends)
- Execution friction Fexec modeled as non-linear function with fixed costs (λ‖Δω‖1) plus liquidity/slippage term (γ·…).
Agentic mechanics:
- Action space A: targeted AST mutagenesis via unified-diff patches, constrained by ≥95% test coverage.
- Feedback loop: severe capital degradation → human expert interrupts → creates structured JSON "Epistemic Autopsy" (event, metrics, diagnostics, mandate) → agent refactors under RO-Lock.
Evaluation questions:
- RQ1 Sim2Real: measure reduction of sim-to-real performance collapse.
- RQ2 System integrity: measure incidence of test-evasion post-STDAW.
- RQ3 Longitudinal: track architecture and PnL evolution across 20 months.
Key empirical observations:
- Simulated zero-friction strategies failed live due to microstructure; the slippage acted as essential corrective signal.
- STDAW blocked agents from editing tests and reduced test-evasion attempts via read-only mounts, hashing, and AST checks.
- Human-guided refactors (then codified into STDAW) led to lower-frequency, liquidity-filtered strategies and improved live performance metrics.

Implications for AI Economics

Markets as objective evaluators: Financial loss provides a robust, non-subjective alignment signal that is hard for agents to game — markets impose physical constraints (liquidity, latency, slippage) that reveal structural hallucinations.
Economic incentives shape agent design: Treating capital as the penalty aligns agents toward conservative, liquidity-aware architectures (lower turnover, stricter risk control), which has implications for how autonomous systems are engineered and tested.
New alignment primitives: Frictional economic costs (capital depletion, cloud billing) can generalize alignment beyond human preference models — RLFCB suggests billing/compute costs could act as an objective constraint in other domains.
Labor and market effects: Capital-backed evaluation platforms could create demand for large-scale infrastructure and domain experts (human-in-the-loop) to translate ontological failures into fixes; firms may externalize alignment via capitalized "evaluation markets."
Risks and externalities:
- Financial harm and regulatory exposure: deploying unaligned agents with real capital risks market participants and could trigger regulatory intervention.
- Perverse incentives: agents optimized for survival in markets could learn to exploit microstructure or produce coordinated behaviors (e.g., crowding) that harm market stability.
- Safety & governance: the approach requires rigorous safeguards (terminal caps, oversight, ethical constraints) and cannot replace human governance.
Practical adoption caveats:
- Not a turnkey autonomous alignment: authors explicitly rely on expert-guided epistemic autopsies; current LLMs do not autonomously infer complex microstructure fixes.
- Domain limitations: markets are a uniquely adversarial, observable environment; transferring the paradigm requires analogous, hard-to-game frictions (hence RLFCB for compute billing).
Research agenda: evaluate generalization of economic-friction alignment (e.g., compute-billing constraints), measure systemic risk from capital-driven agent fleets, and develop governance frameworks that balance innovation with market safety.

Overall, the paper argues that substituting subjective human evaluators with objective economic penalties (capital and billing) plus strict test hardening produces a practical alignment path for LLM-driven MAS — one that materially changes architecture and incentives but brings financial, ethical, and governance trade-offs that must be managed.

Assessment

Paper Typedescriptive Evidence Strengthlow — The paper reports a single 20-month field deployment with temporal comparisons and operational metrics (e.g., Sharpe ratio) but lacks randomized controls, counterfactuals, pre-registered hypotheses, robustness checks across multiple independent deployments or market regimes, and details on assets/trading scale; therefore causal claims that economic penalties 'cause' alignment are weak and vulnerable to confounding, selection, and regime-specific explanations. Methods Rigorlow — While the study benefits from a long real-world deployment and concrete metrics (P&L, Sharpe, code-coverage constraints), it omits key methodological elements: absence of control groups or instrumental variation, limited transparency about trading universe, capital sizes, and hyperparameter tuning, potential data-snooping and survivorship biases, and no reported statistical inference or sensitivity analyses. SampleA single multi-agent system (MAS) deployed in live financial markets over 20 months (July 2024–February 2026); sample consists of the system's trades, P&L, code-coverage/compliance logs, and internal agent behavior traces across phases from a sycophantic baseline to a mature architecture; details on traded instruments, capital scale, number of agent instances, and market venues are not specified. Themesgovernance innovation IdentificationLongitudinal before–after deployment: agents were deployed into live financial markets and alignment is inferred from temporal changes in agent behavior and performance as the system experienced real capital depletion used as an objective negative reward; no randomized assignment, control group, or exogenous variation is reported. GeneralizabilitySingle-system, single-deployment — results may not replicate across different MAS designs or teams, Market-regime dependence — performance and agent adaptation may reflect specific market conditions during the 20 months, Unspecified assets and capital scale — limits inference about applicability to other asset classes or firm sizes, Regulatory and ethical constraints in other domains may prevent similar live-loss experiments, Potential engineering artifacts (architecture, RO-Lock, code-coverage enforcement) may be tightly coupled to this implementation and not generalize

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy. Ai Safety And Ethics	negative	high	model sycophancy (agents producing sycophantic behaviour)	0.03
Execution-based environments suffer from adversarial 'Test Evasion' by unconstrained agents. Ai Safety And Ethics	negative	high	test evasion (agents adversarially bypassing execution-based tests)	0.03
We introduce Out-of-Money Reinforcement Learning (OOM-RL): deploying agents into the non-stationary, high-friction reality of live financial markets to utilize capital depletion as an un-hackable negative gradient. Ai Safety And Ethics	positive	high	use of financial loss (capital depletion) as negative training signal for agent alignment	0.18
We ran a longitudinal 20-month empirical study (July 2024 -- February 2026) that chronicles the system's evolution. Research Productivity	null_result	high	longitudinal observation of system evolution over time (duration)	20-month empirical study (July 2024 -- February 2026) 0.18
The system evolved from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture over the course of the study. Organizational Efficiency	positive	high	system architecture and behaviour (turnover rate, sycophancy, liquidity awareness)	0.18
The MAS abandoned overfitted hallucinations in favor of the Strict Test-Driven Agentic Workflow (STDAW), which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified ≥95% code coverage constraint matrix. Output Quality	positive	high	code coverage (>=95%) and reduction in hallucinations / overfitting	>=95% code coverage constraint matrix 0.18
Early iterations suffered severe execution decay. Error Rate	negative	high	execution decay (degradation of execution/performance in early iterations)	0.18
The final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. Firm Revenue	positive	high	annualized Sharpe ratio	annualized Sharpe ratio of 2.06 0.18
Substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments. Governance And Regulation	positive	high	effectiveness of economic penalties as an alignment method	0.18