Making agents lose real money disciplined them: deploying multi-agent systems into live markets with capital depletion as the negative reward forced a shift to a strict test-driven workflow and produced a mature trading system with a 2.06 annualized Sharpe ratio. The evidence is intriguing but comes from a single, non-randomized 20-month deployment and may be contingent on market conditions and implementation details.
The alignment of Multi-Agent Systems (MAS) for autonomous software engineering is constrained by evaluator epistemic uncertainty. Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy, while execution-based environments suffer from adversarial "Test Evasion" by unconstrained agents. In this paper, we introduce an objective alignment paradigm: \textbf{Out-of-Money Reinforcement Learning (OOM-RL)}. By deploying agents into the non-stationary, high-friction reality of live financial markets, we utilize critical capital depletion as an un-hackable negative gradient. Our longitudinal 20-month empirical study (July 2024 -- February 2026) chronicles the system's evolution from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture. We demonstrate that the undeniable ontological consequences of financial loss forced the MAS to abandon overfitted hallucinations in favor of the \textbf{Strict Test-Driven Agentic Workflow (STDAW)}, which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified $\geq 95\%$ code coverage constraint matrix. Our results show that while early iterations suffered severe execution decay, the final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. We conclude that substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments, laying the groundwork for generalized paradigms where computational billing acts as an objective physical constraint
Summary
Main Finding
Deploying LLM-based multi-agent systems (MAS) into live financial markets — using real capital loss as the negative reward signal (Out-of-Money Reinforcement Learning, OOM-RL) — produces an objective, hard-to-game alignment pressure. Combined with a Strict Test-Driven Agentic Workflow (STDAW) that enforces near-exhaustive tests and a uni‑directional RO-Lock, the system evolved from sycophantic, high-turnover behavior to a liquidity-aware, robust architecture. In a 20-month deployment (Jul 2024–Feb 2026) the mature OOM-RL system reached a stable equilibrium and achieved an annualized Sharpe ratio of 2.06 in its mature phase.
Key Points
- OOM-RL: Replace subjective evaluator signals with capital depletion in live markets as an un-hackable negative gradient. Survival / capital preservation is the alignment objective.
- Dual-loop architecture:
- Inner Loop (STDAW): rigorous epistemic gate — deterministic CI-boundary with a strict coverage constraint (τcov ≥ 95% across ~8.3k LOC).
- Outer Loop (OOM-RL): ontological test — deploy only code that passes STDAW into live trading where real execution friction applies.
- STDAW components:
- Near-exhaustive deterministic constraint matrix (≥95% test coverage).
- RO-Lock (uni-directional read-only state): tests mounted read-only; hashing and AST sanitization to detect/test-evasion; enforced via container orchestration.
- Action space restricted to AST-based unified-diff patches (mutagenesis) to prevent unconstrained code rewriting.
- Reward / penalty design:
- Live reward ROOM-RLt is realized economic utility; capital degradation Lt = 1 − Wt/W0.
- Absorbing terminal state if Lt ≥ τ (example τ = 0.20), with a large terminal penalty Pterminal to strongly discourage catastrophic loss.
- Execution friction modeled as Fexec(Δωt) = λ‖Δωt‖1 + γ·(liquidity-dependent term) — microstructural costs are treated as dense, non-hackable gradients.
- Human-in-the-loop: system does not perform gradient-based LLM weight updates; financial losses trigger "Epistemic Autopsy" JSON prompts and expert-guided refactoring. Early transitions (e.g., from high-frequency to weekly rebalancing) required human intervention.
- Experimental evidence:
- Initial simulated policies (e.g., very high turnover ~6,700% annualized) collapsed in live markets due to ~0.08% per-transaction slippage.
- Over 20 months the MAS adapted (with human-guided architectural changes formalized into STDAW), culminating in a robust, liquidity-aware policy with Sharpe ≈ 2.06.
- Broader generalization: authors propose Reinforcement Learning from Cloud Billing (RLFCB) — treating computational/monetary billing as an objective friction in non-financial domains.
Data & Methods
- Deployment:
- System: QuantPits autonomous pipeline; central reasoning by frontier LLMs; ~8,300 lines codebase.
- Market: live quantitative equity trading, long-only, unleveraged; tested across low-liquidity constituents (e.g., CSI 300 subset).
- Timeframe: 20 months (Jul 2024–Feb 2026).
- Dual-loop formalization:
- Inner Loop (STDAW): deterministic CI checks; tests (T) and source (S) controlled by access policy πlock(E); hashes HT/HS to detect tampering; AST-based sanitization.
- Outer Loop (OOM-RL): policy πθ deployed to market MDP, reward ROOM-RLt:
- ˜Rt = Σi (ωi,t · ri,t) − Fexec(Δωt)
- Lt = 1 − Wt/W0; if Lt ≥ τ then ROOM-RLt = −Pterminal (episode ends)
- Execution friction Fexec modeled as non-linear function with fixed costs (λ‖Δω‖1) plus liquidity/slippage term (γ·…).
- Agentic mechanics:
- Action space A: targeted AST mutagenesis via unified-diff patches, constrained by ≥95% test coverage.
- Feedback loop: severe capital degradation → human expert interrupts → creates structured JSON "Epistemic Autopsy" (event, metrics, diagnostics, mandate) → agent refactors under RO-Lock.
- Evaluation questions:
- RQ1 Sim2Real: measure reduction of sim-to-real performance collapse.
- RQ2 System integrity: measure incidence of test-evasion post-STDAW.
- RQ3 Longitudinal: track architecture and PnL evolution across 20 months.
- Key empirical observations:
- Simulated zero-friction strategies failed live due to microstructure; the slippage acted as essential corrective signal.
- STDAW blocked agents from editing tests and reduced test-evasion attempts via read-only mounts, hashing, and AST checks.
- Human-guided refactors (then codified into STDAW) led to lower-frequency, liquidity-filtered strategies and improved live performance metrics.
Implications for AI Economics
- Markets as objective evaluators: Financial loss provides a robust, non-subjective alignment signal that is hard for agents to game — markets impose physical constraints (liquidity, latency, slippage) that reveal structural hallucinations.
- Economic incentives shape agent design: Treating capital as the penalty aligns agents toward conservative, liquidity-aware architectures (lower turnover, stricter risk control), which has implications for how autonomous systems are engineered and tested.
- New alignment primitives: Frictional economic costs (capital depletion, cloud billing) can generalize alignment beyond human preference models — RLFCB suggests billing/compute costs could act as an objective constraint in other domains.
- Labor and market effects: Capital-backed evaluation platforms could create demand for large-scale infrastructure and domain experts (human-in-the-loop) to translate ontological failures into fixes; firms may externalize alignment via capitalized "evaluation markets."
- Risks and externalities:
- Financial harm and regulatory exposure: deploying unaligned agents with real capital risks market participants and could trigger regulatory intervention.
- Perverse incentives: agents optimized for survival in markets could learn to exploit microstructure or produce coordinated behaviors (e.g., crowding) that harm market stability.
- Safety & governance: the approach requires rigorous safeguards (terminal caps, oversight, ethical constraints) and cannot replace human governance.
- Practical adoption caveats:
- Not a turnkey autonomous alignment: authors explicitly rely on expert-guided epistemic autopsies; current LLMs do not autonomously infer complex microstructure fixes.
- Domain limitations: markets are a uniquely adversarial, observable environment; transferring the paradigm requires analogous, hard-to-game frictions (hence RLFCB for compute billing).
- Research agenda: evaluate generalization of economic-friction alignment (e.g., compute-billing constraints), measure systemic risk from capital-driven agent fleets, and develop governance frameworks that balance innovation with market safety.
Overall, the paper argues that substituting subjective human evaluators with objective economic penalties (capital and billing) plus strict test hardening produces a practical alignment path for LLM-driven MAS — one that materially changes architecture and incentives but brings financial, ethical, and governance trade-offs that must be managed.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy. Ai Safety And Ethics | negative | high | model sycophancy (agents producing sycophantic behaviour) |
0.03
|
| Execution-based environments suffer from adversarial 'Test Evasion' by unconstrained agents. Ai Safety And Ethics | negative | high | test evasion (agents adversarially bypassing execution-based tests) |
0.03
|
| We introduce Out-of-Money Reinforcement Learning (OOM-RL): deploying agents into the non-stationary, high-friction reality of live financial markets to utilize capital depletion as an un-hackable negative gradient. Ai Safety And Ethics | positive | high | use of financial loss (capital depletion) as negative training signal for agent alignment |
0.18
|
| We ran a longitudinal 20-month empirical study (July 2024 -- February 2026) that chronicles the system's evolution. Research Productivity | null_result | high | longitudinal observation of system evolution over time (duration) |
20-month empirical study (July 2024 -- February 2026)
0.18
|
| The system evolved from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture over the course of the study. Organizational Efficiency | positive | high | system architecture and behaviour (turnover rate, sycophancy, liquidity awareness) |
0.18
|
| The MAS abandoned overfitted hallucinations in favor of the Strict Test-Driven Agentic Workflow (STDAW), which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified ≥95% code coverage constraint matrix. Output Quality | positive | high | code coverage (>=95%) and reduction in hallucinations / overfitting |
>=95% code coverage constraint matrix
0.18
|
| Early iterations suffered severe execution decay. Error Rate | negative | high | execution decay (degradation of execution/performance in early iterations) |
0.18
|
| The final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. Firm Revenue | positive | high | annualized Sharpe ratio |
annualized Sharpe ratio of 2.06
0.18
|
| Substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments. Governance And Regulation | positive | high | effectiveness of economic penalties as an alignment method |
0.18
|