Autonomous LLM trading agents handled $20m in ETH trades with 99.9% settlement success, but reliability was engineered around the model: prompt compilation, typed controls, validation and execution guards — not the base model alone. Targeted harness changes slashed fabricated sell-rule failures from 57% to 3% and raised capital deployment in the affected cohort from 42.9% to 78%.

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

T. J. Barton, Chris Constantakis, Patti Hauseman, Annie Mous, Alaska Hoffman, Brian Bergeron, Hunter Goodreau · April 28, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

A 21-day live deployment of 3,505 autonomous language-model trading agents managing real ETH demonstrates that high operational reliability and greater capital deployment depended on system-level controls and validation layers rather than the base model alone, and targeted harness fixes substantially reduced failure modes.

We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.

Summary

Main Finding

Reliability of capital-managing LLM agents is primarily an operating-layer property — not a model-only property. In a 21-day, real-ETH deployment (DX Terminal Pro), robust prompt compilation, typed controls, policy validation, execution guards, memory semantics, and trace-level observability produced high practical reliability and enabled rapid, targeted fixes for systematic failure modes that text-only or simulation benchmarks do not reveal.

Key Points

Deployment summary
- 3,505 funded vaults (one agent per wallet) trading 12 tokens on Uniswap V4 (Base).
- 21-day run: ~7.5M agent invocations, ~70B inference tokens, ~$20M volume, >5,000 ETH deployed, ~300K onchain actions.
- Base model: Qwen/Qwen3-235B-A22B-Thinking-2507 (temp 0.6), served via SGLang.
- Universal harness (fixed kernel, hardware, model version, sampling, prompt template, policy layer) across all agents; user variability via onchain sliders and natural-language strategy text.
- Settlement success: 99.9% for policy-valid submitted transactions (malformed or policy-rejected outputs tracked separately).
Operating-layer components that produced reliability
- Onchain configuration (authoritative user mandate).
- Prompt compiler that merges sliders, strategies, market and portfolio state, memory, and clock into a per-invocation prompt.
- Response parser + policy validation (token allowlists, balance/slippage checks, max-trade limits).
- Execution worker with least-privilege operator role.
- Full instruction-to-settlement trace logging (rendered prompt, model response, extracted reasoning, tool call, validation result, portfolio snapshot, chain outcome).
Main failure modes discovered (and fixes)
- Rule fabrication: Agents invented law-like rules (e.g., “Rule A”). Intervention: remove law-like wording, forbid invented thresholds → incidence fell from 57% to 3% of sell decisions.
- Fee paralysis: Fee sentence placement caused agents to avoid acting (overweighting fee). Fix: move fee language into contextualized block → fee-led observations dropped from 32.5% to <10%.
- Tokenomics misread: Agents sold into a price crash despite off-chain payout mechanics (reap mechanic). Fix: insert structured tokenomics context (whitepaper mechanics) so agent reasons over payoff → capital deployment in affected population rose from 42.9% to 78.0%.
- Number hardening: Explicit numeric “floors” were read as hard targets, inverting intended slider gradients. Fix: replace exact numbers with comparative language tied to market edge → restored monotonic gradients.
- Cadence trading: Agents treated elapsed ticks as trading signals. Fix: ban fixed cadence signals and filter memory to avoid self-reinforcing rhythms.
Diagnostics & pre-launch methodology
- Extensive pre-launch testing: prior large simulation (36,651 agents; 40B+ logged tokens) plus replayed scenarios (3,000 snapshots) and cohort runs to exercise multi-turn harness behavior.
- Trace labeling/classification: 4,900 sampled reasoning traces labeled (Claude Sonnet 4.5) for trade/size/observation drivers to discover failure-mode incidence.
- Control loop: iterate prompt/harness variant → measure trace & metric deltas → attribute failure → apply narrow intervention → remeasure on same scenarios.

Data & Methods

Empirical data sources: internal runtime logs, extracted per-invocation traces, and public onchain records (Dune dashboard).
Scale & runtime details
- 7.5M invocations; ~70B inference tokens recorded; longest-lived agents accumulated 6,000+ prompt-state-action cycles.
- All agents used the same model and serving stack; variation came from user onchain config (sliders, strategy text), portfolio state, and market history.
Evaluation metrics and diagnostics
- Operational metrics: buy:sell ratio (early invocations cold-start), trade rate, ETH deployment, fee-cited reasoning rate.
- Validation metrics: malformed parses, policy-rejected outputs, submission/settlement rates (99.9% settlement for valid submissions).
- Pre-launch A/B-like approach: replayed scenarios across slider grids, prompt revisions (24 pre-launch prompt versions), and consistent harness application for attribution.
Labeling & attribution
- Reasoning traces classified to identify drivers of decisions; matched to macro trade-level outcomes to link prompt/harness features to observed behaviors.
Safety & execution constraints
- Least-privilege operator; backend hard constraints (max trade size, slippageBps, balance checks, token allowlists) enforced outside the model.
- The harness intentionally excluded policy-invalid model outputs from settlement statistics.

Implications for AI Economics

Evaluation unit should be the full operating layer, not the model alone
- For economic systems with real capital, models interact with institutional constraints, transaction costs, and persistence; measuring only model outputs or backtests misses critical failure modes (fabrication, anchoring, narrative ordering effects).
- Instruction-to-settlement traces are essential data products for measuring incentives, attribution, diagnosis, and remediation.
Prompt and harness design materially change market behavior and capital allocation
- Small prompt/harness edits produced large aggregate effects (e.g., capital deployment and sell behavior). Thus product/UX design choices (reading order, where fees are presented, numeric phrasing) have measurable economic impact.
- This implies that platform design is an economic policy lever — designers can shape liquidity, turnover, and trade direction via the operating layer.
Transaction costs and market microstructure must be represented structurally, not as foregrounded single-sentence facts
- Agents overweighted early-presented fee language; so transaction costs must be contextualized against market return distributions or explicit payoff scenarios to avoid perverse inactivity or overtrading.
- Fee regimes (e.g., 2.3% total here) meaningfully alter agent behavior; simulations that ignore real fees will mis-predict live outcomes.
Herding, flow, and systemic effects are emergent and observable
- The shared market and repeated multi-agent actions produced correlated herding and spontaneous two-sided flow. Operating-layer logging allows measurement of aggregate externalities (e.g., liquidity moves, cascade selling) that are otherwise invisible in isolated evaluations.
- For market-stability analysis and regulation, trace-level observability provides the data needed to detect and attribute manipulative or destabilizing dynamics.
Training, auditing, and regulatory implications
- Traces (mandate → prompt → reasoning → action → settlement) are valuable for downstream training, auditing, and compliance; they enable ex post explanation and ex ante policy validation.
- Regulators and platform operators should consider requiring: auditable traces, least-privilege execution, hard-backend constraints, and demonstrable validation layers to limit market harm.
- Auditability also supports liability and contractibility in markets where agents act on behalf of principals.
Research and policy recommendations
- Future AI-economics work should run field experiments with realistic costs, persistence, and shared-market feedback (with careful safety/ethical controls), not only backtests or toy environments.
- Design evaluations that measure both micro (per-invocation correctness) and macro (capital deployment, price impact, herding) metrics.
- Consider an explicit “operating-layer externalities” taxonomy (fee distortion, read-order anchoring, fabricated rules, cadence-induced churn, misread tokenomics) to structure both research and risk controls.

Summary takeaway: in real-capital agentic markets, operational reliability and economic outcomes are shaped more by the surrounding operating layer (controls, prompt compilation, validation, execution guards, and observability) than by the base LLM alone. For AI economics, that means focusing measurement, governance, and incentives on the whole system from mandate to settlement.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The study reports large-scale, real-world deployment data (3,505 agents, ~300K onchain actions, $20M volume) and trace-level observations from mandate to settlement, giving strong internal descriptive evidence about reliability and failure modes; however, it lacks randomized or quasi-experimental identification, has potential selection and timeframe biases, and is specific to one platform and market, limiting causal claims and external validity. Methods Rigormedium — The authors instrumented the system thoroughly (detailed traces, pre-launch testing, measurable metrics) and implemented targeted harness changes with before/after metrics, which indicates careful engineering evaluation; but the description lacks a randomized or controlled experimental design, formal statistical controls for confounders, and full methodological detail about sample selection and hypothesis testing, reducing inferential rigor. Sample21-day field deployment (DX Terminal Pro) with 3,505 user-funded autonomous language-model agents trading real ETH in a bounded onchain market: ~7.5M agent invocations, ~300K onchain actions, ~$20M trading volume, 5,000+ ETH deployed, ~70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions; includes long-running agents with thousands of prompt-state-action cycles and pre-launch testing plus targeted harness interventions. Themeshuman_ai_collab adoption GeneralizabilitySingle proprietary platform (DX Terminal Pro) — results may not generalize to other agent platforms or designs, Specific to an onchain ETH bounded market — crypto tokenomics and fee structures differ from traditional financial markets, User base is self-selected (user-funded agents) and may not represent broader users or institutional actors, Short deployment window (21 days) — may not capture longer-term behavioral or market dynamics, Relies on a particular operating-layer design (prompt compilation, typed controls, policy validation, etc.) and possibly proprietary base models, limiting transferability, Bounded market conditions and settlement rules may have constrained agent strategies relative to open markets, Regulatory, legal, and macro market contexts could change outcomes elsewhere

Claims (14)

Claim	Direction	Confidence	Outcome	Details
DX Terminal Pro was deployed for 21 days with 3,505 user-funded agents trading real ETH in a bounded onchain market. Adoption Rate	null_result	high	number of active agents	n=3505 3,505 agents 0.3
The system produced 7.5M agent invocations during the deployment. Adoption Rate	null_result	high	agent invocations (usage)	n=7500000 7.5M agent invocations 0.3
The deployment produced roughly 300K onchain actions. Adoption Rate	null_result	high	onchain actions (transactions executed)	n=300000 roughly 300K onchain actions 0.3
Agents executed about $20M in trading volume over the deployment. Firm Revenue	null_result	high	trading volume (USD)	$20M in volume 0.3
More than 5,000 ETH was deployed by agents during the experiment. Firm Revenue	null_result	high	ETH deployed	more than 5,000 ETH deployed 0.3
The system consumed roughly 70B inference tokens across the deployment. Adoption Rate	null_result	high	inference token consumption	roughly 70B inference tokens 0.3
Policy-valid submitted transactions settled with 99.9% success. Error Rate	positive	high	settlement success rate for policy-valid submissions	99.9% settlement success 0.3
Long-running agents accumulated thousands of sequential decisions; continuously active agents reached 6,000+ prompt-state-action cycles. Automation Exposure	null_result	high	number of prompt-state-action cycles per agent	6,000+ prompt-state-action cycles 0.18
Reliability did not come from the base model alone; it emerged from the operating layer around the model (prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability). Ai Safety And Ethics	positive	medium	system reliability attributable to operating-layer components	0.11
Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Ai Safety And Ethics	negative	high	types/frequency of operational failure modes	0.18
Targeted harness changes reduced fabricated sell rules from 57% to 3% in an affected test population. Error Rate	positive	high	incidence of fabricated sell rules	reduced fabricated sell rules from 57% to 3% 0.18
Targeted harness changes reduced fee-led observations from 32.5% to below 10% in an affected test population. Error Rate	positive	high	incidence of fee-led observations	reduced fee-led observations from 32.5% to below 10% 0.18
Targeted harness changes increased capital deployment from 42.9% to 78.0% in an affected test population. Automation Exposure	positive	high	capital deployment rate	increased capital deployment from 42.9% to 78.0% 0.18
Capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement. Governance And Regulation	positive	high	evaluation scope for capital-managing agents	0.03