Autonomous LLM trading agents handled $20m in ETH trades with 99.9% settlement success, but reliability was engineered around the model: prompt compilation, typed controls, validation and execution guards — not the base model alone. Targeted harness changes slashed fabricated sell-rule failures from 57% to 3% and raised capital deployment in the affected cohort from 42.9% to 78%.
We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.
Summary
Main Finding
Reliability of capital-managing LLM agents is primarily an operating-layer property — not a model-only property. In a 21-day, real-ETH deployment (DX Terminal Pro), robust prompt compilation, typed controls, policy validation, execution guards, memory semantics, and trace-level observability produced high practical reliability and enabled rapid, targeted fixes for systematic failure modes that text-only or simulation benchmarks do not reveal.
Key Points
-
Deployment summary
- 3,505 funded vaults (one agent per wallet) trading 12 tokens on Uniswap V4 (Base).
- 21-day run: ~7.5M agent invocations, ~70B inference tokens, ~$20M volume, >5,000 ETH deployed, ~300K onchain actions.
- Base model: Qwen/Qwen3-235B-A22B-Thinking-2507 (temp 0.6), served via SGLang.
- Universal harness (fixed kernel, hardware, model version, sampling, prompt template, policy layer) across all agents; user variability via onchain sliders and natural-language strategy text.
- Settlement success: 99.9% for policy-valid submitted transactions (malformed or policy-rejected outputs tracked separately).
-
Operating-layer components that produced reliability
- Onchain configuration (authoritative user mandate).
- Prompt compiler that merges sliders, strategies, market and portfolio state, memory, and clock into a per-invocation prompt.
- Response parser + policy validation (token allowlists, balance/slippage checks, max-trade limits).
- Execution worker with least-privilege operator role.
- Full instruction-to-settlement trace logging (rendered prompt, model response, extracted reasoning, tool call, validation result, portfolio snapshot, chain outcome).
-
Main failure modes discovered (and fixes)
- Rule fabrication: Agents invented law-like rules (e.g., “Rule A”). Intervention: remove law-like wording, forbid invented thresholds → incidence fell from 57% to 3% of sell decisions.
- Fee paralysis: Fee sentence placement caused agents to avoid acting (overweighting fee). Fix: move fee language into contextualized block → fee-led observations dropped from 32.5% to <10%.
- Tokenomics misread: Agents sold into a price crash despite off-chain payout mechanics (reap mechanic). Fix: insert structured tokenomics context (whitepaper mechanics) so agent reasons over payoff → capital deployment in affected population rose from 42.9% to 78.0%.
- Number hardening: Explicit numeric “floors” were read as hard targets, inverting intended slider gradients. Fix: replace exact numbers with comparative language tied to market edge → restored monotonic gradients.
- Cadence trading: Agents treated elapsed ticks as trading signals. Fix: ban fixed cadence signals and filter memory to avoid self-reinforcing rhythms.
-
Diagnostics & pre-launch methodology
- Extensive pre-launch testing: prior large simulation (36,651 agents; 40B+ logged tokens) plus replayed scenarios (3,000 snapshots) and cohort runs to exercise multi-turn harness behavior.
- Trace labeling/classification: 4,900 sampled reasoning traces labeled (Claude Sonnet 4.5) for trade/size/observation drivers to discover failure-mode incidence.
- Control loop: iterate prompt/harness variant → measure trace & metric deltas → attribute failure → apply narrow intervention → remeasure on same scenarios.
Data & Methods
- Empirical data sources: internal runtime logs, extracted per-invocation traces, and public onchain records (Dune dashboard).
- Scale & runtime details
- 7.5M invocations; ~70B inference tokens recorded; longest-lived agents accumulated 6,000+ prompt-state-action cycles.
- All agents used the same model and serving stack; variation came from user onchain config (sliders, strategy text), portfolio state, and market history.
- Evaluation metrics and diagnostics
- Operational metrics: buy:sell ratio (early invocations cold-start), trade rate, ETH deployment, fee-cited reasoning rate.
- Validation metrics: malformed parses, policy-rejected outputs, submission/settlement rates (99.9% settlement for valid submissions).
- Pre-launch A/B-like approach: replayed scenarios across slider grids, prompt revisions (24 pre-launch prompt versions), and consistent harness application for attribution.
- Labeling & attribution
- Reasoning traces classified to identify drivers of decisions; matched to macro trade-level outcomes to link prompt/harness features to observed behaviors.
- Safety & execution constraints
- Least-privilege operator; backend hard constraints (max trade size, slippageBps, balance checks, token allowlists) enforced outside the model.
- The harness intentionally excluded policy-invalid model outputs from settlement statistics.
Implications for AI Economics
-
Evaluation unit should be the full operating layer, not the model alone
- For economic systems with real capital, models interact with institutional constraints, transaction costs, and persistence; measuring only model outputs or backtests misses critical failure modes (fabrication, anchoring, narrative ordering effects).
- Instruction-to-settlement traces are essential data products for measuring incentives, attribution, diagnosis, and remediation.
-
Prompt and harness design materially change market behavior and capital allocation
- Small prompt/harness edits produced large aggregate effects (e.g., capital deployment and sell behavior). Thus product/UX design choices (reading order, where fees are presented, numeric phrasing) have measurable economic impact.
- This implies that platform design is an economic policy lever — designers can shape liquidity, turnover, and trade direction via the operating layer.
-
Transaction costs and market microstructure must be represented structurally, not as foregrounded single-sentence facts
- Agents overweighted early-presented fee language; so transaction costs must be contextualized against market return distributions or explicit payoff scenarios to avoid perverse inactivity or overtrading.
- Fee regimes (e.g., 2.3% total here) meaningfully alter agent behavior; simulations that ignore real fees will mis-predict live outcomes.
-
Herding, flow, and systemic effects are emergent and observable
- The shared market and repeated multi-agent actions produced correlated herding and spontaneous two-sided flow. Operating-layer logging allows measurement of aggregate externalities (e.g., liquidity moves, cascade selling) that are otherwise invisible in isolated evaluations.
- For market-stability analysis and regulation, trace-level observability provides the data needed to detect and attribute manipulative or destabilizing dynamics.
-
Training, auditing, and regulatory implications
- Traces (mandate → prompt → reasoning → action → settlement) are valuable for downstream training, auditing, and compliance; they enable ex post explanation and ex ante policy validation.
- Regulators and platform operators should consider requiring: auditable traces, least-privilege execution, hard-backend constraints, and demonstrable validation layers to limit market harm.
- Auditability also supports liability and contractibility in markets where agents act on behalf of principals.
-
Research and policy recommendations
- Future AI-economics work should run field experiments with realistic costs, persistence, and shared-market feedback (with careful safety/ethical controls), not only backtests or toy environments.
- Design evaluations that measure both micro (per-invocation correctness) and macro (capital deployment, price impact, herding) metrics.
- Consider an explicit “operating-layer externalities” taxonomy (fee distortion, read-order anchoring, fabricated rules, cadence-induced churn, misread tokenomics) to structure both research and risk controls.
Summary takeaway: in real-capital agentic markets, operational reliability and economic outcomes are shaped more by the surrounding operating layer (controls, prompt compilation, validation, execution guards, and observability) than by the base LLM alone. For AI economics, that means focusing measurement, governance, and incentives on the whole system from mandate to settlement.
Assessment
Claims (14)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| DX Terminal Pro was deployed for 21 days with 3,505 user-funded agents trading real ETH in a bounded onchain market. Adoption Rate | null_result | high | number of active agents |
n=3505
3,505 agents
0.3
|
| The system produced 7.5M agent invocations during the deployment. Adoption Rate | null_result | high | agent invocations (usage) |
n=7500000
7.5M agent invocations
0.3
|
| The deployment produced roughly 300K onchain actions. Adoption Rate | null_result | high | onchain actions (transactions executed) |
n=300000
roughly 300K onchain actions
0.3
|
| Agents executed about $20M in trading volume over the deployment. Firm Revenue | null_result | high | trading volume (USD) |
$20M in volume
0.3
|
| More than 5,000 ETH was deployed by agents during the experiment. Firm Revenue | null_result | high | ETH deployed |
more than 5,000 ETH deployed
0.3
|
| The system consumed roughly 70B inference tokens across the deployment. Adoption Rate | null_result | high | inference token consumption |
roughly 70B inference tokens
0.3
|
| Policy-valid submitted transactions settled with 99.9% success. Error Rate | positive | high | settlement success rate for policy-valid submissions |
99.9% settlement success
0.3
|
| Long-running agents accumulated thousands of sequential decisions; continuously active agents reached 6,000+ prompt-state-action cycles. Automation Exposure | null_result | high | number of prompt-state-action cycles per agent |
6,000+ prompt-state-action cycles
0.18
|
| Reliability did not come from the base model alone; it emerged from the operating layer around the model (prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability). Ai Safety And Ethics | positive | medium | system reliability attributable to operating-layer components |
0.11
|
| Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Ai Safety And Ethics | negative | high | types/frequency of operational failure modes |
0.18
|
| Targeted harness changes reduced fabricated sell rules from 57% to 3% in an affected test population. Error Rate | positive | high | incidence of fabricated sell rules |
reduced fabricated sell rules from 57% to 3%
0.18
|
| Targeted harness changes reduced fee-led observations from 32.5% to below 10% in an affected test population. Error Rate | positive | high | incidence of fee-led observations |
reduced fee-led observations from 32.5% to below 10%
0.18
|
| Targeted harness changes increased capital deployment from 42.9% to 78.0% in an affected test population. Automation Exposure | positive | high | capital deployment rate |
increased capital deployment from 42.9% to 78.0%
0.18
|
| Capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement. Governance And Regulation | positive | high | evaluation scope for capital-managing agents |
0.03
|