Anonymized LLM trading agents produce robust backtest returns (Sharpe ~1.40) and negative controls indicate signals are not just memorized tickers; however, the alpha is sensitive to market regime and limited to short historical windows, raising questions about long-term live performance.
For LLM trading agents to be genuinely trustworthy, they must demonstrate understanding of market dynamics rather than exploitation of memorized ticker associations. Building responsible multi-agent systems demands rigorous signal validation: proving that predictions reflect legitimate patterns, not pre-trained recall. We address two sources of spurious performance: memorization bias from ticker-specific pre-training, and survivorship bias from flawed backtesting. Our approach is to blindfold the agents--anonymizing all identifiers--and verify whether meaningful signals persist. BlindTrade anonymizes tickers and company names, and four LLM agents output scores along with reasoning. We construct a GNN graph from reasoning embeddings and trade using PPO-DSR policy. On 2025 YTD (through 2025-08-01), we achieved Sharpe 1.40 +/- 0.22 across 20 seeds and validated signal legitimacy through negative control experiments. To assess robustness beyond a single OOS window, we additionally evaluate an extended period (2024--2025), revealing market-regime dependency: the policy excels in volatile conditions but shows reduced alpha in trending bull markets.
Summary
Main Finding
BlindTrade — an anonymization-first pipeline combining specialized LLM agents, a semantic GNN encoder (SemGAT), and a cost-aware RL allocator (PPO-DSR) — can produce cross-sectional predictive signals that survive anonymization and rigorous negative controls. On 2025 YTD OOS (2025-01-02 to 2025-08-01) the full system achieved an annualized Sharpe of 1.40 ± 0.22 (mean cumulative return 32.22% ± 5.21%), outperforming passive and active benchmarks. However, performance is regime-dependent, comes with higher volatility/MDD, and depends critically on the GNN and LLM-derived features.
Key Points
- Anonymization-first test: all tickers, company/product/person names in inputs (prices + headlines) were replaced by synthetic IDs to block direct memorization.
- Multi-agent LLM feature generation: four specialized agents (Momentum, News-Event, Mean-Reversion, Risk-Regime) produce deterministic JSON scores and explicit textual reasoning; reasoning is embedded and combined with numeric/categorical outputs into a 394-d per-stock feature vector.
- SemGAT (Semantic Graph Encoder): constructs sector edges + semantic edges (cosine similarity of reasoning embeddings, threshold/top-k) and uses a 2-layer GATv2 to produce 128-d node embeddings; predicts return distributions via HL-Gauss (101 bins) with pairwise ranking loss.
- PPO-DSR RL allocator: intent head (defensive/neutral/aggressive) formed from global LLM/GNN statistics; node-score head + Dirichlet mean for allocation; no cash allowed, top-20 mask for tractability, execution inertia (η) to smooth turnover. Reward = Differential Sharpe minus transaction costs (10 bps/turnover).
- Validation & leakage controls:
- IC (Spearman rank) analysis at h=21 days used to screen features; Risk-Regime and News-Event agents showed statistically significant positive LLM IC; average RAW IC was negative but LLM features improved ∆IC.
- Negative control: cross-sectional random shuffling of GNN scores reduced |RankIC| from ~0.015 to ~0.0004 and collapsed performance, indicating non-trivial cross-sectional signal (not trivially spurious).
- Ablations:
- Remove LLM features → Sharpe drops from 1.40 to 1.14 (∆ = -0.26).
- Remove GNN message passing → Sharpe drops to ~0.62 (∆ = -0.78) and variance increases; graph structure is critical.
- RL vs naive Top-K: naive top-20 equal weighting gives turnover ~139%/day and Sharpe collapses post-costs; RL suppresses turnover to ~1.7%/day and sustains Sharpe.
- Robustness & limits:
- Dataset: point-in-time S&P 500 constituents (2020-01-02 to 2025-08-01; 1,403 trading days), 60-day lookback for LLMs, IC horizon h=21 days. Results averaged across 20 seeds; hyperparams tuned on a validation split via Optuna.
- System always fully invested (no cash), so volatility (42.34% ann.) and MDD (~-31.66%) exceed benchmarks. Performance declines in trending bull markets; excels in volatile regimes.
- Anonymization reduces but does not guarantee elimination of all leakage (authors note remaining paths like temporal patterns in synthetic IDs).
Data & Methods
- Data:
- Universe: point-in-time S&P 500 constituents (no survivorship bias from using historical index members not present at that date).
- Time span: 2020-01-02 to 2025-08-01; Train/Val/OOS splits (Train → 2020–2024-09-30; Val → 2024-10-01–2024-12-31; OOS → 2025-01-02–2025-08-01).
- Inputs: anonymized prices, technical indicators, and up to 5 anonymized headlines per stock (t-60 to t-1).
- Anonymization:
- Tickers and proper nouns replaced with synthetic IDs; Google Knowledge Graph used to identify entities in news; LLM prompts enforce 60-day temporal cutoff.
- LLM agents:
- Four roles with structured outputs (scores + reasoning); reasoning concatenated and embedded (384-d) then combined with 7 numeric + 3 categorical features → 394-d vector per stock.
- Deterministic JSON outputs enforced via system prompts.
- SemGAT GNN:
- Node features = 394-d vectors; 2-layer GATv2 → 128-d node embeddings.
- Edges = sector full-connect + semantic rewiring (cosine similarity threshold/top-10 neighbors).
- Loss = HL-Gauss distributional loss + pairwise ranking + market risk terms + J-S regularization.
- RL (PPO-DSR):
- Global intent head using aggregated LLM/GNN statistics; intent conditions temperature for score-to-allocation mapping.
- Action space: Dirichlet-based weights over top-20 masked stocks (no cash). Execution inertia η controls rebalancing smoothing.
- Reward: Differential Sharpe minus transaction cost penalty (10 bps per unit turnover). Hyperparameters optimized on validation.
- Evaluation & robustness:
- Main metrics: annualized Sharpe, cumulative return, annualized volatility, MDD. Results reported as mean ± std across 20 seeds.
- Negative control (shuffling), ablations (remove LLM, remove GNN), training-objective variants (SemGAT vs SemGAT-C/D) performed.
Implications for AI Economics
- Methodological standard: anonymization as a minimal, practical safeguard to distinguish memorization from generalization. For economic/finance AI claims, replacing identifiers and running negative-control shuffles should become standard validation steps.
- Feature value of LLM reasoning: explicit LLM reasoning embeddings can add marginal but meaningful predictive power beyond raw technical features; this supports integrating LLM interpretability outputs as quantitative signals in economic models.
- Importance of relational structure: learning inter-asset relationships via semantic graph encoders materially improves stability and alpha — suggesting networked models (GNNs) are crucial when modeling cross-sectional asset interactions even under anonymization.
- Cost-aware decision-making: RL that internalizes transaction costs and operational constraints (top-K masking, inertia) is essential to realize the value of signals; naive signal-to-portfolio mappings can destroy gains after trading frictions — a reminder that economic evaluation must include realistic execution costs.
- Regime sensitivity and risk governance: high Sharpe in volatile regimes but higher realized volatility and drawdowns imply trade-offs between alpha and tail risk; economic agents and regulators should scrutinize not just point estimates of performance but regime-dependent risk exposures.
- Reproducibility and deployment readiness: the paper demonstrates a structured validation pipeline (IC screening, negative controls, seed stability, point-in-time constituents) that raises the bar for claims about LLM-based trading systems; adopting such pipelines would improve credibility and comparability across AI-for-finance research.
- Limitations to generalization: anonymization does not fully preclude all avenues of leakage; evaluation across multiple OOS regimes and alternative universes (e.g., other markets) remains important before treating LLM-derived signals as broadly reliable.
- Policy and research direction: regulators and practitioners should require (i) identifier-agnostic stress tests, (ii) negative-control shuffles, (iii) explicit reporting of execution assumptions, and (iv) regime-conditional performance analyses when assessing AI trading products.
Short actionable takeaway: anonymize and validate — require LLMs to output reasoning, embed those explanations into relational models, and use cost-aware RL to translate signals to portfolios; but expect regime dependence and enforce rigorous leakage controls before deployment.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| BlindTrade anonymizes tickers and company names (blindfolding agents by anonymizing all identifiers). Other | null_result | high | presence/absence of identifier anonymization (anonymization applied to input data) |
0.48
|
| Four LLM agents output scores along with reasoning. Other | null_result | high | agent outputs: numeric scores and textual reasoning |
n=4
0.48
|
| A GNN graph is constructed from reasoning embeddings and trading decisions are made using a PPO-DSR policy. Other | null_result | high | use of GNN on reasoning embeddings and use of PPO-DSR policy to produce trading actions |
0.48
|
| On 2025 year-to-date (through 2025-08-01), the system achieved Sharpe 1.40 +/- 0.22 across 20 random seeds. Firm Revenue | positive | medium | Sharpe ratio (mean and +/- presumably standard error or standard deviation) over specified 2025 YTD window across 20 seeds |
n=20
1.40 ± 0.22
0.29
|
| Signal legitimacy was validated through negative control experiments. Other | positive | low | legitimacy of predictive signals (i.e., whether performance persists under negative controls / blinded conditions) |
0.14
|
| An extended evaluation over 2024–2025 reveals market-regime dependency: the learned policy performs well in volatile conditions but shows reduced alpha in trending bull markets. Firm Revenue | mixed | medium | strategy alpha/performance (e.g., returns or Sharpe) conditional on market regime (volatile vs trending bull) |
0.29
|
| Two sources of spurious performance addressed are memorization bias from ticker-specific pre-training and survivorship bias from flawed backtesting. Other | negative | medium | reduction/mitigation of spurious performance attributable to memorization and survivorship biases |
0.29
|
| Blindfolding (anonymizing identifiers) allows verification of whether meaningful predictive signals persist (i.e., predictions reflect legitimate patterns rather than pre-trained recall of tickers). Other | positive | medium | persistence of predictive signal after anonymization (signal legitimacy) |
0.29
|