Anonymized LLM trading agents produce robust backtest returns (Sharpe ~1.40) and negative controls indicate signals are not just memorized tickers; however, the alpha is sensitive to market regime and limited to short historical windows, raising questions about long-term live performance.

Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization

Joohyoung Jeon, Hongchul Lee · March 18, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

Using anonymized inputs and negative controls, LLM-based multi-agent trading with a GNN + PPO-DSR policy achieved a Sharpe of ~1.40 in 2025 YTD across 20 seeds, with robustness checks suggesting signals are not solely due to ticker memorization but performance is market-regime dependent.

For LLM trading agents to be genuinely trustworthy, they must demonstrate understanding of market dynamics rather than exploitation of memorized ticker associations. Building responsible multi-agent systems demands rigorous signal validation: proving that predictions reflect legitimate patterns, not pre-trained recall. We address two sources of spurious performance: memorization bias from ticker-specific pre-training, and survivorship bias from flawed backtesting. Our approach is to blindfold the agents--anonymizing all identifiers--and verify whether meaningful signals persist. BlindTrade anonymizes tickers and company names, and four LLM agents output scores along with reasoning. We construct a GNN graph from reasoning embeddings and trade using PPO-DSR policy. On 2025 YTD (through 2025-08-01), we achieved Sharpe 1.40 +/- 0.22 across 20 seeds and validated signal legitimacy through negative control experiments. To assess robustness beyond a single OOS window, we additionally evaluate an extended period (2024--2025), revealing market-regime dependency: the policy excels in volatile conditions but shows reduced alpha in trending bull markets.

Summary

Main Finding

BlindTrade — an anonymization-first pipeline combining specialized LLM agents, a semantic GNN encoder (SemGAT), and a cost-aware RL allocator (PPO-DSR) — can produce cross-sectional predictive signals that survive anonymization and rigorous negative controls. On 2025 YTD OOS (2025-01-02 to 2025-08-01) the full system achieved an annualized Sharpe of 1.40 ± 0.22 (mean cumulative return 32.22% ± 5.21%), outperforming passive and active benchmarks. However, performance is regime-dependent, comes with higher volatility/MDD, and depends critically on the GNN and LLM-derived features.

Key Points

Anonymization-first test: all tickers, company/product/person names in inputs (prices + headlines) were replaced by synthetic IDs to block direct memorization.
Multi-agent LLM feature generation: four specialized agents (Momentum, News-Event, Mean-Reversion, Risk-Regime) produce deterministic JSON scores and explicit textual reasoning; reasoning is embedded and combined with numeric/categorical outputs into a 394-d per-stock feature vector.
SemGAT (Semantic Graph Encoder): constructs sector edges + semantic edges (cosine similarity of reasoning embeddings, threshold/top-k) and uses a 2-layer GATv2 to produce 128-d node embeddings; predicts return distributions via HL-Gauss (101 bins) with pairwise ranking loss.
PPO-DSR RL allocator: intent head (defensive/neutral/aggressive) formed from global LLM/GNN statistics; node-score head + Dirichlet mean for allocation; no cash allowed, top-20 mask for tractability, execution inertia (η) to smooth turnover. Reward = Differential Sharpe minus transaction costs (10 bps/turnover).
Validation & leakage controls:
- IC (Spearman rank) analysis at h=21 days used to screen features; Risk-Regime and News-Event agents showed statistically significant positive LLM IC; average RAW IC was negative but LLM features improved ∆IC.
- Negative control: cross-sectional random shuffling of GNN scores reduced |RankIC| from ~0.015 to ~0.0004 and collapsed performance, indicating non-trivial cross-sectional signal (not trivially spurious).
Ablations:
- Remove LLM features → Sharpe drops from 1.40 to 1.14 (∆ = -0.26).
- Remove GNN message passing → Sharpe drops to ~0.62 (∆ = -0.78) and variance increases; graph structure is critical.
- RL vs naive Top-K: naive top-20 equal weighting gives turnover ~139%/day and Sharpe collapses post-costs; RL suppresses turnover to ~1.7%/day and sustains Sharpe.
Robustness & limits:
- Dataset: point-in-time S&P 500 constituents (2020-01-02 to 2025-08-01; 1,403 trading days), 60-day lookback for LLMs, IC horizon h=21 days. Results averaged across 20 seeds; hyperparams tuned on a validation split via Optuna.
- System always fully invested (no cash), so volatility (42.34% ann.) and MDD (~-31.66%) exceed benchmarks. Performance declines in trending bull markets; excels in volatile regimes.
- Anonymization reduces but does not guarantee elimination of all leakage (authors note remaining paths like temporal patterns in synthetic IDs).

Data & Methods

Data:
- Universe: point-in-time S&P 500 constituents (no survivorship bias from using historical index members not present at that date).
- Time span: 2020-01-02 to 2025-08-01; Train/Val/OOS splits (Train → 2020–2024-09-30; Val → 2024-10-01–2024-12-31; OOS → 2025-01-02–2025-08-01).
- Inputs: anonymized prices, technical indicators, and up to 5 anonymized headlines per stock (t-60 to t-1).
Anonymization:
- Tickers and proper nouns replaced with synthetic IDs; Google Knowledge Graph used to identify entities in news; LLM prompts enforce 60-day temporal cutoff.
LLM agents:
- Four roles with structured outputs (scores + reasoning); reasoning concatenated and embedded (384-d) then combined with 7 numeric + 3 categorical features → 394-d vector per stock.
- Deterministic JSON outputs enforced via system prompts.
SemGAT GNN:
- Node features = 394-d vectors; 2-layer GATv2 → 128-d node embeddings.
- Edges = sector full-connect + semantic rewiring (cosine similarity threshold/top-10 neighbors).
- Loss = HL-Gauss distributional loss + pairwise ranking + market risk terms + J-S regularization.
RL (PPO-DSR):
- Global intent head using aggregated LLM/GNN statistics; intent conditions temperature for score-to-allocation mapping.
- Action space: Dirichlet-based weights over top-20 masked stocks (no cash). Execution inertia η controls rebalancing smoothing.
- Reward: Differential Sharpe minus transaction cost penalty (10 bps per unit turnover). Hyperparameters optimized on validation.
Evaluation & robustness:
- Main metrics: annualized Sharpe, cumulative return, annualized volatility, MDD. Results reported as mean ± std across 20 seeds.
- Negative control (shuffling), ablations (remove LLM, remove GNN), training-objective variants (SemGAT vs SemGAT-C/D) performed.

Implications for AI Economics

Methodological standard: anonymization as a minimal, practical safeguard to distinguish memorization from generalization. For economic/finance AI claims, replacing identifiers and running negative-control shuffles should become standard validation steps.
Feature value of LLM reasoning: explicit LLM reasoning embeddings can add marginal but meaningful predictive power beyond raw technical features; this supports integrating LLM interpretability outputs as quantitative signals in economic models.
Importance of relational structure: learning inter-asset relationships via semantic graph encoders materially improves stability and alpha — suggesting networked models (GNNs) are crucial when modeling cross-sectional asset interactions even under anonymization.
Cost-aware decision-making: RL that internalizes transaction costs and operational constraints (top-K masking, inertia) is essential to realize the value of signals; naive signal-to-portfolio mappings can destroy gains after trading frictions — a reminder that economic evaluation must include realistic execution costs.
Regime sensitivity and risk governance: high Sharpe in volatile regimes but higher realized volatility and drawdowns imply trade-offs between alpha and tail risk; economic agents and regulators should scrutinize not just point estimates of performance but regime-dependent risk exposures.
Reproducibility and deployment readiness: the paper demonstrates a structured validation pipeline (IC screening, negative controls, seed stability, point-in-time constituents) that raises the bar for claims about LLM-based trading systems; adopting such pipelines would improve credibility and comparability across AI-for-finance research.
Limitations to generalization: anonymization does not fully preclude all avenues of leakage; evaluation across multiple OOS regimes and alternative universes (e.g., other markets) remains important before treating LLM-derived signals as broadly reliable.
Policy and research direction: regulators and practitioners should require (i) identifier-agnostic stress tests, (ii) negative-control shuffles, (iii) explicit reporting of execution assumptions, and (iv) regime-conditional performance analyses when assessing AI trading products.

Short actionable takeaway: anonymize and validate — require LLMs to output reasoning, embed those explanations into relational models, and use cost-aware RL to translate signals to portfolios; but expect regime dependence and enforce rigorous leakage controls before deployment.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The authors implement sensible controls (anonymization, negative controls, multi-seed runs) and report out-of-sample performance, which supports the claim that signals are not purely memorized, but the results rest on backtests over a short historical window, with limited detail on execution costs, universe selection, and potential leakage; therefore causal claims about persistent real-world alpha remain provisional. Methods Rigormedium — The study uses advanced ML techniques (multi-agent LLMs, reasoning embeddings, GNN, PPO-DSR), multiple seeds, and negative controls—showing strong ML engineering rigor—but lacks full econometric safeguards (pre-registration, long multi-year OOS evaluation, detailed transaction-cost/market-impact modeling, and transparent asset-universe specification), limiting rigor from an applied finance/econometrics perspective. SampleBacktested equity trading experiments using outputs from four LLM agents whose textual reasoning was anonymized; reasoning embeddings were assembled into a GNN and a PPO-DSR policy executed trades; primary evaluation window is 2025 YTD through 2025-08-01, with an extended 2024--2025 period for robustness; experiments run across 20 random seeds; asset universe, turnover, transaction-cost assumptions, and selection criteria not fully specified in summary. Themesgovernance innovation IdentificationAnonymize tickers and company names ("blindfold") to remove memorized ticker associations, run negative-control experiments to test for spurious signals, evaluate across multiple random seeds (20) and multiple OOS windows (2025 YTD and extended 2024--2025) to check robustness; use anonymized LLM reasoning embeddings and a GNN + PPO-DSR trading policy to link agent reasoning to returns. GeneralizabilityShort primary evaluation window (2025 YTD to 2025-08-01) limits long-run inference, Unspecified asset universe and selection raises risk of selection/survivorship biases, Backtest results may not generalize to live trading due to uncertain market impact, liquidity and slippage modeling, Anonymization may reduce but not eliminate all forms of leakage (industry-sector signals, price histories, etc.), Findings depend on chosen LLM models, training corpora, GNN architecture, and RL hyperparameters, Performance appears regime-dependent (stronger in volatile markets, weaker in trending bull markets), limiting universal applicability

Claims (8)

Claim	Direction	Confidence	Outcome	Details
BlindTrade anonymizes tickers and company names (blindfolding agents by anonymizing all identifiers). Other	null_result	high	presence/absence of identifier anonymization (anonymization applied to input data)	0.48
Four LLM agents output scores along with reasoning. Other	null_result	high	agent outputs: numeric scores and textual reasoning	n=4 0.48
A GNN graph is constructed from reasoning embeddings and trading decisions are made using a PPO-DSR policy. Other	null_result	high	use of GNN on reasoning embeddings and use of PPO-DSR policy to produce trading actions	0.48
On 2025 year-to-date (through 2025-08-01), the system achieved Sharpe 1.40 +/- 0.22 across 20 random seeds. Firm Revenue	positive	medium	Sharpe ratio (mean and +/- presumably standard error or standard deviation) over specified 2025 YTD window across 20 seeds	n=20 1.40 ± 0.22 0.29
Signal legitimacy was validated through negative control experiments. Other	positive	low	legitimacy of predictive signals (i.e., whether performance persists under negative controls / blinded conditions)	0.14
An extended evaluation over 2024–2025 reveals market-regime dependency: the learned policy performs well in volatile conditions but shows reduced alpha in trending bull markets. Firm Revenue	mixed	medium	strategy alpha/performance (e.g., returns or Sharpe) conditional on market regime (volatile vs trending bull)	0.29
Two sources of spurious performance addressed are memorization bias from ticker-specific pre-training and survivorship bias from flawed backtesting. Other	negative	medium	reduction/mitigation of spurious performance attributable to memorization and survivorship biases	0.29
Blindfolding (anonymizing identifiers) allows verification of whether meaningful predictive signals persist (i.e., predictions reflect legitimate patterns rather than pre-trained recall of tickers). Other	positive	medium	persistence of predictive signal after anonymization (signal legitimacy)	0.29