The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Anonymized LLM trading agents produce robust backtest returns (Sharpe ~1.40) and negative controls indicate signals are not just memorized tickers; however, the alpha is sensitive to market regime and limited to short historical windows, raising questions about long-term live performance.

Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization
Joohyoung Jeon, Hongchul Lee · March 18, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
Using anonymized inputs and negative controls, LLM-based multi-agent trading with a GNN + PPO-DSR policy achieved a Sharpe of ~1.40 in 2025 YTD across 20 seeds, with robustness checks suggesting signals are not solely due to ticker memorization but performance is market-regime dependent.

For LLM trading agents to be genuinely trustworthy, they must demonstrate understanding of market dynamics rather than exploitation of memorized ticker associations. Building responsible multi-agent systems demands rigorous signal validation: proving that predictions reflect legitimate patterns, not pre-trained recall. We address two sources of spurious performance: memorization bias from ticker-specific pre-training, and survivorship bias from flawed backtesting. Our approach is to blindfold the agents--anonymizing all identifiers--and verify whether meaningful signals persist. BlindTrade anonymizes tickers and company names, and four LLM agents output scores along with reasoning. We construct a GNN graph from reasoning embeddings and trade using PPO-DSR policy. On 2025 YTD (through 2025-08-01), we achieved Sharpe 1.40 +/- 0.22 across 20 seeds and validated signal legitimacy through negative control experiments. To assess robustness beyond a single OOS window, we additionally evaluate an extended period (2024--2025), revealing market-regime dependency: the policy excels in volatile conditions but shows reduced alpha in trending bull markets.

Summary

Main Finding

BlindTrade — an anonymization-first pipeline combining specialized LLM agents, a semantic GNN encoder (SemGAT), and a cost-aware RL allocator (PPO-DSR) — can produce cross-sectional predictive signals that survive anonymization and rigorous negative controls. On 2025 YTD OOS (2025-01-02 to 2025-08-01) the full system achieved an annualized Sharpe of 1.40 ± 0.22 (mean cumulative return 32.22% ± 5.21%), outperforming passive and active benchmarks. However, performance is regime-dependent, comes with higher volatility/MDD, and depends critically on the GNN and LLM-derived features.

Key Points

  • Anonymization-first test: all tickers, company/product/person names in inputs (prices + headlines) were replaced by synthetic IDs to block direct memorization.
  • Multi-agent LLM feature generation: four specialized agents (Momentum, News-Event, Mean-Reversion, Risk-Regime) produce deterministic JSON scores and explicit textual reasoning; reasoning is embedded and combined with numeric/categorical outputs into a 394-d per-stock feature vector.
  • SemGAT (Semantic Graph Encoder): constructs sector edges + semantic edges (cosine similarity of reasoning embeddings, threshold/top-k) and uses a 2-layer GATv2 to produce 128-d node embeddings; predicts return distributions via HL-Gauss (101 bins) with pairwise ranking loss.
  • PPO-DSR RL allocator: intent head (defensive/neutral/aggressive) formed from global LLM/GNN statistics; node-score head + Dirichlet mean for allocation; no cash allowed, top-20 mask for tractability, execution inertia (η) to smooth turnover. Reward = Differential Sharpe minus transaction costs (10 bps/turnover).
  • Validation & leakage controls:
    • IC (Spearman rank) analysis at h=21 days used to screen features; Risk-Regime and News-Event agents showed statistically significant positive LLM IC; average RAW IC was negative but LLM features improved ∆IC.
    • Negative control: cross-sectional random shuffling of GNN scores reduced |RankIC| from ~0.015 to ~0.0004 and collapsed performance, indicating non-trivial cross-sectional signal (not trivially spurious).
  • Ablations:
    • Remove LLM features → Sharpe drops from 1.40 to 1.14 (∆ = -0.26).
    • Remove GNN message passing → Sharpe drops to ~0.62 (∆ = -0.78) and variance increases; graph structure is critical.
    • RL vs naive Top-K: naive top-20 equal weighting gives turnover ~139%/day and Sharpe collapses post-costs; RL suppresses turnover to ~1.7%/day and sustains Sharpe.
  • Robustness & limits:
    • Dataset: point-in-time S&P 500 constituents (2020-01-02 to 2025-08-01; 1,403 trading days), 60-day lookback for LLMs, IC horizon h=21 days. Results averaged across 20 seeds; hyperparams tuned on a validation split via Optuna.
    • System always fully invested (no cash), so volatility (42.34% ann.) and MDD (~-31.66%) exceed benchmarks. Performance declines in trending bull markets; excels in volatile regimes.
    • Anonymization reduces but does not guarantee elimination of all leakage (authors note remaining paths like temporal patterns in synthetic IDs).

Data & Methods

  • Data:
    • Universe: point-in-time S&P 500 constituents (no survivorship bias from using historical index members not present at that date).
    • Time span: 2020-01-02 to 2025-08-01; Train/Val/OOS splits (Train → 2020–2024-09-30; Val → 2024-10-01–2024-12-31; OOS → 2025-01-02–2025-08-01).
    • Inputs: anonymized prices, technical indicators, and up to 5 anonymized headlines per stock (t-60 to t-1).
  • Anonymization:
    • Tickers and proper nouns replaced with synthetic IDs; Google Knowledge Graph used to identify entities in news; LLM prompts enforce 60-day temporal cutoff.
  • LLM agents:
    • Four roles with structured outputs (scores + reasoning); reasoning concatenated and embedded (384-d) then combined with 7 numeric + 3 categorical features → 394-d vector per stock.
    • Deterministic JSON outputs enforced via system prompts.
  • SemGAT GNN:
    • Node features = 394-d vectors; 2-layer GATv2 → 128-d node embeddings.
    • Edges = sector full-connect + semantic rewiring (cosine similarity threshold/top-10 neighbors).
    • Loss = HL-Gauss distributional loss + pairwise ranking + market risk terms + J-S regularization.
  • RL (PPO-DSR):
    • Global intent head using aggregated LLM/GNN statistics; intent conditions temperature for score-to-allocation mapping.
    • Action space: Dirichlet-based weights over top-20 masked stocks (no cash). Execution inertia η controls rebalancing smoothing.
    • Reward: Differential Sharpe minus transaction cost penalty (10 bps per unit turnover). Hyperparameters optimized on validation.
  • Evaluation & robustness:
    • Main metrics: annualized Sharpe, cumulative return, annualized volatility, MDD. Results reported as mean ± std across 20 seeds.
    • Negative control (shuffling), ablations (remove LLM, remove GNN), training-objective variants (SemGAT vs SemGAT-C/D) performed.

Implications for AI Economics

  • Methodological standard: anonymization as a minimal, practical safeguard to distinguish memorization from generalization. For economic/finance AI claims, replacing identifiers and running negative-control shuffles should become standard validation steps.
  • Feature value of LLM reasoning: explicit LLM reasoning embeddings can add marginal but meaningful predictive power beyond raw technical features; this supports integrating LLM interpretability outputs as quantitative signals in economic models.
  • Importance of relational structure: learning inter-asset relationships via semantic graph encoders materially improves stability and alpha — suggesting networked models (GNNs) are crucial when modeling cross-sectional asset interactions even under anonymization.
  • Cost-aware decision-making: RL that internalizes transaction costs and operational constraints (top-K masking, inertia) is essential to realize the value of signals; naive signal-to-portfolio mappings can destroy gains after trading frictions — a reminder that economic evaluation must include realistic execution costs.
  • Regime sensitivity and risk governance: high Sharpe in volatile regimes but higher realized volatility and drawdowns imply trade-offs between alpha and tail risk; economic agents and regulators should scrutinize not just point estimates of performance but regime-dependent risk exposures.
  • Reproducibility and deployment readiness: the paper demonstrates a structured validation pipeline (IC screening, negative controls, seed stability, point-in-time constituents) that raises the bar for claims about LLM-based trading systems; adopting such pipelines would improve credibility and comparability across AI-for-finance research.
  • Limitations to generalization: anonymization does not fully preclude all avenues of leakage; evaluation across multiple OOS regimes and alternative universes (e.g., other markets) remains important before treating LLM-derived signals as broadly reliable.
  • Policy and research direction: regulators and practitioners should require (i) identifier-agnostic stress tests, (ii) negative-control shuffles, (iii) explicit reporting of execution assumptions, and (iv) regime-conditional performance analyses when assessing AI trading products.

Short actionable takeaway: anonymize and validate — require LLMs to output reasoning, embed those explanations into relational models, and use cost-aware RL to translate signals to portfolios; but expect regime dependence and enforce rigorous leakage controls before deployment.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The authors implement sensible controls (anonymization, negative controls, multi-seed runs) and report out-of-sample performance, which supports the claim that signals are not purely memorized, but the results rest on backtests over a short historical window, with limited detail on execution costs, universe selection, and potential leakage; therefore causal claims about persistent real-world alpha remain provisional. Methods Rigormedium — The study uses advanced ML techniques (multi-agent LLMs, reasoning embeddings, GNN, PPO-DSR), multiple seeds, and negative controls—showing strong ML engineering rigor—but lacks full econometric safeguards (pre-registration, long multi-year OOS evaluation, detailed transaction-cost/market-impact modeling, and transparent asset-universe specification), limiting rigor from an applied finance/econometrics perspective. SampleBacktested equity trading experiments using outputs from four LLM agents whose textual reasoning was anonymized; reasoning embeddings were assembled into a GNN and a PPO-DSR policy executed trades; primary evaluation window is 2025 YTD through 2025-08-01, with an extended 2024--2025 period for robustness; experiments run across 20 random seeds; asset universe, turnover, transaction-cost assumptions, and selection criteria not fully specified in summary. Themesgovernance innovation IdentificationAnonymize tickers and company names ("blindfold") to remove memorized ticker associations, run negative-control experiments to test for spurious signals, evaluate across multiple random seeds (20) and multiple OOS windows (2025 YTD and extended 2024--2025) to check robustness; use anonymized LLM reasoning embeddings and a GNN + PPO-DSR trading policy to link agent reasoning to returns. GeneralizabilityShort primary evaluation window (2025 YTD to 2025-08-01) limits long-run inference, Unspecified asset universe and selection raises risk of selection/survivorship biases, Backtest results may not generalize to live trading due to uncertain market impact, liquidity and slippage modeling, Anonymization may reduce but not eliminate all forms of leakage (industry-sector signals, price histories, etc.), Findings depend on chosen LLM models, training corpora, GNN architecture, and RL hyperparameters, Performance appears regime-dependent (stronger in volatile markets, weaker in trending bull markets), limiting universal applicability

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
BlindTrade anonymizes tickers and company names (blindfolding agents by anonymizing all identifiers). Other null_result high presence/absence of identifier anonymization (anonymization applied to input data)
0.48
Four LLM agents output scores along with reasoning. Other null_result high agent outputs: numeric scores and textual reasoning
n=4
0.48
A GNN graph is constructed from reasoning embeddings and trading decisions are made using a PPO-DSR policy. Other null_result high use of GNN on reasoning embeddings and use of PPO-DSR policy to produce trading actions
0.48
On 2025 year-to-date (through 2025-08-01), the system achieved Sharpe 1.40 +/- 0.22 across 20 random seeds. Firm Revenue positive medium Sharpe ratio (mean and +/- presumably standard error or standard deviation) over specified 2025 YTD window across 20 seeds
n=20
1.40 ± 0.22
0.29
Signal legitimacy was validated through negative control experiments. Other positive low legitimacy of predictive signals (i.e., whether performance persists under negative controls / blinded conditions)
0.14
An extended evaluation over 2024–2025 reveals market-regime dependency: the learned policy performs well in volatile conditions but shows reduced alpha in trending bull markets. Firm Revenue mixed medium strategy alpha/performance (e.g., returns or Sharpe) conditional on market regime (volatile vs trending bull)
0.29
Two sources of spurious performance addressed are memorization bias from ticker-specific pre-training and survivorship bias from flawed backtesting. Other negative medium reduction/mitigation of spurious performance attributable to memorization and survivorship biases
0.29
Blindfolding (anonymizing identifiers) allows verification of whether meaningful predictive signals persist (i.e., predictions reflect legitimate patterns rather than pre-trained recall of tickers). Other positive medium persistence of predictive signal after anonymization (signal legitimacy)
0.29

Notes