An autonomous agentic AI produces strong backtest returns — a reported 3.11 Sharpe and 59.5% annualized return from interpretable long–short signals on U.S. equities — but findings rest on historical backtests and may not survive trading frictions, capacity limits or changing market regimes.

Beyond Prompting: An Autonomous Framework for Systematic Factor Investing via Agentic AI

Allen Yikuan Huang, Zheqi Fan · March 15, 2026 · arXiv (Cornell University)

openalex descriptive low evidence 7/10 relevance Full text usable extracted full text Source PDF

An autonomous agentic AI that self-generates interpretable trading signals produces long–short portfolios in U.S. equity backtests with a reported annualized Sharpe ratio of 3.11 and a 59.53% annual return after out-of-sample validation.

This paper develops an autonomous framework for systematic factor investing via agentic AI. Rather than relying on sequential manual prompts, our approach operationalizes the model as a self-directed engine that endogenously formulates interpretable trading signals. To mitigate data snooping biases, this closed-loop system imposes strict empirical discipline through out-of-sample validation and economic rationale requirements. Applying this methodology to the U.S. equity market, we document that long-short portfolios formed on the simple linear combination of signals deliver an annualized Sharpe ratio of 3.11 and a return of 59.53%. Finally, our empirics demonstrate that self-evolving AI offers a scalable and interpretable paradigm.

Summary

Main Finding

The paper presents an autonomous Agentic AI framework that endogenously discovers, tests, and refines interpretable trading signals (factors) in a closed-loop research cycle. Applied to U.S. equities, agent-discovered signals — combined (first by simple linear aggregation, and then with LightGBM nonlinear aggregation) — produce large, robust out-of-sample returns (reported long-short portfolio: annualized Sharpe = 3.11; return = 59.53%) that survive realistic transaction-cost, turnover, and risk-adjustment tests. The system enforces strict anti–data-snooping discipline by requiring out-of-sample validation and stated economic rationale for promoted factors.

Key Points

Agentic shift: Moves from manual prompt-driven workflows to an autonomous LLM-based agent that functions as an iterative quant researcher (propose → compute → evaluate → gate → update memory/policy).
Constrained hypothesis space: Factors are constructed from a fixed primitive set (price, volume, volatility transforms, technical operators) under a bounded expression grammar to ensure interpretability and auditability.
Deterministic execution and reproducibility: Language proposals are deterministically translated into panel-consistent code to compute factor time series (no hidden numerical drift).
Promotion gates and memory: A unified evaluator computes a common metric set; transparent gates decide promote/hold/retire; memory conditions future proposals for a mix of exploitation/exploration.
Overfitting controls: Strict out-of-sample testing, no-look-ahead rules, economic-rationale requirement, and multiple-hypothesis testing adjustments are integrated to combat p-hacking.
Two-stage portfolio construction: (1) evaluate single-factor decile sorts and long-short spreads; (2) aggregate complementary signals using nonlinear models (LightGBM) to capture interactions and dynamics.
Robustness: Authors report survival of performance after transaction costs, market-impact modeling, turnover constraints, risk adjustments (e.g., Fama–French), across regimes, and across alternate universes/horizons/hyperparameters.
Interpretability: Because factors are symbolic formulas (not black-box embeddings), they can be audited and linked to economic narratives.

Data & Methods

Data universe: Raw price and volume panel data on U.S. equities (paper discusses "extensive historical market data" and an ordinary equities sample; exact sample years not stated in the excerpt).
Candidate generation: LLM-based agent generates symbolic factor formulas fi,t = G(Xi,t, ..., Xi,t−k; O) using a bounded operator set O (moving averages, price-relative transforms, volume/liquidity, volatility states, etc.).
Execution layer: Deterministic code maps symbolic recipes to factor time series with strict cross-sectional and time-series transformation rules.
Evaluation metrics: Common evaluation suite for every candidate (decile sorts, top-minus-bottom spreads, Sharpe, statistical significance, monotonic rank ordering, decay/out-of-sample stability).
Selection gates: Predefined promotion rules that require out-of-sample performance, economic rationale text, and pass multiple-testing corrections before inclusion in the factor library.
Aggregation: Nonlinear synthesis using LightGBM to form investable portfolios capturing interactions among promoted signals; also reports simple linear combination performance.
Anti-overfitting methods: No-look-ahead chronology, multiple-hypothesis testing adjustments, structured memory to prevent repeated exploitation of sample-specific spurious patterns.
Robustness checks: Transaction-cost and market-impact models, turnover constraints, regime-based subsample tests, alternative holding periods, cross-asset/universe checks, and hyperparameter sensitivity.

Implications for AI Economics

Research productivity and structure: Agentic systems can materially accelerate factor discovery and reduce manual human bottlenecks, shifting the role of quant research from feature engineering to governance, auditing, and deployment oversight.
Interpretability + automation: Producing explicit symbolic factor formulas mitigates some black-box concerns of ML in finance and facilitates economic interpretation and regulatory auditability compared with pure deep‑learning black boxes.
Market ecology and commercialization risk: If agentic factor discovery is scalable and widely adopted, it may accelerate factor crowding, shorten alpha decay horizons, and raise competition for capacity-sensitive strategies — necessitating new attention to capacity, liquidity costs, and endogenous market impact.
Methodological standardization: Embedding strict out-of-sample rules, no-look-ahead constraints, and economic-rationale gating into AutoML/Agentic pipelines sets a potential industry standard for credible automated discovery and could raise the bar for reproducibility and false‑discovery control in empirical asset pricing.
Risks and open questions: The framework still depends on (i) fidelity of the agent’s proposal mechanism (LLM hallucinations or training-data bias), (ii) correct specification of primitives/operators and evaluation gates, and (iii) real-world implementation frictions (slippage, market impact at scale). Independent replication, capacity analysis, and regulatory/audit protocols will be crucial.
Directions for further research: cross-market replication (other asset classes and geographies), capacity and crowding dynamics, comparative studies versus human-led discovery, and formal economic modeling of how autonomous factor discovery changes equilibrium returns and research labor demand.

Note: This is a summary of a preliminary preprint (arXiv:2603.14288v1); reported performance figures (Sharpe 3.11, annual return 59.53%) are the authors’ claims and should be independently replicated and stress-tested before any practical deployment.

Assessment

Paper Typedescriptive Evidence Strengthlow — Evidence is based on historical backtests rather than live or experimental evaluation; although the authors report out-of-sample validation and economic-rationale filters, backtests remain vulnerable to look-ahead bias, multiple-testing/data-snooping, survivorship bias, omitted trading frictions (transaction costs, market impact), leverage and capacity constraints, and regime dependence — all of which can materially reduce realized performance. Methods Rigormedium — The paper employs sensible safeguards (closed-loop out-of-sample testing and economic-interpretability requirements) which improve rigor relative to naively optimized strategies, but key methodological details appear missing or unclear (exact cross-validation scheme, sample period and asset coverage, how transaction costs/turnover/market impact were modeled, multiple-hypothesis corrections, sensitivity to parameter choices and market regimes), limiting confidence in replication and robustness. SampleHistorical U.S. equity market data used to construct long–short portfolios from interpretable signals generated by an autonomous agentic AI; the submission does not specify the time span, frequency (daily/weekly/monthly), universe construction (e.g., market-cap filters), data sources, survivorship-treatment, or exact sample size in the summary provided. Themesinnovation productivity IdentificationNo causal identification strategy; claims are based on historical backtests with enforced out-of-sample validation and economic-rationale screening to reduce data-snooping, but no experimental or quasi-experimental design to establish causal effects. GeneralizabilityBacktest-only results may not generalize to live trading due to trading frictions (transaction costs, market impact) and implementation shortfalls, Performance may be sample-period specific and sensitive to market regimes (e.g., low- vs high-volatility periods), Unclear survivorship and selection biases (index composition, delisted firms) could inflate historical returns, Scalability limits: performance reported for a hypothetical portfolio may not hold at larger assets under management, Limited to U.S. equities — results may not transfer to other asset classes or geographies, Interpretability and stability claims may depend on model architecture and hyperparameters not fully disclosed

Claims (6)

Claim	Direction	Outcome	Confidence & Evidence	Details
We develop an autonomous framework for systematic factor investing via agentic AI. Other	positive	autonomy of investment framework (methodological capability)	Reading fidelity high Study strength speculative	not reported 0.03
The approach operationalizes the model as a self-directed engine that endogenously formulates interpretable trading signals (rather than relying on sequential manual prompts). Other	positive	interpretability and autonomy of generated trading signals	Reading fidelity high Study strength low	not reported 0.09
To mitigate data snooping biases, the closed-loop system imposes strict empirical discipline through out-of-sample validation and economic rationale requirements. Other	positive	mitigation of data-snooping bias (robustness of signals)	Reading fidelity high Study strength medium	not reported 0.18
Applying this methodology to the U.S. equity market, long-short portfolios formed on the simple linear combination of signals deliver an annualized Sharpe ratio of 3.11. Firm Revenue	positive	portfolio Sharpe ratio	Reading fidelity high Study strength medium	annualized Sharpe ratio of 3.11 0.18
Applying this methodology to the U.S. equity market, long-short portfolios formed on the simple linear combination of signals deliver a return of 59.53% (annualized). Firm Revenue	positive	annualized portfolio return	Reading fidelity high Study strength medium	return of 59.53% 0.18
Our empirics demonstrate that self-evolving AI offers a scalable and interpretable paradigm. Other	positive	scalability and interpretability of the AI-driven investing approach	Reading fidelity high Study strength low	not reported 0.09