LLMs reading headlines can build better-than-naive equity portfolios but don’t yet match AI-driven optimizers; low-turnover LLM strategies remain competitive after accounting for trading costs.
Recent advances in large language models (LLMs) have raised questions about their potential role in portfolio allocation beyond traditional sentiment analyses. This study investigated whether LLMs, when prompted directly, can autonomously generate portfolio weights that compete with classical optimization and AI-enhanced strategies. We evaluated seven medium-sized open-source LLMs—Gemma-7B, Mistral-7B, Jansen Adapt-Finance-Llama2-7B, DeepSeek-R1-8B, QuantFactory Llama-3-8B-Instruct-Finance, Qwen-7B, and Llama2-7B—using systematic prompt engineering and temperature tuning. Portfolios were constructed from financial news headlines for S&P 500 equities and benchmarked against mean–variance optimization (MVO), the Black–Litterman model, AI-driven optimizers, and naive diversification strategies. The results show that, while LLM-generated portfolios outperformed naive diversification (Sharpe ratio up to 0.741), they lagged behind AI-optimized benchmarks (Sharpe ratio up to 1.361). A transaction cost analysis revealed that low-turnover LLM strategies retain their competitiveness post-costs, surpassing cap-weighted benchmarks. Statistical tests confirmed significant performance differences (p≤0.01). These findings highlight the ability of LLMs to extract signals from unstructured text, but also their limitations without explicit optimization. Future research should explore hybrid frameworks that combine LLM reasoning with quantitative optimization for cost-sensitive environments.
Summary
Main Finding
LLMs prompted to produce portfolio weights from financial-news headlines can extract tradable signals and outperform naive diversification, but they underperform dedicated AI-optimization and classical quantitative methods. Low-turnover LLM strategies can remain cost-competitive (beating cap-weighted benchmarks after transaction costs), yet they do not match the Sharpe ratios achieved by AI-optimized portfolios.
Key Points
- Models evaluated: Gemma-7B, Mistral-7B, Jansen Adapt-Finance-Llama2-7B, DeepSeek-R1-8B, QuantFactory Llama-3-8B-Instruct-Finance, Qwen-7B, Llama2-7B.
- Input: unstructured financial news headlines associated with S&P 500 equities.
- Approach: directly prompt LLMs to output portfolio weights, using systematic prompt engineering and temperature tuning.
- Benchmarks: mean–variance optimization (MVO), Black–Litterman, AI-driven optimizers, and naive diversification (including cap-weighted).
- Performance: best LLM strategy achieved Sharpe ≈ 0.741; best AI-optimized benchmark reached Sharpe ≈ 1.361.
- Statistical significance: performance differences were statistically significant (p ≤ 0.01).
- Transaction costs: low-turnover LLM strategies maintain competitiveness after accounting for trading costs and can outperform cap-weighted benchmarks.
- Interpretation: LLMs can extract signals from headlines but, when used alone to output weights, they lack the formal optimization that yields higher risk-adjusted returns.
Data & Methods
- Data: financial-news headlines linked to S&P 500 equities (study limited to headlines rather than full articles).
- Model set: seven medium-sized open-source LLMs (listed above), not necessarily finetuned on finance except where noted (e.g., “Adapt-Finance” or “Instruct-Finance” variants).
- Prompting: systematic prompt-engineering protocol to elicit weight vectors; temperature tuning to control generation stochasticity and turnover.
- Portfolio construction: LLMs directly produced weights used to build portfolios; results compared to classical portfolio optimizers (MVO, Black–Litterman), AI-driven optimizers, and naïve benchmarks.
- Evaluation metrics: Sharpe ratio as primary performance metric; turnover and transaction-cost-adjusted returns assessed; statistical testing performed to compare strategies (reported significance p ≤ 0.01).
- Limitations in methods: medium-sized models only, headlines-only text input, single asset universe (S&P 500), and no explicit integration of LLM outputs into formal optimization routines in the LLM-only approach.
Implications for AI Economics
- Signal extraction from text: LLMs are viable tools for extracting actionable signals from unstructured text, expanding the scope of alternative data in portfolio allocation research.
- Optimization vs. reasoning trade-off: LLM reasoning alone is insufficient to outperform optimized quantitative strategies; combining LLM-generated insights with formal optimization can unlock better risk-adjusted returns.
- Cost-sensitive deployment: low-turnover LLM strategies can be attractive in environments with realistic trading costs, suggesting practical applicability for longer-horizon or low-transaction strategies.
- Research agenda:
- Develop hybrid frameworks that feed LLM-derived signals or priors into quantitative optimizers (e.g., Black–Litterman, constrained MVO, RL-based allocators).
- Evaluate larger or finance-finetuned LLMs, richer text inputs (full articles, filings), and multimodal data.
- Study robustness across time periods, asset classes, market regimes, and transaction-cost regimes.
- Analyze governance, interpretability, and economic costs (compute, latency, reproducibility) relative to performance gains.
- Policy and market impact: as LLMs become integrated into allocation workflows, market dynamics (liquidity, crowding) and regulatory issues (model risk, disclosure) warrant attention.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We evaluated seven medium-sized open-source LLMs—Gemma-7B, Mistral-7B, Jansen Adapt-Finance-Llama2-7B, DeepSeek-R1-8B, QuantFactory Llama-3-8B-Instruct-Finance, Qwen-7B, and Llama2-7B. Other | null_result | high | models evaluated (model set) |
n=7
0.5
|
| Portfolios were constructed from financial news headlines for S&P 500 equities and benchmarked against mean–variance optimization (MVO), the Black–Litterman model, AI-driven optimizers, and naive diversification strategies. Other | null_result | high | portfolio construction source and benchmarking set |
n=500
0.3
|
| LLM-generated portfolios outperformed naive diversification (Sharpe ratio up to 0.741). Firm Productivity | positive | high | Sharpe ratio (risk-adjusted return) of portfolios |
Sharpe ratio up to 0.741
0.3
|
| LLM-generated portfolios lagged behind AI-optimized benchmarks (Sharpe ratio up to 1.361). Firm Productivity | negative | high | Sharpe ratio (risk-adjusted return) of portfolios |
Sharpe ratio up to 1.361
0.3
|
| A transaction cost analysis revealed that low-turnover LLM strategies retain their competitiveness post-costs, surpassing cap-weighted benchmarks. Firm Productivity | positive | high | post-cost portfolio performance relative to cap-weighted benchmark |
0.3
|
| Statistical tests confirmed significant performance differences (p ≤ 0.01). Decision Quality | mixed | high | statistical significance of performance differences between strategies |
p≤0.01
0.3
|
| LLMs are able to extract signals from unstructured text (financial news headlines) but have limitations without explicit quantitative optimization. Output Quality | mixed | high | ability to extract actionable signals from unstructured text as reflected in portfolio performance |
0.3
|
| Future research should explore hybrid frameworks that combine LLM reasoning with quantitative optimization for cost-sensitive environments. Governance And Regulation | positive | high | recommended research direction (hybrid LLM + optimization frameworks) |
0.05
|