← Papers

LLMs reading headlines can build better-than-naive equity portfolios but don’t yet match AI-driven optimizers; low-turnover LLM strategies remain competitive after accounting for trading costs.

Few-Shot Portfolio Optimization: Can Large Language Models Outperform Quantitative Portfolio Optimization? A Comparative Study of LLMs and Optimized Portfolio Allocators

Lamukanyani Alson Mantshimuli, John Weirstrass Muteba Mwamba · April 28, 2026 · Journal of risk and financial management

openalex correlational medium evidence 7/10 relevance DOI Source PDF

LLMs can extract actionable trading signals from news headlines that produce portfolios outperforming naive diversification and remaining competitive after transaction costs, but they underperform dedicated AI-optimized portfolio methods.

Recent advances in large language models (LLMs) have raised questions about their potential role in portfolio allocation beyond traditional sentiment analyses. This study investigated whether LLMs, when prompted directly, can autonomously generate portfolio weights that compete with classical optimization and AI-enhanced strategies. We evaluated seven medium-sized open-source LLMs—Gemma-7B, Mistral-7B, Jansen Adapt-Finance-Llama2-7B, DeepSeek-R1-8B, QuantFactory Llama-3-8B-Instruct-Finance, Qwen-7B, and Llama2-7B—using systematic prompt engineering and temperature tuning. Portfolios were constructed from financial news headlines for S&P 500 equities and benchmarked against mean–variance optimization (MVO), the Black–Litterman model, AI-driven optimizers, and naive diversification strategies. The results show that, while LLM-generated portfolios outperformed naive diversification (Sharpe ratio up to 0.741), they lagged behind AI-optimized benchmarks (Sharpe ratio up to 1.361). A transaction cost analysis revealed that low-turnover LLM strategies retain their competitiveness post-costs, surpassing cap-weighted benchmarks. Statistical tests confirmed significant performance differences (p≤0.01). These findings highlight the ability of LLMs to extract signals from unstructured text, but also their limitations without explicit optimization. Future research should explore hybrid frameworks that combine LLM reasoning with quantitative optimization for cost-sensitive environments.

Summary

Main Finding

LLMs prompted to produce portfolio weights from financial-news headlines can extract tradable signals and outperform naive diversification, but they underperform dedicated AI-optimization and classical quantitative methods. Low-turnover LLM strategies can remain cost-competitive (beating cap-weighted benchmarks after transaction costs), yet they do not match the Sharpe ratios achieved by AI-optimized portfolios.

Key Points

Models evaluated: Gemma-7B, Mistral-7B, Jansen Adapt-Finance-Llama2-7B, DeepSeek-R1-8B, QuantFactory Llama-3-8B-Instruct-Finance, Qwen-7B, Llama2-7B.
Input: unstructured financial news headlines associated with S&P 500 equities.
Approach: directly prompt LLMs to output portfolio weights, using systematic prompt engineering and temperature tuning.
Benchmarks: mean–variance optimization (MVO), Black–Litterman, AI-driven optimizers, and naive diversification (including cap-weighted).
Performance: best LLM strategy achieved Sharpe ≈ 0.741; best AI-optimized benchmark reached Sharpe ≈ 1.361.
Statistical significance: performance differences were statistically significant (p ≤ 0.01).
Transaction costs: low-turnover LLM strategies maintain competitiveness after accounting for trading costs and can outperform cap-weighted benchmarks.
Interpretation: LLMs can extract signals from headlines but, when used alone to output weights, they lack the formal optimization that yields higher risk-adjusted returns.

Data & Methods

Data: financial-news headlines linked to S&P 500 equities (study limited to headlines rather than full articles).
Model set: seven medium-sized open-source LLMs (listed above), not necessarily finetuned on finance except where noted (e.g., “Adapt-Finance” or “Instruct-Finance” variants).
Prompting: systematic prompt-engineering protocol to elicit weight vectors; temperature tuning to control generation stochasticity and turnover.
Portfolio construction: LLMs directly produced weights used to build portfolios; results compared to classical portfolio optimizers (MVO, Black–Litterman), AI-driven optimizers, and naïve benchmarks.
Evaluation metrics: Sharpe ratio as primary performance metric; turnover and transaction-cost-adjusted returns assessed; statistical testing performed to compare strategies (reported significance p ≤ 0.01).
Limitations in methods: medium-sized models only, headlines-only text input, single asset universe (S&P 500), and no explicit integration of LLM outputs into formal optimization routines in the LLM-only approach.

Implications for AI Economics

Signal extraction from text: LLMs are viable tools for extracting actionable signals from unstructured text, expanding the scope of alternative data in portfolio allocation research.
Optimization vs. reasoning trade-off: LLM reasoning alone is insufficient to outperform optimized quantitative strategies; combining LLM-generated insights with formal optimization can unlock better risk-adjusted returns.
Cost-sensitive deployment: low-turnover LLM strategies can be attractive in environments with realistic trading costs, suggesting practical applicability for longer-horizon or low-transaction strategies.
Research agenda:
- Develop hybrid frameworks that feed LLM-derived signals or priors into quantitative optimizers (e.g., Black–Litterman, constrained MVO, RL-based allocators).
- Evaluate larger or finance-finetuned LLMs, richer text inputs (full articles, filings), and multimodal data.
- Study robustness across time periods, asset classes, market regimes, and transaction-cost regimes.
- Analyze governance, interpretability, and economic costs (compute, latency, reproducibility) relative to performance gains.
Policy and market impact: as LLMs become integrated into allocation workflows, market dynamics (liquidity, crowding) and regulatory issues (model risk, disclosure) warrant attention.

Assessment

Paper Typecorrelational Evidence Strengthmedium — The paper provides empirical backtest results, multiple benchmarks, systematic prompt/temperature tuning, and statistical tests (reporting p≤0.01) plus transaction-cost analysis, which lends credible evidence that LLM-based portfolios can extract useful signals; however, results rely on historical simulation, details on sample period, out-of-sample validation, robustness checks and sensitivity to prompt choices are limited or unspecified, leaving risk of overfitting and limits to inference about real-world, causal effectiveness. Methods Rigormedium — The authors use multiple LLMs, systematic prompt engineering, temperature sweeps, varied benchmarks, and explicit transaction-cost adjustments — good empirical practice — but the description lacks clarity on key methodological elements (e.g., exact time period, headline sources, holdout procedures, hyperparameter selection protocol, multiple-testing corrections, and robustness to alternative benchmark implementations), which weakens reproducibility and internal validity. SamplePortfolios constructed from financial news headlines linked to S&P 500 equities; seven medium-sized open-source LLMs evaluated (Gemma-7B, Mistral-7B, Jansen Adapt-Finance-Llama2-7B, DeepSeek-R1-8B, QuantFactory Llama-3-8B-Instruct-Finance, Qwen-7B, Llama2-7B); benchmarked against mean–variance optimization, Black–Litterman, AI-driven optimizers, and naive diversification strategies; performance assessed via Sharpe ratios and transaction-cost-adjusted returns (exact sample period and headline sources not specified in the summary). Themesinnovation adoption IdentificationComparative performance evaluation via historical backtesting: LLM-generated portfolio weights (from news headlines) are backtested and compared to benchmark strategies (MVO, Black–Litterman, AI-optimizers, naive diversification) using performance metrics (Sharpe ratios) and statistical tests (p-values); no causal identification strategy beyond out-of-sample performance comparison. GeneralizabilityRestricted to S&P 500 equities (large-cap US stocks); may not generalize to small caps, other asset classes, or non-US markets, Relies solely on news headlines (not full articles, filings, or alternative data), limiting signal scope, Evaluated only medium-sized open-source LLMs—results may differ for larger proprietary models or domain-tuned systems, Backtest-based evaluation may not translate to live trading (market impact, latency, real-time data availability), Performance may be sensitive to prompt engineering, temperature tuning, and hyperparameter choices that may not generalize, Transaction cost assumptions and execution modelling specifics may not match real-world trading frictions across contexts, Unspecified sample period and market regimes — results could be regime-dependent (e.g., crisis vs calm markets)

Claims (8)

Claim	Direction	Confidence	Outcome	Details
We evaluated seven medium-sized open-source LLMs—Gemma-7B, Mistral-7B, Jansen Adapt-Finance-Llama2-7B, DeepSeek-R1-8B, QuantFactory Llama-3-8B-Instruct-Finance, Qwen-7B, and Llama2-7B. Other	null_result	high	models evaluated (model set)	n=7 0.5
Portfolios were constructed from financial news headlines for S&P 500 equities and benchmarked against mean–variance optimization (MVO), the Black–Litterman model, AI-driven optimizers, and naive diversification strategies. Other	null_result	high	portfolio construction source and benchmarking set	n=500 0.3
LLM-generated portfolios outperformed naive diversification (Sharpe ratio up to 0.741). Firm Productivity	positive	high	Sharpe ratio (risk-adjusted return) of portfolios	Sharpe ratio up to 0.741 0.3
LLM-generated portfolios lagged behind AI-optimized benchmarks (Sharpe ratio up to 1.361). Firm Productivity	negative	high	Sharpe ratio (risk-adjusted return) of portfolios	Sharpe ratio up to 1.361 0.3
A transaction cost analysis revealed that low-turnover LLM strategies retain their competitiveness post-costs, surpassing cap-weighted benchmarks. Firm Productivity	positive	high	post-cost portfolio performance relative to cap-weighted benchmark	0.3
Statistical tests confirmed significant performance differences (p ≤ 0.01). Decision Quality	mixed	high	statistical significance of performance differences between strategies	p≤0.01 0.3
LLMs are able to extract signals from unstructured text (financial news headlines) but have limitations without explicit quantitative optimization. Output Quality	mixed	high	ability to extract actionable signals from unstructured text as reflected in portfolio performance	0.3
Future research should explore hybrid frameworks that combine LLM reasoning with quantitative optimization for cost-sensitive environments. Governance And Regulation	positive	high	recommended research direction (hybrid LLM + optimization frameworks)	0.05