AEL — a two-timescale approach pairing bandit-guided memory retrieval with LLM-driven reflection — lifts trading-agent Sharpe to 2.13 and reduces variance on a 208-episode portfolio benchmark, beating prior self-improving methods; notably, adding further architectural complexity harms performance, suggesting the bottleneck is diagnosing how to use past experience rather than more components.
LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael{} achieves a Sharpe ratio of 2.13$\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.
Summary
Main Finding
AEL (Agent Evolving Learning) is a two-timescale self-improvement framework for LLM agents that learns not only what to remember but—critically—how to use memories. By combining a fast-timescale Thompson-Sampling bandit that selects memory-retrieval policies with a slow-timescale LLM-driven reflection that produces causal diagnoses and (when needed) generates new policies, AEL materially improves risk-adjusted performance and robustness on a sequential portfolio benchmark. On the primary benchmark (10 sector-diverse tickers, 208 episodes, 5 seeds), AEL attains Sharpe 2.13 ± 0.47, outperforming prior self-improving methods and all non-LLM baselines tested. The authors find that memory + reflection produce the bulk of gains (cumulative +58% vs. a stateless baseline) and that adding further complexity typically degrades performance.
Key Points
- Two-timescale architecture
- Fast timescale: Thompson Sampling bandit selects among memory retrieval policies episode-by-episode.
- Slow timescale: LLM reflection aggregates windows of episodes to produce causal diagnoses, regime labels, and to decide when to evolve retrieval policies (or planners/tools when enabled).
- Principle: diagnose before prescribe—structural changes are triggered only when reflection indicates they are needed.
- Memory design
- Three-tier memory: episodic (raw logs), semantic (distilled cross-episode patterns), procedural (promoted high-confidence rules injected into prompts).
- Retrieval policies vary tier visibility, retrieval depth, formatting (5 initial policies); bandit learns which policy to use as memory matures.
- Retrieved entries are ranked by a composite relevance score combining feature match, quality, recency, and tier boost.
- Learning signal & credit assignment
- Uniform scalar episode outcome st ∈ [−1,1] is clipped/transformed to [0,1] and used to update Beta posteriors for the selected bandit arm.
- Authors evaluated more complex credit methods (factored counterfactual credit — FCC, and LLM-driven FCC) but found they degraded performance in this noisy domain.
- Ablation & robustness
- Comprehensive nine-variant ablation: removing reflection or memory, enabling planner evolution, per-tool selection, cold-start, skill extraction, or alternate credit schemes—nearly all modifications reduced performance.
- Key empirical pattern: “less is more” — simplest AEL configuration (memory bandit + LLM reflection) yielded best mean and lowest variance across seeds.
- Empirical gains
- Incremental build: Stateless → +Memory → AEL produced Sharpe 1.35 → 1.68 → 2.13 (memory +24%, reflection +27%).
- AEL achieved highest Sharpe, Sortino, and Calmar ratios and lowest Max Drawdown variance among LLM-based approaches.
- Implementation notes
- Default backbone LLM: Claude Haiku 4.5; same 12 tools across methods; experiments freeze learning at test time (bandits and memory read-only).
- Code and data: https://github.com/WujiangXu/AEL (paper is a preprint under review).
Data & Methods
- Benchmark / domain
- Sequential portfolio allocation: 10 sector-diverse tickers at hourly resolution, 208 episodes (140 train / 40 val / 28 test). Test set includes a bear→bull regime shift.
- Objective metrics: Sharpe (primary), Sortino, Calmar, cumulative return, max drawdown, win rate, tail ratio.
- Experimental protocol
- Baselines: 4 non-LLM strategies (equal-weight, momentum-weighted, min-variance, inverse-momentum) plus 5 prior self-improving LLM-agent methods (Reflexion, ExpeL, FactorMiner, Meta-Reflexion, EvoTool), HyperAgent, and incremental AEL variants.
- Seeds: main comparisons reported across 5 random seeds for stochastic methods (incremental build used 3 seeds in one figure).
- All methods share same tools, LLM, and data split; learning frozen at test time.
- Core algorithms
- Memory-policy selection: each policy keeps a Beta(α,β) posterior; at each episode sample µ̃m and pick argmax; after episode convert outcome st to reward r̃t = clip((st+1)/2,0,1) and update α,β of chosen arm.
- Retrieval ranking score: composite function of ticker/sector/tool match, entry quality, recency (exponential decay), and tier boost (episodic/semantic/procedural).
- Slow-window reflection: LLM ingests per-ticker summaries, per-tool accuracy over window, market-side info, and recent reflections; outputs causal insight, regime label, confidence; insight injected into next-window prompts (not stored into memory).
- Evolution: new retrieval policies (or planners in full variant) are generated by the LLM only when reflection indicates persistent underperformance (e.g., average bandit reward below threshold).
- Ablation & credit study
- Evaluated uniform reward (default), FCC (structural/counterfactual/Shapley), and LLM-FCC. Uniform credit performed best; FCC and LLM-FCC harmed results.
- Tested nine variant changes (remove warm-up/reflection, add planner evolution, per-tool selection, cold-start initialization, skill extraction, alternate credit methods); simplest AEL was optimal.
Implications for AI Economics
- Bottleneck is interpretive use of experience, not raw memory quantity
- In noisy, regime-switching economic environments, the main value comes from diagnosing when signals are misleading and giving agents an interpretive frame to use stored experience. That suggests investments in meta-reasoning and causal interpretation can yield outsized returns compared with adding more memory or modular complexity.
- Practical gains in risk-adjusted performance and stability
- AEL’s higher Sharpe and lower variance indicate that two-timescale self-improvement can produce more reliable economic decision agents—important for deployment in finance, market-making, or automated policy tools where stability across random seeds/initializations matters.
- Caution on complex credit-assignment and overengineering
- Sophisticated attribution (Shapley-style, LLM-driven counterfactuals) can introduce noise and worsen performance in high-variance economic environments. Simple uniform returns may be preferable when feedback is noisy and sparse.
- Design lessons for economic agents
- Two-timescale “diagnose-before-prescribe” architectures are promising: fast adaptation of retrieval/use policies with slower, aggregated causal reflection leads to targeted structural changes and avoids overfitting to short-term noise.
- Procedural memory (promoting stable, high-confidence rules) offers a compact way to inject distilled domain insights into agents’ decision processes without heavy online retrieval costs.
- Limitations and risks
- Domain specificity: results are demonstrated on a controlled sequential portfolio benchmark; transfer to more complex market microstructure, multi-agent strategic settings, or longer horizons remains to be validated.
- LLM noise & costs: reliance on LLM reflection has compute and reliability costs; misdiagnoses or overconfident but incorrect causal inferences could introduce systemic risk if used in production without oversight.
- Regulatory & interpretability concerns: automatic policy evolution (code generation) raises auditability questions; procedural rules injected into prompts need traceability for compliance.
- Future directions relevant to AI economics
- Test AEL in multi-agent markets, limit-order book settings, and longer-horizon macroeconomic forecasting to evaluate robustness to strategic interactions.
- Formalize when simple credit signals suffice vs. when more structured attribution is necessary—important for tuning agent learning in different economic regimes.
- Study human-in-the-loop reflection or constrained LLM-generated rules to balance autonomy with interpretability and regulatory compliance.
If you want, I can extract the core algorithm pseudocode into a compact bullet summary, produce a one-page slide-ready summary, or map the framework to a concrete trading-automation deployment checklist (risks, monitoring, compute/costs).
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce Agent Evolving Learning (AEL), a two-timescale framework in which a Thompson Sampling bandit at the fast timescale learns which memory retrieval policy to apply each episode, while LLM-driven reflection at the slow timescale diagnoses failure patterns and injects causal insights into the agent's decision prompt. Training Effectiveness | positive | high | framework architecture / learning framework |
0.06
|
| On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), AEL achieves a Sharpe ratio of 2.13 ± 0.47. Decision Quality | positive | high | Sharpe ratio (portfolio performance metric) |
n=5
Sharpe ratio of 2.13±0.47
0.12
|
| AEL outperforms five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches on the benchmark. Decision Quality | positive | high | relative performance (ranking) and variance across methods |
n=5
0.12
|
| A nine-variant ablation reveals that memory and reflection together produce a 58% cumulative improvement over the stateless baseline. Decision Quality | positive | high | cumulative improvement in performance relative to stateless baseline |
n=5
58% cumulative improvement
0.12
|
| Every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) degrades performance. Decision Quality | negative | high | performance (e.g., Sharpe ratio or other benchmark metrics) relative to memory+reflection baseline |
n=5
0.12
|
| The central obstacle to agent self-improvement is not what to remember but how to use what has been remembered (which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change). Training Effectiveness | positive | medium | bottleneck characterization for agent self-improvement |
0.01
|
| The results demonstrate a 'less is more' pattern: simpler combination (memory + reflection) yields better performance than adding architectural complexity. Training Effectiveness | positive | high | relative performance of simpler vs. more complex agent configurations |
n=5
0.12
|