AEL — a two-timescale approach pairing bandit-guided memory retrieval with LLM-driven reflection — lifts trading-agent Sharpe to 2.13 and reduces variance on a 208-episode portfolio benchmark, beating prior self-improving methods; notably, adding further architectural complexity harms performance, suggesting the bottleneck is diagnosing how to use past experience rather than more components.

AEL: Agent Evolving Learning for Open-Ended Environments

Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu, Han Zhang, Dimitris N. Metaxas · April 23, 2026

arxiv other medium evidence 7/10 relevance Source PDF

AEL combines a Thompson-Sampling bandit for retrieval-policy selection with LLM-driven reflective updates at a slower timescale, yielding higher Sharpe (2.13±0.47) and lower variance on a 208-episode portfolio benchmark and outperforming prior self-improving and non-LLM baselines.

LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael{} achieves a Sharpe ratio of 2.13$\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.

Summary

Main Finding

AEL (Agent Evolving Learning) is a two-timescale self-improvement framework for LLM agents that learns not only what to remember but—critically—how to use memories. By combining a fast-timescale Thompson-Sampling bandit that selects memory-retrieval policies with a slow-timescale LLM-driven reflection that produces causal diagnoses and (when needed) generates new policies, AEL materially improves risk-adjusted performance and robustness on a sequential portfolio benchmark. On the primary benchmark (10 sector-diverse tickers, 208 episodes, 5 seeds), AEL attains Sharpe 2.13 ± 0.47, outperforming prior self-improving methods and all non-LLM baselines tested. The authors find that memory + reflection produce the bulk of gains (cumulative +58% vs. a stateless baseline) and that adding further complexity typically degrades performance.

Key Points

Two-timescale architecture
- Fast timescale: Thompson Sampling bandit selects among memory retrieval policies episode-by-episode.
- Slow timescale: LLM reflection aggregates windows of episodes to produce causal diagnoses, regime labels, and to decide when to evolve retrieval policies (or planners/tools when enabled).
- Principle: diagnose before prescribe—structural changes are triggered only when reflection indicates they are needed.
Memory design
- Three-tier memory: episodic (raw logs), semantic (distilled cross-episode patterns), procedural (promoted high-confidence rules injected into prompts).
- Retrieval policies vary tier visibility, retrieval depth, formatting (5 initial policies); bandit learns which policy to use as memory matures.
- Retrieved entries are ranked by a composite relevance score combining feature match, quality, recency, and tier boost.
Learning signal & credit assignment
- Uniform scalar episode outcome st ∈ [−1,1] is clipped/transformed to [0,1] and used to update Beta posteriors for the selected bandit arm.
- Authors evaluated more complex credit methods (factored counterfactual credit — FCC, and LLM-driven FCC) but found they degraded performance in this noisy domain.
Ablation & robustness
- Comprehensive nine-variant ablation: removing reflection or memory, enabling planner evolution, per-tool selection, cold-start, skill extraction, or alternate credit schemes—nearly all modifications reduced performance.
- Key empirical pattern: “less is more” — simplest AEL configuration (memory bandit + LLM reflection) yielded best mean and lowest variance across seeds.
Empirical gains
- Incremental build: Stateless → +Memory → AEL produced Sharpe 1.35 → 1.68 → 2.13 (memory +24%, reflection +27%).
- AEL achieved highest Sharpe, Sortino, and Calmar ratios and lowest Max Drawdown variance among LLM-based approaches.
Implementation notes
- Default backbone LLM: Claude Haiku 4.5; same 12 tools across methods; experiments freeze learning at test time (bandits and memory read-only).
- Code and data: https://github.com/WujiangXu/AEL (paper is a preprint under review).

Data & Methods

Benchmark / domain
- Sequential portfolio allocation: 10 sector-diverse tickers at hourly resolution, 208 episodes (140 train / 40 val / 28 test). Test set includes a bear→bull regime shift.
- Objective metrics: Sharpe (primary), Sortino, Calmar, cumulative return, max drawdown, win rate, tail ratio.
Experimental protocol
- Baselines: 4 non-LLM strategies (equal-weight, momentum-weighted, min-variance, inverse-momentum) plus 5 prior self-improving LLM-agent methods (Reflexion, ExpeL, FactorMiner, Meta-Reflexion, EvoTool), HyperAgent, and incremental AEL variants.
- Seeds: main comparisons reported across 5 random seeds for stochastic methods (incremental build used 3 seeds in one figure).
- All methods share same tools, LLM, and data split; learning frozen at test time.
Core algorithms
- Memory-policy selection: each policy keeps a Beta(α,β) posterior; at each episode sample µ̃m and pick argmax; after episode convert outcome st to reward r̃t = clip((st+1)/2,0,1) and update α,β of chosen arm.
- Retrieval ranking score: composite function of ticker/sector/tool match, entry quality, recency (exponential decay), and tier boost (episodic/semantic/procedural).
- Slow-window reflection: LLM ingests per-ticker summaries, per-tool accuracy over window, market-side info, and recent reflections; outputs causal insight, regime label, confidence; insight injected into next-window prompts (not stored into memory).
- Evolution: new retrieval policies (or planners in full variant) are generated by the LLM only when reflection indicates persistent underperformance (e.g., average bandit reward below threshold).
Ablation & credit study
- Evaluated uniform reward (default), FCC (structural/counterfactual/Shapley), and LLM-FCC. Uniform credit performed best; FCC and LLM-FCC harmed results.
- Tested nine variant changes (remove warm-up/reflection, add planner evolution, per-tool selection, cold-start initialization, skill extraction, alternate credit methods); simplest AEL was optimal.

Implications for AI Economics

Bottleneck is interpretive use of experience, not raw memory quantity
- In noisy, regime-switching economic environments, the main value comes from diagnosing when signals are misleading and giving agents an interpretive frame to use stored experience. That suggests investments in meta-reasoning and causal interpretation can yield outsized returns compared with adding more memory or modular complexity.
Practical gains in risk-adjusted performance and stability
- AEL’s higher Sharpe and lower variance indicate that two-timescale self-improvement can produce more reliable economic decision agents—important for deployment in finance, market-making, or automated policy tools where stability across random seeds/initializations matters.
Caution on complex credit-assignment and overengineering
- Sophisticated attribution (Shapley-style, LLM-driven counterfactuals) can introduce noise and worsen performance in high-variance economic environments. Simple uniform returns may be preferable when feedback is noisy and sparse.
Design lessons for economic agents
- Two-timescale “diagnose-before-prescribe” architectures are promising: fast adaptation of retrieval/use policies with slower, aggregated causal reflection leads to targeted structural changes and avoids overfitting to short-term noise.
- Procedural memory (promoting stable, high-confidence rules) offers a compact way to inject distilled domain insights into agents’ decision processes without heavy online retrieval costs.
Limitations and risks
- Domain specificity: results are demonstrated on a controlled sequential portfolio benchmark; transfer to more complex market microstructure, multi-agent strategic settings, or longer horizons remains to be validated.
- LLM noise & costs: reliance on LLM reflection has compute and reliability costs; misdiagnoses or overconfident but incorrect causal inferences could introduce systemic risk if used in production without oversight.
- Regulatory & interpretability concerns: automatic policy evolution (code generation) raises auditability questions; procedural rules injected into prompts need traceability for compliance.
Future directions relevant to AI economics
- Test AEL in multi-agent markets, limit-order book settings, and longer-horizon macroeconomic forecasting to evaluate robustness to strategic interactions.
- Formalize when simple credit signals suffice vs. when more structured attribution is necessary—important for tuning agent learning in different economic regimes.
- Study human-in-the-loop reflection or constrained LLM-generated rules to balance autonomy with interpretability and regulatory compliance.

If you want, I can extract the core algorithm pseudocode into a compact bullet summary, produce a one-page slide-ready summary, or map the framework to a concrete trading-automation deployment checklist (risks, monitoring, compute/costs).

Assessment

Paper Typeother Evidence Strengthmedium — The paper reports consistent performance gains (higher Sharpe, lower variance) across five random seeds and a nine-variant ablation, and compares against published self-improving methods and non-LLM baselines; however the evaluation is limited to a single synthetic/benchmarked task (10 tickers, 208 episodes), uses a small number of seeds, lacks detailed statistical tests reported in the abstract, and may be sensitive to prompt/hyperparameter choices and dataset selection. Methods Rigormedium — The design uses a clear two-timescale algorithmic framework, includes an extensive ablation study and multiple baselines, and makes code/data available, but rigor is limited by scope (single benchmark), modest seed count, likely dependence on specific LLM prompts/models and implementation details, and absence (in the abstract) of robustness checks across domains, transaction-cost modelling, or out-of-sample market regimes. SampleSequential portfolio benchmark consisting of 10 sector-diverse tickers evaluated over 208 episodes with 5 random seeds; comparisons include five published self-improving methods, several non-LLM baselines and multiple LLM-based approaches; full code and data reportedly available at the authors' GitHub. Themesproductivity innovation GeneralizabilitySingle-domain evaluation (financial portfolio trading) — may not generalize to other open-ended tasks, Small number of assets (10 tickers) and episodes (208) — limited temporal and cross-sectional diversity, Possibly simulated or historical benchmark data that may not capture real-world frictions (transaction costs, market impact, non-stationarity), Performance likely dependent on LLM model, prompt engineering, and hyperparameters, Only five random seeds reported — limited evidence on robustness to initialization and randomness, Potential for selection or tuning bias in choice of baseline tasks and reported variants

Claims (7)

Claim	Direction	Confidence	Outcome	Details
We introduce Agent Evolving Learning (AEL), a two-timescale framework in which a Thompson Sampling bandit at the fast timescale learns which memory retrieval policy to apply each episode, while LLM-driven reflection at the slow timescale diagnoses failure patterns and injects causal insights into the agent's decision prompt. Training Effectiveness	positive	high	framework architecture / learning framework	0.06
On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), AEL achieves a Sharpe ratio of 2.13 ± 0.47. Decision Quality	positive	high	Sharpe ratio (portfolio performance metric)	n=5 Sharpe ratio of 2.13±0.47 0.12
AEL outperforms five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches on the benchmark. Decision Quality	positive	high	relative performance (ranking) and variance across methods	n=5 0.12
A nine-variant ablation reveals that memory and reflection together produce a 58% cumulative improvement over the stateless baseline. Decision Quality	positive	high	cumulative improvement in performance relative to stateless baseline	n=5 58% cumulative improvement 0.12
Every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) degrades performance. Decision Quality	negative	high	performance (e.g., Sharpe ratio or other benchmark metrics) relative to memory+reflection baseline	n=5 0.12
The central obstacle to agent self-improvement is not what to remember but how to use what has been remembered (which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change). Training Effectiveness	positive	medium	bottleneck characterization for agent self-improvement	0.01
The results demonstrate a 'less is more' pattern: simpler combination (memory + reflection) yields better performance than adding architectural complexity. Training Effectiveness	positive	high	relative performance of simpler vs. more complex agent configurations	n=5 0.12