A feedback-grounded training method meaningfully boosts long-horizon LLM-agent performance—raising Pass@k up to 14% versus strong baselines—by teaching models how to recover from failures rather than only optimizing final success. The approach increases sample efficiency, implying firms can extract more performance from the same interaction logs and reduce marginal interaction costs.

Internalizing Agency from Reflective Experience

Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, Hao Zhang · March 17, 2026

arxiv other medium evidence 7/10 relevance Source PDF

LEAFE turns rich environment feedback into corrective supervision via backtracking and distillation, substantially improving long-horizon agentic performance (up to +14% Pass@128) under fixed interaction budgets compared with outcome-driven and prior experience-based baselines.

Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.

Summary

Main Finding

LEAFE (Learning Feedback-Grounded Agency from Reflective Experience) substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback. Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k across interactive coding and agentic tasks under fixed interaction budgets — up to a 14% absolute improvement on Pass@128 — by converting rich feedback into actionable corrective supervision rather than optimizing only final success signals.

Key Points

Problem identified: Outcome-driven post-training (optimizing final rewards) underutilizes rich environment feedback and causes "distribution sharpening" — policies overfit a narrow set of successful behaviors and fail to broaden problem-solving/ recovery capacity in long-horizon settings.
Core idea: Internalize recovery agency from reflective experience so the model learns how to recover from failures, not just reproduce end-successes.
LEAFE algorithm (high level):
- During exploration, summarize environment feedback into compact, actionable "experience" items.
- Backtrack to earlier decision points identified as causal to failures and re-explore alternative action branches using the summarized experience to guide corrections.
- Distill the corrected decision trajectories into the policy via supervised fine-tuning so the model can execute recovery behaviors in future interactions.
Empirical outcomes: Consistent improvements in Pass@1 and higher Pass@k than baselines (GRPO, Early Experience); up to +14% Pass@128 reported. Gains occur across diverse interactive coding and agentic tasks with limited interaction budget.
Advantage: LEAFE uses the same environmental interactions more effectively by converting feedback into targeted training signals that expand the model’s actionable repertoire rather than narrowing it.

Data & Methods

Tasks: A suite of long-horizon interactive tasks emphasizing planning, acting, and recovery; examples include multi-step coding problems and agentic tasks with rich feedback channels.
Interaction regime: Fixed interaction budgets to simulate realistic deployment constraints and emphasize sample efficiency.
Experience collection: Agents explore; environment returns rich feedback (error messages, intermediate observations). LEAFE summarizes that feedback into actionable experience (e.g., what went wrong, where to change).
Reflective procedure: Identify earlier decision points causally linked to failure, backtrack, and perform targeted alternative explorations conditioned on the summarized experience.
Training: Distill corrected trajectories into the model using supervised fine-tuning (experience-guided corrections as targets), rather than relying solely on reward signals or final-outcome optimization.
Baselines: Outcome-driven RL (GRPO) and prior experience-based method (Early Experience). Evaluation metric: Pass@k (with emphasis on Pass@1 and Pass@128), measuring the fraction of problems solved among k candidate runs.
Results summary: LEAFE consistently raises Pass@1 and Pass@k across task suites; reported maximum gain up to 14% on Pass@128 versus the strongest baselines.

Implications for AI Economics

Productivity and quality of automation: LEAFE increases the effective problem-solving breadth of deployed LLM agents under limited interaction budgets, which can raise the productivity and reliability of automated coding, customer support, and other agentic services.
Returns to scale and efficiency: By extracting more training value from the same environment interactions, LEAFE improves sample efficiency — reducing marginal data/interaction costs to improve performance. This shifts the cost curve of deploying agentic systems (less need for massive extra interactions or reward engineering).
Competitive dynamics: Firms that implement feedback-grounded learning can achieve faster real-world capability improvements without proportional increases in compute or data, potentially altering competitive advantages toward organizations that instrument and leverage rich feedback pipelines.
Labor-market impacts: More robust recovery and broader problem-solving by agents could accelerate automation of complex, multi-step tasks (e.g., software engineering workflows), increasing substitution pressure for certain labor tasks while creating demand for roles that design feedback systems, curate experience pipelines, or supervise agent learning.
Policy and investment signals: Investing in systems that capture and structure environment feedback (logging, error annotation, backtracking mechanisms) yields outsized returns; policy should consider standards for feedback access, privacy, and reliability because feedback quality materially affects agent behavior and economic consequences.
Risk profile: Better recovery capability reduces brittle failure modes but may also enable more autonomous behavior in novel settings — amplifying both benefits and potential misuse. Regulators and firms should monitor deployed agents’ exploratory/backtracking mechanisms and audit how reflective experiences are used.
General equilibrium: If many deployed agents adopt LEAFE-like learning, the aggregate increase in agent competence could accelerate the diffusion of agentic automation across sectors, affecting wages, task allocation, and demand for complementary capital (tooling, monitoring, retraining systems).

Limitations and open questions relevant for economic assessment: - Generality: Reported gains are on interactive coding and agentic tasks; transfer to other domains depends on feedback richness and the ability to identify causal decision points. - Costs: The reflective/backtracking pipeline and supervised fine-tuning impose engineering and compute costs that need to be weighed against interaction savings. - Feedback quality: LEAFE’s benefits depend on informative, actionable feedback; environments with noisy or adversarial feedback may limit improvements. - Externalities: Widespread adoption may change the value of data and feedback infrastructures, with implications for market concentration and access to performance-improving signals.

Assessment

Paper Typeother Evidence Strengthmedium — The paper presents consistent, within-sample performance improvements (up to +14% Pass@128) across multiple interactive coding and agentic tasks, which is credible evidence that the algorithm improves agentic behavior in the evaluated settings; however, there is no measurement of real-world economic outcomes, limited domain scope, and uncertain robustness to noisy/adversarial feedback or different deployment conditions, so extrapolation to economic impacts is speculative. Methods Rigormedium — The evaluation uses controlled baseline comparisons, fixed interaction budgets, and relevant metrics (Pass@k), and the algorithmic procedure (backtracking, experience summarization, distillation) is well-motivated; but the summary lacks details on statistical significance, ablation studies, sensitivity to hyperparameters/model size, real-world deployment tests, and computational/engineering cost accounting, which prevents a high rigor rating. SampleA suite of long-horizon interactive tasks including multi-step coding problems and agentic environments that provide rich feedback (error messages, intermediate observations); agents explore under fixed interaction budgets, collect summarized 'experience' items, and are evaluated by Pass@k (Pass@1 to Pass@128) against baselines GRPO and Early Experience; specific dataset sizes, model families, and environment distributions are not detailed in the summary. Themesproductivity adoption labor_markets IdentificationControlled empirical comparisons: the authors implement LEAFE and evaluate it against outcome-driven (GRPO) and experience-based (Early Experience) baselines on the same suite of long-horizon interactive tasks under fixed interaction budgets, using Pass@k as the primary performance metric; gains are inferred from these within-task experimental comparisons rather than from an external causal identification strategy. GeneralizabilityEvaluation limited to interactive coding and selected agentic tasks; results may not transfer to other domains (e.g., low-feedback or highly stochastic environments)., Relies on rich, informative feedback channels (error messages, intermediate observations); performance may degrade with noisy, sparse, or adversarial feedback., Engineering, compute, and implementation costs for the reflective/backtracking and distillation pipeline may limit applicability in resource-constrained settings., Unclear sensitivity to model family/size and hyperparameters; results on one model class may not generalize to others., Benchmarks likely simulated or lab-style tasks rather than large-scale real-world deployments, so economic impacts are inferred rather than measured.

Claims (13)

Claim	Direction	Confidence	Outcome	Details
LEAFE substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback. Output Quality	positive	high	Long-horizon agentic performance measured by Pass@k (Pass@1, Pass@k, Pass@128)	0.12
Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k under fixed interaction budgets. Output Quality	positive	high	Pass@1 and Pass@k (fraction of problems solved among k candidate runs)	0.12
LEAFE achieves up to a 14% absolute improvement on Pass@128 versus the strongest baselines. Output Quality	positive	high	Pass@128 (absolute percentage point improvement)	14% 0.12
LEAFE converts rich environment feedback into actionable corrective supervision rather than optimizing only final success signals, which drives performance gains. Other	positive	medium	Pass@k performance; also qualitative measure of learned recovery behavior (implicit)	0.07
Outcome-driven post-training (optimizing final rewards) underutilizes rich environment feedback and causes 'distribution sharpening' — policies overfit a narrow set of successful behaviors and fail to broaden problem-solving/recovery capacity in long-horizon settings. Other	negative	medium	Breadth of problem-solving/recovery capacity (inferred from failure modes and Pass@k performance across diverse cases)	0.07
LEAFE uses the same environmental interactions more effectively, improving sample efficiency under fixed interaction budgets. Other	positive	medium	Sample efficiency operationalized as Pass@k achieved under fixed interaction budgets (performance per interaction)	0.07
LEAFE's gains occur across diverse interactive coding and agentic tasks with limited interaction budget. Other	positive	medium	Pass@k across multiple task types (interactive coding and agentic tasks)	0.07
The LEAFE algorithmic procedure: summarize environment feedback into compact experience items; backtrack to earlier decision points causally linked to failures and re-explore corrective action branches; distill corrected trajectories into the policy via supervised fine-tuning. Other	null_result	high	N/A (algorithmic procedure description rather than an outcome)	0.12
Distilling corrected decision trajectories into the model via supervised fine-tuning produces better recovery behavior than relying solely on reward signals or final-outcome optimization. Other	positive	medium	Recovery behavior performance reflected in Pass@k (success rates) after training	0.07
LEAFE's benefits depend on informative, actionable feedback; environments with noisy or adversarial feedback may limit improvements. Other	negative	medium	Change in Pass@k or recovery performance under degraded/noisy feedback (qualitative/conditional claim)	0.07
By extracting more training value from the same environment interactions, LEAFE reduces marginal data/interaction costs and shifts the cost curve of deploying agentic systems (improves returns-to-sample-effort). Firm Productivity	positive	speculative	Effective cost per unit performance (implied reduction via higher Pass@k per interaction) — not directly measured numerically	0.01
Widespread adoption of LEAFE-like learning could accelerate diffusion of agentic automation across sectors, affecting wages, task allocation, and demand for complementary capital (tooling, monitoring, retraining systems). Task Allocation	mixed	speculative	Macro-level economic outcomes (productivity, wages, task allocation) — not directly measured in the paper	0.01
Improved recovery capability from LEAFE reduces brittle failure modes but may also enable more autonomous behavior in novel settings, increasing both benefits and potential misuse risks. Ai Safety And Ethics	mixed	speculative	System brittleness and autonomy-related risk potential (qualitative; no direct empirical safety metrics reported)	0.01