A feedback-grounded training method meaningfully boosts long-horizon LLM-agent performance—raising Pass@k up to 14% versus strong baselines—by teaching models how to recover from failures rather than only optimizing final success. The approach increases sample efficiency, implying firms can extract more performance from the same interaction logs and reduce marginal interaction costs.
Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.
Summary
Main Finding
LEAFE (Learning Feedback-Grounded Agency from Reflective Experience) substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback. Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k across interactive coding and agentic tasks under fixed interaction budgets — up to a 14% absolute improvement on Pass@128 — by converting rich feedback into actionable corrective supervision rather than optimizing only final success signals.
Key Points
- Problem identified: Outcome-driven post-training (optimizing final rewards) underutilizes rich environment feedback and causes "distribution sharpening" — policies overfit a narrow set of successful behaviors and fail to broaden problem-solving/ recovery capacity in long-horizon settings.
- Core idea: Internalize recovery agency from reflective experience so the model learns how to recover from failures, not just reproduce end-successes.
- LEAFE algorithm (high level):
- During exploration, summarize environment feedback into compact, actionable "experience" items.
- Backtrack to earlier decision points identified as causal to failures and re-explore alternative action branches using the summarized experience to guide corrections.
- Distill the corrected decision trajectories into the policy via supervised fine-tuning so the model can execute recovery behaviors in future interactions.
- Empirical outcomes: Consistent improvements in Pass@1 and higher Pass@k than baselines (GRPO, Early Experience); up to +14% Pass@128 reported. Gains occur across diverse interactive coding and agentic tasks with limited interaction budget.
- Advantage: LEAFE uses the same environmental interactions more effectively by converting feedback into targeted training signals that expand the model’s actionable repertoire rather than narrowing it.
Data & Methods
- Tasks: A suite of long-horizon interactive tasks emphasizing planning, acting, and recovery; examples include multi-step coding problems and agentic tasks with rich feedback channels.
- Interaction regime: Fixed interaction budgets to simulate realistic deployment constraints and emphasize sample efficiency.
- Experience collection: Agents explore; environment returns rich feedback (error messages, intermediate observations). LEAFE summarizes that feedback into actionable experience (e.g., what went wrong, where to change).
- Reflective procedure: Identify earlier decision points causally linked to failure, backtrack, and perform targeted alternative explorations conditioned on the summarized experience.
- Training: Distill corrected trajectories into the model using supervised fine-tuning (experience-guided corrections as targets), rather than relying solely on reward signals or final-outcome optimization.
- Baselines: Outcome-driven RL (GRPO) and prior experience-based method (Early Experience). Evaluation metric: Pass@k (with emphasis on Pass@1 and Pass@128), measuring the fraction of problems solved among k candidate runs.
- Results summary: LEAFE consistently raises Pass@1 and Pass@k across task suites; reported maximum gain up to 14% on Pass@128 versus the strongest baselines.
Implications for AI Economics
- Productivity and quality of automation: LEAFE increases the effective problem-solving breadth of deployed LLM agents under limited interaction budgets, which can raise the productivity and reliability of automated coding, customer support, and other agentic services.
- Returns to scale and efficiency: By extracting more training value from the same environment interactions, LEAFE improves sample efficiency — reducing marginal data/interaction costs to improve performance. This shifts the cost curve of deploying agentic systems (less need for massive extra interactions or reward engineering).
- Competitive dynamics: Firms that implement feedback-grounded learning can achieve faster real-world capability improvements without proportional increases in compute or data, potentially altering competitive advantages toward organizations that instrument and leverage rich feedback pipelines.
- Labor-market impacts: More robust recovery and broader problem-solving by agents could accelerate automation of complex, multi-step tasks (e.g., software engineering workflows), increasing substitution pressure for certain labor tasks while creating demand for roles that design feedback systems, curate experience pipelines, or supervise agent learning.
- Policy and investment signals: Investing in systems that capture and structure environment feedback (logging, error annotation, backtracking mechanisms) yields outsized returns; policy should consider standards for feedback access, privacy, and reliability because feedback quality materially affects agent behavior and economic consequences.
- Risk profile: Better recovery capability reduces brittle failure modes but may also enable more autonomous behavior in novel settings — amplifying both benefits and potential misuse. Regulators and firms should monitor deployed agents’ exploratory/backtracking mechanisms and audit how reflective experiences are used.
- General equilibrium: If many deployed agents adopt LEAFE-like learning, the aggregate increase in agent competence could accelerate the diffusion of agentic automation across sectors, affecting wages, task allocation, and demand for complementary capital (tooling, monitoring, retraining systems).
Limitations and open questions relevant for economic assessment: - Generality: Reported gains are on interactive coding and agentic tasks; transfer to other domains depends on feedback richness and the ability to identify causal decision points. - Costs: The reflective/backtracking pipeline and supervised fine-tuning impose engineering and compute costs that need to be weighed against interaction savings. - Feedback quality: LEAFE’s benefits depend on informative, actionable feedback; environments with noisy or adversarial feedback may limit improvements. - Externalities: Widespread adoption may change the value of data and feedback infrastructures, with implications for market concentration and access to performance-improving signals.
Assessment
Claims (13)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| LEAFE substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback. Output Quality | positive | high | Long-horizon agentic performance measured by Pass@k (Pass@1, Pass@k, Pass@128) |
0.12
|
| Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k under fixed interaction budgets. Output Quality | positive | high | Pass@1 and Pass@k (fraction of problems solved among k candidate runs) |
0.12
|
| LEAFE achieves up to a 14% absolute improvement on Pass@128 versus the strongest baselines. Output Quality | positive | high | Pass@128 (absolute percentage point improvement) |
14%
0.12
|
| LEAFE converts rich environment feedback into actionable corrective supervision rather than optimizing only final success signals, which drives performance gains. Other | positive | medium | Pass@k performance; also qualitative measure of learned recovery behavior (implicit) |
0.07
|
| Outcome-driven post-training (optimizing final rewards) underutilizes rich environment feedback and causes 'distribution sharpening' — policies overfit a narrow set of successful behaviors and fail to broaden problem-solving/recovery capacity in long-horizon settings. Other | negative | medium | Breadth of problem-solving/recovery capacity (inferred from failure modes and Pass@k performance across diverse cases) |
0.07
|
| LEAFE uses the same environmental interactions more effectively, improving sample efficiency under fixed interaction budgets. Other | positive | medium | Sample efficiency operationalized as Pass@k achieved under fixed interaction budgets (performance per interaction) |
0.07
|
| LEAFE's gains occur across diverse interactive coding and agentic tasks with limited interaction budget. Other | positive | medium | Pass@k across multiple task types (interactive coding and agentic tasks) |
0.07
|
| The LEAFE algorithmic procedure: summarize environment feedback into compact experience items; backtrack to earlier decision points causally linked to failures and re-explore corrective action branches; distill corrected trajectories into the policy via supervised fine-tuning. Other | null_result | high | N/A (algorithmic procedure description rather than an outcome) |
0.12
|
| Distilling corrected decision trajectories into the model via supervised fine-tuning produces better recovery behavior than relying solely on reward signals or final-outcome optimization. Other | positive | medium | Recovery behavior performance reflected in Pass@k (success rates) after training |
0.07
|
| LEAFE's benefits depend on informative, actionable feedback; environments with noisy or adversarial feedback may limit improvements. Other | negative | medium | Change in Pass@k or recovery performance under degraded/noisy feedback (qualitative/conditional claim) |
0.07
|
| By extracting more training value from the same environment interactions, LEAFE reduces marginal data/interaction costs and shifts the cost curve of deploying agentic systems (improves returns-to-sample-effort). Firm Productivity | positive | speculative | Effective cost per unit performance (implied reduction via higher Pass@k per interaction) — not directly measured numerically |
0.01
|
| Widespread adoption of LEAFE-like learning could accelerate diffusion of agentic automation across sectors, affecting wages, task allocation, and demand for complementary capital (tooling, monitoring, retraining systems). Task Allocation | mixed | speculative | Macro-level economic outcomes (productivity, wages, task allocation) — not directly measured in the paper |
0.01
|
| Improved recovery capability from LEAFE reduces brittle failure modes but may also enable more autonomous behavior in novel settings, increasing both benefits and potential misuse risks. Ai Safety And Ethics | mixed | speculative | System brittleness and autonomy-related risk potential (qualitative; no direct empirical safety metrics reported) |
0.01
|