A predictive prompt-selection method halves (or more) the costly rollouts needed for RL finetuning and improves reasoning accuracy across math, planning and visual-geometry benchmarks; by making iterative finetuning cheaper and faster, the technique could materially lower the marginal compute cost of model improvement for practitioners.

Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji · March 11, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

DPS uses an HMM-based online Bayesian predictor of per-prompt learning progress to select informative prompts without exhaustive rollouts, substantially cutting redundant rollouts and speeding RL finetuning while improving final reasoning accuracy across multiple benchmark domains.

Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their learning dynamics prior to costly rollouts. Specifically, we introduce a new perspective by modeling each prompt's solving progress during RL finetuning as a dynamical system, where the extent of solving is represented as the state and the transition is characterized by a hidden Markov model. Using historical rollout reward signals, we perform online Bayesian inference to estimate evolving state distributions, and the inference outcome provides a predictive prior for efficient prompt selection without rollout-intensive filtering. Empirical results across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that DPS substantially reduces redundant rollouts, accelerates the training process, and achieves superior reasoning performance.

Summary

Main Finding

Dynamics-Predictive Sampling (DPS) is an online prompt-selection method for RL finetuning of LLMs that predicts each prompt’s expected learning progress before doing expensive rollouts. By modeling per-prompt solving progress as a dynamical system (hidden Markov model) and performing online Bayesian inference on historical rollout reward signals, DPS creates a predictive prior that identifies informative prompts without exhaustive rollouts. Empirically, DPS cuts redundant rollouts, speeds up RL finetuning, and improves final reasoning performance across mathematics, planning, and visual-geometry tasks.

Key Points

Problem: Online prompt-selection strategies improve RL finetuning by focusing training on moderately challenging examples, but identifying those examples typically requires many costly LLM rollouts over large candidate batches.
Idea: Treat each prompt’s “extent of solving” under the current policy as a latent state in a dynamical system; transitions are modeled via a hidden Markov model (HMM).
Inference: Use historical rollout rewards to perform online Bayesian updates of each prompt’s latent-state distribution.
Selection: Use the inferred state distributions as a predictive prior to select prompts estimated to be most informative, avoiding rollout-heavy filtering.
Benefits: Dramatically fewer redundant rollouts, faster training (in terms of required rollouts/steps and wall-clock compute for rollouts), and improved downstream reasoning accuracy on a range of tasks.
Tasks evaluated: Mathematical reasoning, planning, and visual geometry (diverse reasoning domains to test generality).

Data & Methods

Modeling
- Representation: The “solving progress” of a prompt is the latent state; transitions over finetuning steps are treated as a stochastic dynamical process (HMM).
- Observation: Rollout rewards for a prompt are noisy observations tied to its latent state.
- Inference: Online Bayesian updates use past rollout signals to estimate the evolving state distribution per prompt.
Selection Mechanism
- Predictive prior from inference ranks or filters prompts for RL finetuning without performing costly candidate rollouts.
- Selected prompts are then used for actual RL updates (with full rollouts) as usual.
Baselines & Comparisons
- Compared to standard online prompt-selection methods that rely on rollout-based filtering (i.e., evaluating large candidate batches via rollouts to find informative examples).
- Metrics include number of rollouts required, training speed (rollout- or wall-clock-based), and final reasoning performance on benchmark tasks.
Empirical Findings
- DPS substantially reduces the number of rollouts wasted on uninformative prompts.
- Faster convergence of RL finetuning in rollout budgets and improved final task accuracy across tested domains.
Practical considerations
- The inference procedure uses only historical rollout reward signals, keeping its extra compute small relative to the avoided rollouts.
- Assumes sufficient historical data per prompt (cold-start prompts require handling).

Implications for AI Economics

Compute-cost reduction and productivity
- Reduces the dominant expense in online selection workflows (LLM rollouts), lowering marginal compute cost of RL finetuning and making iterative model improvement cheaper.
- Shorter finetuning cycles and fewer rollouts increase developer productivity and accelerate model deployment cadence.
Market & competitive effects
- Lowers the barrier to effective RL finetuning for organizations with limited compute budgets—potentially democratizing access to high-quality finetuned models.
- Firms that adopt DPS-like efficiencies gain a cost advantage, possibly intensifying competition and increasing returns to firms that combine algorithmic efficiency with existing model/data assets.
Pricing for compute and services
- Demand for large-batch rollout compute could fall relative to demand for lightweight inference and bookkeeping services; cloud providers may adjust pricing or product offerings (e.g., more fine-grained inference primitives, monitoring/inference tooling).
Labor and specialization
- Reduces labor/time cost for manual candidate curation and for repeating expensive rollout-based filtering loops; shifts value toward tooling, algorithmic orchestration, and data engineering for managing historical rollout signals.
Safety, externalities, and regulation
- Faster, cheaper finetuning accelerates the pace of capability improvements, which has positive productivity benefits but could amplify negative externalities (misuse, unanticipated harms) if governance does not scale accordingly.
Investment and R&D incentives
- Algorithmic improvements that cut operational costs (like DPS) increase the expected returns on investments in fine-tuning research and infrastructure. This could channel more capital into iterative model improvement rather than larger model scale alone.
Open questions for economic assessment
- Quantitative cost–benefit: how much rollout compute (and $) is saved in representative industry settings?
- Distributional impacts: who benefits most—large incumbents or smaller labs—and how does this affect market concentration?
- Interaction with hardware and data markets: does reduced demand for rollouts shift investment to other parts of the stack?

Potential limitations to consider when assessing economic impact: DPS relies on the quality and availability of historical rollout signals and on the HMM dynamical assumption; performance under distribution shift, for very large prompt inventories, or with severe cold starts may reduce realized savings.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides empirical experimental comparisons showing large reductions in rollout counts and improved downstream accuracy across multiple benchmark reasoning domains, which supports the method's effectiveness in lab settings; however, evidence is limited to benchmark tasks and simulated RL finetuning experiments (no field or production deployment data, limited information on model sizes, hyperparameter sensitivity, or long-run replication), so external validity and robustness to real-world variation remain uncertain. Methods Rigormedium — The method is well-motivated (HMM dynamical model + online Bayesian updates) and evaluated against sensible rollout-based baselines on diverse tasks with multiple metrics (rollouts, wall-clock, final accuracy), but the description lacks detailed information about statistical significance, ablation studies, sensitivity to modeling choices, scale across very large prompt inventories, and reproducibility details that would justify a high-rigor rating. SampleEmpirical evaluation uses RL finetuning experiments on benchmark reasoning tasks (mathematical reasoning, planning, and visual-geometry), with per-prompt historical rollout reward traces collected during training; comparisons are against standard rollout-based online prompt-selection baselines, measuring number of rollouts, rollout/wall-clock compute, and final task accuracy (specific LLM sizes and dataset sizes not specified in the summary). Themesproductivity adoption innovation org_design GeneralizabilityEvaluations limited to benchmark reasoning tasks; results may not transfer to other task types (e.g., open-ended generation, dialogue)., Relies on availability and quality of historical rollout reward signals; settings with sparse or noisy histories may reduce effectiveness., HMM dynamical assumption may not hold under abrupt distribution shifts or non-Markovian learning dynamics., Cold-start prompts (no history) weaken predictive prior benefits and may require separate handling., Scalability to very large prompt inventories and heterogeneous production workloads is not fully demonstrated., Performance may vary with LLM size, reward design, RL algorithm, and infrastructure differences in production deployments.

Claims (12)

Claim	Direction	Confidence	Outcome	Details
Dynamics-Predictive Sampling (DPS) models each prompt’s "extent of solving" under the current policy as a latent state in a dynamical system (a hidden Markov model) and performs online Bayesian inference on historical rollout reward signals to estimate that state. Other	null_result	high	inferred latent state distribution / predicted expected learning progress per prompt	0.18
DPS uses the inferred per-prompt state distributions as a predictive prior to select prompts estimated to be most informative, avoiding exhaustive candidate rollouts for filtering. Training Effectiveness	null_result	high	selection of prompts (number of candidate rollouts avoided)	0.18
Compared to standard online prompt-selection methods that rely on large candidate-batch rollouts for filtering, DPS substantially reduces the number of redundant (uninformative) rollouts. Training Effectiveness	positive	medium	number of rollouts (redundant rollouts avoided)	0.11
DPS speeds up RL finetuning in terms of required rollout budgets and wall-clock rollout compute. Task Completion Time	positive	medium	training speed (rollout budget to convergence; wall-clock rollout compute)	0.11
DPS improves final reasoning performance (final task accuracy) across evaluated domains: mathematical reasoning, planning, and visual-geometry tasks. Output Quality	positive	medium	final reasoning accuracy on benchmarks (mathematics, planning, visual-geometry)	0.11
The DPS inference procedure requires only historical rollout reward signals and therefore adds only a small amount of extra compute compared to the rollouts it avoids. Organizational Efficiency	positive	medium	additional inference compute relative to avoided rollout compute	0.11
DPS creates a predictive prior that identifies informative prompts without performing exhaustive rollouts over large candidate batches. Training Effectiveness	positive	medium	informativeness of selected prompts (as implied by downstream learning gains and reduced rollouts)	0.11
DPS was empirically evaluated across diverse reasoning domains (mathematical reasoning, planning, and visual-geometry) to test generality. Research Productivity	null_result	high	task domains evaluated (mathematics, planning, visual-geometry)	0.18
Under realistic limitations (distribution shift, very large prompt inventories, or severe cold starts), DPS’s realized rollout savings and performance gains may be reduced. Training Effectiveness	negative	medium	magnitude of rollout savings and performance gains under adverse conditions	0.11
Adopting DPS-like efficiencies reduces the marginal compute cost of online prompt-selection workflows (dominated by rollouts), thereby shortening finetuning cycles and increasing developer productivity. Developer Productivity	positive	low	marginal compute cost of RL finetuning; finetuning cycle time; developer productivity (conceptual)	0.05
DPS gives organizations with limited compute budgets a cost advantage for RL finetuning, potentially democratizing access to effective finetuning or shifting demand across cloud compute products. Adoption Rate	positive	speculative	accessibility of RL finetuning for low-compute organizations; demand patterns for compute products (speculative)	0.02
DPS compares favorably to standard rollout-based prompt-selection baselines across the reported metrics (rollouts required, training speed, final accuracy). Training Effectiveness	positive	medium	relative performance vs baseline on number of rollouts, training speed, and final accuracy	0.11