The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A simple evaluation-driven scaling strategy (SimpleTES) lets relatively modest LLMs outperform larger baselines across 21 scientific tasks — doubling LASSO speed, cutting quantum gate overhead by 24.5%, and finding new combinatorial constructions — and trajectory-based retraining improves generalization to new problems.

Evaluation-driven Scaling for Scientific Discovery
Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu, James Zou, Yuzhi Xu · April 21, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
SimpleTES scales evaluation-driven LLM discovery loops via parallel exploration, feedback-driven refinement, and local selection, yielding consistent state-of-the-art solutions across 21 scientific tasks and enabling models trained on successful trajectories to generalize to unseen problems.

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.

Summary

Main Finding

SIMPLETES is a principled, test-time “evaluation-driven” scaling framework that reallocates evaluator-query budget across three explicit axes—global width (C), refinement depth (L), and local sample size (K)—to dramatically improve LLM-driven scientific discovery. Using only open-source gpt-oss generators plus evaluators, SIMPLETES finds new state-of-the-art solutions across 21 problems in six domains (quantum compilation, GPU kernels, algorithm engineering, combinatorics, extremal mathematics, data science), often outperforming stronger frontier-model baselines and complex optimization pipelines. Trajectory histories produced by SIMPLETES can be used to post-train models, yielding more efficient discovery and transfer to unseen tasks.

Key Points

  • Core idea: Treat the test-time discovery loop explicitly as an allocation problem of evaluator queries N = C × L × K, and optimize allocation across
    • C (global width): number of independent trajectories exploring diverse regions,
    • L (refinement depth): number of iterative feedback-driven refinement steps per trajectory,
    • K (local sample size): how many candidate variants to generate per refinement step and keep only the best.
  • SIMPLETES algorithm (high level):
    • Start from initial solution y0 with evaluator feedback V(y0) = (r0, m0).
    • For each of C parallel trajectories: iteratively (L steps) generate K candidates from the generator G given a proposal constructed from historical S via a template Φ(S); evaluate all K, keep the best, append to S.
    • Return the best solution seen across all trajectories.
  • Robust empirical gains across 21 tasks spanning six domains:
    • Quantum routing (superconducting qubits): reduced total CNOT overhead from 60,189 to 45,441 in the largest case — a 24.5% improvement vs prior SOTA (LightSABRE).
    • Quantum zoned neutral-atom compilation: 33.2% geometric-mean execution-time reduction across 36 circuits.
    • GPU kernel TriMul (H100): 1.122 ms (SIMPLETES) vs prior best 1.131 ms; also beats leaderboard Triton submissions on other devices (A100, MI300).
    • LASSO regularization path solver: discovered hybrid switching strategy (LARS ↔ coordinate descent) — 2.17× speedup over glmnet and 14.08× over sklearn.
    • Erdős minimum overlap: improved best-known construction (0.380856 vs previous 0.380871).
    • Sum-Difference problem: new SOTA 1.144887 (improves human record by ~8%).
    • Many other SOTAs across circle packing, Hadamard determinant, autocorrelation inequalities, scaling law discovery, and scRNA-seq denoising.
  • SIMPLETES outperforms stronger model baselines and complex pipelines while using a single open-source LLM as the generator, highlighting that scaling the evaluation loop can be a more cost-effective axis than only scaling model capability.
  • Post-training on successful trajectories (supervising on the best outcome within each trajectory instead of immediate scores) yields models that:
    • Improve discovery efficiency on the training problems.
    • Generalize to unseen, out-of-distribution problems, finding solutions base models cannot.
  • Analysis includes comprehensive ablations and scaling studies that attribute improvements to the C/L/K design choices and to trajectory-level pruning and surrogate vs golden metric issues.

Data & Methods

  • Formalization:
    • Problem: find y ∈ Y maximizing an (often inaccessible) golden metric; use a queryable surrogate evaluator V: Y → (r, m) returning scalar score r and metadata m (verifier traces, errors, etc.).
    • Policy π uses a generator (fixed LLM G) and proposal construction Φ(S) from trajectory history S.
    • Budgeted evaluator queries N are used according to (C, L, K).
  • SIMPLETES implementation (Algorithm 1 in paper):
    • For each of C trajectories run in parallel:
      • For ℓ = 1..L:
        • Build prompt x = Φ(S) from accumulated trajectory states.
        • Sample K candidates y_k ∼ G(x).
        • Evaluate each: (r_k, m_k) = V(y_k).
        • Append the best candidate (max r_k) to trajectory S.
    • Return the globally best solution across trajectories.
  • Evaluator types:
    • Exact/verifiers (mathematical correctness), empirical estimators (timing kernels on subsets), or heuristics (surrogate losses).
    • Paper analyzes fidelity issues (reward hacking, overfitting) and mitigations (coarse-to-fine, mixed surrogates).
  • Experimental scope:
    • 21 diverse open-ended tasks across 6 domains (detailed per-task pipelines in paper).
    • Generators: single open-source gpt-oss models (no expensive frontier models required).
    • Comparative baselines: human bests, prior AI-discovered solutions, frontier models/ensembles, and specialized optimization pipelines.
    • Budgeting and parallelism: experiments explore varying C × L × K allocations, asynchronous execution, trajectory-level pruning for efficiency.
  • Post-training:
    • Collect successful trajectory histories produced by SIMPLETES.
    • Train model to imitate successful refinements; supervision uses the best outcome within a trajectory as label for earlier steps (not immediate local reward).
    • Evaluate transfer to unseen instances and problems.
  • Analysis:
    • Ablations on design choices for Φ, reflection/failure handling, trajectory pruning, and surrogate vs golden metric behavior.
    • Scaling studies quantify returns along C, L, K axes.

Implications for AI Economics

  • New axis of investment: evaluation-driven compute vs model parameter scaling
    • SIMPLETES shows that allocating budget to high-throughput evaluation and structured test-time loop design can yield SOTA results while using modest (open-source) model capacity. Firms and labs might obtain larger marginal returns by investing in evaluator infrastructure (simulators, verifiers, rapid benchmarking hardware) and clever loop designs rather than only larger models.
  • Cost-effectiveness and democratization
    • Because SIMPLETES attains top results with open-source LLMs, smaller organizations can compete on discovery tasks by investing in evaluation pipelines and engineering for parallelized test-time queries, lowering barriers to entry for some discovery workloads.
  • Shifts in labor and specialization
    • Demand may grow for expertise in building high-fidelity evaluators, scalable evaluation orchestration, and task-specific surrogate design (rather than just model-building). R&D teams could reallocate labor toward evaluator engineering and experiment orchestration.
  • Capital concentration and economies of scale
    • High-throughput evaluation benefits groups that can amortize the cost of specialized simulators and hardware (e.g., quantum hardware access, GPU clusters used for microbenchmarking). This creates an economic advantage for better-capitalized labs and could accelerate concentration of certain discovery tasks.
  • Value of trajectory data and re-use
    • Trajectory histories are valuable training data: post-training on successful trajectories materially improves model efficiency and transfer. There is economic value in collecting, curating, and licensing such trajectory datasets—analogous to labeled datasets or fine-tuning corpora.
  • Market for evaluators & measurement services
    • Demand for reliable evaluators (fast emulators, verifiers, benchmark suites) could spawn commercial services (evaluation-as-a-service), including standardized, auditable evaluators to prevent reward hacking and to ensure reproducibility.
  • Externalities and policy considerations
    • Environmental and compute externalities: mass evaluation runs (especially on expensive hardware or physical experimentation) have nontrivial resource costs. Policymakers and funders may want to consider funding shared evaluator infrastructure or subsidizing reproducible evaluator platforms to mitigate duplicated costs.
    • Reproducibility and standards: because evaluator fidelity matters, economic incentives exist for transparent, validated evaluators and standardized benchmark definitions to avoid overfitting to fragile surrogates.
  • Strategic implications for R&D management
    • R&D managers should consider mixed investments: modest models + evaluator scaling + data-collection (trajectories) can sometimes beat outright investment in frontier models, allowing faster iteration on applied discovery tasks and better ROI per dollar of compute or lab time.

Summary takeaway: SIMPLETES reframes scaling for LLM-driven discovery—evaluation-loop scaling (C, L, K) is a practical, high-leverage axis that changes the economics of discovery: organizations can trade higher evaluator/query throughput and better loop design for smaller model investments, shifting where capital and labor are most productively allocated.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents systematic empirical results across 21 scientific problems and compares SimpleTES to frontier-model baselines and optimization pipelines, showing consistent improvements (including specific quantitative gains). However, results are engineering/benchmark-based rather than causal; potential task-selection, compute-budget, and simulator/verifier fidelity concerns limit confidence in broad claims without further replication and statistical analysis. Methods Rigormedium — The work evaluates a clear algorithmic framework on a relatively large and diverse suite of problems, appears to include comparisons and post-training generalization tests, and reports quantitative improvements on key tasks; but the description lacks (or at least this abstract does not report) detailed ablations, statistical uncertainty measures, full hyperparameter and compute-budget transparency, and external replication, which would be needed for a higher rigor rating. SampleEmpirical experiments on 21 scientific problems spanning six domains (examples reported include speeding up LASSO, quantum circuit routing, and new Erdos minimum overlap constructions); experiments use gpt-oss family models as the base LLMs, task-specific verifiers/simulators/scoring functions to evaluate candidates, comparisons to frontier model baselines and optimization pipelines, and additional experiments where models are post-trained on successful trajectory histories. Themesinnovation productivity human_ai_collab GeneralizabilityResults shown for a specific set of 21 tasks and six domains; performance may not transfer to other scientific problems or open-ended discovery tasks., Relies on the availability and fidelity of verifiers/simulators/score functions — real-world lab experiments with noisy measurements may not match simulated evaluations., Evaluations primarily with gpt-oss models and particular baseline pipelines; outcomes may differ with other LLM families, model sizes, or compute budgets., Potential selection bias toward problems amenable to local refinement and parallel search; tasks requiring rare insight or novel theory might not benefit similarly., Reported speedups and gains may depend on engineering details, compute resources, and hyperparameter tuning that are not fully described in the abstract.

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection. Other positive high framework design combining parallel exploration, feedback-driven refinement, and local selection
0.18
Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models. Research Productivity positive high ability to discover state-of-the-art solutions (solution quality / discovery success)
n=21
0.18
SimpleTES consistently outperforms both frontier-model baselines and sophisticated optimization pipelines. Research Productivity positive high performance relative to baselines (solution quality / discovery success)
n=21
0.18
We sped up the widely used LASSO algorithm by over 2x. Task Completion Time positive high LASSO algorithm runtime / speed
over 2x
0.3
We designed quantum circuit routing policies that reduce gate overhead by 24.5%. Task Completion Time positive high quantum circuit gate overhead
24.5% reduction
0.3
We discovered new Erdos minimum overlap constructions that surpass the best-known results. Research Productivity positive high quality of Erdos minimum overlap constructions (best-known benchmarks)
0.3
SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. Training Effectiveness positive high availability and usefulness of trajectory-level histories for supervision
0.18
When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Research Productivity positive high post-training efficiency on seen problems and generalization to unseen problems (solution discovery success)
0.18
Effective evaluation-driven loop scaling is a central axis for advancing LLM-driven scientific discovery, and SimpleTES provides a simple yet practical framework for realizing these gains. Research Productivity positive high impact of scaling evaluation-driven discovery loops on LLM-driven scientific discovery
n=21
0.03

Notes