ASI-Evolve: AI Accelerates AI

Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. In neural architecture design, it discovered 105 SOTA linear attention architectures, with the best discovered model surpassing DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points, with gains exceeding 18 points on MMLU. In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 points on AIME24, and +5.04 points on OlympiadBench. We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.

Summary

Main Finding

ASI-EVOLVE is an agentic "AI-for-AI" framework that closes a learn–design–experiment–analyze loop by (1) injecting accumulated human priors via a cognition base and (2) distilling experimental outcomes via a dedicated analyzer. In a unified set of large-scale experiments, ASI-EVOLVE autonomously produced meaningful advances across three foundational AI development components—model architectures, pretraining data curation, and training algorithms—providing initial evidence that closed-loop AI research can accelerate AI progress. The codebase is open-sourced: https://github.com/GAIR-NLP/ASI-Evolve.

Key Points

Core design:
- Iterative pipeline: sample context from a persistent database, retrieve cognition priors, generate candidate programs (Researcher), execute experiments with early-rejection and debugging (Engineer), and distill multi-dimensional outputs into compact reports (Analyzer).
- Persistent stores: Database D (history of nodes = motivation, code, results, analysis, metadata) and Cognition C (task-relevant human priors indexed by embeddings).
- Sampling policies supported: UCB1, greedy, random, MAP-Elites/island.
Two central innovations:
- Cognition base: injects curated human literature/heuristics to steer exploration and accelerate cold-start performance.
- Analyzer: converts verbose experimental logs and multi-metric outputs into reusable, structured insights for future rounds.
Empirical achievements (selected highlights):
- Neural architecture design (linear/efficient attention): discovered 1,350 candidates across 1,773 rounds, 105 architectures surpass human-designed DeltaNet; best model +0.97 points above DeltaNet (nearly 3× gains of recent human tweaks).
- Pretraining data curation: evolved pipelines improved average benchmark performance by +3.96 points; MMLU gains exceeded +18 points.
- Reinforcement learning algorithm design: discovered algorithms outperform GRPO by up to +12.5 (AMC32), +11.67 (AIME24), and +5.04 (OlympiadBench).
- Cross-domain transfer: initial evidence in mathematics and biomedicine (drug–target interaction: +6.94 AUROC in cold-start).
Ablations and diagnostics:
- Cognition base substantially improves cold-start climb speed and iteration efficiency.
- Choice of sampling strategy strongly affects long-run trajectory and sustained improvement.
- Engineering guards (static checks, debug agent, novelty filter) are important for constraint-heavy large-codebase evolution.

Data & Methods

Conceptual framing:
- Introduces Ltask = ⟨Cexec, Sspace, Dfeedback⟩ to characterize tasks by execution cost, search-space complexity, and feedback complexity, motivating architecture/design choices for long-horizon, high-cost scientific automation.
System modules:
- Researcher: LLM- or diff-based program generation conditioned on sampled database nodes + retrieved cognition entries.
- Engineer: runs full evaluation scripts; supports early rejection (wall-clock timeouts, quick tests) and optional LLM-based quality judging.
- Analyzer: ingests raw logs, training dynamics, per-benchmark breakdowns, and produces compact causal analyses that are stored and used for retrieval.
- Cognition store: curated human prior corpus (e.g., ~150 entries from 100 papers for the linear-attention task) used via embedding search to guide proposals.
Experimental workflow specifics (example: model architecture task):
- Candidate generation with static checks (complexity bounds, chunk-wise structure, causal masks), debug agent for runtime fixes, and novelty checks to avoid duplicates.
- Multi-stage evaluation: quick/cheaper exploration stage (small models ~20M params trained for short steps on subset), then more expensive validation of promising candidates. (Paper gives concrete benchmarks, training steps, and evaluation splits per task.)
Benchmarks and metrics:
- Architecture: evaluated on a suite of language-modeling and sequence benchmarks (10 core benchmarks at exploration stage).
- Data curation: measured via downstream benchmark suites including MMLU.
- RL algorithms: evaluated on AMC32, AIME24, OlympiadBench.
Storage and selection:
- Database retains top candidate pool (e.g., top-50) and samples contexts from the high-performing set to encourage progressive improvement.

Implications for AI Economics

Productivity and R&D efficiency:
- If generalizable, ASI-EVOLVE-style systems could raise marginal productivity of AI R&D by automating parts of hypothesis generation, code modification, experiment orchestration, and insight distillation—reducing human time per iteration and accelerating iteration velocity.
- The cognition base plus analyzer model suggests AI agents can internalize and accumulate institutional memory, lowering the cost of rediscovery and improving long-run returns to AI R&D investments.
Capital vs. labor effects:
- High computational costs remain (many GPU-hours per candidate), so gains primarily increase returns to capital (compute infrastructure, cloud/GPU investments) and algorithmic automation, potentially substituting some human experimental labor (junior engineers, routine architecture search tasks) while increasing demand for higher-skill oversight roles (tooling, validation, safety).
Returns to scale and market concentration:
- Organizations with large compute budgets and engineering pipelines can deploy closed-loop search at scale, possibly increasing first-mover advantages and market concentration. Open-sourcing mitigates but does not fully offset capital asymmetries.
Innovation dynamics and creative complementarity:
- ASI-EVOLVE shows AI can contribute non-trivial innovations (novel architectures, algorithms, data pipelines). This implies complementarity between humans (setting objectives, high-level theory, ethical oversight) and automated agents (search, experimentation), potentially shifting human roles toward meta-science, evaluation, and translational work.
Implications for R&D budgeting and policy:
- Budget models should account for higher upfront compute spending to enable automated search loops with potentially faster payoff horizons.
- Public funding and policy may need to support shared compute and benchmarking infrastructure to democratize access and limit concentration.
Risks and caveats:
- Results are promising but preliminary: replication, robustness, and evaluation of long-run generalization across more domains and against stronger baselines are needed.
- Accelerated discovery also raises risks: faster model cycles could shorten product lifecycles, raise externalities (e.g., unforeseen capabilities), and stress governance regimes.
- High compute demand emphasizes energy and environmental cost considerations; economic assessments must include these externalities.
Research economics questions opened by this work:
- How do returns to compute vs. human labor evolve if closed-loop agents scale?
- What is the impact on the allocation of funding across basic vs. applied ML research?
- How will intellectual property, publication, and reproducibility norms adapt when agents generate substantial parts of technical progress?

Caveats: ASI-EVOLVE demonstrates strong results on several tasks, but deployment and economic impact depend on generalization beyond the reported domains, reproducibility of gains under different compute constraints, and the broader organizational adoption of such closed-loop systems.

Assessment

Paper Typeother Evidence Strengthmedium — The paper demonstrates consistent and large performance gains across multiple, diverse ML problems (architecture search, pretraining-data curation, RL algorithm design) and reports specific benchmark improvements, which provides substantive empirical support for the core claim. However, the evidence is limited by potential search/computation confounds (improvements could come from massive search or compute rather than the proposed methodological innovation), uncertain reproducibility (details on budgets, seeds, and baselines may be incomplete), and lack of external independent replication or causal isolation of the key components. Methods Rigormedium — Multiple tasks and domains are evaluated and quantitative comparisons to prior SOTA and baselines are presented, suggesting careful experimentation; nonetheless, rigor is weakened by likely insufficient transparency on compute and hyperparameter budgets, limited discussion of statistical variation/significance, and possible absence of controlled ablation to fully attribute gains to the cognition base and analyzer components rather than search scale or implicit human tuning. SampleExperimental evaluations across three core AI-development domains: (1) neural architecture design — evolutionary discovery of 105 state-of-the-art linear attention architectures with the best model outperforming DeltaNet by +0.97 points; (2) pretraining data curation — evolved data pipelines producing average benchmark gains of +3.96 points and up to +18 points on MMLU; (3) reinforcement-learning algorithm design — discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 on AIME24, and +5.04 on OlympiadBench; additional exploratory transfers to mathematics and biomedicine are reported. Benchmarks, specific datasets, model sizes, and compute budgets are not fully enumerated in the summary. Themesinnovation productivity IdentificationNo formal causal identification strategy; claims are supported by a suite of engineering experiments in which an evolutionary agentic loop (ASI-Evolve) is run to search for architectures, datasets, and learning algorithms and improvements are measured by benchmark comparisons to prior models/algorithms and ablated baselines. GeneralizabilityFindings may depend on large compute and search budgets not available to typical labs or firms., Benchmarks and tasks used are research-oriented and may not reflect real-world engineering complexity or long-horizon scientific discovery., Reproducibility may be limited if full experimental details (random seeds, hyperparameters, compute costs) are not disclosed., Performance gains might partly reflect overfitting to chosen benchmarks rather than broad architectural or algorithmic improvements., Human priors injected via the cognition base could encode dataset or researcher biases, limiting transfer to other domains.

Claims (12)

Claim	Direction	Confidence	Outcome	Details
We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. Other	positive	high	existence and operation of a learn-design-experiment-analyze closed-loop framework (ASI-Evolve)	0.12
ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. Other	positive	high	design and inclusion of cognition base and dedicated analyzer components in the ASI-Evolve agentic framework	0.12
To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. Other	positive	medium	breadth of AI-driven discovery across data, architectures, and learning algorithms	0.01
In neural architecture design, it discovered 105 SOTA linear attention architectures. Output Quality	positive	high	count of discovered state-of-the-art (SOTA) linear attention architectures	n=105 105 SOTA linear attention architectures 0.12
The best discovered model surpasses DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. Output Quality	positive	high	performance difference vs DeltaNet (points)	+0.97 points 0.12
In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points. Output Quality	positive	high	average benchmark performance (points)	+3.96 points 0.12
In pretraining data curation, gains exceed 18 points on MMLU. Output Quality	positive	high	MMLU benchmark performance (points)	gains exceeding 18 points on MMLU 0.12
In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32. Output Quality	positive	high	performance difference vs GRPO on AMC32 (points)	up to +12.5 points on AMC32 0.12
In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +11.67 points on AIME24. Output Quality	positive	high	performance difference vs GRPO on AIME24 (points)	up to +11.67 points on AIME24 0.12
In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +5.04 points on OlympiadBench. Output Quality	positive	high	performance difference vs GRPO on OlympiadBench (points)	up to +5.04 points on OlympiadBench 0.12
We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Other	positive	medium	transferability of AI-for-AI paradigm to domains outside core AI (mathematics and biomedicine)	0.04
Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research. Other	positive	high	feasibility and promise of closed-loop AI-driven research (ASI-Evolve) to accelerate AI development	0.12