An autonomous evolutionary system, ASI-Evolve, discovered hundreds of improved models and pipelines—finding 105 state-of-the-art linear-attention architectures and yielding large benchmark gains (e.g., +18 MMLU, up to +12.5 RL points)—offering early evidence that AI can speed parts of the AI research loop.
Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. In neural architecture design, it discovered 105 SOTA linear attention architectures, with the best discovered model surpassing DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points, with gains exceeding 18 points on MMLU. In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 points on AIME24, and +5.04 points on OlympiadBench. We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.
Summary
Main Finding
ASI-EVOLVE is an agentic "AI-for-AI" framework that closes a learn–design–experiment–analyze loop by (1) injecting accumulated human priors via a cognition base and (2) distilling experimental outcomes via a dedicated analyzer. In a unified set of large-scale experiments, ASI-EVOLVE autonomously produced meaningful advances across three foundational AI development components—model architectures, pretraining data curation, and training algorithms—providing initial evidence that closed-loop AI research can accelerate AI progress. The codebase is open-sourced: https://github.com/GAIR-NLP/ASI-Evolve.
Key Points
- Core design:
- Iterative pipeline: sample context from a persistent database, retrieve cognition priors, generate candidate programs (Researcher), execute experiments with early-rejection and debugging (Engineer), and distill multi-dimensional outputs into compact reports (Analyzer).
- Persistent stores: Database D (history of nodes = motivation, code, results, analysis, metadata) and Cognition C (task-relevant human priors indexed by embeddings).
- Sampling policies supported: UCB1, greedy, random, MAP-Elites/island.
- Two central innovations:
- Cognition base: injects curated human literature/heuristics to steer exploration and accelerate cold-start performance.
- Analyzer: converts verbose experimental logs and multi-metric outputs into reusable, structured insights for future rounds.
- Empirical achievements (selected highlights):
- Neural architecture design (linear/efficient attention): discovered 1,350 candidates across 1,773 rounds, 105 architectures surpass human-designed DeltaNet; best model +0.97 points above DeltaNet (nearly 3× gains of recent human tweaks).
- Pretraining data curation: evolved pipelines improved average benchmark performance by +3.96 points; MMLU gains exceeded +18 points.
- Reinforcement learning algorithm design: discovered algorithms outperform GRPO by up to +12.5 (AMC32), +11.67 (AIME24), and +5.04 (OlympiadBench).
- Cross-domain transfer: initial evidence in mathematics and biomedicine (drug–target interaction: +6.94 AUROC in cold-start).
- Ablations and diagnostics:
- Cognition base substantially improves cold-start climb speed and iteration efficiency.
- Choice of sampling strategy strongly affects long-run trajectory and sustained improvement.
- Engineering guards (static checks, debug agent, novelty filter) are important for constraint-heavy large-codebase evolution.
Data & Methods
- Conceptual framing:
- Introduces Ltask = ⟨Cexec, Sspace, Dfeedback⟩ to characterize tasks by execution cost, search-space complexity, and feedback complexity, motivating architecture/design choices for long-horizon, high-cost scientific automation.
- System modules:
- Researcher: LLM- or diff-based program generation conditioned on sampled database nodes + retrieved cognition entries.
- Engineer: runs full evaluation scripts; supports early rejection (wall-clock timeouts, quick tests) and optional LLM-based quality judging.
- Analyzer: ingests raw logs, training dynamics, per-benchmark breakdowns, and produces compact causal analyses that are stored and used for retrieval.
- Cognition store: curated human prior corpus (e.g., ~150 entries from 100 papers for the linear-attention task) used via embedding search to guide proposals.
- Experimental workflow specifics (example: model architecture task):
- Candidate generation with static checks (complexity bounds, chunk-wise structure, causal masks), debug agent for runtime fixes, and novelty checks to avoid duplicates.
- Multi-stage evaluation: quick/cheaper exploration stage (small models ~20M params trained for short steps on subset), then more expensive validation of promising candidates. (Paper gives concrete benchmarks, training steps, and evaluation splits per task.)
- Benchmarks and metrics:
- Architecture: evaluated on a suite of language-modeling and sequence benchmarks (10 core benchmarks at exploration stage).
- Data curation: measured via downstream benchmark suites including MMLU.
- RL algorithms: evaluated on AMC32, AIME24, OlympiadBench.
- Storage and selection:
- Database retains top candidate pool (e.g., top-50) and samples contexts from the high-performing set to encourage progressive improvement.
Implications for AI Economics
- Productivity and R&D efficiency:
- If generalizable, ASI-EVOLVE-style systems could raise marginal productivity of AI R&D by automating parts of hypothesis generation, code modification, experiment orchestration, and insight distillation—reducing human time per iteration and accelerating iteration velocity.
- The cognition base plus analyzer model suggests AI agents can internalize and accumulate institutional memory, lowering the cost of rediscovery and improving long-run returns to AI R&D investments.
- Capital vs. labor effects:
- High computational costs remain (many GPU-hours per candidate), so gains primarily increase returns to capital (compute infrastructure, cloud/GPU investments) and algorithmic automation, potentially substituting some human experimental labor (junior engineers, routine architecture search tasks) while increasing demand for higher-skill oversight roles (tooling, validation, safety).
- Returns to scale and market concentration:
- Organizations with large compute budgets and engineering pipelines can deploy closed-loop search at scale, possibly increasing first-mover advantages and market concentration. Open-sourcing mitigates but does not fully offset capital asymmetries.
- Innovation dynamics and creative complementarity:
- ASI-EVOLVE shows AI can contribute non-trivial innovations (novel architectures, algorithms, data pipelines). This implies complementarity between humans (setting objectives, high-level theory, ethical oversight) and automated agents (search, experimentation), potentially shifting human roles toward meta-science, evaluation, and translational work.
- Implications for R&D budgeting and policy:
- Budget models should account for higher upfront compute spending to enable automated search loops with potentially faster payoff horizons.
- Public funding and policy may need to support shared compute and benchmarking infrastructure to democratize access and limit concentration.
- Risks and caveats:
- Results are promising but preliminary: replication, robustness, and evaluation of long-run generalization across more domains and against stronger baselines are needed.
- Accelerated discovery also raises risks: faster model cycles could shorten product lifecycles, raise externalities (e.g., unforeseen capabilities), and stress governance regimes.
- High compute demand emphasizes energy and environmental cost considerations; economic assessments must include these externalities.
- Research economics questions opened by this work:
- How do returns to compute vs. human labor evolve if closed-loop agents scale?
- What is the impact on the allocation of funding across basic vs. applied ML research?
- How will intellectual property, publication, and reproducibility norms adapt when agents generate substantial parts of technical progress?
Caveats: ASI-EVOLVE demonstrates strong results on several tasks, but deployment and economic impact depend on generalization beyond the reported domains, reproducibility of gains under different compute constraints, and the broader organizational adoption of such closed-loop systems.
Assessment
Claims (12)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. Other | positive | high | existence and operation of a learn-design-experiment-analyze closed-loop framework (ASI-Evolve) |
0.12
|
| ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. Other | positive | high | design and inclusion of cognition base and dedicated analyzer components in the ASI-Evolve agentic framework |
0.12
|
| To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. Other | positive | medium | breadth of AI-driven discovery across data, architectures, and learning algorithms |
0.01
|
| In neural architecture design, it discovered 105 SOTA linear attention architectures. Output Quality | positive | high | count of discovered state-of-the-art (SOTA) linear attention architectures |
n=105
105 SOTA linear attention architectures
0.12
|
| The best discovered model surpasses DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. Output Quality | positive | high | performance difference vs DeltaNet (points) |
+0.97 points
0.12
|
| In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points. Output Quality | positive | high | average benchmark performance (points) |
+3.96 points
0.12
|
| In pretraining data curation, gains exceed 18 points on MMLU. Output Quality | positive | high | MMLU benchmark performance (points) |
gains exceeding 18 points on MMLU
0.12
|
| In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32. Output Quality | positive | high | performance difference vs GRPO on AMC32 (points) |
up to +12.5 points on AMC32
0.12
|
| In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +11.67 points on AIME24. Output Quality | positive | high | performance difference vs GRPO on AIME24 (points) |
up to +11.67 points on AIME24
0.12
|
| In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +5.04 points on OlympiadBench. Output Quality | positive | high | performance difference vs GRPO on OlympiadBench (points) |
up to +5.04 points on OlympiadBench
0.12
|
| We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Other | positive | medium | transferability of AI-for-AI paradigm to domains outside core AI (mathematics and biomedicine) |
0.04
|
| Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research. Other | positive | high | feasibility and promise of closed-loop AI-driven research (ASI-Evolve) to accelerate AI development |
0.12
|