AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

Summary

Main Finding

AUTOSCIENTISTS is a decentralized, self-organizing team of long-running LLM agents that coordinates scientific search via a shared experimental state (champion, experiment log, forum, per-team queues). Compared with single-agent and centrally-planned multi-agent baselines under matched compute budgets, AUTOSCIENTISTS (i) achieves higher end-to-end performance across diverse scientific ML tasks, (ii) explores productive directions in parallel while avoiding redundant or exhausted searches via a dead‑end registry and critique-before-execution, and (iii) sustains progress on long-running experiments where single-agent approaches plateau.

Key quantitative results: on BioML-Bench (24 biomedical ML tasks) AUTOSCIENTISTS attains mean leaderboard percentile 74.40% versus 66.07% for Autoresearch (+8.33 percentage points); on GPT nanochat training optimization it reaches an intermediate validation loss 1.9× faster (34 vs 65 experiments) and continues to find accepted improvements (7 vs 0) from a pre‑found champion; on ProteinGym it discovers a method that raises ACE2–Spike Spearman ρ from 0.747 → 0.840 (+12.5%) and transfers across 217 assays to increase average ρ from 0.657 → 0.700 (+6.5%).

Key Points

Architecture and coordination
- No central orchestrator: agents read and write a shared state S (champion p*, experiment log L, forum F, team-local queues and dead-end registries).
- Self-organization: agents discuss proposals, form teams dynamically around promising directions, and reorganize after stagnation.
- Two agent roles per team: Analysts propose/prioritize experiments; Experiment agents claim and run experiments.
- Peer critique: proposals are debated in a forum and weak proposals filtered before compute is spent.
- Negative knowledge sharing: dead-end registries and public failure logs reduce redundant exploration.
Long-running behavior
- Agents persist (maintain memory, adapt strategies) across heartbeat cycles.
- Teams run parallel experiments and trigger re-discussion if progress stalls.
- Improvements are confirmed (e.g., replicate with second seed) before promotion to champion.
Implementation and scale
- LLM-agnostic design; experiments used Claude Sonnet 4.6 (Claude Code) as the coding/agent LLM.
- Typical team in experiments: 3 analyst agents + 6 experiment agents.
- Benchmarks used H100 GPUs; matched per-domain compute budgets for fair comparison.
Empirical scope
- Benchmarks: BioML-Bench (24 tasks across imaging, drug discovery, protein engineering, single-cell omics), GPT nanochat training optimization (short GPT training runs evaluated by validation bits-per-byte), ProteinGym (217 supervised substitution assays).
- Code and artifacts publicly available (paper provides repository and website).

Data & Methods

Problem formalization: iterative search for a program p maximizing an evaluation metric ℓeval; agents propose program modifications, run training/evaluation on Dtrain/Dval (or CV), and update shared state.
Agent workflow:
- Discussion phase: multi-round proposal, critique, team formation, roster written to shared state.
- Execution phase: teams run continuous propose-execute loops; analysts populate per-team queues Qk; experiment agents claim jobs, run experiments, write results to L and forum F.
- Stagnation detection: teams re-open discussion when no improvements appear in recent experiments (example threshold: no improvement in last 10 experiments).
Shared artifacts recorded: champion model p* (with full training script and model card), full experiment log L (results, effect sizes, diagnostics), forum posts (mechanistic analyses), dead-end registries Dk (failed directions and reason).
Evaluation metrics and comparisons:
- BioML-Bench: leaderboard percentile, median-exceeding submissions, medal rates, completion rate.
- GPT nanochat: validation bits-per-byte per experiment; comparison of experiments-to-target and accepted improvements.
- ProteinGym: Spearman correlation on substitution fitness prediction tasks.
Baselines: Autoresearch (single-agent iterative code-modifying agent) and several published biomedical-agent systems; compute budgets matched to isolate orchestration effects.

Implications for AI Economics

Productivity and cost-efficiency of R&D
- Parallel, self-organizing agent teams can raise the rate of useful experimental discoveries per unit of computational budget (e.g., 1.9× faster to same target in GPT tuning). That implies higher R&D throughput and lower marginal cost per improvement in computational sciences where experiments are automatable.
- Negative-knowledge sharing (dead-end registries) reduces redundant search, increasing effective returns to compute and lowering wasted experimentation costs.
Organization of research and firm structure
- The decentralized coordination model here resembles decentralized market structures or platform-mediated collaboration more than hierarchical planning. Economic models of R&D allocation (central planning vs decentralized teams) should incorporate how shared public states and low-friction information sharing affect search complementarities and exploration–exploitation tradeoffs.
- Agent specialization (analyst vs experimenter) and dynamic team formation highlight gains from flexible division of labor; this suggests firms may profit by deploying ensembles of specialized AI workers rather than monolithic systems.
Labor demand and skill complementarity
- AUTOSCIENTISTS automates iterative experimental design and execution tasks, likely substituting for some routine research assistance roles (experimental operators, baseline engineering). However, humans remain crucial for high-level strategic choices, validation, safety decisions, and regulatory/ethical oversight—complementary skills emphasizing domain expertise, interpretation, and governance may rise in value.
Compute markets and investment signals
- Long-running, persistent-agent systems create demand for continuous compute and storage for shared state and agent memory; pricing models and capacity planning for cloud/HPC providers may shift toward bundles that favor long-lived agent deployments and persistent-state services.
- Higher marginal product of compute from improved coordination could revalue investments in specialized orchestration software and shared-state infrastructure.
Intellectual property, attribution, and public goods
- Shared forums and dead-end registries function as public goods that increase global R&D efficiency; however, incentives for sharing versus privatizing agent-discovered methods (e.g., in industry drug discovery) will affect diffusion and aggregate welfare. Market design (e.g., prizes, open-source incentives, or IP regimes) will shape whether such negative-knowledge externalities are internalized.
Policy and regulatory considerations
- Faster automated discovery pipelines could accelerate high-impact outcomes (e.g., drug leads) but also raise dual-use and safety concerns. Regulators may need to adapt oversight to systems that autonomously iterate on experimental designs.
Empirical research directions for AI economics
- Measure firm-level productivity changes from deploying decentralized agent teams versus single-agent or human teams.
- Model optimal allocation of compute and coordination mechanisms under budget constraints and uncertainty about shifting payoff landscapes.
- Study labor reallocation: which occupations are displaced vs complemented, and how skill premiums evolve.
- Welfare analysis of open vs private sharing of dead-end knowledge and its effect on aggregate innovation rates.

Caveats & Limitations (economically relevant) - Results depend on LLM backend and compute budget; gains may shrink or change with different LLMs or if compute costs rise. - Benchmarks are computational and may not capture wet-lab experimental costs, regulatory timelines, or full productization expenses. - Self-organization reduces redundant search but could still converge to correlated failures if team diversity is insufficient—market and organizational mechanisms are needed to preserve beneficial exploration diversity. - Human judgment remains necessary for downstream validation, safety, and ethical decision-making; pure automation is not a complete substitute.

Overall, AUTOSCIENTISTS demonstrates that decentralized, shared-state coordination among persistent agent teams materially improves the efficiency of computational scientific search. For economics, this points toward meaningful shifts in R&D productivity, organizational design, compute demand, and the incentives governing knowledge sharing and intellectual property.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Presents consistent empirical improvements across multiple computational benchmarks (BioML-Bench, GPT training optimization, ProteinGym) under matched experimental budgets and reports concrete metrics (leaderboard percentiles, speedups, Spearman gains), but evidence is limited to in-silico/benchmark environments, may reflect benchmark-specific tuning, and does not establish real-world causal impacts on laboratory productivity or economic outcomes. Methods Rigormedium — Evaluation spans diverse tasks and compares against prior agent baselines with matched budgets, suggesting careful experimental design; however, potential issues include dependence on benchmark choice, possible hyperparameter and baseline tuning differences, limited transparency about reproducibility details, and no wet-lab validation of discovered protein improvements. SampleComputational experiments on BioML-Bench (24 tasks across biomedical imaging, protein engineering, single-cell omics, and drug discovery), GPT training optimization runs measuring validation bits-per-byte against Autoresearch (reporting time-to-target and number of accepted improvements), and ProteinGym fitness-prediction experiments covering 217 assays including a focused ACE2–Spike binding improvement; comparisons made to prior strongest AI agents and state-of-the-art models under matched experimental budgets. Themesproductivity innovation human_ai_collab GeneralizabilityResults restricted to computational benchmarks and simulation environments; may not translate to physical wet-lab experiments or field settings., Performance could depend on benchmark-specific tuning, dataset overlap, or evaluation oracles not available in other domains., Requires substantial compute/resources which limits applicability to smaller labs or organizations., Reported gains may vary with changes in baseline implementations, agent prompts, or budget definitions., Scalability and safety issues in real-world experimental pipelines (e.g., experimental failure costs, regulatory constraints) are not addressed.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
AutoScientists is a decentralized team of AI agents that interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Organizational Efficiency	positive	high	agent coordination and information sharing (qualitative description)	0.18
Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. Research Productivity	positive	high	overall performance across multiple benchmarks	0.18
On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. Research Productivity	positive	high	leaderboard percentile across benchmark tasks	n=24 mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33% 0.3
On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch. Task Completion Time	positive	high	time-to-target (validation bits-per-byte)	1.9x faster 0.3
On GPT training optimization, AutoScientists continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). Research Productivity	positive	high	count of accepted improvements discovered	7 vs. 0 accepted improvements 0.3
On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Research Productivity	positive	high	Spearman correlation on ACE2-Spike binding fitness prediction	n=1 +12.5% (Spearman correlation) 0.3
Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation). Research Productivity	positive	high	Spearman correlation averaged across 217 ProteinGym assays	n=217 +6.5% (Spearman correlation) across all 217 assays 0.3
Agents share successes and failures to reduce redundant exploration during long-running experiments. Organizational Efficiency	positive	high	redundant exploration (qualitative/system-level reduction)	0.18

Decentralized AI teams speed up computational science: AutoScientists outperforms prior agents across 24 biomedical ML tasks, finds faster GPT training recipes (1.9x to target) and discovers a protein-prediction method that boosts ACE2–Spike binding accuracy by 12.5%.