ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well-documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.

Summary

Main Finding

ABC-Bench, an agentic bio-capabilities benchmark, shows that frontier LLM agents—when given tool access—match or exceed expert human baselines on multi-step molecular biology tasks (fragment design, robot-executed Gibson Assembly) and can even produce wet-lab–runnable scripts that successfully assemble DNA. Models perform well on tasks grounded in published protocols but less well on novel creative bioinformatics problems (e.g., screening evasion). The results imply substantial implications for productivity, risk, and governance in the intersection of AI and bioeconomics.

Key Points

Benchmark overview
- ABC-Bench has three agentic tasks: Fragment Design (design DNA fragments for Gibson Assembly), Screening Evasion (obfuscate fragments to evade nucleic acid synthesis screening while remaining reconstructable), and Liquid Handling Robot (write OpenTrons scripts to perform Gibson Assembly).
- Tasks are run in an “agent” scaffold with tools (Python, Biopython, Bash, web search, OpenTrons simulator, BLAST for screening evasion).
- Scoring is objective and automated (e.g., code run-ability, correct reagent/volumes, successful assembly as verified by sequencing).
Empirical results
- Frontier models were tested (examples: Claude Sonnet/Opus, GPT-5.4, Gemini 3.1 Pro, Qwen3.5, Kimi K2.5, GPT-o4-mini-high).
- Models outperformed the median expert human baseline on all three tasks. Human baseliners: 175 person-hours across tasks; mean baseline scores by task: Fragment Design ~0.33, Screening Evasion ~0.22, Liquid Handling Robot ~0.20 (± reported SEs).
- Models performed best on Liquid Handling Robot and Fragment Design (both rely on well-documented protocols/APIs), worst on Screening Evasion (creative, no published protocol).
- Refusal behavior: high refusal rates for Screening Evasion in some safety-guarded models (some refused all samples), lower refusal on robot task.
Wet-lab validation
- GPT-o4-mini-high generated OpenTrons scripts that were iteratively debugged with human-provided compile errors and then run on an OpenTrons Flex robot.
- Three independent Gibson Assembly experiments: all produced correctly assembled DNA confirmed by whole-plasmid sequencing.
Design principles emphasized by ABC-Bench
- Measure dual-use capabilities while minimizing information hazards.
- Test agentic/tool-augmented model behavior.
- Cover diverse tasks and parts of the risk chain.
- Use reproducible algorithmic scoring and include human baselines.

Data & Methods

Tasks and tooling
- Fragment Design: Python/Biopython; fragments must satisfy Gibson Assembly overlaps and commercial synthesis size criteria.
- Screening Evasion: Python, BLAST, web search; objective: evade three screening approaches while enabling reconstruction.
- Liquid Handling Robot: Python + opentrons API, OpenTrons simulation; simulation scoring plus wet-lab validation for successful DNA assembly.
Models and evaluation
- Each model evaluated N=10 runs per task (consistent with prior agentic benchmark practice).
- Refusal-corrected mean accuracy reported; automated grading for each subcriterion with possible partial credit.
Human baseline
- Recruited PhD-level / experienced molecular biology + Python practitioners, max 5 hours per task, compensated; AI use disallowed and checked.
- 175 total person-hours across baseliners.
Wet-lab protocol
- Used NEBuilder Hi‑Fi DNA Assembly kit, OpenTrons Flex, webcam for deck state, human assistant provides concentrations and photos, model produces scripts, human runs & reports compile errors for iterative fixes until script runs.
- Outcome validated via Oxford Nanopore whole-plasmid sequencing (Plasmidsaurus).
Limitations noted by authors
- Benchmark covers only specific steps of a larger risk chain.
- Models’ refusal behavior and provider red-teaming affect measured capabilities.
- Tasks draw unevenly on published vs. novel knowledge; results may change rapidly as models evolve.

Implications for AI Economics

Productivity and R&D acceleration
- Lowering technical barriers: agentic LLMs reduce skilled-labor time for routine molecular design and protocol scripting, increasing effective lab throughput per scientist.
- Capital reallocation: greater returns to lab automation and computational tools (robots, cloud lab services, synthesis providers) as AI reduces marginal labor costs for design and execution tasks.
- Innovation rate: faster iteration cycles (design → build → test) could accelerate discovery, shorten time-to-market for biotech startups, and raise aggregate R&D productivity.
Labor market effects
- Skill shift: demand likely shifts toward higher-level experimental design, oversight, verification, and biosecurity expertise; routine cloning and scripting tasks may be automated.
- Wage and employment implications: downward pressure on certain technician/engineer tasks; potential premium for biosecurity, wet-lab validation, and model-auditing skills.
Market structure and competition
- Competitive advantage for integrated providers: firms that combine advanced LLM agents, lab automation, and synthesis services may capture value through vertically integrated pipelines.
- Concentration risks: capability concentration among model providers could create market power and systemic risk; customers may prefer providers who include safety/benchmarking guarantees.
Risk externalities and cost of governance
- Dual-use externalities: easier construction of hazardous sequences raises social risk; private decisions do not internalize societal risk, suggesting justification for regulation, monitoring, and insurance requirements.
- Compliance and transaction costs: firms will face new compliance costs (screening, audits, certifications) and potential operational frictions (model refusals, forced red-teaming), altering marginal costs of R&D.
- Insurance and liability markets: insurers will need models to price bio-risk, potentially requiring ABC-Bench–style verifications as underwriting conditions; higher premiums or exclusions for insufficient governance.
Investment and financing
- Valuation dynamics: startups demonstrating safe, benchmarked AI pipelines may command premiums; conversely, firms facing high regulatory uncertainty may see valuation discounts.
- Shifts in VC funding: increased investment in biosecurity tooling, model safety, detection and provenance technologies, and in lab automation platforms.
Policy and regulatory design
- Benchmarking as a regulatory tool: ABC-Bench–style objective, reproducible tests can be incorporated into ex ante governance (e.g., model capability disclosures, mandatory pre-deployment evaluation for high-risk models).
- Cost-effective regulation: objective benchmarks enable targeted interventions (e.g., gating access to agentic tools based on benchmark scores), which may be more economically efficient than blanket restrictions.
- International coordination: capabilities cross borders rapidly; harmonized standards/benchmarks reduce regulatory arbitrage and facilitate global insurance and trade frameworks.
Market for safety and compliance services
- Demand growth: a market for third-party benchmarking, model auditing, secure deployment, and incident response will expand.
- Business models: subscription or certification services that provide ABC-Bench compliance reports, secure sandboxing, or required refusal behaviors could be monetized.
Strategic and macroeconomic considerations
- Diffusion of capability: as agentic models permeate labs, aggregate productivity gains may be large but accompanied by distributional concerns (who captures gains vs. who bears risk).
- Externalities and social welfare: absent governance, benefits from faster R&D may be offset by increased biosecurity risks; optimal policy should balance innovation incentives with externality mitigation (e.g., subsidies for safety tech, taxes or permit systems for high-risk activities).
Practical near-term actions for economic actors
- Firms: integrate reproducible capability testing into procurement and vendor diligence; require third-party benchmarks for any agentic model used in wet-lab or synthesis workflows.
- Investors: include biosecurity maturity and benchmark compliance in due diligence; price regulatory uncertainty into valuations.
- Insurers/regulators: require objective benchmark evidence (or equivalents) for underwriting or licensing of agentic bio tools; fund public-good benchmarking infrastructure.
- Public sector: invest in open, transparent benchmarks and in workforce retraining for higher-skill oversight roles.

Caveats and uncertainty - Rapid model improvements and shifting provider safety policies mean measured capabilities can change quickly. - ABC-Bench covers specific tasks and not the full chain from design to deployed pathogen; economic risk assessments should model probable extensions and mitigation timelines. - Empirical measurements are subject to sampling (N=10 runs per model) and provider-determined refusal behavior, which interacts with observed capability.

Overall, ABC-Bench demonstrates that agentic LLMs materially lower technical barriers to common molecular biology tasks. For AI economics, this implies higher R&D productivity and demand for complementary capital (robots, cloud labs), shifting labor demand and creating a growing market for safety, auditing, and regulatory compliance—while also generating nontrivial externalities that policymakers, insurers, and markets must factor into pricing, governance, and investment decisions.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides empirical evaluations and three wet‑lab validations showing an LLM-produced script successfully assembled DNA, which is strong proof-of-concept evidence of capability; however the scope is narrow (few tasks, limited hardware and models), human baselines and agent selection are not fully detailed, and results are not demonstrated across diverse real-world lab settings. Methods Rigormedium — Benchmark tasks span relevant capabilities and include both simulation and wet‑lab validation, and performance is compared to an expert baseline; but the paper appears to test a limited set of tasks and agents, lacks full transparency on human baseline sampling and inter-rater reliability, and uses a single robotic platform for lab validation, which limits internal robustness and reproducibility. SampleSuite of three task types (liquid-handling robot code generation, DNA fragment design for in vitro assembly, and evasion of DNA synthesis screening) evaluated across multiple LLM agents (including OpenAI o4-mini-high) and compared to a median expert human baseline; performance characterized on tasks drawing on published protocols versus novel bioinformatics reasoning; three wet‑lab validation experiments executed on an OpenTrons liquid-handling robot using scripts generated by o4-mini-high that produced assembled DNA with expected sequences. Themesproductivity governance GeneralizabilityTasks are a small, specific subset of bio-lab procedures and may not reflect broader biological R&D workflows, Wet-lab validation was performed on a single robotic platform (OpenTrons) and one model, limiting transfer to other hardware or lab environments, Human baseline details (sample size, expertise distribution, scoring) are not fully specified, limiting comparison robustness, Performance may depend on prompt engineering and model versions; results may not generalize to other LLMs or future updates, Assembled sequences and experimental conditions appear controlled/synthetic and may not capture complexity or biosafety constraints of real-world biological work

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Skill Acquisition	positive	high	acquisition of research-relevant capabilities by LLMs	0.03
LLM agents can perform in silico biology tasks that previously required experienced human biologists. Automation Exposure	positive	high	ability of LLM agents to perform in silico biology tasks	0.18
We introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. Other	neutral	high	existence and composition of the ABC-Bench benchmark	0.09
ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks, including: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. Other	neutral	high	types of tasks included in ABC-Bench	0.09
All tested LLM agents outperformed the median expert human baseline on all three tasks. Output Quality	positive	high	task performance relative to median expert human baseline	0.18
Agents performed highly on tasks drawing on published knowledge and well-documented protocols. Output Quality	positive	high	performance on protocol- and literature-based tasks	0.18
Agents performed more weakly on a task requiring novel bioinformatics reasoning. Output Quality	negative	high	performance on novel bioinformatics reasoning task	0.18
In three wet-lab validation experiments, OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences. Output Quality	positive	high	successful DNA assembly with expected sequences	n=3 successful assembly (qualitative) 0.3
The ABC-Bench tasks require a combination of biology and software expertise. Other	neutral	high	skill mix required for benchmark tasks	0.09

Large language models outpace median human experts on a suite of bio-lab tasks, and one model produced working robot scripts that assembled DNA in lab tests, underscoring rapid productivity gains and heightened biosecurity concerns.