SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

Summary

Main Finding

A small, simulator-specific grounding adapter (SIGA) wrapped around an off-the-shelf coding agent can turn that agent into a practical operator of complex scientific simulators. For GEOS, SIGA matched the quality of an expert who spent ~3 hours to produce a complete deck while the SIGA-equipped agent produced equivalent output in ≈5 minutes (≈36× wall-clock speedup). On a held-out GEOS set, grounding raised mean TreeSim from 0.720 to 0.789 (~10% relative gain) and reduced across-seed standard deviation by ~16×. A self-evolving variant that rewrites the adapter from prior trajectories further improved held-out performance, matching or beating best hand-designed configurations. Transfer experiments (OpenFOAM, LAMMPS) show the dominant helpful mechanism differs by simulator: validator-driven checks matter where structural completeness is the bottleneck, while retrieval/memory matter where domain correctness is the bottleneck.

Key Points

Problem framing: simulator setup is an agent-tool interface grounding problem — general coding agents already handle file navigation, edits, shell use, and repairs but lack the simulator’s executable contract (vocabulary, schema/structural constraints, validation/termination criteria).
SIGA design (minimal, plugin-style adapter around an existing harness/model):
- M (procedural memory): an always-visible cheatsheet (775 tokens for GEOS) appended to system prompt to surface high-frequency vocabulary/patterns.
- R (retrieval): semantic search tools (doc, schema, examples) exposed as callable tools (RAG over simulator docs, XSD, example files) to recover correct tokens/phrases.
- X (validator tool): agent-callable validator for in-trajectory checks and guided repair (e.g., xmllint --schema).
- S (stop-hook validator): termination-gating validator that prevents finishing until parse/validation succeeds (with bounded retries).
Integration choice: adapter modifies only three harness interfaces (context, tools, termination). The base model and harness remain frozen — the approach adapts the harness rather than reimplementing agent logic.
Empirical outcomes:
- GEOS representative task: SIGA matched expert-level deck quality in ≈5 minutes vs expert ≈3 hours.
- GEOS held-out set: TreeSim mean improved ~10% absolute relative to bare agent; variance across seeds dropped ~16×.
- Self-evolution (adapter rewrite from logged trajectories) improved held-out performance and matched/outperformed best hand-designed adapters.
- Transfer: In GEOS and OpenFOAM validation (S/X) is most impactful; in LAMMPS retrieval/memory (R/M) supply domain parameters and names better.
Cost-effectiveness and portability: the adapter is compact and simulator-agnostic (instantiate same slots with simulator-appropriate checks and retrieval), making it cheap to port and amenable to automated self-improvement.

Data & Methods

Experimental setup:
- Base: a frozen coding-harness (H0) wrapping a frozen model π (the paper built SIGA on top of an existing coding agent harness such as Claude Code).
- SIGA modifies the harness by (1) appending procedural memory to system context, (2) adding retrieval and validator tools to the toolset, and (3) replacing termination with a validation-gated stop-hook.
Components and implementation details:
- Retrieval (R): Model Context Protocol (MCP) server with three semantic search tools — search_navigator (documentation pages), search_schema (XSD entries), search_technical (example XML/technical snippets). Targets unknown-vocabulary substitution.
- Validator (X, S): For GEOS, xmllint --schema against GEOS .xsd is exposed both as an optional in-trajectory tool (X) and as an enforced termination gate (S). Agents call validator ≈3× per task on average when X is enabled.
- Procedural memory (M): 775-token cheatsheet distilled from 18 training trajectories (kept in view via appended system prompt).
Objectives and evaluation:
- Reward metric: TreeSim (a tree-edit similarity) for GEOS (score in [0,1], failures-as-zero convention). For transfer studies: file-coverage and LLM-judge metrics.
- Design-space exploration: factorial ablation over the four binary slots {R,S,X,M} to see which ideas matter per simulator.
- Self-evolution: offline search for adapter contents θ (primer, cheatsheet, auxiliary skills) using a proposer agent that rewrites adapter text based on logged trajectories to maximize expected reward on a held-out validation-selection split.
Benchmarks and baselines:
- Representative GEOS task compared to a geoscience expert (time and quality).
- Held-out GEOS task set for robustness/generalization.
- Transfers to OpenFOAM and LAMMPS with comparable evaluation metrics and comparisons to lint-only baselines used in prior OpenFOAM agents (e.g., Foam-Agent 2.0 / MetaOpenFOAM lint mode).
Key quantitative outcomes:
- Representative GEOS: SIGA ≈5 minutes vs expert ≈3 hours; TreeSim >0.90 (parity).
- Held-out GEOS: TreeSim from 0.720 (bare agent) to 0.789 (grounded) — ~10% relative gain; across-seed std dev reduced ~16×.
- Self-evolution: yielded highest held-out GEOS mean and matched/beat best hand-tuned adapter.

Implications for AI Economics

Immediate productivity and cost savings:
- Large reductions in expert setup time (36× speedup on representative GEOS task) imply substantial labor cost savings per simulation setup and faster experimental iteration cycles. This lowers effective cost-per-experiment and increases throughput for research groups and engineering teams.
High return on small engineering investments:
- A lightweight adapter (textual/plug-in artifacts + validators + retrieval index) can deliver major productivity gains without retraining models. For organizations, investing in simulator-specific grounding plugins is likely much cheaper and faster than model fine-tuning or building bespoke agent loops.
Market and productization opportunities:
- Demand for simulator adapters/plugins as commercial products or open-source modules. Firms could specialize in high-quality grounding packages (validators, schema mappings, curated memories, retrieval indices) for particular scientific tools.
- Platform providers (LLM-as-a-service, coding agent vendors) can offer official adapter ecosystems, charging for curated grounding bundles or hosting validation/retrieval backends.
Labor-market and skill-shift effects:
- Routine simulator setup tasks become automatable; junior/technical roles focused on configuration may shrink, while demand grows for higher-skilled roles (validator engineering, adapter authoring, domain verification, scientific interpretation).
- Time reallocated from setup to higher-level science (experiment design, analysis) could raise the value of domain expertise and accelerate discovery.
Competitive and research implications:
- Faster per-task iteration increases the pace of empirical work and may accelerate R&D cycles across industries that rely on simulation (energy, materials, aerospace, pharma), giving adopters an early advantage.
- Self-evolving adapters reduce maintenance costs over time since adapters can improve from logged runs — compounding productivity gains without continuous human retuning.
Allocation of R&D spending:
- The results argue for shifting marginal R&D spend from large model retraining to engineering robust grounding layers and validator infrastructure for domain-specific tools, particularly where simulator interfaces are complex and stable.
Risks, limits, and governance:
- Overreliance: if adapters allow automated generation of runnable simulations without sufficient domain oversight, there is risk of misconfigured experiments that produce plausible but scientifically wrong outcomes. Validator logic helps for structural correctness but not necessarily for physical validity.
- Externalities: easier setup may increase aggregate compute usage (more simulations), raising costs and energy footprint. Institutions should weigh compute budgets and environmental impacts.
- Platform lock-in and standards: adapters tied to particular harnesses or proprietary retrievers could create lock-in; open standards for validator schemas and retrieval indices could mitigate this.
Distributional considerations:
- Distribution of benefits may skew to well-resourced labs and firms that can build or buy adapters, potentially widening capability gaps. Open-source adapters and shared retrieval corpora can help democratize access.
Scalability and multiplier effects:
- Because the adapter approach composes with future improvements in base models and harnesses (they remain replaceable/frozen), gains can compound: better base agents make adapters more effective and vice versa, increasing the long-run economic value of the approach.

Caveats and limitations (important for economic assessment) - SIGA addresses interface grounding (vocabulary, schema, validation) — it does not replace domain expertise for experiment design, result interpretation, or verification of physical correctness beyond schema-level checks. - Reported improvements depend on availability of good schemas, example corpora, and validators (not all simulators have machine-checkable schemas of equivalent quality). - Quantitative results are simulator-dependent (GEOS, OpenFOAM, LAMMPS) and hinge on the chosen metrics (TreeSim, file-coverage, LLM-judge); real-world economic gains require integration into full research workflows.

Overall takeaway for AI economics: small, targeted engineering (grounding adapters + validation + retrieval) can unlock outsized productivity gains for simulation-driven workflows, making adapter development a high-impact, cost-effective lever for organizations that rely on scientific software.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents controlled experimental comparisons (bare agent vs SIGA, ablations, held-out sets, and transfers to other simulators) showing large relative improvements and an expert baseline; however, evidence is limited to a small set of scientific simulators and benchmark tasks, with few details about statistical significance, user diversity, or real-world deployment, so external validity and robustness beyond the tested setups are uncertain. Methods Rigormedium — The authors evaluate across in-domain, held-out, and transfer tasks, run multiple seeds, and include ablations (validation, memory, retrieval, self-evolution) and a human expert baseline, indicating solid experimental design; but the paper appears to lack broader user studies, detailed statistical testing/reporting, large-scale field validation, and full transparency on dataset sizes and selection criteria. SampleBenchmarks consist primarily of GEOS (an open-source multiphysics simulator) configuration/setup tasks, including a main set and a harder held-out set; comparisons include an off-the-shelf coding agent, SIGA (with components: retrieval, procedural memory, in-trajectory validation, validation-enforced termination), ablated variants, a human expert who took ~3 hours, and transfers evaluated on OpenFOAM and LAMMPS; performance measured by TreeSim similarity metric, wall-clock time, and across-seed variance. Themesproductivity human_ai_collab adoption GeneralizabilityEvaluated on a limited set of scientific simulators (GEOS, OpenFOAM, LAMMPS) — other simulators or domains may differ substantially, Depends on the capabilities of the underlying coding agent; results may not hold for weaker/stronger base models, TreeSim metric and benchmark tasks may not capture full domain correctness or practical experiment readiness, Small or unclear sample sizes, limited human baseline diversity (single or few experts), and synthetic/benchmark tasks limit external validity, Real-world operational constraints (compute, institutional workflows, safety/validation norms) not fully tested

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Learning specialized simulator input languages can cost domain scientists hours to days. Skill Acquisition	negative	high	time required to learn simulator input languages	hours to days 0.09
SIGA (Simulator-Interface Grounding Adapter) supplies a simulator's executable contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. Other	positive	high	capacity to encode and supply simulator interface contract	0.03
On GEOS, SIGA produces a complete GEOS deck in about five minutes, matching an extended-budget human expert who took about three hours—a roughly 36x wall-clock speedup. Task Completion Time	positive	high	time to produce complete GEOS deck (wall-clock time)	about five minutes (SIGA) vs about three hours (human); roughly 36x wall-clock speedup 0.18
On GEOS, SIGA attains TreeSim above 0.90, matching the extended-budget human expert. Output Quality	positive	high	TreeSim (similarity / output-quality metric)	TreeSim above 0.90 0.18
On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent. Output Quality	positive	high	TreeSim (output quality) on held-out set	from 0.720 to 0.789; roughly 10% relative gain 0.18
Grounding can reduce the across-seed standard deviation by 16x on the held-out GEOS set. Output Quality	positive	high	across-seed standard deviation of performance	reduce the across-seed standard deviation by 16x 0.18
Self-evolution (rewriting adapter contents from prior trajectories) further improves SIGA, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Output Quality	positive	medium	held-out GEOS mean performance (e.g., TreeSim)	yields the highest held-out GEOS mean and matches or outperforms the strongest hand-designed configuration 0.11
When transferring SIGA to OpenFOAM and LAMMPS, the dominant grounding mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. Task Allocation	mixed	medium	relative importance of SIGA mechanisms (validation vs memory/retrieval) for performance on different simulators	mechanism importance shifts by interface (validation vs memory/retrieval) 0.11
Lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software. Adoption Rate	positive	high	practicality of general coding agents operating scientific software (combined metrics: time, quality, robustness)	0.18
Coding agents already know how to navigate files, edit code, run commands, and repair outputs, but lack the simulator's executable contract (vocabulary, structural constraints, validation rules, termination conditions). Other	mixed	high	agents' pre-existing capabilities vs missing simulator-specific contract	0.03