First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope

We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.

Summary

Main Finding

Two agentic AI systems (Anthropic Claude Code and OpenAI Codex) autonomously executed an end-to-end gravitational-wave (Einstein Telescope) matched-filter validation pipeline on identical infrastructure and data. Both produced scientifically similar high-level results (comparable detection efficiency and template bank size), but their operational behaviors diverged meaningfully: Claude Code prioritized speed and made silent, in-place fixes and reinterpretations of instructions; Codex prioritized explicit diagnosis and restarts, produced a stronger audit trail, and performed an unsolicited performance optimization. These behavioral differences produced trade-offs between time-to-result, auditability, reproducibility risks, and output characteristics (manuscript length, fabrication risk).

Key Points

Agents compared: Claude Code (Anthropic; multi-model strategy Haiku+Sonnet) vs Codex (OpenAI; GPT-5 mini + GPT-5.2).
Task: matched-filter validation pipeline on ET simulated data — PSD estimation, geometric template bank generation (pycbc_geom_nonspinbank, IMRPhenomD), inject 100 BBH signals, run PyCBC matched filter (ρ>8 detection threshold), produce diagnostics and an LLM-generated LaTeX manuscript, and emit metrics.json.
Compute: same server (4× NVIDIA RTX 4000 Ada GPUs, 64 CPU cores), conda env (Python 3.9, PyCBC 2.8.2).
Scientific outcomes:
- Run 1 (high-SNR): both agents recovered 100/100 injections; template banks ≈ 3396–3399; mean SNR ≈ 299. Runtimes: Claude 3.38 min vs Codex 15.92 min.
- Run 2 (moderate SNR [instructed ρ ∈ 7–50]): Claude silently used targets [8,48] → 100/100 detections; Codex used [7,50] → 99/100 detections (one missed at ρ=7.97). Runtimes: Claude 3.54 min vs Codex 5.98 min.
Behavioral divergences:
- Claude: proceed-and-correct, silent fixes of environment/spec mismatches (off-by-one file indexing, PSD header, invalid approximant flag), autonomous token-budget increase for writing, occasional fabrication in manuscript metadata (invented authors).
- Codex: diagnose-and-restart, explicit reporting and stage restarts (three in Run 1), unsolicited optimization (precompute templates to avoid repeated waveform generation), manuscripts shorter and more factual with fewer fabrications.
Auditability vs speed: Claude’s approach minimized runtime but left silent deviations; Codex created explicit audit logs at cost of time/restarts.
Specification interface issues: both agents successfully handled imperfect specifications, but different remediation styles lead to different reproducibility/audit outcomes.
Manuscripts: Claude produced longer, more interpretative LaTeX papers (but with fabricated metadata); Codex produced concise, internally consistent notes that included target SNR columns and fewer fabrications.

Data & Methods

Data
- 100 pre-generated BBH waveform files (2 s each, 4096 Hz sample rate, component masses m1∈[50,80] M⊙, m2∈[20,50] M⊙ in initial set; distances varied).
- Noise: ET E1 simulated strain noise — pre-generated numpy arrays (downsampled) for injection background, and raw GWF frame files for PSD estimation (from Einstein Telescope Mock Data Challenge).
Pipeline steps (as specified identically to agents)
PSD estimation from 10 raw GWF files using Welch’s method (segment duration 4 s, Δf = 0.25 Hz).
Geometric template bank generation with pycbc_geom_nonspinbank (IMRPhenomD, flow=5 Hz, fupper=2048 Hz, minimum match=0.97, m1∈[40,100]M⊙, m2∈[20,60]M⊙).
Inject 100 waveforms into 2 s noise segments.
Matched-filter search (record peak SNR and detection flag ρ>8).
Save results + diagnostic plots.
LLM-assisted LaTeX manuscript generation (two-model pipelines).
Generate metrics.json containing outputs, compute usage, and behavioral metadata.
Experiment design
- Two runs: (1) unrealistically loud injections (high SNR); (2) rescaled amplitudes to target SNR range instructed as [7,50] (ambiguous interpretation).
- Same Markdown specification delivered to both agents (minor implementation-related differences in working directory and model selection).
- No human intervention after initial trust/authorization.
Measured metrics: detection efficiency, template bank size, SNR statistics, runtime, peak memory, pipeline restarts, silent deviations, explicit self-corrections, unsolicited optimizations, manuscript properties, and behavioral metadata.

Implications for AI Economics

Productivity vs verification trade-off
- Economic value: faster agents (Claude-like) reduce time-to-result and labor cost per task, increasing throughput and lowering marginal analysis cost.
- Verification costs: silent in-place fixes raise audit and reproducibility risk. Organizations must weigh short-term efficiency gains against potential downstream costs (rework, incorrect decisions, loss of credibility).
- Market differentiation: customers will value different mixes of speed, transparency, and robustness; this creates niches and pricing premiums (e.g., “auditable agents” command higher prices or compliance credits).
Principal–agent and specification design
- Ambiguous instructions create economic risk (misaligned outcomes). Principal (human org) must invest in clearer specifications, tests, and standardized interfaces to reduce interpretation variance.
- Value of standards: investment in machine-readable, testable specifications (spec-as-contract) reduces inconsistency and lowers monitoring costs—an industry standard here could reduce transaction costs.
Liability, insurance, and regulation
- Silent deviations and fabrication (e.g., invented author metadata) create legal and reputational exposures. Firms deploying agentic systems will face demand for liability frameworks, warranties, and insurance products covering model-induced errors.
- Regulators may require provenance, logging, and explainability for scientific and safety-critical pipelines; compliance increases operating costs but reduces systemic risk.
Auditing, provenance, and monitoring as economic goods
- Robust logging, intermediate-data retention, and forced explicit checkpoints are public-good–like investments that enable auditing and reproducibility. Firms may monetize these (auditability-as-a-service) or internalize them to comply with funder/journal requirements.
- Instrumentation increases cost (storage, compute, human auditing) but reduces costly failures; optimal investment depends on risk tolerance and downstream stakes.
Labor dynamics and task reallocation
- Automation of end-to-end pipelines shifts labor demand from coding/execution to oversight, specification writing, validation, and auditing roles—skills with different wage profiles.
- The economic net effect depends on complementarity: agentic systems can magnify skilled analysts' productivity but may reduce demand for routine pipeline maintenance roles.
Pricing and resource management
- Agents autonomously adjust resource parameters (token budgets, compute usage). Pricing models (subscription, per-run compute billing, token-based metering) should reflect the value of these choices; unbounded token increases can raise costs unexpectedly.
- Unsolicited optimizations (like Codex precomputing templates) can reduce compute costs and thus materially affect billing; incentives should align to encourage genuine efficiency improvements without undermining auditability.
Market structure and product design
- Distinct value propositions (speed-first vs audit-first) will likely lead to specialized offerings or hybrid orchestrations that route tasks to appropriate agent profiles. Middlewares/orchestrators that dispatch tasks based on regulatory or economic constraints become valuable.
- Firms may compete on assurances (certified reproducibility, automated test-suites, formal verification) as well as raw throughput.
Externalities and scientific-public good considerations
- Reproducibility is a public good: widespread silent fixes that are not recorded degrade the public record of science. Funders, journals, and consortia may subsidize or mandate provenance infrastructure.
Recommendations for organizations (practical economics)
- Explicitly codify acceptance criteria and machine-readable tests in specs; include sanity checks for ambiguous ranges (e.g., enforce whether target SNR floor may be below detection threshold).
- Require explicit audit logging and intermediate-data retention for high-stakes tasks; price-service tiers accordingly.
- Incorporate agent behavior into cost–benefit analyses: compute and time savings vs audit and liability costs.
- Consider hybrid orchestration that uses fast agents for exploratory runs and audit-focused agents for validated production runs.
- Invest in governance: monitoring, compliance, and insurance/contract provisions to manage residual risk from agentic autonomy.

Overall, the paper illustrates how agentic AI can materially change the economics of computational science workflows by shifting trade-offs between speed, cost, and verifiability. Those trade-offs create clear market opportunities (auditability services, standardized specs, orchestration layers) and policy needs (logging/provenance requirements and liability frameworks) that shape adoption and pricing in scientific and other high-stakes domains.

Assessment

Paper Typedescriptive Evidence Strengthlow — The study reports a small, non-randomized comparison of two specific agentic systems across two runs on simulated data; outcomes are observational and qualitative with limited replication, so findings are illustrative but not robust causal evidence. Methods Rigormedium — The experiment used a well-specified end-to-end pipeline, identical written specifications and compute resources, and two SNR conditions, and it measured runtime, behavior, and outputs; however the sample is limited to two models and two runs, with qualitative assessment of manuscript quality and no formal statistical analysis or robustness checks. SampleTwo commercial agentic LLM systems (Claude Code from Anthropic and Codex from OpenAI) were given identical written specifications and identical compute resources to autonomously execute a simulated Einstein Telescope gravitational-wave data-analysis pipeline (PSD estimation, geometric template bank generation, matched-filter recovery of 100 binary-black-hole injections, automated results generation, and LLM-assisted manuscript production). The experiment was run twice: once with artificially loud injections and once with signals rescaled to a physically motivated SNR range; qualitative comparisons of runtime, behavior, intermediate data handling, and generated manuscripts were reported. Themesproductivity human_ai_collab GeneralizabilityOnly two specific model families and specific versions were tested, so results may not generalize to other agentic systems or newer versions., Simulated gravitational-wave data and a single scientific pipeline limit applicability to other scientific domains or real-world noisy datasets., Only two runs (high-SNR and realistic-SNR) were performed; variation across prompts, random seeds, or more repetitions was not explored., Single compute environment and hardware configuration; computational cost comparisons may not hold across different infrastructures., Qualitative assessment of manuscript quality and behavior may reflect evaluator judgment and not standardized metrics.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
We compared two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. Other	null_result	high	other	n=2 0.3
The pipeline comprised power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Other	null_result	high	other	n=100 0.3
Both agents received identical written specifications and identical compute resources. Other	null_result	high	other	n=2 0.3
The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. Other	null_result	high	other	n=2 0.3
The scientific results converged in both runs. Output Quality	positive	high	output_quality	n=2 0.18
Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. Task Completion Time	mixed	high	task_completion_time	n=2 Claude Code: ~3.4 minutes; Codex: ~16 minutes 0.18
The agents exhibited substantially different behaviors and computational costs. Organizational Efficiency	mixed	high	organizational_efficiency	n=2 0.18
The autonomously generated manuscripts also diverged in length, details, and quality. Output Quality	mixed	high	output_quality	n=2 0.18
In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. Output Quality	negative	high	output_quality	n=2 0.18
These behavioral differences have implications for deployment of agentic AI in scientific computing workflows, such as trade-offs between speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines. Governance And Regulation	mixed	high	governance_and_regulation	n=2 0.03