An LLM-driven synthesis system shrinks months of expert formal-verification work to hours: IDS fully implements and verifies seven distributed key-value-store specs in ~6.8 hours at $106 each, outperforming prior coding agents and producing implementations up to three times faster than published verified systems.
AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.
Summary
Main Finding
Inductive Deductive Synthesis (IDS) is a multi-agent LLM system that jointly and incrementally synthesizes implementations and machine-checked proofs from a formal specification. On a suite of distributed key-value-store consistency specifications, IDS produced fully verified implementations for all 7 specs (7/7) automatically in about 6.8 hours and roughly $106 per spec on average — outperforming state-of-the-art coding agents (Codex/GPT-5.4 and Claude Code/Opus 4.6), which solved only 2/7 under the same prompt and budget. IDS’ loop of interleaving partial proofs, verifier feedback, strategy learning from failures, and performance benchmarking also found implementations that match or exceed hand-written verified references (up to 3× throughput vs. Chapar’s vector-clock reference). As a general prover, IDS improved state-of-the-art performance on multiple public verification benchmarks.
Key Points
- Joint inductive–deductive loop: IDS synthesizes code and proof together, incrementally. Partial proofs (holes/admitted lemmas) are allowed so Rocq’s type-checker can validate intermediate states and rule out dead-end design choices early.
- Agentic architecture:
- Deductive Synthesis Agents (DSAs): LLM-guided, stepwise builders that extend implementations/proofs and query Rocq after each step.
- Inductive Synthesis Agent (ISA): monitors failures/stalls and proposes (a) tactical fixes (helper lemmas, decompositions) or (b) strategic reloads (entirely new designs).
- Coordinator: runs parallel DSAs, benchmarks completed implementations, audits final proofs for non-vacuity and interface conformance, and feeds performance data back to the ISA.
- Verifier-as-oracle: Rocq’s type-checker provides precise, no-false-positive/-negative feedback on both full and partial artifacts, which is leveraged as a high-fidelity guide rather than relying solely on tests.
- Performance-aware synthesis: completed candidate implementations are extracted to OCaml and benchmarked on a distributed testbed; these metrics guide the search toward efficient designs.
- Empirical results summary:
- 7/7 verified implementations in ~6.8 hours and ~$106/spec (max 11h, $155).
- SOTA coding agents solved 2/7 under same budget/prompt.
- IDS’ implementations matched or beat expert references (up to 3× throughput).
- General verifier performance improvements: DSA alone improved on four public proof benchmarks by +10% to +51%.
- Cross-benchmark synthesis gains: e.g., 176/189 on VERINA (prior 38/189); 65/77 on AlgoVeri (prior 31/77); 62/62 on CloverBench (perfect).
- IDS substantially reduces the human effort and time needed for verified distributed systems (authors cite ~200× faster than prior expert effort estimates).
Data & Methods
- Tasks / Benchmarks:
- Primary: 7 Rocq specifications of distributed key-value-store consistency: Chapar’s causal-consistency (CC) spec plus six IDS-suite specs (read-your-writes, monotonic writes, monotonic reads, RYW+MW composition, IDS-CC, LCC). The IDS suite was co-developed with experts and released as a benchmark.
- Secondary: public verification benchmarks across languages (DafnyBench, miniCodeProps, Verus-Bench, CoqStoq) and synthesis benchmarks (VERINA, AlgoVeri, CloverBench).
- System:
- IDS is a multi-agent LLM system integrating Rocq (a Coq-like proof assistant) as the formal checker/oracle.
- Agents are LLMs pretrained on code and proofs; the DSA enforces strict constraints (well-typed before proving, specs immutable, final outputs without Admitted).
- ISA provides strategy evolution (proposer for lemmas; reloader for new designs).
- Evaluation pipeline:
- Synthesis proceeds until a closed proof is achieved; completed Gallina implementations are extracted to OCaml via Rocq extraction.
- Implementations are run on a 5-VM Google Cloud cluster; each run uses 4 parallel workers × 1,000 random operations, sweeps over put rates, and scaling at chosen put rate.
- Benchmarks compare throughput (ops/s) and robustness; every run uses the same wall-clock and dollar budgets across methods.
- Baselines and costs:
- Baselines: Codex (GPT-5.4) and Claude Code (Opus 4.6) under identical prompts and budgets.
- Reported average synthesis time ≈ 6.8 hours and average dollar cost ≈ $106 per spec for IDS; compared to SOTA agents and historical expert-month estimates (months→years).
- Verification and audit:
- Rocq’s type-checker verifies both partial and complete artifacts.
- Final audit checks proofs are non-vacuous and interfaces conform to spec; performance benchmarks ensure implementations are not only correct but also practical.
Implications for AI Economics
- Large reductions in verification labor cost and calendar time:
- IDS compresses months-to-years of formal-proof effort into hours at modest compute cost (~$100/spec), implying large productivity gains and sharply lower marginal cost for producing machine-checked systems. This suggests major downward pressure on prices for verified-system development and large productivity increases for teams that adopt these tools.
- Labor-market effects and skill shifting:
- Demand for routine, manual formal-proof work may decline as IDS-like tools mature, but demand for higher-level roles (tooling oversight, formal-spec authoring, audit/regulation, and complex strategy design) may persist or shift upward in skill premium.
- The complementary skillset may move toward hybrid roles that combine system design, verification oversight, and LLM orchestration.
- Capitalization and competitive advantage:
- Firms with early access to robust verified-synthesis pipelines gain faster time-to-market for high-assurance services, lower liability risk, and potential cost advantages. This may accelerate concentration if access to the required compute, expert prompts, or private models is uneven.
- Product and industry impacts:
- Lower cost of verification could enable wider deployment of formally-verified components in safety- and correctness-critical domains (distributed databases, storage systems, critical infrastructure), potentially changing investment decisions and risk assessments.
- Insurance, compliance, and liability frameworks may evolve: cheaper formal verification lowers expected failure risk and could reduce premiums or change liability expectations; regulators might raise the bar by expecting machine-checked guarantees in certain domains.
- Market for verified components and standards:
- A commoditized pipeline for verified implementations could spawn a market for certified, drop-in verified modules. Standardization and third-party audit markets may grow (and be profitable) to validate IDS outputs in high-stakes contexts.
- Research and capital allocation:
- Economists and firms should model substitution effects between human expert months and automated verification compute. Returns to scale may favor organizations that can amortize fixed engineering of agent orchestration and integrate performance testing loops.
- Caution and externalities:
- Over-reliance on automated verification is a risk: the correctness of the formal specification, trust in the proof assistant, and soundness of extraction/runtime remain critical. Mis-specified requirements lead to formally-verified-but-wrong systems.
- IDS requires formal specifications as input; costs and expertise to write correct specs remain a bottleneck and a potential niche for labor and services.
- Access inequality (proprietary LLMs, cloud resources) may create uneven adoption and market power.
- Directions for AI economics research:
- Empirically quantify substitution elasticities between formal-verification specialists and automated synthesis.
- Model pricing for verified software components, auditing services, and liability insurance in a world with low-cost formal synthesis.
- Study diffusion dynamics: adoption thresholds, regulatory incentives, and competitive responses in infrastructure and cloud markets.
Summary conclusion: IDS demonstrates that automated, LLM-driven joint code+proof synthesis can drastically reduce the time and cost of producing formally-verified distributed systems while improving performance. The economic implications are broad — from labor-market shifts and new product markets to regulatory and liability effects — and warrant targeted economic modeling and empirical study.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. Output Quality | negative | high | output_quality |
n=7
2/7
0.12
|
| IDS achieves 7/7 (succeeds on all 7 specs) in about 6.8 hours per spec on average. Task Completion Time | positive | high | task_completion_time |
n=7
about 6.8 hours
0.12
|
| IDS costs $106 per spec on average. Organizational Efficiency | positive | high | organizational_efficiency |
n=7
$106 per spec
0.12
|
| IDS is roughly 200x faster than expert effort. Task Completion Time | positive | high | task_completion_time |
n=7
roughly 200x faster
0.06
|
| IDS is 17% cheaper than SOTA agents. Organizational Efficiency | positive | high | organizational_efficiency |
n=7
17% cheaper
0.12
|
| IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems. Organizational Efficiency | positive | high | organizational_efficiency |
up to 3x faster
0.12
|
| This paper presents the first effective approach to addressing the gap between LLM coding agents and mechanized formal verification for distributed systems (Inductive Deductive Synthesis, IDS). Research Productivity | positive | high | research_productivity |
n=7
0.06
|
| IDS jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Other | positive | high | other |
0.02
|