An LLM-driven synthesis system shrinks months of expert formal-verification work to hours: IDS fully implements and verifies seven distributed key-value-store specs in ~6.8 hours at $106 each, outperforming prior coding agents and producing implementations up to three times faster than published verified systems.

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

Shubham Agarwal, Alexander Krentsel, Shu Liu, Mert Cemri, Audrey Cheng, Rui Meng, Tomas Pfister, Chun-Liang Li, Sylvia Ratnasamy, Aditya Parameswaran, Matei Zaharia, Ion Stoica, Mohsen Lesani · May 22, 2026

arxiv other medium evidence 8/10 relevance Source PDF

IDS, an agentic LLM system that jointly synthesizes implementations and proofs and learns from failures, solved 7/7 distributed key-value-store specs in about 6.8 hours at $106 per spec—roughly 200× faster than expert effort, 17% cheaper than prior SOTA agents, and producing implementations up to 3× faster than published verified systems.

AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

Summary

Main Finding

Inductive Deductive Synthesis (IDS) is a multi-agent LLM system that jointly and incrementally synthesizes implementations and machine-checked proofs from a formal specification. On a suite of distributed key-value-store consistency specifications, IDS produced fully verified implementations for all 7 specs (7/7) automatically in about 6.8 hours and roughly $106 per spec on average — outperforming state-of-the-art coding agents (Codex/GPT-5.4 and Claude Code/Opus 4.6), which solved only 2/7 under the same prompt and budget. IDS’ loop of interleaving partial proofs, verifier feedback, strategy learning from failures, and performance benchmarking also found implementations that match or exceed hand-written verified references (up to 3× throughput vs. Chapar’s vector-clock reference). As a general prover, IDS improved state-of-the-art performance on multiple public verification benchmarks.

Key Points

Joint inductive–deductive loop: IDS synthesizes code and proof together, incrementally. Partial proofs (holes/admitted lemmas) are allowed so Rocq’s type-checker can validate intermediate states and rule out dead-end design choices early.
Agentic architecture:
- Deductive Synthesis Agents (DSAs): LLM-guided, stepwise builders that extend implementations/proofs and query Rocq after each step.
- Inductive Synthesis Agent (ISA): monitors failures/stalls and proposes (a) tactical fixes (helper lemmas, decompositions) or (b) strategic reloads (entirely new designs).
- Coordinator: runs parallel DSAs, benchmarks completed implementations, audits final proofs for non-vacuity and interface conformance, and feeds performance data back to the ISA.
Verifier-as-oracle: Rocq’s type-checker provides precise, no-false-positive/-negative feedback on both full and partial artifacts, which is leveraged as a high-fidelity guide rather than relying solely on tests.
Performance-aware synthesis: completed candidate implementations are extracted to OCaml and benchmarked on a distributed testbed; these metrics guide the search toward efficient designs.
Empirical results summary:
- 7/7 verified implementations in ~6.8 hours and ~$106/spec (max 11h, $155).
- SOTA coding agents solved 2/7 under same budget/prompt.
- IDS’ implementations matched or beat expert references (up to 3× throughput).
- General verifier performance improvements: DSA alone improved on four public proof benchmarks by +10% to +51%.
- Cross-benchmark synthesis gains: e.g., 176/189 on VERINA (prior 38/189); 65/77 on AlgoVeri (prior 31/77); 62/62 on CloverBench (perfect).
IDS substantially reduces the human effort and time needed for verified distributed systems (authors cite ~200× faster than prior expert effort estimates).

Data & Methods

Tasks / Benchmarks:
- Primary: 7 Rocq specifications of distributed key-value-store consistency: Chapar’s causal-consistency (CC) spec plus six IDS-suite specs (read-your-writes, monotonic writes, monotonic reads, RYW+MW composition, IDS-CC, LCC). The IDS suite was co-developed with experts and released as a benchmark.
- Secondary: public verification benchmarks across languages (DafnyBench, miniCodeProps, Verus-Bench, CoqStoq) and synthesis benchmarks (VERINA, AlgoVeri, CloverBench).
System:
- IDS is a multi-agent LLM system integrating Rocq (a Coq-like proof assistant) as the formal checker/oracle.
- Agents are LLMs pretrained on code and proofs; the DSA enforces strict constraints (well-typed before proving, specs immutable, final outputs without Admitted).
- ISA provides strategy evolution (proposer for lemmas; reloader for new designs).
Evaluation pipeline:
- Synthesis proceeds until a closed proof is achieved; completed Gallina implementations are extracted to OCaml via Rocq extraction.
- Implementations are run on a 5-VM Google Cloud cluster; each run uses 4 parallel workers × 1,000 random operations, sweeps over put rates, and scaling at chosen put rate.
- Benchmarks compare throughput (ops/s) and robustness; every run uses the same wall-clock and dollar budgets across methods.
Baselines and costs:
- Baselines: Codex (GPT-5.4) and Claude Code (Opus 4.6) under identical prompts and budgets.
- Reported average synthesis time ≈ 6.8 hours and average dollar cost ≈ $106 per spec for IDS; compared to SOTA agents and historical expert-month estimates (months→years).
Verification and audit:
- Rocq’s type-checker verifies both partial and complete artifacts.
- Final audit checks proofs are non-vacuous and interfaces conform to spec; performance benchmarks ensure implementations are not only correct but also practical.

Implications for AI Economics

Large reductions in verification labor cost and calendar time:
- IDS compresses months-to-years of formal-proof effort into hours at modest compute cost (~$100/spec), implying large productivity gains and sharply lower marginal cost for producing machine-checked systems. This suggests major downward pressure on prices for verified-system development and large productivity increases for teams that adopt these tools.
Labor-market effects and skill shifting:
- Demand for routine, manual formal-proof work may decline as IDS-like tools mature, but demand for higher-level roles (tooling oversight, formal-spec authoring, audit/regulation, and complex strategy design) may persist or shift upward in skill premium.
- The complementary skillset may move toward hybrid roles that combine system design, verification oversight, and LLM orchestration.
Capitalization and competitive advantage:
- Firms with early access to robust verified-synthesis pipelines gain faster time-to-market for high-assurance services, lower liability risk, and potential cost advantages. This may accelerate concentration if access to the required compute, expert prompts, or private models is uneven.
Product and industry impacts:
- Lower cost of verification could enable wider deployment of formally-verified components in safety- and correctness-critical domains (distributed databases, storage systems, critical infrastructure), potentially changing investment decisions and risk assessments.
- Insurance, compliance, and liability frameworks may evolve: cheaper formal verification lowers expected failure risk and could reduce premiums or change liability expectations; regulators might raise the bar by expecting machine-checked guarantees in certain domains.
Market for verified components and standards:
- A commoditized pipeline for verified implementations could spawn a market for certified, drop-in verified modules. Standardization and third-party audit markets may grow (and be profitable) to validate IDS outputs in high-stakes contexts.
Research and capital allocation:
- Economists and firms should model substitution effects between human expert months and automated verification compute. Returns to scale may favor organizations that can amortize fixed engineering of agent orchestration and integrate performance testing loops.
Caution and externalities:
- Over-reliance on automated verification is a risk: the correctness of the formal specification, trust in the proof assistant, and soundness of extraction/runtime remain critical. Mis-specified requirements lead to formally-verified-but-wrong systems.
- IDS requires formal specifications as input; costs and expertise to write correct specs remain a bottleneck and a potential niche for labor and services.
- Access inequality (proprietary LLMs, cloud resources) may create uneven adoption and market power.
Directions for AI economics research:
- Empirically quantify substitution elasticities between formal-verification specialists and automated synthesis.
- Model pricing for verified software components, auditing services, and liability insurance in a world with low-cost formal synthesis.
- Study diffusion dynamics: adoption thresholds, regulatory incentives, and competitive responses in infrastructure and cloud markets.

Summary conclusion: IDS demonstrates that automated, LLM-driven joint code+proof synthesis can drastically reduce the time and cost of producing formally-verified distributed systems while improving performance. The economic implications are broad — from labor-market shifts and new product markets to regulatory and liability effects — and warrant targeted economic modeling and empirical study.

Assessment

Paper Typeother Evidence Strengthmedium — The paper provides direct empirical comparisons (task success rates, wall-clock time, monetary cost, and runtime performance) showing large improvements over state-of-the-art coding agents and expert effort, which is persuasive for the specific domain. However, the evaluation is limited to seven distributed key-value-store specifications, details about selection of those specs and variability across runs are not supplied, and cost/time calculations depend on system-specific engineering and pricing assumptions; these factors limit the strength of inference about broader effectiveness. Methods Rigormedium — The authors implement a novel method (IDS) and report multiple quantitative metrics versus baselines (SOTA agents and expert effort) and include performance benchmarking versus published verified systems, indicating careful empirical work. Yet the methods section appears to lack key reproducibility details (exact hardware/compute, number of trials/random seeds, hyperparameters, how specs were selected, and sensitivity analyses), and the empirical scope (7 specs) is small, leaving open potential selection bias and uncertainty about robustness. SampleEvaluation on seven distributed key-value-store specifications; comparisons against state-of-the-art coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) and against reported expert time-to-verify; measured outcomes include correctness (verification success), wall-clock synthesis time (~6.8 hours average), monetary cost (~$106 per spec), and runtime performance (implementations up to 3× faster than published verified systems). Details on hardware, compute budgets, number of random seeds, and selection criteria for the seven specs are not provided in the summary. Themesproductivity labor_markets GeneralizabilitySmall, domain-specific sample (only seven distributed key-value-store specs) limits how results translate across other types of software or larger systems., Performance and cost depend on particular LLMs, system engineering, cloud/pricing assumptions, and hardware used — results may not hold with different models or configurations., Success may rely on specification quality; unclear how IDS handles messy/partial/ambiguous real-world specs., Unclear robustness to harder or larger-scale distributed protocols beyond the tested benchmarks., Potential selection bias if specs were chosen where inductive/deductive synthesis is particularly well-suited.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. Output Quality	negative	high	output_quality	n=7 2/7 0.12
IDS achieves 7/7 (succeeds on all 7 specs) in about 6.8 hours per spec on average. Task Completion Time	positive	high	task_completion_time	n=7 about 6.8 hours 0.12
IDS costs $106 per spec on average. Organizational Efficiency	positive	high	organizational_efficiency	n=7 $106 per spec 0.12
IDS is roughly 200x faster than expert effort. Task Completion Time	positive	high	task_completion_time	n=7 roughly 200x faster 0.06
IDS is 17% cheaper than SOTA agents. Organizational Efficiency	positive	high	organizational_efficiency	n=7 17% cheaper 0.12
IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems. Organizational Efficiency	positive	high	organizational_efficiency	up to 3x faster 0.12
This paper presents the first effective approach to addressing the gap between LLM coding agents and mechanized formal verification for distributed systems (Inductive Deductive Synthesis, IDS). Research Productivity	positive	high	research_productivity	n=7 0.06
IDS jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Other	positive	high	other	0.02