Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau Equilibrium

We present a complete Lean 4 formalization of the equilibrium characterization in the Vlasov-Maxwell-Landau (VML) system, which describes the motion of charged plasma. The project demonstrates the full AI-assisted mathematical research loop: an AI reasoning model (Gemini DeepThink) generated the proof from a conjecture, an agentic coding tool (Claude Code) translated it into Lean from natural-language prompts, a specialized prover (Aristotle) closed 111 lemmas, and the Lean kernel verified the result. A single mathematician supervised the process over 10 days at a cost of \$200, writing zero lines of code. The entire development process is public: all 229 human prompts, and 213 git commits are archived in the repository. We report detailed lessons on AI failure modes -- hypothesis creep, definition-alignment bugs, agent avoidance behaviors -- and on what worked: the abstract/concrete proof split, adversarial self-review, and the critical role of human review of key definitions and theorem statements. Notably, the formalization was completed before the final draft of the corresponding math paper was finished.

Summary

Main Finding

A complete formalization in Lean 4 of the equilibrium characterization for the Vlasov–Maxwell–Landau (VML) system was produced through an AI-assisted pipeline. The pipeline combined: an AI reasoning model (Gemini DeepThink) to generate the proof, an agentic coding tool (Claude Code) to translate the proof to Lean, a specialized automated prover (Aristotle) that closed 111 lemmas, and the Lean kernel to fully verify the result. One mathematician supervised the process over 10 days at a reported cost of $200 and wrote no code. The entire development (229 human prompts, 213 git commits) and artifacts are public.

Key Points

Project scope: formalization of an equilibrium characterization in the VML system (charged plasma dynamics).
AI components and roles:
- Gemini DeepThink: generated the natural-language/mathematical proof from a conjecture.
- Claude Code: translated the proof into Lean 4 code based on prompts.
- Aristotle: a specialized prover that automatically closed 111 lemmas (automation inside the proof assistant workflow).
- Lean kernel: provided final machine verification of all formalized statements.
Human role: a single mathematician acted as supervisor/reviewer for ~10 days, cost ≈ $200, no code written by the human.
Transparency and reproducibility: full archive of 229 prompts and 213 commits available in the repository.
Process-level lessons:
- Failure modes observed: hypothesis creep, definition-alignment bugs (mismatch between informal and formal definitions), agent avoidance behaviors (agents delegating or failing to complete tasks).
- Successful practices: splitting proofs into abstract (high-level reasoning) and concrete (formalization) parts, adversarial self-review by agents, and targeted human review of key definitions and theorem statements.
Timing: the formalization finished prior to the final draft of the corresponding math paper.

Data & Methods

Artifacts:
- Full Lean 4 development containing formal statements, proof scripts, and verified theorems.
- Archive of 229 human prompts used to steer the agents.
- Git history with 213 commits documenting the code evolution.
Verification metrics:
- 111 lemmas automatically closed by the specialized prover.
- Full verification by the Lean kernel (i.e., machine-checked correctness).
Workflow/method:
Conjecture → AI reasoning model produces a human-readable proof outline.
Agentic coding tool translates outline into Lean code via iterative prompting.
Specialized prover attempts automated lemma discharge.
Human supervisor reviews definitions and key statements, runs the Lean verifier, and directs retries/repairs when agents fail or definitions misalign.
Empirical reporting: detailed qualitative analysis of failure modes and what mitigations worked, plus quantitative logs (prompts, commits, lemma counts).
Cost accounting: reported human supervision time (10 days) and monetary cost ($200); no coding time by human.

Implications for AI Economics

Productivity and task automation:
- Demonstrates that high-skill, knowledge-intensive tasks (formal mathematics) can be substantially automated with a heterogeneous AI toolchain, reducing human coding labor but retaining supervisory oversight.
- Suggests a new unit of productivity: formally verified outputs per human-supervised time; here, a significant formalization completed in 10 days with minimal monetary cost to labor.
Labor substitution and complementarity:
- Agents substitute for routine coding and lemma-proving work but complement humans for specification, definition alignment, and critical review. Economies will likely see task reallocation toward supervision, specification, and validation roles rather than full replacement of domain experts.
- The value of a small number of highly skilled supervisors may increase (scarcity of effective supervisors who can validate definitions/theorems).
Quality-adjusted output and verification:
- Machine verification (Lean kernel) increases the trustworthiness of outputs relative to informal results; this quality premium should be modeled (e.g., higher market value or lower downstream verification costs).
- However, failure modes (definition misalignment, hypothesis creep) imply monitoring/verification frictions; models should include nonnegligible verification and correction costs even when AI automates production.
Cost structure and scaling:
- Low reported monetary cost for supervision here ($200) partly reflects short time and specific setup; general equilibrium effects may differ as demand for supervisors rises or as more complex projects require more human effort.
- Fixed costs (toolchain development, model integration, prover tuning) and coordination costs (prompt engineering, commit reviews) matter; firms may internalize these as overhead, changing returns to scale.
Innovation and diffusion:
- Public archives (prompts, commits) accelerate diffusion by lowering search/learning costs and enabling replication, which increases adoption speed and lowers entry barriers for other researchers/firms.
- Early adopters in technical domains may gain outsized advantages due to knowledge accumulation of prompt libraries and toolchain configurations.
Policy and governance:
- The need for human review and the recorded failure modes argue for regulation/standards around AI-assisted scientific claims (e.g., disclosure of human/AI roles, archival of prompts and verification artifacts).
- Public-good value of verified artifacts suggests potential for subsidized infrastructure (formal proof repositories, automated provers) to maximize social returns.
Modeling recommendations for economists:
- Incorporate heterogeneous task decomposition (specification, translation, automation, verification) into models of automation.
- Treat agent reliability as stochastic with correlated failure modes; include monitoring costs and the supply of supervisors as constraints.
- Model quality-adjusted output and optionality value from verifiability (e.g., lower downstream liability or faster certification).
- Consider dynamic effects: prompt/commit archives reduce future costs (learning-by-sharing), potentially generating increasing returns and winner-take-most dynamics.
Measurement challenges:
- Single-project metrics (time and money) are informative but insufficient—need cross-project studies to estimate general equilibrium labor effects, scaling properties, and distributional consequences.
- Important to measure not just speed/cost but verification depth, residual human effort, and incidence of critical failures that require rework.

If you want, I can (a) extract quantitative timeline and resource-use charts from the repository, (b) draft a simple economic model that captures supervision vs automation tradeoffs and verification externalities, or (c) list specific variables and parameters economists should collect in field studies of AI-assisted scientific workflows. Which would be most useful?

Assessment

Paper Typedescriptive Evidence Strengthlow — Single-case, observational project-level evidence: richly documented and reproducible for this one formalization, but no counterfactual, no cross-project variation, and strong risk of selection and domain-specificity, so weak basis for general causal claims about AI and economic outcomes. Methods Rigormedium — High transparency and reproducibility (complete Lean development, 229 prompts, 213 commits, automated-verification outputs) and careful qualitative analysis of failure modes; however, no systematic empirical design, small N=1, and limited measurement of time/cost components beyond a reported human-supervision estimate. SampleA single Lean 4 formalization project of an equilibrium characterization in the Vlasov–Maxwell–Landau (VML) system (charged plasma dynamics); artifacts include the full Lean 4 development, 229 human prompts, 213 git commits, and metrics such as 111 lemmas closed by a specialized prover; AI components used were Gemini DeepThink (reasoning), Claude Code (code translation), Aristotle (automated prover), and the Lean kernel (verification); supervised by one mathematician for ~10 days (reported cost ≈ $200). Themesproductivity human_ai_collab adoption governance GeneralizabilitySingle-project case study: results may not generalize beyond this one formalization., Domain-specific: formalized pure/applied mathematics (theorem proving) differs from other knowledge-work tasks (e.g., empirical analysis, engineering design, customer service)., Toolchain- and model-version dependent: outcomes may change with different models, provers, or software configurations., Supervisor skill and availability: relies on a highly skilled mathematician for review—supply constraints could alter cost and feasibility at scale., Reported cost/time incomplete: monetary accounting excludes model/compute costs, tool setup, and upstream engineering overhead., Short-run snapshot: does not capture learning-by-doing, longer-run scaling, or market equilibrium effects.

Claims (13)

Claim	Direction	Confidence	Outcome	Details
A complete formalization in Lean 4 of the equilibrium characterization for the Vlasov–Maxwell–Landau (VML) system was produced through an AI-assisted pipeline. Other	positive	high	completeness of formalization / machine-checked verification of the VML equilibrium characterization	n=1 0.09
The AI-assisted pipeline combined an AI reasoning model (Gemini DeepThink) to generate the proof, an agentic coding tool (Claude Code) to translate the proof to Lean, a specialized automated prover (Aristotle) that closed 111 lemmas, and the Lean kernel to fully verify the result. Other	positive	high	composition of toolchain and number of lemmas automatically discharged (111)	n=1 0.09
A specialized prover (Aristotle) automatically closed 111 lemmas during the development. Other	positive	high	number of lemmas automatically discharged by Aristotle (111)	n=1 0.09
The Lean kernel provided full machine verification of all formalized statements in the development. Other	positive	high	machine-checked verification status of formalized statements (verified/unverified)	n=1 verified 0.09
One mathematician supervised the process over approximately 10 days, reported a human cost of about $200, and wrote no code. Other	positive	medium	human supervision time (≈10 days), monetary supervision cost (≈$200), human coding contribution (none)	n=1 ≈10 days supervision; ≈$200 cost; no code written 0.05
The development artifacts are fully transparent and reproducible: the repository includes an archive of 229 human prompts and a git history with 213 commits. Other	positive	high	number of human prompts archived (229); number of git commits (213); public availability of artifacts	n=1 229 prompts; 213 commits 0.09
Observed failure modes during the workflow included hypothesis creep, definition-alignment bugs (mismatch between informal and formal definitions), and agent avoidance behaviors (agents delegating or failing to complete tasks). Other	negative	medium	presence and types of failure modes observed in the workflow (hypothesis creep, definition-alignment bugs, agent avoidance)	n=1 hypothesis creep; definition-alignment bugs; agent avoidance 0.05
Effective practices included splitting proofs into abstract (high-level reasoning) and concrete (formalization) parts, having agents perform adversarial self-review, and targeting human review to key definitions and theorem statements. Other	positive	medium	process practices associated with smoother formalization (binary presence/use of practices)	n=1 use of split proof workflow, agent adversarial self-review, targeted human review 0.05
The formalization finished prior to the final draft of the corresponding informal math paper. Other	positive	medium	relative completion timing (formalization finished before final draft of math paper)	n=1 formalization completed prior to final draft 0.05
The project demonstrates that high-skill, knowledge-intensive tasks (formal mathematics) can be substantially automated with a heterogeneous AI toolchain, reducing human coding labor while retaining supervisory oversight. Research Productivity	positive	medium	degree of automation in formal mathematics work (reduction in human coding effort in this project)	n=1 0.05
Reported monetary supervision cost was low (~$200) for this project, but the paper cautions that general equilibrium effects and scaling may change costs as demand for supervisors rises. Research Productivity	mixed	medium	monetary supervision cost for this project (≈$200) and authors' caution about scalability	n=1 $200 (reported supervision cost) 0.05
Public archives of prompts and commits accelerate diffusion by lowering search/learning costs and enabling replication, thereby increasing adoption speed and lowering entry barriers. Adoption Rate	positive	speculative	hypothesized effect on diffusion/adoption (not directly measured in the project)	0.01
Because failure modes such as definition misalignment and hypothesis creep were observed, the authors argue for regulation/standards around disclosure of AI-assisted scientific claims and archival of verification artifacts. Governance And Regulation	mixed	speculative	policy recommendation presence (advocacy for disclosure/archival standards) based on observed failure modes	advocacy for disclosure/archival standards 0.01