A Multi-agent AI System for Deep Learning Model Migration from TensorFlow to JAX

The rapid development of AI-based products and their underlying models has led to constant innovation in deep learning frameworks. Google has been pioneering machine learning usage across dozens of products. Maintaining the multitude of model source codes in different ML frameworks and versions is a significant challenge. So far the maintenance and migration work was done largely manually by human experts. We describe an AI-based multi-agent system that we built to support automatic migration of TensorFlow-based deep learning models into JAX-based ones. We make three main contributions: First, we show how an AI planner that uses a mix of static analysis with AI instructions can create migration plans for very complex code components that are reliably followed by the combination of an orchestrator and coders, using AI-generated example-based playbooks. Second, we define quality metrics and AI-based judges that accelerate development when the code to evaluate has no tests and has to adhere to strict style and dependency requirements. Third, we demonstrate how the system accelerates code migrations in a large hyperscaler environment on commercial real-world use-cases. Our approach dramatically reduces the time (6.4x-8x speedup) for deep learning model migrations and creates a virtuous circle where effectively AI supports its own development workflow. We expect that the techniques and approaches described here can be generalized for other framework migrations and general code transformation tasks.

Summary

Main Finding

The authors build a bespoke multi-agent AI system that automates large-scale migrations of deep-learning code from TensorFlow to JAX (Flax Linen). By combining static analysis, hierarchical “playbooks,” and a three-role agent architecture (Planner, Orchestrator, Coder) plus LLM-based judges, the system produces buildable, testable migrations and yields large productivity gains: reported end-to-end speedups of about 6.4x–8x versus manual migration workflows.

Key Points

Problem scope
- TensorFlow and JAX have fundamental architectural differences (stateful OO vs. functional/stateless, different defaults, implicit side-channels like TF regularizers vs. explicit PyTrees in JAX).
- Migrating thousands of production models manually would cost “hundreds of expert engineering years.”
Multi-agent architecture
- Planner: hybrid static+LLM approach that discovers dependencies (uses Kythe static analysis) and generates a fine-grained, JSON-plan of migration steps with explicit validation conditions.
- Orchestrator: groups planner steps into manageable sub-steps, injects only relevant playbooks, handles retries/failure strategies, and persists progress.
- Coder: a ReAct-style coding agent with repository tools (read/write, search, build, test, dependency fixes). Each coder invocation must produce buildable/testable outputs and runs an internal self-review before claiming completion.
Playbook hierarchy and prompt engineering
- Playbooks (markdown) encode repository rules, stylistic conventions, task-specific API differences, and client-specific conventions.
- Client-specific playbooks can be auto-generated from a small set of “golden” migrated examples (authors report good playbooks from as few as 2 examples: one medium, one highly complex).
Memory and state
- File-based memory bank stores plan, step summaries, migration state, and versioned playbooks to avoid context drift and redundant work among agents.
Validation and quality control
- Emphasis on producing buildable and (where possible) testable code at every step.
- LLM-based checklist/judge evaluates code quality when unit tests are sparse; judges verify style, dependency, and functional-parity constraints.
- Self-review prompt added to coder to reduce premature completion/hallucinations.
Technical tools & patterns
- Static dependency discovery via Kythe for deterministic, cheap analysis.
- Dynamic knowledge injection by orchestrator to reduce context pollution.
- Migration plan schema (JSON) includes step IDs, source/target files, instructions, validations, and dependencies.
Empirical outcomes
- Reported 6.4x–8x speedups in migration time on real, large-scale commercial models.
- Improved repeatability and scaling: playbook generation creates a virtuous cycle (more migrations → better playbooks → better future migrations).
Failure modes & mitigations
- Early issues: API hallucinations, style inconsistency, non-buildable outputs.
- Mitigations: playbooks, multi-agent decomposition, build/test requirements per step, static dependency discovery, and LLM judges.

Data & Methods

Dataset / workloads
- Experiments run on a mix of open-source models and large internal Google models (including a client-specific example for YouTube).
- The paper focuses on real-world, production-scale migration tasks (models spanning multiple files, custom ops, and ecosystem scripts).
System components & implementation details
- Planner: static analysis via Kythe to enumerate dependency graph; iterative LLM prompting to generate stepwise JSON plans.
- Orchestrator: implements chunking strategies, dynamic playbook selection, failure-handling policies, and persistence to the memory bank.
- Coder: ReAct agent with repo tools (list, read, write files; grep-like search; run builds; run unit tests; auto-fix deps). Outputs include code changes plus a Markdown summary of changes (used as short-term memory).
- Playbook generation: LLM-assisted summarization of migrated pairs into client-specific playbooks; human-in-the-loop review.
- Judges & metrics: checklist-based LLM judges for assessing quality where unit tests are lacking; explicit validation fields in each plan step (e.g., “file X exists and compiles”).
Evaluation methodology
- Measured migration time and developer effort versus manual migration to compute speedup (the paper reports 6.4–8× faster).
- Ablations and case studies on open-source and internal models to assess generalization, playbook effectiveness, and the impact of different components (planner, orchestrator, judge).
- Qualitative assessment of migration correctness via builds, tests (where available), style adherence, and human inspection.
Reproducibility notes
- Core static tool (Kythe) and the plan schema are deterministic; many steps rely on proprietary infra and specific LLMs (authors used Gemini and internal tools), so exact replication requires equivalent tooling and models.

Implications for AI Economics

Large productivity gains and cost savings
- 6.4–8× speedups imply substantial reductions in engineering time and cost for large migration programs; if manual migration cost is on the order of “hundreds of expert years,” automation can reduce headcount/time and free experts for higher-value tasks.
Scale and capital intensity
- Building such a bespoke multi-agent migration system requires up-front engineering and compute investments (LLM usage, orchestration infra, static-analysis integration). Large organizations (hyperscalers, cloud providers) are best positioned to realize net economic benefits.
Labor reallocation and skill shifts
- Demand shifts from mass manual rewriting to tasks such as playbook curation, planner/orchestrator tuning, judge design, and human oversight/auditing.
- New roles may emerge (agent orchestration engineers, playbook authors, LLM-judge maintainers).
Faster adoption of efficient frameworks → operational savings
- Automating framework migration accelerates adoption of more performant frameworks (here, JAX on TPUs), which can reduce inference/training cost per model and increase compute utilization efficiency—an operational cost advantage with direct economic impact.
Market and competitive dynamics
- Firms that automate large code transformations can reduce technical-debt costs and shorten product cycles, yielding competitive advantages.
- Vendors offering migration-as-a-service or enterprise-grade agent orchestration may emerge; however, incumbents with internal infra and proprietary tooling will capture outsized gains.
Risk-adjusted economic considerations
- Quality assurance remains critical—errors in automated migrations of production ML systems have non-trivial costs (performance regressions, production outages). The need for human-in-the-loop validation imposes residual labor and liability costs.
- Overreliance on proprietary LLMs or ecosystem-specific tooling could create vendor lock-in or raise variable OPEX (LLM inference costs).
Generalization beyond framework migration
- Techniques (static analysis + hierarchical playbooks + multi-agent orchestration + checklist judges) can apply to many large-code transformations (API upgrades, cross-language ports, security patches), suggesting broader economic potential in automating routine but complex software engineering workflows.
Policy and workforce implications
- Widespread adoption could reduce demand for some classes of routine coding work while increasing demand for higher-level engineering and QA roles; policy-makers and firms should anticipate and invest in reskilling.

Summary takeaway: The study demonstrates that a carefully engineered, multi-agent system—combining deterministic static analysis with LLM-driven planning, coders, and judges—can materially reduce the cost and time of complex ML framework migrations. Economically, this enables faster, cheaper adoption of more efficient infrastructures (e.g., JAX/TPU), reshapes software labor toward oversight and curation roles, and creates opportunities (and risks) for firms that internalize or provide automated migration capabilities.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Provides quantitative, real-world measurements (6.4x–8x speedups) from commercial migrations at a hyperscaler, but lacks randomized or controlled comparisons, does not report sample sizes or selection criteria in detail, and relies on proprietary infrastructure that prevents independent replication. Methods Rigormedium — Uses a concrete engineering pipeline (planner, orchestrator, AI coders, and AI judges) and defines quality metrics for evaluating migrations, but the paper appears to omit detailed experimental design, baseline definitions, ablation studies, and full reproducibility information (e.g., exact models, workloads, hyperparameters, and statistical uncertainty). SampleCommercial TensorFlow-based deep learning models and code components from a large hyperscaler (Google) migrated to JAX using the proposed multi-agent system; evaluation reported on real-world use-cases across multiple internal products, though the exact number, size, and selection criteria of migrated models are not specified. Themesproductivity human_ai_collab GeneralizabilityResults obtained on proprietary Google infrastructure and workflows may not generalize to smaller teams or different orgs., Evaluation focused on TensorFlow-to-JAX migrations; applicability to other framework translations or languages is unproven., Details about model selection and sample size are not provided, so results may reflect selected high-impact cases (selection bias)., Performance depends on internal AI models, orchestration tooling, and developer-process integration that may not be available elsewhere., Quality metrics and AI judges are tailored for codebases with strict style/dependency constraints; open-source or differently structured projects may behave differently.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We built an AI-based multi-agent system to support automatic migration of TensorFlow-based deep learning models into JAX-based ones. Other	positive	high	existence and functioning of an AI-based migration system	0.18
An AI planner that uses a mix of static analysis with AI instructions can create migration plans for very complex code components that are reliably followed by the combination of an orchestrator and coders, using AI-generated example-based playbooks. Task Allocation	positive	medium	reliability of migration plans being followed (plan adherence)	0.11
We define quality metrics and AI-based judges that accelerate development when the code to evaluate has no tests and has to adhere to strict style and dependency requirements. Developer Productivity	positive	high	development speed / time to develop when evaluating untested code under strict style/dependency constraints	0.18
The system accelerates code migrations in a large hyperscaler environment on commercial real-world use-cases. Task Completion Time	positive	high	speed of code migrations in commercial/hyperscaler environment	0.18
Our approach dramatically reduces the time (6.4x-8x speedup) for deep learning model migrations. Task Completion Time	positive	high	time required to perform deep learning model migrations	6.4x-8x speedup 0.18
The system creates a virtuous circle where effectively AI supports its own development workflow. Organizational Efficiency	positive	high	self-supporting/iterative improvement of AI-assisted development workflow	0.03
The techniques and approaches described can be generalized for other framework migrations and general code transformation tasks. Other	positive	high	generalizability to other framework migrations / code transformation tasks	0.03
So far the maintenance and migration work was done largely manually by human experts. Task Allocation	negative	high	degree of manual effort for model maintenance and migration historically	0.09
Google has been pioneering machine learning usage across dozens of products. Other	positive	high	extent of ML usage across Google products	0.09

An AI multi-agent system cuts TensorFlow-to-JAX migration time roughly 6–8x on commercial Google workloads by combining planners, orchestrators and AI judges; the approach speeds model framework transitions but rests on proprietary tooling and case selection.