Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

Summary

Main Finding

Compiling procedural agent workflows into the weights of small, fine-tuned LLMs (“subterranean agents”) achieves near-frontier quality while cutting per-conversation inference cost by roughly two orders of magnitude. Small models (3B–8B) fine-tuned on synthetic conversations internalize workflows and (a) match or beat the same base model run under explicit orchestration, (b) reach 87–98% of an in-context frontier baseline on quality metrics, and (c) are 128–462× cheaper per conversation (and faster in latency) than the frontier in-context approach.

Key Points

Architectural contrast
- Surface orchestration: external orchestrator injects prompts and routes every turn (e.g., LangGraph + frontier LLM).
- Subterranean compilation: generate training conversations from a flowchart with an orchestrator, then fine-tune a model so it self-orchestrates at runtime with only a minimal system prompt.
Pipeline (high level)
Encode procedure as directed flowchart (nodes, edges, decision conditions).
Generate synthetic conversation data by traversing valid paths (Claude Sonnet 4.5 used for generation).
Full-parameter fine-tuning on generated conversations.
Deploy fine-tuned model; no orchestrator at runtime.
Experimental domains and models
- Travel booking (14 nodes): controlled same-model comparison using Qwen 2.5 3B; 2,125 synthetic conversations (1,912 train / 213 eval).
- Zoom support (14 nodes + product knowledge): Qwen3-8B, expanded training to ~6,264 conversations.
- Insurance claims (55 nodes, 2,381 paths): Qwen3-8B, 3,000 synthetic conversations (2,700 train / 300 eval).
Quality metrics (1–5 scale): Task success, Information accuracy, Consistency, Graceful handling, Naturalness. Judged with Claude Sonnet 4.5; robustness check with independent GPT-4.1 judge.
Key quantitative results
- Same-model comparison (travel): 3B compiled > 3B orchestrated on 4/5 metrics (significant; p < 0.001). 3B compiled reaches ~82% of in-context baseline on graceful handling & naturalness.
- Scaling to 8B (Zoom, Insurance): closes gap — compiled models achieve ~87–98% of in-context baseline across metrics and match a LangGraph orchestrator built on a ~70× larger frontier model on several metrics.
- Failure rates: Travel — 3B compiled 5.5% vs LangGraph orchestrator 24.0%; Insurance — 8B compiled 9% vs 17% for orchestrator; Zoom comparable.
- Efficiency/cost: Compiled models are 128–462× cheaper per conversation than the in-context frontier baseline. Cost advantage decomposed into ~65× per-token savings from self-hosting plus a further ~2–7× token-volume reduction (compiled model’s prompt is constant-size). Latency: compiled inference is faster (example: insurance 2.8×).
- Recompile (update) cycle: fine-tune/compile new model in ~30–50 minutes on production hardware (CI/CD-style cycle, not multi-day retraining).
Practical notes & constraints
- Full-parameter fine-tuning was used; parameter-efficient methods (e.g., low-rank LoRA at ranks 16–128) failed to match full fine-tuning for procedural internalization.
- Remaining weakness: information that depends on broad world knowledge (not encoded in the procedure) can lag frontier in-context models (observed as lower information-accuracy in some experiments).
- Compiled agents learn interaction style from training data (e.g., single-question-per-turn “interview style”), which changes granularity of turns but not total conveyed information.

Data & Methods

Synthetic dataset generation
- Procedures represented as directed graphs F = (N,E,n0,T). Nodes have role and prompt templates; edges may have conditions.
- Path sampling plus scenario-variable sampling (destinations, budgets, user persona, claim type, etc.) yields diverse dialogues; data generated by Claude Sonnet 4.5.
- Experimental sampling: n = 200 scenarios per condition per domain for evaluation; training set sizes per domain noted above.
Fine-tuning & compute
- Travel (3B): Qwen 2.5 3B Instruct, full fine-tuning on single RTX 5090; AdamW 8-bit, LR 2e-5, batch accumulation → effective batch 16, 20 epochs (~3.5 hrs wall-clock), best checkpoint at epoch ≈4.
- Zoom & Insurance (8B): Qwen3-8B bf16, DeepSpeed ZeRO-3 across 8×A100, AdamW LR 2e-5, effective batch 32, Zoom 10 epochs (best at epoch 2 across expanded data), Insurance 20 epochs (best at epoch 3).
Evaluation methodology
- Two operational baselines: LangGraph orchestrator using Claude Sonnet 4.5 (frontier orchestrated baseline) and in-context baseline (Claude Sonnet 4.5 given serialized flowchart in system prompt).
- Judges: primary judge Claude Sonnet 4.5 (approach-agnostic LLM-as-judge). Robustness check with GPT-4.1 yielded comparable conclusions (83–99% of in-context quality replicability).
- Metrics scored 1–5 with specific behavioral anchors. Statistical tests: Wilcoxon signed-rank (paired) or Mann–Whitney U (unpaired), Cohen’s d, bootstrap 95% CIs, Holm–Bonferroni correction.
Efficiency measurement
- Measured avg turns, wall-clock per conversation (includes LLM calls and routing), and counted API-induced latencies for orchestrated baselines versus local inference for compiled agents.

Implications for AI Economics

Major cost shift for procedural automation
- Two-orders-of-magnitude per-conversation cost reductions make serving large volumes of procedural agent interactions far cheaper when procedures are compiled and self-hosted. This materially changes unit economics for services like customer support, claims intake, and guided workflows.
- Cost advantage increases with workflow complexity because compiled models carry a constant-size prompt regardless of procedure size, while orchestration/in-context approaches grow token volume (and API cost) with procedure complexity.
Market and competitive dynamics
- Reduced dependence on frontier/api providers: firms can internalize proprietary procedures in small, self-hosted models—lowering per-use payments to API providers and improving confidentiality of sensitive workflows.
- Orchestration frameworks may need to pivot: their value today is ease of development; compiled workflows shift value to data-generation pipelines, fine-tuning tooling, and CI/CD for model recompilation.
- Product differentiation: companies can embed domain-specific procedural knowledge (e.g., Zoom UI steps, insurer rules) into small models, enabling differentiated products with lower operating costs and privacy advantages.
Infrastructure and capital trade-offs
- Upfront and operational capital for hosting GPUs (or inference accelerators) is required, but per-conversation marginal costs fall sharply. The economics favors organizations with predictable, high-volume procedural workloads.
- The 30–50 minute recompile cycle implies feasible CI/CD integration; organizations can update procedures quickly without lengthy retraining, making compiled workflows operationally practical.
Labor, productivity, and demand effects
- Lower-cost automated agents can scale to handle more customer interactions, potentially substituting routine human labor in procedural tasks. This may compress labor demand for scripted support but could reallocate human effort to edge-case handling, escalation, and oversight.
- Faster and cheaper automation could expand the set of tasks firms choose to automate (lowering the threshold), raising productivity and potentially increasing demand for higher-value human roles.
Risks, limits, and policy considerations
- Information-quality trade-offs: compiled models can lag frontier models on broad world knowledge; firms must decide when to rely on compiled agents vs. call out to stronger knowledge-enabled models.
- Governance & auditability: compiled agents internalize procedures in weights, complicating human-readable audits of decision logic; companies may need toolchains to extract or verify procedural compliance.
- Data/privacy/regulatory: compiling proprietary procedures into self-hosted weights reduces third-party exposure (privacy upside) but concentrates responsibility for correctness and compliance on the deploying firm (regulatory risk).
Research & deployment priorities
- Investment in robust synthetic-data pipelines, provenance, testing on edge-case paths, and tooling for rapid safe recompilation will have outsized economic returns.
- Further work needed on continual/online fine-tuning, multi-procedure composition, and hybrid architectures that combine compiled procedural knowledge with on-demand frontier knowledge for exceptions.

Limitations to bear in mind: results use synthetic conversations and simulated users; judges are LLMs (albeit cross-checked); experiments focus on procedural domains (not open-ended creative tasks); full fine-tuning was required—parameter-efficient adapters underperformed. These caveats suggest practical but not universal applicability; nevertheless, the cost-quality trade-off shown is large enough to materially affect AI economics in procedural automation.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper reports controlled, multi-domain comparisons and ablation tests that support causal claims about architecture performance on procedural tasks, but it does not present randomized field trials, long-run deployment outcomes, or direct economic measures (costs at scale, developer adoption rates in production, user-level productivity or firm-level outcomes), limiting external causal inference about economic impact. Methods Rigormedium — The study is rigorous in scope (three distinct domains, multiple nodes including a large insurance-workflow test) and includes targeted empirical tests of stated barriers, but the description lacks detail about statistical methods, sample sizes per condition, model/hyperparameter choices, robustness checks across model families, and whether evaluations include real users or only simulated/task-based metrics. SampleProcedural task suites in three domains: travel booking (14 decision nodes), Zoom support (14 nodes with product-specific knowledge), and insurance claims processing (55 nodes across 6 decision hubs); comparisons are between orchestration using frontier LLMs (system-prompted procedures) and small fine-tuned models that encode procedures in weights (subterranean agents); likely evaluated via automated task-completion metrics and domain-specific correctness checks (paper does not detail model sizes, exact datasets, or real-world user samples). Themesadoption org_design IdentificationControlled comparative experiments that benchmark two architectures — external orchestration (frontier LLM in system prompt) versus small fine-tuned 'subterranean' agents (procedure compiled into model weights) — across three procedural domains (travel booking, Zoom support, insurance claims), with ablations addressing three hypothesized adoption barriers; performance is measured on task-completion, node-level correctness, context footprint, and exposure of proprietary procedures. GeneralizabilityLimited to procedural, multi-step workflows — results may not extend to open-ended or creative tasks, Evaluated on three domains; other industries or more heterogeneous workflows may differ, Findings may depend on the specific frontier models and fine-tuning procedures used (unclear cross-model robustness), Experiments appear task/simulation-based rather than deployed long-term with real users, so operational and adoption dynamics are uncertain, Language, locale, and regulatory contexts (e.g., privacy requirements) may affect applicability, Cost, latency, and maintenance trade-offs at production scale (multi-tenant serving, frequent updates) are not fully assessed

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. Adoption Rate	positive	high	GitHub star count (popularity/adoption proxy) across listed agent orchestration frameworks	n=7 exceeding 290,000 GitHub stars 0.48
All [the listed orchestration frameworks] follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Other	null_result	high	architectural pattern (external orchestrator behavior)	n=7 0.48
Recent work has shown this [orchestration] architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a]. Task Completion Time	positive	medium	performance on procedural tasks (dominance of system-prompted frontier model approach)	0.29
Using a frontier model's system prompt to supply the procedure has costs: it consumes the context window. Other	negative	high	context-window usage	0.48
Using a frontier model's system prompt to supply the procedure requires a frontier model for every conversation. Other	negative	high	requirement to use frontier model per conversation (operational/deployment cost)	0.48
Using a frontier model's system prompt to supply the procedure exposes proprietary procedures to third-party providers. Other	negative	high	exposure of proprietary procedures to third-party providers (privacy/intellectual-property risk)	0.48
Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns. Other	positive	high	mitigation of the named concerns (context usage, frontier-model requirement, exposure of procedures)	0.08
Prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique [compiling procedures into model weights / subterranean agents] works. Other	positive	medium	feasibility/effectiveness of compiling procedures into model weights (approach success)	n=6 0.29
Developer adoption has overwhelmingly favored orchestration (despite the viability of subterranean agents). Adoption Rate	negative	medium	developer adoption preference (orchestration vs. subterranean agents)	0.14
We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs). Other	positive	high	empirical evaluation across three workflow domains (breadth/complexity of tests measured by node counts and decision hubs)	n=3 travel booking (14 nodes), Zoom support (14 nodes), insurance claims (55 nodes, 6 decision hubs) 0.48

Encoding procedures into small fine‑tuned agents matches orchestration for procedural tasks while cutting context costs and protecting proprietary logic; this approach resolves several practical barriers that have nevertheless kept developers favoring external orchestrators.