The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Compiling LLM outputs into deterministic code cuts token costs massively and preserves task accuracy: compiled AI reduced token consumption by up to ~57x at scale while matching or improving document- and function-level accuracy. The approach also boosts auditability and detects common prompt-injection and code-safety issues, making it attractive for reliability- and compliance-sensitive enterprise settings such as healthcare.

Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
Geert Trooskens, Aaron Karlsberg, Anmol Sharma, Lamara De Brouwer, Max Van Puyvelde, Matthew Young, John Thickstun, Gil Alterovitz, Walter A. De Brouwer · April 06, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
Compiled AI converts LLM-generated code into deterministic, auditable artifacts and, in evaluated tasks, matches or exceeds runtime-LM accuracy while substantially reducing token usage and improving security/determinism for enterprise workflows.

We study compiled AI, a paradigm in which large language models generate executable code artifacts during a compilation phase, after which workflows execute deterministically without further model invocation. This paradigm has antecedents in prior work on declarative pipeline optimization (DSPy) and hybrid neural-symbolic planning (LLM+P); our contribution is a systems-oriented study of its application to high-stakes enterprise workflows, with particular emphasis on healthcare settings where reliability and auditability are critical. By constraining generation to narrow business-logic functions embedded in validated templates, compiled AI trades runtime flexibility for predictability, auditability, cost efficiency, and reduced security exposure. We introduce (i) a system architecture for constrained LLM-based code generation, (ii) a four-stage generation-and-validation pipeline that converts probabilistic model output into production-ready code artifacts, and (iii) an evaluation framework measuring operational metrics including token amortization, determinism, reliability, security, and cost. We evaluate on two task types: function-calling (BFCL, n=400) and document intelligence (DocILE, n=5,680 invoices). On function-calling, compiled AI achieves 96% task completion with zero execution tokens, breaking even with runtime inference at approximately 17 transactions and reducing token consumption by 57x at 1,000 transactions. On document intelligence, our Code Factory variant matches Direct LLM on key field extraction (KILE: 80.0%) while achieving the highest line item recognition accuracy (LIR: 80.4%). Security evaluation across 135 test cases demonstrates 96.7% accuracy on prompt injection detection and 87.5% on static code safety analysis with zero false positives.

Summary

Main Finding

Compiled AI — invoking an LLM once to generate validated, template-constrained code artifacts that execute deterministically with zero runtime model calls — can deliver large operational and economic gains for well-specified, high-volume enterprise workflows. In experiments (function-calling and document intelligence), compiled AI achieved near-perfect determinism and substantial token/cost savings (break-even ≈ 17 transactions; 57× token savings at 1,000 transactions), while matching or exceeding accuracy on the evaluated tasks when using a Code Factory variant for noisy inputs.

Key Points

  • Definition: Compiled AI requires (1) one-time LLM invocation, (2) zero-token deterministic execution of deployed artifacts, and (3) mandatory multi-stage validation before deployment.
  • Architecture: YAML workflow → Orchestrator selects templates/modules → single LLM generation → four-stage validation → deploy static Temporal activities. Templates + modules constrain generation to 20–50 line functions.
  • Four-stage validation pipeline: Security (static analysis + injection detection), Syntax (AST, type checks, linting), Execution (sandboxed tests), Accuracy (task-specific metrics vs. golden data).
  • Bounded agentic invocation: generated code may call LLMs at runtime only for narrowly scoped subtasks, with schemas, fallback logic, and monitoring.
  • Security gates: Input Gate (DeBERTa-v3 + Presidio), Code Gate (Bandit, Semgrep), Output Gate (cryptographic canaries). Dual-LLM separation for privileged/quarantined models.
  • Empirical robustness: determinism (entropy H = 0), reproducibility 100% for compiled artifacts; runtime inference shows non-zero variance even at temperature = 0.
  • Limitations noted: some tasks still require runtime LLM calls (Code Factory), validation overhead, experiment scope limited to two task families and one LLM.

Key quantitative results (selected) - One-time generation cost: 9,600 tokens. Marginal runtime tokens after compilation: 0 (for pure compiled artifacts). - Break-even transaction count n ≈ 17 (formula in paper: n = GenTokens_compiled / (RuntimeTokens_per-tx_runtime − RuntimeTokens_per-tx_compiled)). - Token savings: 57× vs Direct LLM at 1,000 transactions; 84× vs AutoGen at 1,000 tx. - Cost example (1M transactions/month): Compiled AI TCO ≈ $555 vs Direct LLM ≈ $22,000 (≈ 40× cheaper). - Latency: compiled execution P50 = 4.5 ms vs direct LLM P50 ≈ 2,004 ms (≈450× speedup); jitter much lower (10.5 ms vs 1,123 ms). - BFCL (function-calling, n=400): task completion 96% (384/400). Validation first-pass: 100% syntax & execution, 96% accuracy gate. - DocILE (5,680 invoices): Code Factory variant KILE = 80.0% (matches Direct LLM), LIR = 80.4% (best). Pure deterministic regex: KILE = 20.3%. - Code quality: 99% type coverage, 96% test pass rate. Cyclomatic complexity higher (avg 23.8) vs human code (8). - Security eval (135 tests): prompt injection detection recall 95.8%, precision 100%; code safety gate recall 75%, precision 100%; overall precision ~96.1%. Output-gate canary detection had low recall (12.5%) in synthetic tests (authors note simulation limits).

Data & Methods

  • Tasks and datasets:
    • BFCL (Berkeley Function-Calling Leaderboard): 400 function-calling instances requiring identification of function and argument extraction.
    • DocILE: 5,680 degraded-OCR invoices, heterogeneous formats, evaluated with KILE (exact-match on key fields) and LIR (line-item recognition).
  • Baselines compared:
    • Direct LLM (Claude Opus 4.5) per-transaction inference.
    • LangChain and AutoGen orchestration frameworks (same LLM, extra overhead).
    • Deterministic variant (pure regex) for DocILE as lower-bound baseline.
    • Compiled AI / Code Factory (single generation + validated static code; Code Factory allows bounded LLM calls for hard subtasks).
  • Model & tooling:
    • Claude Opus 4.5 at temperature = 0 for generation.
    • Templates, Module Library, Pydantic for structured orchestration, Temporal activities for runtime.
    • Security & validation tooling: Bandit, Semgrep, mypy, ruff, DeBERTa-v3 for injection detection, Presidio for PII, cryptographic canaries for output leakage tests.
  • Metrics and evaluation framework:
    • Operational metrics emphasized: token efficiency (and break-even n*), latency, determinism (output entropy), reliability, validation first-pass rates, code quality, and security detection rates.
    • Experiments measured token consumption, latency distributions, reproducibility across repeated inputs, validation pass/fail rates, and downstream task accuracy.
  • Validation/regeneration loop: on validation failure, the system regenerates using error context until artifact passes all four stages (or fails deployment).

Implications for AI Economics

  • Token amortization & marginal cost structure:
    • Compiled AI shifts cost from per-transaction inference to one-time generation + validation. For high-volume workflows, marginal cost per transaction approaches zero, massively lowering per-transaction inference expenses.
    • The break-even transaction count (n) is a practical adoption threshold—workflows with expected volume > n favor compilation economically.
  • Provider revenue impact:
    • High shift-to-compiled workflows reduces recurring inference token revenue for LLM providers; demand may move toward higher-priced one-time generation/finetuning or model-hosting for bounded agentic calls.
    • New business models: compilation-as-a-service, validated-code subscription, or charging for template libraries and validation tooling rather than per-request tokens.
  • Operational cost & capital allocation:
    • Organizations can trade increased engineering/validation overhead for vastly reduced variable costs and latency. Investment shifts toward template engineering, test fixture curation, compliance encoding, and security tooling.
    • For regulated domains (healthcare, finance), compiled AI creates economic value beyond direct token savings: lower audit/compliance costs, reduced legal/exposure risk, and improved SLA guarantees — benefits that can be monetized or reduce insurance/penalty risk.
  • Labor & labor-market effects:
    • Less need for post-deployment prompt engineering and runtime monitoring; more need for software-engineering workflows around generation, validation, security review, and template maintenance.
    • Potential increase in demand for roles combining domain expertise + template/validation engineering.
  • Market segmentation & adoption:
    • Best fit: high-volume, well-specified workflows (billing, prior authorization, API-driven business logic). These yield immediate ROI.
    • Partial fit: noisy/unstructured tasks benefit from Code Factory hybrid (compiled orchestration that calls LLMs for bounded subtasks), still delivering latency and audit benefits while retaining some runtime token costs.
    • Poor fit: inherently interactive, highly dynamic, or exploratory uses where per-request reasoning is essential; here compiled AI’s upfront cost and validation friction may be prohibitive.
  • Policy & regulatory economics:
    • Deterministic, auditable artifacts align with compliance demands (HIPAA, FDA/CMS), potentially reducing regulatory friction and associated costs when deploying AI in regulated sectors.
  • Risk & concentration effects:
    • If many firms adopt compiled artifacts that embed vendor-specific templates or validators, lock-in to particular toolchains/templates may increase; conversely, an ecosystem for validated templates could emerge and be monetized.
  • Recommendations for economists and decision-makers:
    • Compute n* for candidate workflows to prioritize compilation investments.
    • Estimate total cost of ownership including validation engineering and template maintenance, not just token spend.
    • Model provider-side impacts: lower per-transaction token demand may push providers to offer higher-margin products (e.g., code-factory generation, compliance-certified models).
    • Consider hybrid pricing or bundling (one-time compilation + optional bounded runtime calls) as a commercial product offering.

Caveats & open questions - Experiments are limited to two task families and a single LLM; generality to other tasks/models needs further evaluation. - Validation and security measures depend on tooling and red-teaming quality; simulated output-gate results understate practical attack complexity. - Higher cyclomatic complexity suggests maintainability and refactoring costs over time; template engineering can mitigate this but requires investment. - Empirical economics of maintenance (template drift, retraining/regeneration frequency) were not deeply quantified; these affect long-run TCO.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents empirical evaluations with concrete metrics (task completion, accuracy, token usage, security detection) on multiple datasets and benchmarks, giving credible within-sample evidence of operational benefits; however, it lacks causal identification versus alternative deployment strategies, randomized or field experiments, and broader real-world deployment results that would support stronger causal or general claims. Methods Rigormedium — The authors define a clear architecture, a four-stage generation/validation pipeline, and an evaluation framework with multiple operational metrics; they report sample sizes (BFCL n=400, DocILE n=5,680, security tests n=135) and multiple performance measures. However, important methodological details are missing or unclear (e.g., which LLMs and prompts were used, how datasets were sampled or preprocessed, statistical uncertainty or baselines beyond 'Direct LLM', reproducibility artifacts), and tests are limited to selected tasks rather than live deployments. SampleTwo primary evaluation tasks: function-calling (BFCL, n=400 transactions) and document intelligence on invoices (DocILE, n=5,680 invoices). Security evaluation comprised 135 adversarial/test cases for prompt injection detection and static code safety analysis. The settings are systems-oriented experiments applying a 'compiled AI' pipeline and a 'Code Factory' variant; datasets appear domain-specific (enterprise / healthcare-oriented workflows and invoices) and likely include synthetic or curated test cases for security checks. Themesproductivity adoption governance GeneralizabilityEvaluations are task- and dataset-specific (function-calling and invoice extraction) and may not generalize to other workflow types or domains., Results depend on implementation details (LLM family, prompting, templates, validation rules) which are not fully specified; different models or prompt engineering could change outcomes., Security and determinism claims are based on limited adversarial test cases (n=135); adversaries in the wild may be more diverse and adaptive., Enterprise integration costs, maintenance burden, and human-in-the-loop workflows (governance, change management) are not measured, limiting inference about real-world adoption and long-run productivity impacts., Healthcare-specific regulatory and data-privacy constraints may alter feasibility; dataset representativeness and access restrictions are unclear.

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
By constraining generation to narrow business-logic functions embedded in validated templates, compiled AI trades runtime flexibility for predictability, auditability, cost efficiency, and reduced security exposure. Organizational Efficiency positive high predictability, auditability, cost efficiency, security exposure (design trade-offs)
0.03
We introduce a system architecture for constrained LLM-based code generation, a four-stage generation-and-validation pipeline that converts probabilistic model output into production-ready code artifacts, and an evaluation framework measuring operational metrics including token amortization, determinism, reliability, security, and cost. Other positive high availability of system architecture, pipeline, and evaluation framework (methodological contribution)
0.3
We evaluate on two task types: function-calling (BFCL, n=400) and document intelligence (DocILE, n=5,680 invoices). Other null_result high evaluation datasets and sample sizes
n=5680
0.3
On function-calling, compiled AI achieves 96% task completion with zero execution tokens. Task Completion Time positive high task completion rate
n=400
96% task completion
0.18
Compiled AI breaks even with runtime inference at approximately 17 transactions. Organizational Efficiency positive high cost trade-off / break-even transaction count
n=400
approximately 17 transactions
0.18
Compiled AI reduces token consumption by 57x at 1,000 transactions. Organizational Efficiency positive high token consumption
reducing token consumption by 57x at 1,000 transactions
0.18
On document intelligence (DocILE), our Code Factory variant matches Direct LLM on key field extraction (KILE: 80.0%). Output Quality null_result high key field extraction accuracy (KILE)
n=5680
KILE: 80.0%
0.18
On document intelligence (DocILE), Code Factory achieves the highest line item recognition accuracy (LIR: 80.4%). Output Quality positive high line item recognition accuracy (LIR)
n=5680
LIR: 80.4%
0.18
Security evaluation across 135 test cases demonstrates 96.7% accuracy on prompt injection detection. Ai Safety And Ethics positive high prompt injection detection accuracy
n=135
96.7% accuracy
0.18
Security evaluation across 135 test cases demonstrates 87.5% accuracy on static code safety analysis with zero false positives. Ai Safety And Ethics positive high static code safety analysis accuracy and false positive rate
n=135
87.5% accuracy with zero false positives
0.18

Notes