12 minute read

“Token prices have fallen 280x since 2022. Enterprise AI spend has risen 320% in the same period. We keep optimizing inference when we should be eliminating it.”

TL;DR

The dominant assumption in AI engineering is that LLMs run at request time. Compiled AI (arXiv:2604.05150) inverts this: use the LLM once during code generation, then deploy deterministic executables with zero runtime inference cost. On function-calling tasks, this achieves 96% completion with 57x fewer tokens at scale and 450x lower latency. The break-even point is 17 transactions. Not every task qualifies — creativity and open-ended reasoning still need runtime inference. But the structured workflows consuming most enterprise AI budgets can be compiled away. For the decision framework on which workflow steps need LLMs at all, see hybrid agentic workflows: LLM vs code nodes.

Molten metal poured into a precision mold transitioning from liquid to crystalline solid, representing the shift from fluid LLM inference to rigid compiled code

The interpreter trap

Most AI systems today work like interpreted languages. Every request hits the LLM. Every response costs tokens. Every call adds latency. This made sense when AI tasks were novel and unpredictable, when you genuinely needed the model’s reasoning on each input.

But look at what enterprise agents actually do. They parse invoices with the same fields. They route support tickets through the same decision trees. They call APIs with the same parameter patterns. They generate configs from the same templates. These workflows crystallized months ago. The model stopped learning anything new from them around the hundredth execution. Yet every single call still burns tokens at runtime.

Agentic workflows make this worse. A single user task triggers 10-20 LLM calls through chains of tool use, planning, and reflection steps. RAG inflates context windows 3-5x. Always-on monitoring agents run around the clock. The per-token price drops, but the volume grows faster.

Gartner projects inference costs will fall 90% by 2030. That does not matter if volume grows 1000%.

What compiled AI actually does

Compiled AI treats the LLM as a compiler rather than an interpreter. The model runs once during a build phase to generate executable code. That code then handles all future requests deterministically. No model calls. No token costs. No latency variance.

The analogy to traditional computing is exact. An interpreter reads source code line by line at runtime. A compiler translates source code to machine instructions once, then the binary runs without the compiler. Compiled AI translates high-level workflow specifications to executable Python once, then the generated code runs without the LLM.

XY.AI Labs, Stanford, and Harvard formalized this in a system with a four-stage validation pipeline:

graph LR
    A[YAML Spec] --> B[LLM Compilation]
    B --> C{Security Gate}
    C -->|Pass| D{Syntax Gate}
    C -->|Fail| B
    D -->|Pass| E{Execution Gate}
    D -->|Fail| B
    E -->|Pass| F{Accuracy Gate}
    E -->|Fail| B
    F -->|Pass| G[Deploy Deterministic Code]
    F -->|Fail| B
    
    style A fill:#f9f,stroke:#333
    style G fill:#0f0,stroke:#333

Security gate. Static analysis with Bandit and Semgrep catches injection vulnerabilities, path traversal, and secrets exposure in the generated code.

Syntax gate. AST parsing, mypy type checking, and ruff linting verify the code is structurally sound and type-safe.

Execution gate. Sandboxed runs against test fixtures confirm the code actually works: proper error handling, correct output structure, successful completion.

Accuracy gate. Comparison against golden datasets using task-specific thresholds validates that the generated code produces correct results.

The critical finding: all 16 failures in their function-calling evaluation (out of 400 tasks) were caught during compilation. Once code passed all four gates, it executed with 100% reliability. Zero runtime failures.

The numbers

Two benchmarks tell the story.

Function calling (BFCL)

400 function-calling tasks from the Berkeley Function-Calling Leaderboard. The model generates code once, that code handles all subsequent calls.

Metric Compiled AI Direct LLM LangChain AutoGen
Task completion 96% 100% 100% 100%
Runtime tokens per call 0 552 720 1,080
P50 latency 4.5ms 2,025ms 2,025ms 2,025ms
Output reproducibility 100% Variable Variable Variable
Tokens at 1,000 calls 9,600 552,000 720,000 1,080,000

That 4.5ms versus 2,025ms is not a typo. Once compiled, the function calls execute as native Python. No network round-trip to an inference API. The code just runs. 450x faster.

The token economics are even more striking at scale. At 1 million transactions per month:

System Monthly inference cost Infrastructure Total
Compiled AI $55 $500 $555
Direct LLM $21,500 $500 $22,000
LangChain $28,900 $500 $29,400
AutoGen $31,400 $500 $31,900

The 4% completion gap (96% vs 100%) deserves scrutiny. Those 16 failed tasks were all caught at compile time — the pipeline rejected them before deployment. In production, you would fall back to runtime LLM calls for those specific cases. A 96% compile rate with 4% runtime fallback still eliminates the vast majority of inference costs.

Document intelligence (DocILE)

5,680 invoices with degraded OCR, variable vendor formats, and semantic ambiguity. This is a harder test. Documents are messy, and pure regex extraction only achieves 20.3% accuracy on key fields.

Approach Key field extraction Line item recognition Median latency
Pure regex 20.3% 59.7% 0.6ms
Compiled AI 80.0% 80.4% 2,695ms
Direct LLM 80.0% 74.5% 6,339ms
LangChain 80.0% 75.6% 6,207ms
AutoGen 77.8% 78.9% 13,742ms

Compiled AI matches the LLM on key fields, beats it on line items (80.4% vs 74.5%), and runs at less than half the latency. The generated code handles the semantic complexity — interpreting customer names from variable formats, parsing currency amounts with OCR errors — through logic the LLM baked in during compilation.

The break-even math

The formula is simple:

Break-even = compilation tokens / tokens saved per transaction

For BFCL: 9,600 compilation tokens / 552 tokens per direct LLM call = ~17 transactions.

After 17 calls, compiled AI has consumed fewer total tokens than calling the LLM directly. After 1,000 calls, it is 57x cheaper. After 1 million, 40x cheaper on total cost of ownership.

Any workflow that runs more than 17 times benefits from compilation. That covers essentially every production deployment.

Which tasks can be compiled

Not everything qualifies. The key variable is whether the task’s logic can be fully determined at compile time, or whether it requires runtime context that cannot be anticipated.

Task class Compilable? Reasoning
Function calling with typed signatures Yes Input/output schemas are fixed; parameter extraction follows deterministic patterns
Configuration generation Yes Template structure is known; variable substitution is mechanical
Data extraction from structured documents Yes Field locations and formats can be learned from examples
Typed API call construction Yes API schemas define the output space completely
Rule-based routing and classification Yes Decision boundaries are enumerable and testable
Data transformation pipelines Yes Transformations are pure functions over known schemas
Open-ended text generation No Output space is unbounded; quality depends on runtime reasoning
Multi-turn conversation No Each turn depends on the previous; cannot be pre-compiled
Novel problem solving No Input space is too large to enumerate at compile time
Real-time context integration No External state changes between compilation and execution
Creative content generation No Requires the model’s generative capability per request

The boundary is not binary. Many real workflows are partially compilable. An invoice processing pipeline might compile the field extraction (deterministic) while keeping the LLM for anomaly detection (requires judgment). A customer support system might compile the routing logic and response templates while keeping the LLM for cases that do not match any template.

Progressive compilation: the practical workflow

In practice, you compile progressively. Start with the most deterministic parts of your workflow and expand the compiled surface over time.

graph TD
    A[Production Workflow] --> B[Audit: classify each step]
    B --> C[Deterministic steps]
    B --> D[Probabilistic steps]
    C --> E[Compile to code]
    D --> F[Keep as LLM calls]
    E --> G[Validate with 4-stage pipeline]
    G -->|Pass| H[Deploy compiled artifact]
    G -->|Fail| F
    H --> I[Monitor accuracy in production]
    I -->|Drift detected| J[Re-compile with updated spec]
    I -->|Stable| K[Identify next compilation candidate]
    K --> B
    
    style C fill:#0d0,stroke:#333
    style D fill:#f90,stroke:#333
    style H fill:#0d0,stroke:#333

Step 1: Audit your agent’s task distribution. Log every LLM call for a week. Classify each by whether the logic is deterministic or probabilistic. Most teams find 40-60% of their calls are deterministic: the same transformations, the same routing decisions, the same API parameter construction.

Step 2: Compile the deterministic slice. Start with the highest-volume, most repetitive calls. Function calling is the easiest win. Config generation is the second. Data extraction from well-structured sources is the third. Use the four-stage validation pipeline to verify correctness before deployment.

Step 3: Monitor and expand. Track accuracy of compiled artifacts in production. When the compiled code handles a case correctly that you expected to need the LLM for, that is evidence to compile more. When it fails on edge cases, that defines the boundary.

Step 4: Re-compile on drift. Specifications change. APIs update. Document formats evolve. The compiled artifacts need periodic regeneration — but the compilation cost is negligible compared to the ongoing inference cost you are eliminating. The paper’s authors describe this as “an evolutionary system: deterministic execution with adaptive improvement.”

For teams already using the hybrid workflow pattern of mixing LLM and code nodes, compiled AI is the natural next step: instead of hand-writing code nodes, let the LLM generate them during a build phase.

Limitations worth knowing

The paper is honest about its boundaries.

The specification problem. Compiled AI requires accurate YAML specifications of what the workflow should do. Writing those specs is itself a nontrivial task. The authors acknowledge that “iterative refinement is often necessary” — the first spec rarely produces correct compiled code. Natural language specification is listed as future work.

4% compilation failure rate. On function-calling tasks, 16 out of 400 failed to compile. These were all caught before deployment (no silent failures), but they represent tasks where the LLM could not generate correct deterministic code. For production use, you need a fallback path to runtime inference for the tasks that resist compilation.

Two benchmarks. The evaluation covers function calling and document extraction. Other task classes — multi-step reasoning chains, complex data transformations, domain-specific calculations — have not been tested. Generalizability is assumed but not proven.

Model coupling. The quality of compiled artifacts depends on the underlying LLM. Model updates may require re-validation of all compiled workflows. This is analogous to compiler version upgrades breaking builds. Manageable, but it adds operational overhead.

What this means for agent cost management

Inference accounts for 85% of enterprise AI budgets in 2026. Compiled AI attacks the largest line item directly — by eliminating tokens entirely for qualifying workflows, rather than making each token cheaper.

The math compounds with the insight from token efficiency optimization: first reduce the tokens you send per call, then eliminate the calls that do not need to happen. Compiled AI is the logical endpoint of token efficiency — zero tokens per transaction.

For cost management at the system level, compilation adds a new category to the cost optimization toolkit: amortized generation. Instead of paying per-transaction inference costs, you pay a one-time compilation cost that amortizes to near-zero over the workflow’s lifetime.

The competitive implication is that teams running compiled AI on their high-volume deterministic workflows will operate at fundamentally different cost structures than teams paying per-token for every request. At $555/month versus $22,000/month for the same workload, that is not a marginal efficiency gain. It is a structural advantage.

Key takeaways

  • Compiled AI eliminates runtime inference for deterministic tasks. The LLM runs once during compilation. Generated code handles all subsequent requests with zero tokens, 450x lower latency, and 100% reproducibility.
  • Break-even is 17 transactions. After that, every call is essentially free. At 1 million monthly transactions, total cost drops from $22,000 to $555.
  • Not everything compiles. Open-ended generation, multi-turn conversation, novel problem solving, and tasks requiring real-time context still need runtime LLM inference.
  • Progressive compilation is the practical path. Audit your workflows, compile the deterministic slice (typically 40-60% of calls), keep the LLM for the rest, and expand the compiled surface over time.
  • The four-stage validation pipeline is the safety net. Security, syntax, execution, and accuracy gates catch all failures before deployment. Once compiled code passes, it runs with 100% reliability.

Further reading

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch