Compiled AI: moving LLM inference from runtime to compile-time
“Token prices have fallen 280x since 2022. Enterprise AI spend has risen 320% in the same period. We keep optimizing inference when we should be eliminating it.”
TL;DR
The dominant assumption in AI engineering is that LLMs run at request time. Compiled AI (arXiv:2604.05150) inverts this: use the LLM once during code generation, then deploy deterministic executables with zero runtime inference cost. On function-calling tasks, this achieves 96% completion with 57x fewer tokens at scale and 450x lower latency. The break-even point is 17 transactions. Not every task qualifies — creativity and open-ended reasoning still need runtime inference. But the structured workflows consuming most enterprise AI budgets can be compiled away. For the decision framework on which workflow steps need LLMs at all, see hybrid agentic workflows: LLM vs code nodes.

The interpreter trap
Most AI systems today work like interpreted languages. Every request hits the LLM. Every response costs tokens. Every call adds latency. This made sense when AI tasks were novel and unpredictable, when you genuinely needed the model’s reasoning on each input.
But look at what enterprise agents actually do. They parse invoices with the same fields. They route support tickets through the same decision trees. They call APIs with the same parameter patterns. They generate configs from the same templates. These workflows crystallized months ago. The model stopped learning anything new from them around the hundredth execution. Yet every single call still burns tokens at runtime.
Agentic workflows make this worse. A single user task triggers 10-20 LLM calls through chains of tool use, planning, and reflection steps. RAG inflates context windows 3-5x. Always-on monitoring agents run around the clock. The per-token price drops, but the volume grows faster.
Gartner projects inference costs will fall 90% by 2030. That does not matter if volume grows 1000%.
What compiled AI actually does
Compiled AI treats the LLM as a compiler rather than an interpreter. The model runs once during a build phase to generate executable code. That code then handles all future requests deterministically. No model calls. No token costs. No latency variance.
The analogy to traditional computing is exact. An interpreter reads source code line by line at runtime. A compiler translates source code to machine instructions once, then the binary runs without the compiler. Compiled AI translates high-level workflow specifications to executable Python once, then the generated code runs without the LLM.
XY.AI Labs, Stanford, and Harvard formalized this in a system with a four-stage validation pipeline:
graph LR
A[YAML Spec] --> B[LLM Compilation]
B --> C{Security Gate}
C -->|Pass| D{Syntax Gate}
C -->|Fail| B
D -->|Pass| E{Execution Gate}
D -->|Fail| B
E -->|Pass| F{Accuracy Gate}
E -->|Fail| B
F -->|Pass| G[Deploy Deterministic Code]
F -->|Fail| B
style A fill:#f9f,stroke:#333
style G fill:#0f0,stroke:#333
Security gate. Static analysis with Bandit and Semgrep catches injection vulnerabilities, path traversal, and secrets exposure in the generated code.
Syntax gate. AST parsing, mypy type checking, and ruff linting verify the code is structurally sound and type-safe.
Execution gate. Sandboxed runs against test fixtures confirm the code actually works: proper error handling, correct output structure, successful completion.
Accuracy gate. Comparison against golden datasets using task-specific thresholds validates that the generated code produces correct results.
The critical finding: all 16 failures in their function-calling evaluation (out of 400 tasks) were caught during compilation. Once code passed all four gates, it executed with 100% reliability. Zero runtime failures.
The numbers
Two benchmarks tell the story.
Function calling (BFCL)
400 function-calling tasks from the Berkeley Function-Calling Leaderboard. The model generates code once, that code handles all subsequent calls.
| Metric | Compiled AI | Direct LLM | LangChain | AutoGen |
|---|---|---|---|---|
| Task completion | 96% | 100% | 100% | 100% |
| Runtime tokens per call | 0 | 552 | 720 | 1,080 |
| P50 latency | 4.5ms | 2,025ms | 2,025ms | 2,025ms |
| Output reproducibility | 100% | Variable | Variable | Variable |
| Tokens at 1,000 calls | 9,600 | 552,000 | 720,000 | 1,080,000 |
That 4.5ms versus 2,025ms is not a typo. Once compiled, the function calls execute as native Python. No network round-trip to an inference API. The code just runs. 450x faster.
The token economics are even more striking at scale. At 1 million transactions per month:
| System | Monthly inference cost | Infrastructure | Total |
|---|---|---|---|
| Compiled AI | $55 | $500 | $555 |
| Direct LLM | $21,500 | $500 | $22,000 |
| LangChain | $28,900 | $500 | $29,400 |
| AutoGen | $31,400 | $500 | $31,900 |
The 4% completion gap (96% vs 100%) deserves scrutiny. Those 16 failed tasks were all caught at compile time — the pipeline rejected them before deployment. In production, you would fall back to runtime LLM calls for those specific cases. A 96% compile rate with 4% runtime fallback still eliminates the vast majority of inference costs.
Document intelligence (DocILE)
5,680 invoices with degraded OCR, variable vendor formats, and semantic ambiguity. This is a harder test. Documents are messy, and pure regex extraction only achieves 20.3% accuracy on key fields.
| Approach | Key field extraction | Line item recognition | Median latency |
|---|---|---|---|
| Pure regex | 20.3% | 59.7% | 0.6ms |
| Compiled AI | 80.0% | 80.4% | 2,695ms |
| Direct LLM | 80.0% | 74.5% | 6,339ms |
| LangChain | 80.0% | 75.6% | 6,207ms |
| AutoGen | 77.8% | 78.9% | 13,742ms |
Compiled AI matches the LLM on key fields, beats it on line items (80.4% vs 74.5%), and runs at less than half the latency. The generated code handles the semantic complexity — interpreting customer names from variable formats, parsing currency amounts with OCR errors — through logic the LLM baked in during compilation.
The break-even math
The formula is simple:
Break-even = compilation tokens / tokens saved per transaction
For BFCL: 9,600 compilation tokens / 552 tokens per direct LLM call = ~17 transactions.
After 17 calls, compiled AI has consumed fewer total tokens than calling the LLM directly. After 1,000 calls, it is 57x cheaper. After 1 million, 40x cheaper on total cost of ownership.
Any workflow that runs more than 17 times benefits from compilation. That covers essentially every production deployment.
Which tasks can be compiled
Not everything qualifies. The key variable is whether the task’s logic can be fully determined at compile time, or whether it requires runtime context that cannot be anticipated.
| Task class | Compilable? | Reasoning |
|---|---|---|
| Function calling with typed signatures | Yes | Input/output schemas are fixed; parameter extraction follows deterministic patterns |
| Configuration generation | Yes | Template structure is known; variable substitution is mechanical |
| Data extraction from structured documents | Yes | Field locations and formats can be learned from examples |
| Typed API call construction | Yes | API schemas define the output space completely |
| Rule-based routing and classification | Yes | Decision boundaries are enumerable and testable |
| Data transformation pipelines | Yes | Transformations are pure functions over known schemas |
| Open-ended text generation | No | Output space is unbounded; quality depends on runtime reasoning |
| Multi-turn conversation | No | Each turn depends on the previous; cannot be pre-compiled |
| Novel problem solving | No | Input space is too large to enumerate at compile time |
| Real-time context integration | No | External state changes between compilation and execution |
| Creative content generation | No | Requires the model’s generative capability per request |
The boundary is not binary. Many real workflows are partially compilable. An invoice processing pipeline might compile the field extraction (deterministic) while keeping the LLM for anomaly detection (requires judgment). A customer support system might compile the routing logic and response templates while keeping the LLM for cases that do not match any template.
Progressive compilation: the practical workflow
In practice, you compile progressively. Start with the most deterministic parts of your workflow and expand the compiled surface over time.
graph TD
A[Production Workflow] --> B[Audit: classify each step]
B --> C[Deterministic steps]
B --> D[Probabilistic steps]
C --> E[Compile to code]
D --> F[Keep as LLM calls]
E --> G[Validate with 4-stage pipeline]
G -->|Pass| H[Deploy compiled artifact]
G -->|Fail| F
H --> I[Monitor accuracy in production]
I -->|Drift detected| J[Re-compile with updated spec]
I -->|Stable| K[Identify next compilation candidate]
K --> B
style C fill:#0d0,stroke:#333
style D fill:#f90,stroke:#333
style H fill:#0d0,stroke:#333
Step 1: Audit your agent’s task distribution. Log every LLM call for a week. Classify each by whether the logic is deterministic or probabilistic. Most teams find 40-60% of their calls are deterministic: the same transformations, the same routing decisions, the same API parameter construction.
Step 2: Compile the deterministic slice. Start with the highest-volume, most repetitive calls. Function calling is the easiest win. Config generation is the second. Data extraction from well-structured sources is the third. Use the four-stage validation pipeline to verify correctness before deployment.
Step 3: Monitor and expand. Track accuracy of compiled artifacts in production. When the compiled code handles a case correctly that you expected to need the LLM for, that is evidence to compile more. When it fails on edge cases, that defines the boundary.
Step 4: Re-compile on drift. Specifications change. APIs update. Document formats evolve. The compiled artifacts need periodic regeneration — but the compilation cost is negligible compared to the ongoing inference cost you are eliminating. The paper’s authors describe this as “an evolutionary system: deterministic execution with adaptive improvement.”
For teams already using the hybrid workflow pattern of mixing LLM and code nodes, compiled AI is the natural next step: instead of hand-writing code nodes, let the LLM generate them during a build phase.
Limitations worth knowing
The paper is honest about its boundaries.
The specification problem. Compiled AI requires accurate YAML specifications of what the workflow should do. Writing those specs is itself a nontrivial task. The authors acknowledge that “iterative refinement is often necessary” — the first spec rarely produces correct compiled code. Natural language specification is listed as future work.
4% compilation failure rate. On function-calling tasks, 16 out of 400 failed to compile. These were all caught before deployment (no silent failures), but they represent tasks where the LLM could not generate correct deterministic code. For production use, you need a fallback path to runtime inference for the tasks that resist compilation.
Two benchmarks. The evaluation covers function calling and document extraction. Other task classes — multi-step reasoning chains, complex data transformations, domain-specific calculations — have not been tested. Generalizability is assumed but not proven.
Model coupling. The quality of compiled artifacts depends on the underlying LLM. Model updates may require re-validation of all compiled workflows. This is analogous to compiler version upgrades breaking builds. Manageable, but it adds operational overhead.
What this means for agent cost management
Inference accounts for 85% of enterprise AI budgets in 2026. Compiled AI attacks the largest line item directly — by eliminating tokens entirely for qualifying workflows, rather than making each token cheaper.
The math compounds with the insight from token efficiency optimization: first reduce the tokens you send per call, then eliminate the calls that do not need to happen. Compiled AI is the logical endpoint of token efficiency — zero tokens per transaction.
For cost management at the system level, compilation adds a new category to the cost optimization toolkit: amortized generation. Instead of paying per-transaction inference costs, you pay a one-time compilation cost that amortizes to near-zero over the workflow’s lifetime.
The competitive implication is that teams running compiled AI on their high-volume deterministic workflows will operate at fundamentally different cost structures than teams paying per-token for every request. At $555/month versus $22,000/month for the same workload, that is not a marginal efficiency gain. It is a structural advantage.
Key takeaways
- Compiled AI eliminates runtime inference for deterministic tasks. The LLM runs once during compilation. Generated code handles all subsequent requests with zero tokens, 450x lower latency, and 100% reproducibility.
- Break-even is 17 transactions. After that, every call is essentially free. At 1 million monthly transactions, total cost drops from $22,000 to $555.
- Not everything compiles. Open-ended generation, multi-turn conversation, novel problem solving, and tasks requiring real-time context still need runtime LLM inference.
- Progressive compilation is the practical path. Audit your workflows, compile the deterministic slice (typically 40-60% of calls), keep the LLM for the rest, and expand the compiled surface over time.
- The four-stage validation pipeline is the safety net. Security, syntax, execution, and accuracy gates catch all failures before deployment. Once compiled code passes, it runs with 100% reliability.
Further reading
- Hybrid agentic workflows: LLM vs code nodes — the decision framework for which workflow steps need LLMs
- Cost management for agents — system-level cost optimization strategies
- Token efficiency optimization — reducing per-call token consumption before eliminating calls
- Terminal agents suffice — another case for simplicity over framework complexity
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch