Speculative decoding in 2026: Saguaro, Nightjar, and the universal draft model
“The draft model sits idle for half the wall-clock time. That’s the bottleneck nobody talks about.”
TL;DR
Three papers solve three different speculative decoding bottlenecks. Saguaro (ICLR 2026, Tri Dao et al.) fills the draft model’s idle time during verification to hit 255.8 tok/s on Llama-70B, 4.7x over autoregressive. Nightjar adapts speculation depth to server load, gaining 27% throughput under realistic traffic patterns. Intel’s SLEM algorithm lets any draft model accelerate any target model regardless of vocabulary, reaching 2.8x speedup. Each targets a different deployment scenario, and understanding which one fits your serving stack is the point of this post. For the foundational inference techniques these build on, see 14 techniques for high-performance ML inference.

Why does speculative decoding matter more in 2026 than in 2024?
Because inference is now the dominant cost. Deloitte’s 2026 TMT Predictions report puts inference at roughly two-thirds of all AI compute spending, up from one-third in 2023 and half in 2025. The inference-optimized chip market alone exceeds $50 billion in 2026. Meanwhile, 55% of AI-optimized cloud infrastructure spending ($20.6B) now goes to inference workloads (ByteIota, 2026).
Speculative decoding directly attacks this cost center. A 2x throughput improvement means serving the same traffic with half the GPUs. At batch size 1 (the latency-sensitive interactive case), standard speculative decoding already delivers 2-3x speedups. The three papers covered here push that envelope further: Saguaro to 4.7x, Nightjar to variable-load scenarios where fixed speculation hurts, and universal drafting to model pairs that were previously incompatible.
The production ecosystem has caught up too. EAGLE-3 (NeurIPS 2025) is now the default in vLLM, SGLang, and TensorRT-LLM. AWS SageMaker auto-selects it. P-EAGLE in vLLM v0.16.0 adds parallel drafting for 1.69x over EAGLE-3 on B200 hardware. Speculative decoding moved from research novelty to production infrastructure between 2024 and 2026.
How does standard speculative decoding work?
A small draft model generates K candidate tokens. The large target model then verifies all K tokens in a single forward pass. Rejection sampling at each position ensures the output distribution is mathematically identical to sampling from the target alone. No quality loss.
Standard speculative decoding (one round):
Draft model (1B): [t1] → [t2] → [t3] → [t4] (fast, sequential)
|
Target model (70B): ══════ verify all 4 in parallel ═══ (one forward pass)
|
Result: [t1 ✓] [t2 ✓] [t3 ✗] → resample [t3*] + bonus token
Why it works: the target model’s forward pass cost at batch size 1 is nearly identical whether processing 1 or K tokens (memory-bandwidth bound). If the draft model is correct 60-80% of the time, you generate ~3 tokens per target model call instead of 1.
Typical acceptance rates by task:
| Task type | Acceptance rate (α) |
|---|---|
| Code, structured output | 0.75-0.85 |
| General chat, instructions | 0.60-0.80 |
| Creative writing, domain-specific | 0.50-0.65 |
Below α = 0.50, speculation hurts rather than helps. The three papers below each address a specific failure mode of this basic approach.
What does Saguaro change by speculating on speculations?
Saguaro (arXiv:2603.03251, ICLR 2026) eliminates the sequential bottleneck between drafting and verification. In standard speculative decoding, the draft model sits idle during the entire verification phase. Saguaro fills that dead time.
sequenceDiagram
participant D as Draft model (1B)
participant T as Target model (70B)
participant C as Speculation cache
Note over D,T: Standard SD: draft idles during verify
D->>T: Send draft tokens [t1..t4]
Note over D: Idle waiting...
T->>D: Verified: accept 3, bonus token t*
Note over D,T: Saguaro: draft works during verify
D->>T: Send draft tokens [t1..t4]
D->>C: Predict likely outcomes (k=2,t*=A), (k=3,t*=B)
D->>C: Generate continuations for each outcome
T->>C: Verified: accept 3, bonus=B
C->>T: Cache hit → instant next draft
The mechanics: while the target model verifies round T, the draft model predicts the most likely verification outcome as a (k, t) pair, where k is the number of accepted tokens and t is the bonus token. It generates continuations for several likely outcomes and stores them in a cache. When verification completes, a cache lookup returns the next draft with near-zero latency.
Three optimizations make this work:
- Geometric fan-out distributes the speculation budget across outcomes weighted by acceptance probability. Achieves ~90% bonus token prediction accuracy.
- Acceptance-cache tradeoff sampling downweights already-cached tokens during drafting, maximizing cache hit rate with a tunable tension parameter.
- Adaptive fallback switches strategies based on batch size: high-quality fallback at small batches, random-token fallback at large batches where the overhead matters more.
Measured on Llama-3.1-70B Instruct (4x H100, tensor parallel) with Llama-3.2-1B as the draft on a separate H100:
| Method | Tokens/sec | Speedup vs AR |
|---|---|---|
| Autoregressive | 54.7 | 1.0x |
| Standard SD | 161.8 | 3.0x |
| Saguaro | 255.8 | 4.7x |
The catch: Saguaro requires a dedicated GPU for the draft model, separate from the target model’s hardware. Standard SD often collocates draft and target on the same device. This is a deployment architecture change, not just a configuration flag.
How does Nightjar adapt to changing server load?
Nightjar (arXiv:2512.22420) solves a problem Saguaro ignores: what happens when your server load changes. Standard speculative decoding with a fixed speculation length (γ=4) helps at low QPS but hurts at high QPS. The target model becomes compute-saturated at large batch sizes, and speculation adds overhead without benefit.
Nightjar uses a multi-armed bandit (MAB) planner that continuously adjusts speculation depth γ ∈ {0, 1, 2, 3, 4, 5} based on current batch size. Setting γ=0 disables speculation entirely. The planner monitors batch size as its primary context signal and uses exploration-exploitation balancing to find the optimal depth for each load level.
graph TD
A[Incoming requests] --> B{Current batch size?}
B -->|B < 4| C[γ = 4-5\nFull speculation\nMax latency benefit]
B -->|B = 4-16| D[γ = 2-3\nReduced speculation\nBalance throughput/latency]
B -->|B > 16| E[γ = 0-1\nMinimal or no speculation\nAvoid overhead]
E -->|Memory pressure| F[Offload draft model to CPU\nReclaim GPU for KV cache]
F -->|Load drops| D
When the server is heavily loaded and speculation is disabled (γ=0), Nightjar does something surprising: it offloads the draft model entirely to CPU via async CUDA streams, reclaiming that GPU memory for KV cache. This enables larger batch sizes under memory pressure. Reload triggers when the request queue empties. The overhead is 143.9ms for GPU-to-CPU transfer and 11.9ms for reload (Triton-accelerated).
Results on real workloads (Azure LLM Inference Dataset traces, Poisson arrival 5-35 QPS):
| Metric | Improvement vs standard SD |
|---|---|
| Throughput | +27.29% |
| Latency | -20.18% |
| TTFT under high load | -47.2% |
| vs DSD baseline | +22.89% |
| vs BanditSpec | +19.76% |
Tested on DeepSeek-R1-Distill-Qwen-7B (RTX 4090), Vicuna-13B (A100), and Qwen2.5-32B (2x L20 tensor parallel).
The KV cache switching cost ranges from 17.87ms to 102.03ms depending on batch size and context length. The MAB planner factors this penalty into its arm-selection formula, preventing thrashing between speculation depths.
What if your draft model has a different vocabulary?
Every approach above requires the draft and target model to share a tokenizer. You can’t use a Llama-3 draft to accelerate a Gemma target because their vocabularies differ. Intel Labs and the Weizmann Institute solved this with three algorithms published at ICML 2025 (oral, top 1% of ~15,000 submissions).
The core idea behind SLEM (String-Level Exact Match): use plain text as a universal bridge between vocabularies.
- Draft model generates tokens in its own vocabulary
- Decode those tokens to a plain text string
- Re-tokenize that string into the target model’s vocabulary
- Target model verifies in a single parallel forward pass
- Accept via rejection sampling up to the first mismatch
This enables pairing any draft with any target. A tiny GPT-2 variant can accelerate a Llama-3.1-70B. The SLEM algorithm achieves 2.8x speedup over autoregressive decoding on summarization, programming, and long-context tasks (arXiv:2502.05202).
Two additional algorithms cover different tradeoff points:
| Algorithm | How it works | Speedup | When to use |
|---|---|---|---|
| SLEM | Text-bridge between vocabularies | 2.8x | Maximum speed, any model pair |
| SLRS | String-level rejection sampling | ~2x | When draft quality varies |
| TLI | Token intersection (shared vocab subset) | 1.7x | Conservative, no text decoding overhead |
SLEM has been merged into Hugging Face Transformers (tracking: github.com/huggingface/transformers/issues/39545). For teams already using HF’s generate() pipeline, this is the lowest-friction path to cross-vocabulary speculative decoding.
Which approach fits your serving stack?
| Scenario | Best approach | Why |
|---|---|---|
| Single-user latency, dedicated GPU budget | Saguaro | 4.7x speedup at batch 1, needs separate draft GPU |
| Variable traffic, shared serving infra | Nightjar | Adapts to load, offloads draft when speculation hurts |
| Mismatched draft/target vocabularies | SLEM | Only option that works across tokenizers |
| Production default, minimum integration effort | EAGLE-3 | Native in vLLM, SGLang, TensorRT-LLM |
| Maximum throughput on latest hardware | P-EAGLE | 1.69x over EAGLE-3 on B200, vLLM v0.16.0+ |
Framework support as of March 2026:
| Variant | vLLM | SGLang | TensorRT-LLM | HF Transformers |
|---|---|---|---|---|
| EAGLE-3 | Yes | Yes | Yes | Via speculators |
| P-EAGLE | Yes (v0.16+) | No | No | No |
| Saguaro | No (standalone repo) | No | No | No |
| Nightjar | No | No | No | No |
| SLEM/Universal | No | No | No | Yes (merged) |
| MTP (DeepSeek) | Yes | Yes | Yes | N/A |
| Medusa | Yes | No | Partial | Yes |
Saguaro and Nightjar are research implementations with standalone codebases (github.com/tanishqkumar/ssd for Saguaro). Neither has first-class framework integration yet. For production today, EAGLE-3 or P-EAGLE through vLLM remains the pragmatic choice.
The practical composition: use EAGLE-3 as your production baseline, evaluate Saguaro if you have dedicated draft GPUs and need maximum single-user latency, and adopt SLEM when your best available draft model doesn’t share a tokenizer with your target. Nightjar’s load-adaptive approach will likely be absorbed into serving frameworks within 6-12 months rather than deployed standalone.
For the serving infrastructure these techniques plug into, see LLM serving infrastructure.
Takeaways
- Speculative decoding is now production infrastructure, not a research trick. EAGLE-3 ships in every major serving framework.
- Saguaro (4.7x over AR) fills the draft model’s idle time during verification. The cost is a dedicated draft GPU, making it best for latency-critical deployments.
- Nightjar (+27% throughput) adapts speculation depth to server load. It addresses the real-world problem that fixed speculation hurts at high batch sizes.
- SLEM (2.8x, ICML 2025) breaks the shared-vocabulary requirement. Any draft model can now accelerate any target model by using text as the bridge.
- Inference is two-thirds of AI compute spending in 2026 (Deloitte). A 2x throughput gain from speculative decoding halves your GPU fleet for interactive workloads.
- For production today: start with EAGLE-3 in vLLM or SGLang. Evaluate Saguaro and SLEM as your requirements diverge from the default.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch