What is the fastest speculative decoding method in 2026?

Saguaro achieves 255.8 tokens per second on Llama-3.1-70B using 4x H100 GPUs with a separate draft GPU, which is 4.7x faster than autoregressive decoding and 1.58x faster than standard speculative decoding. This comes from parallelizing the draft and verification phases so neither sits idle. However, it requires dedicated draft model hardware.

Does speculative decoding work at high batch sizes?

Standard speculative decoding helps most at batch sizes 1-4, where the target model is memory-bandwidth bound. Above batch size 8, gains diminish. Above 32, speculation often hurts because the target model is already compute-saturated. Nightjar addresses this by dynamically reducing or disabling speculation depth as batch size grows, and offloading the draft model to CPU to reclaim GPU memory.

Can I use a draft model with a different tokenizer than my target model?

Yes. Intel and the Weizmann Institute published SLEM at ICML 2025, which uses plain text as a bridge between vocabularies. The draft model generates tokens in its vocabulary, decodes to text, re-tokenizes into the target vocabulary, then verifies. This enables pairing any draft with any target model, reaching 2.8x speedup with no quality loss.

Which speculative decoding variant do production teams actually use?

EAGLE-3 is the current production default. It is supported natively in vLLM, SGLang, and TensorRT-LLM, achieving 2-6x speedup depending on model and batch configuration. AWS SageMaker auto-selects EAGLE-2 or EAGLE-3. P-EAGLE, released with vLLM v0.16.0, adds parallel drafting for an additional 1.69x over EAGLE-3 on B200 hardware.

How much money does speculative decoding save on inference costs?

A 2x latency reduction translates directly to 2x throughput on the same hardware, meaning you serve the same traffic with half the GPUs. At DeepInfra's Blackwell pricing of $0.10 per million tokens, a 2x speedup halves effective cost to $0.05 per million. Combined with quantization and continuous batching, the multiplicative effect reaches 8-16x total cost reduction versus naive deployment.

Speculative decoding in 2026: Saguaro, Nightjar, and the universal draft model

10 minute read

“The draft model sits idle for half the wall-clock time. That’s the bottleneck nobody talks about.”

TL;DR

Three papers solve three different speculative decoding bottlenecks. Saguaro (ICLR 2026, Tri Dao et al.) fills the draft model’s idle time during verification to hit 255.8 tok/s on Llama-70B, 4.7x over autoregressive. Nightjar adapts speculation depth to server load, gaining 27% throughput under realistic traffic patterns. Intel’s SLEM algorithm lets any draft model accelerate any target model regardless of vocabulary, reaching 2.8x speedup. Each targets a different deployment scenario, and understanding which one fits your serving stack is the point of this post. For the foundational inference techniques these build on, see 14 techniques for high-performance ML inference.

A chess board mid-game with several moves planned ahead shown as ghost pieces in transparent form at future positions

Why does speculative decoding matter more in 2026 than in 2024?

Because inference is now the dominant cost. Deloitte’s 2026 TMT Predictions report puts inference at roughly two-thirds of all AI compute spending, up from one-third in 2023 and half in 2025. The inference-optimized chip market alone exceeds $50 billion in 2026. Meanwhile, 55% of AI-optimized cloud infrastructure spending ($20.6B) now goes to inference workloads (ByteIota, 2026).

Speculative decoding directly attacks this cost center. A 2x throughput improvement means serving the same traffic with half the GPUs. At batch size 1 (the latency-sensitive interactive case), standard speculative decoding already delivers 2-3x speedups. The three papers covered here push that envelope further: Saguaro to 4.7x, Nightjar to variable-load scenarios where fixed speculation hurts, and universal drafting to model pairs that were previously incompatible.

The production ecosystem has caught up too. EAGLE-3 (NeurIPS 2025) is now the default in vLLM, SGLang, and TensorRT-LLM. AWS SageMaker auto-selects it. P-EAGLE in vLLM v0.16.0 adds parallel drafting for 1.69x over EAGLE-3 on B200 hardware. Speculative decoding moved from research novelty to production infrastructure between 2024 and 2026.

How does standard speculative decoding work?

A small draft model generates K candidate tokens. The large target model then verifies all K tokens in a single forward pass. Rejection sampling at each position ensures the output distribution is mathematically identical to sampling from the target alone. No quality loss.

Standard speculative decoding (one round):

Draft model (1B):   [t1] → [t2] → [t3] → [t4]     (fast, sequential)
                                                      |
Target model (70B):  ══════ verify all 4 in parallel ═══  (one forward pass)
                                                      |
Result:              [t1 ✓] [t2 ✓] [t3 ✗] → resample [t3*] + bonus token

Why it works: the target model’s forward pass cost at batch size 1 is nearly identical whether processing 1 or K tokens (memory-bandwidth bound). If the draft model is correct 60-80% of the time, you generate ~3 tokens per target model call instead of 1.

Typical acceptance rates by task:

Task type	Acceptance rate (α)
Code, structured output	0.75-0.85
General chat, instructions	0.60-0.80
Creative writing, domain-specific	0.50-0.65

Below α = 0.50, speculation hurts rather than helps. The three papers below each address a specific failure mode of this basic approach.

What does Saguaro change by speculating on speculations?

Saguaro (arXiv:2603.03251, ICLR 2026) eliminates the sequential bottleneck between drafting and verification. In standard speculative decoding, the draft model sits idle during the entire verification phase. Saguaro fills that dead time.

sequenceDiagram
    participant D as Draft model (1B)
    participant T as Target model (70B)
    participant C as Speculation cache

    Note over D,T: Standard SD: draft idles during verify
    D->>T: Send draft tokens [t1..t4]
    Note over D: Idle waiting...
    T->>D: Verified: accept 3, bonus token t*

    Note over D,T: Saguaro: draft works during verify
    D->>T: Send draft tokens [t1..t4]
    D->>C: Predict likely outcomes (k=2,t*=A), (k=3,t*=B)
    D->>C: Generate continuations for each outcome
    T->>C: Verified: accept 3, bonus=B
    C->>T: Cache hit → instant next draft

The mechanics: while the target model verifies round T, the draft model predicts the most likely verification outcome as a (k, t) pair, where k is the number of accepted tokens and t is the bonus token. It generates continuations for several likely outcomes and stores them in a cache. When verification completes, a cache lookup returns the next draft with near-zero latency.

Three optimizations make this work:

Geometric fan-out distributes the speculation budget across outcomes weighted by acceptance probability. Achieves ~90% bonus token prediction accuracy.
Acceptance-cache tradeoff sampling downweights already-cached tokens during drafting, maximizing cache hit rate with a tunable tension parameter.
Adaptive fallback switches strategies based on batch size: high-quality fallback at small batches, random-token fallback at large batches where the overhead matters more.

Measured on Llama-3.1-70B Instruct (4x H100, tensor parallel) with Llama-3.2-1B as the draft on a separate H100:

Method	Tokens/sec	Speedup vs AR
Autoregressive	54.7	1.0x
Standard SD	161.8	3.0x
Saguaro	255.8	4.7x

The catch: Saguaro requires a dedicated GPU for the draft model, separate from the target model’s hardware. Standard SD often collocates draft and target on the same device. This is a deployment architecture change, not just a configuration flag.

How does Nightjar adapt to changing server load?

Nightjar (arXiv:2512.22420) solves a problem Saguaro ignores: what happens when your server load changes. Standard speculative decoding with a fixed speculation length (γ=4) helps at low QPS but hurts at high QPS. The target model becomes compute-saturated at large batch sizes, and speculation adds overhead without benefit.

Nightjar uses a multi-armed bandit (MAB) planner that continuously adjusts speculation depth γ ∈ {0, 1, 2, 3, 4, 5} based on current batch size. Setting γ=0 disables speculation entirely. The planner monitors batch size as its primary context signal and uses exploration-exploitation balancing to find the optimal depth for each load level.

graph TD
    A[Incoming requests] --> B{Current batch size?}
    B -->|B < 4| C[γ = 4-5\nFull speculation\nMax latency benefit]
    B -->|B = 4-16| D[γ = 2-3\nReduced speculation\nBalance throughput/latency]
    B -->|B > 16| E[γ = 0-1\nMinimal or no speculation\nAvoid overhead]
    E -->|Memory pressure| F[Offload draft model to CPU\nReclaim GPU for KV cache]
    F -->|Load drops| D

When the server is heavily loaded and speculation is disabled (γ=0), Nightjar does something surprising: it offloads the draft model entirely to CPU via async CUDA streams, reclaiming that GPU memory for KV cache. This enables larger batch sizes under memory pressure. Reload triggers when the request queue empties. The overhead is 143.9ms for GPU-to-CPU transfer and 11.9ms for reload (Triton-accelerated).

Results on real workloads (Azure LLM Inference Dataset traces, Poisson arrival 5-35 QPS):

Metric	Improvement vs standard SD
Throughput	+27.29%
Latency	-20.18%
TTFT under high load	-47.2%
vs DSD baseline	+22.89%
vs BanditSpec	+19.76%

Tested on DeepSeek-R1-Distill-Qwen-7B (RTX 4090), Vicuna-13B (A100), and Qwen2.5-32B (2x L20 tensor parallel).

The KV cache switching cost ranges from 17.87ms to 102.03ms depending on batch size and context length. The MAB planner factors this penalty into its arm-selection formula, preventing thrashing between speculation depths.

What if your draft model has a different vocabulary?

Every approach above requires the draft and target model to share a tokenizer. You can’t use a Llama-3 draft to accelerate a Gemma target because their vocabularies differ. Intel Labs and the Weizmann Institute solved this with three algorithms published at ICML 2025 (oral, top 1% of ~15,000 submissions).

The core idea behind SLEM (String-Level Exact Match): use plain text as a universal bridge between vocabularies.

Draft model generates tokens in its own vocabulary
Decode those tokens to a plain text string
Re-tokenize that string into the target model’s vocabulary
Target model verifies in a single parallel forward pass
Accept via rejection sampling up to the first mismatch

This enables pairing any draft with any target. A tiny GPT-2 variant can accelerate a Llama-3.1-70B. The SLEM algorithm achieves 2.8x speedup over autoregressive decoding on summarization, programming, and long-context tasks (arXiv:2502.05202).

Two additional algorithms cover different tradeoff points:

Algorithm	How it works	Speedup	When to use
SLEM	Text-bridge between vocabularies	2.8x	Maximum speed, any model pair
SLRS	String-level rejection sampling	~2x	When draft quality varies
TLI	Token intersection (shared vocab subset)	1.7x	Conservative, no text decoding overhead

SLEM has been merged into Hugging Face Transformers (tracking: github.com/huggingface/transformers/issues/39545). For teams already using HF’s generate() pipeline, this is the lowest-friction path to cross-vocabulary speculative decoding.

Which approach fits your serving stack?

Scenario	Best approach	Why
Single-user latency, dedicated GPU budget	Saguaro	4.7x speedup at batch 1, needs separate draft GPU
Variable traffic, shared serving infra	Nightjar	Adapts to load, offloads draft when speculation hurts
Mismatched draft/target vocabularies	SLEM	Only option that works across tokenizers
Production default, minimum integration effort	EAGLE-3	Native in vLLM, SGLang, TensorRT-LLM
Maximum throughput on latest hardware	P-EAGLE	1.69x over EAGLE-3 on B200, vLLM v0.16.0+

Framework support as of March 2026:

Variant	vLLM	SGLang	TensorRT-LLM	HF Transformers
EAGLE-3	Yes	Yes	Yes	Via speculators
P-EAGLE	Yes (v0.16+)	No	No	No
Saguaro	No (standalone repo)	No	No	No
Nightjar	No	No	No	No
SLEM/Universal	No	No	No	Yes (merged)
MTP (DeepSeek)	Yes	Yes	Yes	N/A
Medusa	Yes	No	Partial	Yes

Saguaro and Nightjar are research implementations with standalone codebases (github.com/tanishqkumar/ssd for Saguaro). Neither has first-class framework integration yet. For production today, EAGLE-3 or P-EAGLE through vLLM remains the pragmatic choice.

The practical composition: use EAGLE-3 as your production baseline, evaluate Saguaro if you have dedicated draft GPUs and need maximum single-user latency, and adopt SLEM when your best available draft model doesn’t share a tokenizer with your target. Nightjar’s load-adaptive approach will likely be absorbed into serving frameworks within 6-12 months rather than deployed standalone.

For the serving infrastructure these techniques plug into, see LLM serving infrastructure.

Takeaways

Speculative decoding is now production infrastructure, not a research trick. EAGLE-3 ships in every major serving framework.
Saguaro (4.7x over AR) fills the draft model’s idle time during verification. The cost is a dedicated draft GPU, making it best for latency-critical deployments.
Nightjar (+27% throughput) adapts speculation depth to server load. It addresses the real-world problem that fixed speculation hurts at high batch sizes.
SLEM (2.8x, ICML 2025) breaks the shared-vocabulary requirement. Any draft model can now accelerate any target model by using text as the bridge.
Inference is two-thirds of AI compute spending in 2026 (Deloitte). A 2x throughput gain from speculative decoding halves your GPU fleet for interactive workloads.
For production today: start with EAGLE-3 in vLLM or SGLang. Evaluate Saguaro and SLEM as your requirements diverge from the default.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Speculative decoding in 2026: Saguaro, Nightjar, and the universal draft model

TL;DR

Why does speculative decoding matter more in 2026 than in 2024?

How does standard speculative decoding work?

What does Saguaro change by speculating on speculations?

How does Nightjar adapt to changing server load?

What if your draft model has a different vocabulary?

Which approach fits your serving stack?

Takeaways

Related across topics

Share on

TL;DR

Why does speculative decoding matter more in 2026 than in 2024?

How does standard speculative decoding work?

What does Saguaro change by speculating on speculations?

How does Nightjar adapt to changing server load?

What if your draft model has a different vocabulary?

Which approach fits your serving stack?

Takeaways

Related across topics

LFU Cache (Least Frequently Used)

Compute Allocation for Speech Models

Token Efficiency Optimization

Share on