14 minute read

A glowing power-law curve on a dark grid surface with scattered data points, representing scaling law predictions

TL;DR — Speculative decoding has been a tuning problem: pick a draft model, measure acceptance rate, iterate. SDSL (arXiv 2603.11053) turns it into a systems problem with predictive scaling laws. The headline finding: the throughput-optimal draft model is roughly 200x smaller than the target. For Llama-3-70B, that is a 189M parameter model, not the 1B-8B range most practitioners reach for. Stanford’s Speculative Speculative Decoding (arXiv 2603.03251, ICLR 2026) applies the same insight recursively, pre-speculating verification outcomes to eliminate draft latency entirely and hitting 5x over autoregressive on Llama-70B. Together, these papers shift speculative decoding from craft to engineering.


The guessing game that runs your inference stack

Every production team deploying speculative decoding makes the same bet: which draft model produces the best throughput? The decision process is remarkably informal for infrastructure that handles millions of tokens per day. You try the smallest model in your family. You benchmark a few acceptance rates. You pick the one that feels right.

This is how most teams landed on 1B-8B draft models for 70B targets. The logic is intuitive — a bigger draft model accepts more tokens, so you verify fewer times. And that logic is wrong, or at least incomplete, because it optimizes for acceptance rate when the actual objective is tokens per FLOP.

SDSL makes the case that speculative decoding obeys the same kind of empirical scaling laws that govern model training. The acceptance rate between any draft-target pair is predictable from their perplexities. The optimal draft size is predictable from the target size. And the prediction says most teams are burning compute on draft models 5-40x larger than they need.

For context on the speculative decoding work this builds on, see Saguaro, Nightjar, and universal draft models and 14 inference optimization techniques.

What SDSL actually says

Bozorgkhoo and Molybog start from a premise borrowed directly from Chinchilla: if you can predict a model’s perplexity from its parameter count and training data, and you can predict acceptance rate from the perplexities of two models, then you can predict the optimal draft-target pairing without running a single inference benchmark.

The acceptance rate formula

The expected acceptance rate between a draft model (perplexity x) and target model (perplexity y) follows a linear model:

\[\alpha = Ax + By + C\]

Fitted across OPT, Qwen 1.5/2.5, and Llama 3 families:

Coefficient Value Interpretation
A (draft perplexity) -0.00670 Lower draft perplexity raises acceptance
B (target perplexity) +0.01297 Lower target perplexity lowers acceptance
C (intercept) 0.64208 Baseline acceptance rate
R-squared 0.6023 Explains 60% of variance — enough to size hardware, not enough for SLA math

The asymmetry is what matters here. Draft perplexity (coefficient A) drives acceptance rate. Target model quality (coefficient B) barely moves the needle. This means that improving your draft model matters far more than improving your target model for speculative decoding throughput — but only up to the point where the draft model’s inference cost starts eating the throughput gain.

Acceptance rates in practice

Measured on HellaSwag with OPT-13B as target:

Draft Model Parameters Acceptance Rate
OPT-125M 125M 0.596
OPT-350M 350M 0.628
OPT-1.3B 1.3B 0.669
OPT-2.7B 2.7B 0.679
Qwen2.5-0.5B 500M 0.652
Qwen2.5-1.5B 1.5B 0.668
Qwen2.5-3B 3B 0.673

Going from 125M to 2.7B — a 22x increase in draft size — buys you 0.083 in acceptance rate. Going from 125M to 350M — a 2.8x increase — buys you 0.032. The marginal return on draft model size drops off fast.

The throughput formula

Raw acceptance rate is not the objective. Tokens per FLOP is. The throughput formula uses the Lambert W function to find the optimal speculation length for a given draft-target pair:

\[\mathcal{T} = \frac{-\log(\alpha)}{2N(-1+\alpha)W(-\alpha^{(M/N-1)}/e)}\]

Where N is draft model size, M is target model size, and W is the Lambert W function. The optimal speculation length (lookahead) is:

\[\gamma_{opt} = \frac{-M\log(\alpha) + N \cdot W(-\alpha^{(M/N-1)}/e) + N}{N \cdot \log(\alpha)}\]

The prediction that matters: optimal draft size

All of the above collapses into a linear relationship between target model size and optimal draft model size:

\[N^*(M) = \mu M + M_0\]

Where the slope is approximately 0.0027 and the offset is approximately 450M parameters. In plain terms: the throughput-optimal draft is roughly 200x smaller than the target, plus a fixed overhead around 450M.

Target Model Target Size SDSL Optimal Draft What Teams Actually Use
OPT-13B 13B ~35M OPT-1.3B (37x too large)
OPT-30B 30B ~81M OPT-2.7B (33x too large)
OPT-66B 66B ~178M OPT-6.7B (38x too large)
Llama-3-70B 70B ~189M Llama-3.2-1B (5x too large)
Qwen1.5-110B 110B ~298M Qwen2.5-1.5B (5x too large)
Llama-3.1-405B 405B ~1.5B Llama-3.2-8B (5x too large)

That last column is the practical gap. Nobody deploys a 189M draft for a 70B target because the model does not exist off the shelf. Teams grab the nearest available checkpoint. SDSL suggests this convenience costs real throughput.

Why smaller drafts win on throughput despite lower acceptance

The intuition trap is focusing on acceptance rate. A 1B draft accepting 67% of tokens sounds better than a 189M draft accepting 60%. But the 189M draft runs roughly 5x faster per forward pass, which means it proposes tokens 5x more frequently. The net token throughput — tokens verified per unit time — favors the faster draft even though it wastes more proposals.

This is exactly the dynamic behind Chinchilla scaling laws for training: smaller models trained on more data beat larger models trained on less data, because the compute budget is fixed. Here, the compute budget per verification cycle is fixed by the target model’s forward pass time. The question is how many draft tokens you can generate in that window, not how many will be accepted.

graph LR
    subgraph "Standard Practice"
        A[70B Target] --> B[1B Draft]
        B --> C["α = 0.67"]
        C --> D["~2.3x speedup"]
    end
    subgraph "SDSL Optimal"
        E[70B Target] --> F[189M Draft]
        F --> G["α = 0.60"]
        G --> H["~3.1x speedup<br/>(more proposals/sec)"]
    end
    subgraph "SSD Recursive"
        I[70B Target] --> J[1B Draft]
        J --> K[Pre-speculate<br/>outcomes]
        K --> L["5x speedup<br/>(zero draft latency)"]
    end

Speculative Speculative Decoding: the recursive extension

If SDSL tells you the optimal draft model is tiny, SSD (Kumar, Dao, May — ICLR 2026) asks: what if you could eliminate the draft model’s latency entirely?

Standard speculative decoding is sequential: draft proposes, target verifies, draft proposes again. The draft model sits idle during verification, and the target sits idle during drafting. Post 0067 covered how Saguaro fills the draft model’s idle time. SSD goes further — it fills the draft’s idle time with pre-speculation of the next round’s tokens.

How SSD works

During the verification step, the draft model does not wait. Instead, it predicts what the verification outcome will be (how many tokens accepted, what the bonus token is) and pre-generates candidate continuations for each likely outcome. When verification finishes, if the actual outcome matches one of the pre-computed branches, the next draft is already done. Latency drops to near zero.

The challenge is combinatorial: at speculation length K with vocabulary V, there are (K+1) x V possible verification outcomes. SSD addresses this with a geometric fan-out strategy that allocates prediction budget following a power law:

\[F_k = F_0 \cdot \alpha_p^{k/(1+r)}\]

Where alpha_p is the primary speculator’s acceptance rate and r is a power-law cache hit rate exponent. Empirically, this achieves roughly 90% accuracy on predicting bonus tokens.

SSD performance numbers

On Llama-3.1-70B with a Llama-3.2-1B draft on 4x H100 + 1x H100 (dedicated draft):

Metric Standard SD SSD Improvement
vs. autoregressive 3.5x 5.0x +43%
vs. standard SD baseline 1.3x +30%
Cache hit rate N/A ~90%

The 30% improvement over standard speculative decoding comes entirely from eliminating the sequential dependency between draft and verify phases. No new model training required. No architecture changes to either model.

The SDSL-SSD connection

These two papers are complementary. SDSL tells you how small your draft model should be for optimal throughput. SSD tells you how to squeeze more from whatever draft model you choose by pre-speculating during verification. A deployment using both insights — a 189M draft for a 70B target, running SSD’s pre-speculation — would combine the throughput-optimal sizing with latency-optimal scheduling.

Neither paper tests this combination explicitly, but the math is compositional. SDSL’s throughput formula assumes sequential draft-verify cycles. SSD’s speedup factor multiplies the tokens-per-cycle by eliminating inter-cycle latency. The predicted combined speedup for 70B would be in the 4-6x range at batch size 1.

What this means for 405B deployments

The 405B case is where SDSL’s predictions become most useful in practice. Current production deployments of Llama-3.1-405B (including those using quantized approaches) typically pair with an 8B draft model. SDSL predicts the optimal draft size at roughly 1.5B — which happens to be an available checkpoint in the Qwen2.5 family, and close to Llama-3.2-1B.

Real-world benchmarks on AMD MI300X partially validate this. The reported speedups for 405B:

Draft Model Speedup
8B draft 1.8x
3B draft 1.5x
1B draft 1.2x

These numbers appear to contradict SDSL — the 8B draft wins. But these benchmarks used off-the-shelf models without optimizing speculation length per pairing. SDSL’s prediction includes co-optimizing gamma (lookahead length) with draft size, using the Lambert W function to find the sweet spot. The paper’s latency measurements on OPT-13B show the theoretical prediction matches wall-clock reality when speculation length is also optimized.

The gap between theory and current benchmarks is an implementation gap, not a theoretical flaw. Frameworks like vLLM and SGLang set speculation length as a static config parameter rather than computing it from the draft-target perplexity relationship. Closing that gap means integrating SDSL’s formulas into the serving framework’s auto-tuning.

Practical implications for inference engineers

1. Train purpose-built draft models

Stop grabbing the smallest checkpoint in your model family. SDSL gives you a formula for the right size. For a 70B target, train a 200M draft model on the same data distribution. The acceptance rate will be lower than a 1B draft, but throughput will be higher because the draft runs faster.

2. Co-optimize speculation length with draft size

Current frameworks let you set gamma (speculation length) as a constant. SDSL shows the optimal gamma depends on both models’ perplexities and sizes through a non-trivial Lambert W function relationship. The right gamma for a 200M draft is different from the right gamma for a 1B draft. Hard-coding gamma=5 everywhere leaves throughput on the table.

3. Consider the SSD architecture for latency-sensitive paths

If your deployment already dedicates separate hardware for draft and target models (as recommended in the Saguaro analysis), SSD’s pre-speculation adds 30% throughput at zero additional hardware cost. The cache hit rate of 90% means the draft model’s idle time during verification is productively used in 9 out of 10 rounds.

4. Self-speculation changes the calculus

QuantSpec and other self-speculative methods use the model’s own quantized layers as a draft. SDSL’s framework does not directly apply to self-speculation because the “draft model” is not a separate model with independent perplexity. But the directional insight holds: faster draft computation matters more than higher acceptance rate, within bounds.

5. Watch for framework integration

SDSL’s code is available. The natural next step is auto-tuning in serving frameworks: given a draft and target model, compute the optimal speculation length automatically. vLLM and SGLang both have the plugin architecture to support this. Until it ships, you can compute gamma_opt yourself using the formula and your models’ perplexities.

Limitations and open questions

SDSL has real limitations worth understanding before you build on it:

  • R-squared of 0.60. The linear acceptance rate model explains 60% of variance. The remaining 40% includes architecture-specific effects, training recipe differences, and domain-specific token distributions. Good enough for sizing decisions, not precise enough for SLA predictions.

  • Autoregressive text-only models. The framework excludes mixture-of-experts (which have different compute scaling), encoder-decoder architectures, and multimodal models. Given that MoE is increasingly common at scale, this is a significant gap.

  • Single GPU validation. SDSL’s empirical validation uses a single A100 GPU. Real deployments use tensor parallelism across 4-8 GPUs, which changes the draft-to-target latency ratio. The directional prediction (smaller drafts win) likely holds, but the exact optimal size may shift.

  • No batching analysis. Speculative decoding’s benefits degrade at high batch sizes because the target model becomes compute-bound rather than memory-bandwidth-bound. SDSL’s framework does not model this transition, which is exactly the regime where Nightjar’s adaptive approach matters.

  • SSD’s hardware requirement. SSD’s gains assume a dedicated GPU for the draft model. In shared-resource deployments, the pre-speculation compute may contend with other workloads.

The bigger picture: inference is becoming an engineering discipline

What matters about SDSL is not the specific prediction that draft models should be 200x smaller. That number will be refined by future work, and it may not generalize cleanly to MoE or multimodal architectures. What matters is that speculative decoding now has a predictive theory at all.

Before SDSL, speculative decoding configuration was empirical. You benchmarked. You tuned. You hoped the result generalized across prompts and traffic patterns. After SDSL, you can predict. The prediction is imperfect (R-squared 0.60), but it is a prediction — the same kind that Chinchilla gave training: imperfect, directionally correct, and enormously useful for planning.

SSD extends this from “what model to pair” to “how to schedule the pair.” Together, they represent the maturation of speculative decoding from a clever trick into a characterizable system with known trade-offs, predictable behavior, and analytically optimal configurations.

For inference engineers, the takeaway is concrete: your draft model is probably too big, your speculation length is probably not co-optimized, and 30% more throughput is available from pre-speculation at zero hardware cost. The formulas exist. The code is open. The remaining gap is framework integration.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch