SDSL: speculative decoding finally has scaling laws, and they predict your draft model is too big

TL;DR — Speculative decoding has been a tuning problem: pick a draft model, measure acceptance rate, iterate. SDSL (arXiv 2603.11053) turns it into a systems problem with predictive scaling laws. The headline finding: the throughput-optimal draft model is roughly 200x smaller than the target. For Llama-3-70B, that is a 189M parameter model, not the 1B-8B range most practitioners reach for. Stanford’s Speculative Speculative Decoding (arXiv 2603.03251, ICLR 2026) applies the same insight recursively, pre-speculating verification outcomes to eliminate draft latency entirely and hitting 5x over autoregressive on Llama-70B. Together, these papers shift speculative decoding from craft to engineering.
The guessing game that runs your inference stack
Every production team deploying speculative decoding makes the same bet: which draft model produces the best throughput? The decision process is remarkably informal for infrastructure that handles millions of tokens per day. You try the smallest model in your family. You benchmark a few acceptance rates. You pick the one that feels right.
This is how most teams landed on 1B-8B draft models for 70B targets. The logic is intuitive — a bigger draft model accepts more tokens, so you verify fewer times. And that logic is wrong, or at least incomplete, because it optimizes for acceptance rate when the actual objective is tokens per FLOP.
SDSL makes the case that speculative decoding obeys the same kind of empirical scaling laws that govern model training. The acceptance rate between any draft-target pair is predictable from their perplexities. The optimal draft size is predictable from the target size. And the prediction says most teams are burning compute on draft models 5-40x larger than they need.
For context on the speculative decoding work this builds on, see Saguaro, Nightjar, and universal draft models and 14 inference optimization techniques.
What SDSL actually says
Bozorgkhoo and Molybog start from a premise borrowed directly from Chinchilla: if you can predict a model’s perplexity from its parameter count and training data, and you can predict acceptance rate from the perplexities of two models, then you can predict the optimal draft-target pairing without running a single inference benchmark.
The acceptance rate formula
The expected acceptance rate between a draft model (perplexity x) and target model (perplexity y) follows a linear model:
\[\alpha = Ax + By + C\]Fitted across OPT, Qwen 1.5/2.5, and Llama 3 families:
| Coefficient | Value | Interpretation |
|---|---|---|
| A (draft perplexity) | -0.00670 | Lower draft perplexity raises acceptance |
| B (target perplexity) | +0.01297 | Lower target perplexity lowers acceptance |
| C (intercept) | 0.64208 | Baseline acceptance rate |
| R-squared | 0.6023 | Explains 60% of variance — enough to size hardware, not enough for SLA math |
The asymmetry is what matters here. Draft perplexity (coefficient A) drives acceptance rate. Target model quality (coefficient B) barely moves the needle. This means that improving your draft model matters far more than improving your target model for speculative decoding throughput — but only up to the point where the draft model’s inference cost starts eating the throughput gain.
Acceptance rates in practice
Measured on HellaSwag with OPT-13B as target:
| Draft Model | Parameters | Acceptance Rate |
|---|---|---|
| OPT-125M | 125M | 0.596 |
| OPT-350M | 350M | 0.628 |
| OPT-1.3B | 1.3B | 0.669 |
| OPT-2.7B | 2.7B | 0.679 |
| Qwen2.5-0.5B | 500M | 0.652 |
| Qwen2.5-1.5B | 1.5B | 0.668 |
| Qwen2.5-3B | 3B | 0.673 |
Going from 125M to 2.7B — a 22x increase in draft size — buys you 0.083 in acceptance rate. Going from 125M to 350M — a 2.8x increase — buys you 0.032. The marginal return on draft model size drops off fast.
The throughput formula
Raw acceptance rate is not the objective. Tokens per FLOP is. The throughput formula uses the Lambert W function to find the optimal speculation length for a given draft-target pair:
\[\mathcal{T} = \frac{-\log(\alpha)}{2N(-1+\alpha)W(-\alpha^{(M/N-1)}/e)}\]Where N is draft model size, M is target model size, and W is the Lambert W function. The optimal speculation length (lookahead) is:
\[\gamma_{opt} = \frac{-M\log(\alpha) + N \cdot W(-\alpha^{(M/N-1)}/e) + N}{N \cdot \log(\alpha)}\]The prediction that matters: optimal draft size
All of the above collapses into a linear relationship between target model size and optimal draft model size:
\[N^*(M) = \mu M + M_0\]Where the slope is approximately 0.0027 and the offset is approximately 450M parameters. In plain terms: the throughput-optimal draft is roughly 200x smaller than the target, plus a fixed overhead around 450M.
| Target Model | Target Size | SDSL Optimal Draft | What Teams Actually Use |
|---|---|---|---|
| OPT-13B | 13B | ~35M | OPT-1.3B (37x too large) |
| OPT-30B | 30B | ~81M | OPT-2.7B (33x too large) |
| OPT-66B | 66B | ~178M | OPT-6.7B (38x too large) |
| Llama-3-70B | 70B | ~189M | Llama-3.2-1B (5x too large) |
| Qwen1.5-110B | 110B | ~298M | Qwen2.5-1.5B (5x too large) |
| Llama-3.1-405B | 405B | ~1.5B | Llama-3.2-8B (5x too large) |
That last column is the practical gap. Nobody deploys a 189M draft for a 70B target because the model does not exist off the shelf. Teams grab the nearest available checkpoint. SDSL suggests this convenience costs real throughput.
Why smaller drafts win on throughput despite lower acceptance
The intuition trap is focusing on acceptance rate. A 1B draft accepting 67% of tokens sounds better than a 189M draft accepting 60%. But the 189M draft runs roughly 5x faster per forward pass, which means it proposes tokens 5x more frequently. The net token throughput — tokens verified per unit time — favors the faster draft even though it wastes more proposals.
This is exactly the dynamic behind Chinchilla scaling laws for training: smaller models trained on more data beat larger models trained on less data, because the compute budget is fixed. Here, the compute budget per verification cycle is fixed by the target model’s forward pass time. The question is how many draft tokens you can generate in that window, not how many will be accepted.
graph LR
subgraph "Standard Practice"
A[70B Target] --> B[1B Draft]
B --> C["α = 0.67"]
C --> D["~2.3x speedup"]
end
subgraph "SDSL Optimal"
E[70B Target] --> F[189M Draft]
F --> G["α = 0.60"]
G --> H["~3.1x speedup<br/>(more proposals/sec)"]
end
subgraph "SSD Recursive"
I[70B Target] --> J[1B Draft]
J --> K[Pre-speculate<br/>outcomes]
K --> L["5x speedup<br/>(zero draft latency)"]
end
Speculative Speculative Decoding: the recursive extension
If SDSL tells you the optimal draft model is tiny, SSD (Kumar, Dao, May — ICLR 2026) asks: what if you could eliminate the draft model’s latency entirely?
Standard speculative decoding is sequential: draft proposes, target verifies, draft proposes again. The draft model sits idle during verification, and the target sits idle during drafting. Post 0067 covered how Saguaro fills the draft model’s idle time. SSD goes further — it fills the draft’s idle time with pre-speculation of the next round’s tokens.
How SSD works
During the verification step, the draft model does not wait. Instead, it predicts what the verification outcome will be (how many tokens accepted, what the bonus token is) and pre-generates candidate continuations for each likely outcome. When verification finishes, if the actual outcome matches one of the pre-computed branches, the next draft is already done. Latency drops to near zero.
The challenge is combinatorial: at speculation length K with vocabulary V, there are (K+1) x V possible verification outcomes. SSD addresses this with a geometric fan-out strategy that allocates prediction budget following a power law:
\[F_k = F_0 \cdot \alpha_p^{k/(1+r)}\]Where alpha_p is the primary speculator’s acceptance rate and r is a power-law cache hit rate exponent. Empirically, this achieves roughly 90% accuracy on predicting bonus tokens.
SSD performance numbers
On Llama-3.1-70B with a Llama-3.2-1B draft on 4x H100 + 1x H100 (dedicated draft):
| Metric | Standard SD | SSD | Improvement |
|---|---|---|---|
| vs. autoregressive | 3.5x | 5.0x | +43% |
| vs. standard SD | baseline | 1.3x | +30% |
| Cache hit rate | N/A | ~90% | — |
The 30% improvement over standard speculative decoding comes entirely from eliminating the sequential dependency between draft and verify phases. No new model training required. No architecture changes to either model.
The SDSL-SSD connection
These two papers are complementary. SDSL tells you how small your draft model should be for optimal throughput. SSD tells you how to squeeze more from whatever draft model you choose by pre-speculating during verification. A deployment using both insights — a 189M draft for a 70B target, running SSD’s pre-speculation — would combine the throughput-optimal sizing with latency-optimal scheduling.
Neither paper tests this combination explicitly, but the math is compositional. SDSL’s throughput formula assumes sequential draft-verify cycles. SSD’s speedup factor multiplies the tokens-per-cycle by eliminating inter-cycle latency. The predicted combined speedup for 70B would be in the 4-6x range at batch size 1.
What this means for 405B deployments
The 405B case is where SDSL’s predictions become most useful in practice. Current production deployments of Llama-3.1-405B (including those using quantized approaches) typically pair with an 8B draft model. SDSL predicts the optimal draft size at roughly 1.5B — which happens to be an available checkpoint in the Qwen2.5 family, and close to Llama-3.2-1B.
Real-world benchmarks on AMD MI300X partially validate this. The reported speedups for 405B:
| Draft Model | Speedup |
|---|---|
| 8B draft | 1.8x |
| 3B draft | 1.5x |
| 1B draft | 1.2x |
These numbers appear to contradict SDSL — the 8B draft wins. But these benchmarks used off-the-shelf models without optimizing speculation length per pairing. SDSL’s prediction includes co-optimizing gamma (lookahead length) with draft size, using the Lambert W function to find the sweet spot. The paper’s latency measurements on OPT-13B show the theoretical prediction matches wall-clock reality when speculation length is also optimized.
The gap between theory and current benchmarks is an implementation gap, not a theoretical flaw. Frameworks like vLLM and SGLang set speculation length as a static config parameter rather than computing it from the draft-target perplexity relationship. Closing that gap means integrating SDSL’s formulas into the serving framework’s auto-tuning.
Practical implications for inference engineers
1. Train purpose-built draft models
Stop grabbing the smallest checkpoint in your model family. SDSL gives you a formula for the right size. For a 70B target, train a 200M draft model on the same data distribution. The acceptance rate will be lower than a 1B draft, but throughput will be higher because the draft runs faster.
2. Co-optimize speculation length with draft size
Current frameworks let you set gamma (speculation length) as a constant. SDSL shows the optimal gamma depends on both models’ perplexities and sizes through a non-trivial Lambert W function relationship. The right gamma for a 200M draft is different from the right gamma for a 1B draft. Hard-coding gamma=5 everywhere leaves throughput on the table.
3. Consider the SSD architecture for latency-sensitive paths
If your deployment already dedicates separate hardware for draft and target models (as recommended in the Saguaro analysis), SSD’s pre-speculation adds 30% throughput at zero additional hardware cost. The cache hit rate of 90% means the draft model’s idle time during verification is productively used in 9 out of 10 rounds.
4. Self-speculation changes the calculus
QuantSpec and other self-speculative methods use the model’s own quantized layers as a draft. SDSL’s framework does not directly apply to self-speculation because the “draft model” is not a separate model with independent perplexity. But the directional insight holds: faster draft computation matters more than higher acceptance rate, within bounds.
5. Watch for framework integration
SDSL’s code is available. The natural next step is auto-tuning in serving frameworks: given a draft and target model, compute the optimal speculation length automatically. vLLM and SGLang both have the plugin architecture to support this. Until it ships, you can compute gamma_opt yourself using the formula and your models’ perplexities.
Limitations and open questions
SDSL has real limitations worth understanding before you build on it:
-
R-squared of 0.60. The linear acceptance rate model explains 60% of variance. The remaining 40% includes architecture-specific effects, training recipe differences, and domain-specific token distributions. Good enough for sizing decisions, not precise enough for SLA predictions.
-
Autoregressive text-only models. The framework excludes mixture-of-experts (which have different compute scaling), encoder-decoder architectures, and multimodal models. Given that MoE is increasingly common at scale, this is a significant gap.
-
Single GPU validation. SDSL’s empirical validation uses a single A100 GPU. Real deployments use tensor parallelism across 4-8 GPUs, which changes the draft-to-target latency ratio. The directional prediction (smaller drafts win) likely holds, but the exact optimal size may shift.
-
No batching analysis. Speculative decoding’s benefits degrade at high batch sizes because the target model becomes compute-bound rather than memory-bandwidth-bound. SDSL’s framework does not model this transition, which is exactly the regime where Nightjar’s adaptive approach matters.
-
SSD’s hardware requirement. SSD’s gains assume a dedicated GPU for the draft model. In shared-resource deployments, the pre-speculation compute may contend with other workloads.
The bigger picture: inference is becoming an engineering discipline
What matters about SDSL is not the specific prediction that draft models should be 200x smaller. That number will be refined by future work, and it may not generalize cleanly to MoE or multimodal architectures. What matters is that speculative decoding now has a predictive theory at all.
Before SDSL, speculative decoding configuration was empirical. You benchmarked. You tuned. You hoped the result generalized across prompts and traffic patterns. After SDSL, you can predict. The prediction is imperfect (R-squared 0.60), but it is a prediction — the same kind that Chinchilla gave training: imperfect, directionally correct, and enormously useful for planning.
SSD extends this from “what model to pair” to “how to schedule the pair.” Together, they represent the maturation of speculative decoding from a clever trick into a characterizable system with known trade-offs, predictable behavior, and analytically optimal configurations.
For inference engineers, the takeaway is concrete: your draft model is probably too big, your speculation length is probably not co-optimized, and 30% more throughput is available from pre-speculation at zero hardware cost. The formulas exist. The code is open. The remaining gap is framework integration.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch