What are speculative decoding scaling laws?

SDSL (Speculative Decoding Scaling Laws) is an analytical framework from arXiv 2603.11053 that connects pre-training scaling laws to speculative decoding throughput. It establishes that the expected acceptance rate between a draft and target model follows a linear relationship with their perplexities: alpha = Ax + By + C, where x is draft perplexity and y is target perplexity. This lets you predict the optimal draft model size before training it.

What is the optimal draft model size for speculative decoding?

SDSL predicts the throughput-optimal draft model is approximately 200x smaller than the target. Specific predictions: Llama-3-70B pairs best with a 189M parameter draft, not the 1B-8B models commonly used. For 405B targets, the optimal draft is around 1.5B parameters. The relationship follows N*(M) = 0.0027M + 450M, where M is target size.

How does Speculative Speculative Decoding differ from standard speculative decoding?

Standard speculative decoding runs draft then verify sequentially. Speculative Speculative Decoding (SSD, arXiv 2603.03251) pre-speculates multiple verification outcomes in parallel during the verification step itself. If the actual verification result matches a pre-computed speculation, the next draft is ready immediately — eliminating draft latency. This achieves up to 5x speedup over autoregressive decoding and 30% improvement over standard speculative decoding.

Why do most teams use draft models that are too large?

Teams default to the smallest available model in the same family (e.g., Llama-3.2-1B for Llama-3-70B) because that is what is convenient to deploy. SDSL shows that acceptance rate is dominated by draft model perplexity — the target model's quality barely matters in the equation. A 189M draft has lower acceptance rate than a 1B draft, but the throughput gain from running the draft faster more than compensates. The net effect on tokens per FLOP favors the smaller model.

Can SDSL predictions be applied across different model families?

Yes, with caveats. SDSL's linear scaling relationship holds across OPT, Qwen 1.5/2.5, and Llama 3 families. The coefficients (A=-0.00670, B=0.01297, C=0.64208) were fitted across these families with R-squared of 0.60. Cross-family pairing works when training data distributions are comparable. The framework excludes mixture-of-experts and multimodal architectures.

SDSL: speculative decoding finally has scaling laws, and they predict your draft model is too big

14 minute read

A glowing power-law curve on a dark grid surface with scattered data points, representing scaling law predictions

TL;DR — Speculative decoding has been a tuning problem: pick a draft model, measure acceptance rate, iterate. SDSL (arXiv 2603.11053) turns it into a systems problem with predictive scaling laws. The headline finding: the throughput-optimal draft model is roughly 200x smaller than the target. For Llama-3-70B, that is a 189M parameter model, not the 1B-8B range most practitioners reach for. Stanford’s Speculative Speculative Decoding (arXiv 2603.03251, ICLR 2026) applies the same insight recursively, pre-speculating verification outcomes to eliminate draft latency entirely and hitting 5x over autoregressive on Llama-70B. Together, these papers shift speculative decoding from craft to engineering.

The guessing game that runs your inference stack

Every production team deploying speculative decoding makes the same bet: which draft model produces the best throughput? The decision process is remarkably informal for infrastructure that handles millions of tokens per day. You try the smallest model in your family. You benchmark a few acceptance rates. You pick the one that feels right.

This is how most teams landed on 1B-8B draft models for 70B targets. The logic is intuitive — a bigger draft model accepts more tokens, so you verify fewer times. And that logic is wrong, or at least incomplete, because it optimizes for acceptance rate when the actual objective is tokens per FLOP.

SDSL makes the case that speculative decoding obeys the same kind of empirical scaling laws that govern model training. The acceptance rate between any draft-target pair is predictable from their perplexities. The optimal draft size is predictable from the target size. And the prediction says most teams are burning compute on draft models 5-40x larger than they need.

For context on the speculative decoding work this builds on, see Saguaro, Nightjar, and universal draft models and 14 inference optimization techniques.

What SDSL actually says

Bozorgkhoo and Molybog start from a premise borrowed directly from Chinchilla: if you can predict a model’s perplexity from its parameter count and training data, and you can predict acceptance rate from the perplexities of two models, then you can predict the optimal draft-target pairing without running a single inference benchmark.

The acceptance rate formula

The expected acceptance rate between a draft model (perplexity x) and target model (perplexity y) follows a linear model:

\[\alpha = Ax + By + C\]

Fitted across OPT, Qwen 1.5/2.5, and Llama 3 families:

Coefficient	Value	Interpretation
A (draft perplexity)	-0.00670	Lower draft perplexity raises acceptance
B (target perplexity)	+0.01297	Lower target perplexity lowers acceptance
C (intercept)	0.64208	Baseline acceptance rate
R-squared	0.6023	Explains 60% of variance — enough to size hardware, not enough for SLA math

The asymmetry is what matters here. Draft perplexity (coefficient A) drives acceptance rate. Target model quality (coefficient B) barely moves the needle. This means that improving your draft model matters far more than improving your target model for speculative decoding throughput — but only up to the point where the draft model’s inference cost starts eating the throughput gain.

Acceptance rates in practice

Measured on HellaSwag with OPT-13B as target:

Draft Model	Parameters	Acceptance Rate
OPT-125M	125M	0.596
OPT-350M	350M	0.628
OPT-1.3B	1.3B	0.669
OPT-2.7B	2.7B	0.679
Qwen2.5-0.5B	500M	0.652
Qwen2.5-1.5B	1.5B	0.668
Qwen2.5-3B	3B	0.673

Going from 125M to 2.7B — a 22x increase in draft size — buys you 0.083 in acceptance rate. Going from 125M to 350M — a 2.8x increase — buys you 0.032. The marginal return on draft model size drops off fast.

The throughput formula

Raw acceptance rate is not the objective. Tokens per FLOP is. The throughput formula uses the Lambert W function to find the optimal speculation length for a given draft-target pair:

\[\mathcal{T} = \frac{-\log(\alpha)}{2N(-1+\alpha)W(-\alpha^{(M/N-1)}/e)}\]

Where N is draft model size, M is target model size, and W is the Lambert W function. The optimal speculation length (lookahead) is:

\[\gamma_{opt} = \frac{-M\log(\alpha) + N \cdot W(-\alpha^{(M/N-1)}/e) + N}{N \cdot \log(\alpha)}\]

The prediction that matters: optimal draft size

All of the above collapses into a linear relationship between target model size and optimal draft model size:

\[N^*(M) = \mu M + M_0\]

Where the slope is approximately 0.0027 and the offset is approximately 450M parameters. In plain terms: the throughput-optimal draft is roughly 200x smaller than the target, plus a fixed overhead around 450M.

Target Model	Target Size	SDSL Optimal Draft	What Teams Actually Use
OPT-13B	13B	~35M	OPT-1.3B (37x too large)
OPT-30B	30B	~81M	OPT-2.7B (33x too large)
OPT-66B	66B	~178M	OPT-6.7B (38x too large)
Llama-3-70B	70B	~189M	Llama-3.2-1B (5x too large)
Qwen1.5-110B	110B	~298M	Qwen2.5-1.5B (5x too large)
Llama-3.1-405B	405B	~1.5B	Llama-3.2-8B (5x too large)

That last column is the practical gap. Nobody deploys a 189M draft for a 70B target because the model does not exist off the shelf. Teams grab the nearest available checkpoint. SDSL suggests this convenience costs real throughput.

Why smaller drafts win on throughput despite lower acceptance

The intuition trap is focusing on acceptance rate. A 1B draft accepting 67% of tokens sounds better than a 189M draft accepting 60%. But the 189M draft runs roughly 5x faster per forward pass, which means it proposes tokens 5x more frequently. The net token throughput — tokens verified per unit time — favors the faster draft even though it wastes more proposals.

This is exactly the dynamic behind Chinchilla scaling laws for training: smaller models trained on more data beat larger models trained on less data, because the compute budget is fixed. Here, the compute budget per verification cycle is fixed by the target model’s forward pass time. The question is how many draft tokens you can generate in that window, not how many will be accepted.

graph LR
    subgraph "Standard Practice"
        A[70B Target] --> B[1B Draft]
        B --> C["α = 0.67"]
        C --> D["~2.3x speedup"]
    end
    subgraph "SDSL Optimal"
        E[70B Target] --> F[189M Draft]
        F --> G["α = 0.60"]
        G --> H["~3.1x speedup<br/>(more proposals/sec)"]
    end
    subgraph "SSD Recursive"
        I[70B Target] --> J[1B Draft]
        J --> K[Pre-speculate<br/>outcomes]
        K --> L["5x speedup<br/>(zero draft latency)"]
    end

Speculative Speculative Decoding: the recursive extension

If SDSL tells you the optimal draft model is tiny, SSD (Kumar, Dao, May — ICLR 2026) asks: what if you could eliminate the draft model’s latency entirely?

Standard speculative decoding is sequential: draft proposes, target verifies, draft proposes again. The draft model sits idle during verification, and the target sits idle during drafting. Post 0067 covered how Saguaro fills the draft model’s idle time. SSD goes further — it fills the draft’s idle time with pre-speculation of the next round’s tokens.

How SSD works

During the verification step, the draft model does not wait. Instead, it predicts what the verification outcome will be (how many tokens accepted, what the bonus token is) and pre-generates candidate continuations for each likely outcome. When verification finishes, if the actual outcome matches one of the pre-computed branches, the next draft is already done. Latency drops to near zero.

The challenge is combinatorial: at speculation length K with vocabulary V, there are (K+1) x V possible verification outcomes. SSD addresses this with a geometric fan-out strategy that allocates prediction budget following a power law:

\[F_k = F_0 \cdot \alpha_p^{k/(1+r)}\]

Where alpha_p is the primary speculator’s acceptance rate and r is a power-law cache hit rate exponent. Empirically, this achieves roughly 90% accuracy on predicting bonus tokens.

SSD performance numbers

On Llama-3.1-70B with a Llama-3.2-1B draft on 4x H100 + 1x H100 (dedicated draft):

Metric	Standard SD	SSD	Improvement
vs. autoregressive	3.5x	5.0x	+43%
vs. standard SD	baseline	1.3x	+30%
Cache hit rate	N/A	~90%	—

The 30% improvement over standard speculative decoding comes entirely from eliminating the sequential dependency between draft and verify phases. No new model training required. No architecture changes to either model.

The SDSL-SSD connection

These two papers are complementary. SDSL tells you how small your draft model should be for optimal throughput. SSD tells you how to squeeze more from whatever draft model you choose by pre-speculating during verification. A deployment using both insights — a 189M draft for a 70B target, running SSD’s pre-speculation — would combine the throughput-optimal sizing with latency-optimal scheduling.

Neither paper tests this combination explicitly, but the math is compositional. SDSL’s throughput formula assumes sequential draft-verify cycles. SSD’s speedup factor multiplies the tokens-per-cycle by eliminating inter-cycle latency. The predicted combined speedup for 70B would be in the 4-6x range at batch size 1.

What this means for 405B deployments

The 405B case is where SDSL’s predictions become most useful in practice. Current production deployments of Llama-3.1-405B (including those using quantized approaches) typically pair with an 8B draft model. SDSL predicts the optimal draft size at roughly 1.5B — which happens to be an available checkpoint in the Qwen2.5 family, and close to Llama-3.2-1B.

Real-world benchmarks on AMD MI300X partially validate this. The reported speedups for 405B:

Draft Model	Speedup
8B draft	1.8x
3B draft	1.5x
1B draft	1.2x

These numbers appear to contradict SDSL — the 8B draft wins. But these benchmarks used off-the-shelf models without optimizing speculation length per pairing. SDSL’s prediction includes co-optimizing gamma (lookahead length) with draft size, using the Lambert W function to find the sweet spot. The paper’s latency measurements on OPT-13B show the theoretical prediction matches wall-clock reality when speculation length is also optimized.

The gap between theory and current benchmarks is an implementation gap, not a theoretical flaw. Frameworks like vLLM and SGLang set speculation length as a static config parameter rather than computing it from the draft-target perplexity relationship. Closing that gap means integrating SDSL’s formulas into the serving framework’s auto-tuning.

Practical implications for inference engineers

1. Train purpose-built draft models

Stop grabbing the smallest checkpoint in your model family. SDSL gives you a formula for the right size. For a 70B target, train a 200M draft model on the same data distribution. The acceptance rate will be lower than a 1B draft, but throughput will be higher because the draft runs faster.

2. Co-optimize speculation length with draft size

Current frameworks let you set gamma (speculation length) as a constant. SDSL shows the optimal gamma depends on both models’ perplexities and sizes through a non-trivial Lambert W function relationship. The right gamma for a 200M draft is different from the right gamma for a 1B draft. Hard-coding gamma=5 everywhere leaves throughput on the table.

3. Consider the SSD architecture for latency-sensitive paths

If your deployment already dedicates separate hardware for draft and target models (as recommended in the Saguaro analysis), SSD’s pre-speculation adds 30% throughput at zero additional hardware cost. The cache hit rate of 90% means the draft model’s idle time during verification is productively used in 9 out of 10 rounds.

4. Self-speculation changes the calculus

QuantSpec and other self-speculative methods use the model’s own quantized layers as a draft. SDSL’s framework does not directly apply to self-speculation because the “draft model” is not a separate model with independent perplexity. But the directional insight holds: faster draft computation matters more than higher acceptance rate, within bounds.

5. Watch for framework integration

SDSL’s code is available. The natural next step is auto-tuning in serving frameworks: given a draft and target model, compute the optimal speculation length automatically. vLLM and SGLang both have the plugin architecture to support this. Until it ships, you can compute gamma_opt yourself using the formula and your models’ perplexities.

Limitations and open questions

SDSL has real limitations worth understanding before you build on it:

R-squared of 0.60. The linear acceptance rate model explains 60% of variance. The remaining 40% includes architecture-specific effects, training recipe differences, and domain-specific token distributions. Good enough for sizing decisions, not precise enough for SLA predictions.
Autoregressive text-only models. The framework excludes mixture-of-experts (which have different compute scaling), encoder-decoder architectures, and multimodal models. Given that MoE is increasingly common at scale, this is a significant gap.
Single GPU validation. SDSL’s empirical validation uses a single A100 GPU. Real deployments use tensor parallelism across 4-8 GPUs, which changes the draft-to-target latency ratio. The directional prediction (smaller drafts win) likely holds, but the exact optimal size may shift.
No batching analysis. Speculative decoding’s benefits degrade at high batch sizes because the target model becomes compute-bound rather than memory-bandwidth-bound. SDSL’s framework does not model this transition, which is exactly the regime where Nightjar’s adaptive approach matters.
SSD’s hardware requirement. SSD’s gains assume a dedicated GPU for the draft model. In shared-resource deployments, the pre-speculation compute may contend with other workloads.

The bigger picture: inference is becoming an engineering discipline

What matters about SDSL is not the specific prediction that draft models should be 200x smaller. That number will be refined by future work, and it may not generalize cleanly to MoE or multimodal architectures. What matters is that speculative decoding now has a predictive theory at all.

Before SDSL, speculative decoding configuration was empirical. You benchmarked. You tuned. You hoped the result generalized across prompts and traffic patterns. After SDSL, you can predict. The prediction is imperfect (R-squared 0.60), but it is a prediction — the same kind that Chinchilla gave training: imperfect, directionally correct, and enormously useful for planning.

SSD extends this from “what model to pair” to “how to schedule the pair.” Together, they represent the maturation of speculative decoding from a clever trick into a characterizable system with known trade-offs, predictable behavior, and analytically optimal configurations.

For inference engineers, the takeaway is concrete: your draft model is probably too big, your speculation length is probably not co-optimized, and 30% more throughput is available from pre-speculation at zero hardware cost. The formulas exist. The code is open. The remaining gap is framework integration.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

SDSL: speculative decoding finally has scaling laws, and they predict your draft model is too big

The guessing game that runs your inference stack

What SDSL actually says

The acceptance rate formula

Acceptance rates in practice

The throughput formula

The prediction that matters: optimal draft size

Why smaller drafts win on throughput despite lower acceptance

Speculative Speculative Decoding: the recursive extension

How SSD works

SSD performance numbers

The SDSL-SSD connection

What this means for 405B deployments

Practical implications for inference engineers

1. Train purpose-built draft models

2. Co-optimize speculation length with draft size

3. Consider the SSD architecture for latency-sensitive paths

4. Self-speculation changes the calculus

5. Watch for framework integration

Limitations and open questions

The bigger picture: inference is becoming an engineering discipline

Related across topics

Share on

The guessing game that runs your inference stack

What SDSL actually says

The acceptance rate formula

Acceptance rates in practice

The throughput formula

The prediction that matters: optimal draft size

Why smaller drafts win on throughput despite lower acceptance

Speculative Speculative Decoding: the recursive extension

How SSD works

SSD performance numbers

The SDSL-SSD connection

What this means for 405B deployments

Practical implications for inference engineers

1. Train purpose-built draft models

2. Co-optimize speculation length with draft size

3. Consider the SSD architecture for latency-sensitive paths

4. Self-speculation changes the calculus

5. Watch for framework integration

Limitations and open questions

The bigger picture: inference is becoming an engineering discipline

Related across topics

Compute Allocation for Speech Models

Token Efficiency Optimization

Share on