15 minute read

“We’ve spent five years optimizing the GPU. The CPU was the bottleneck the entire time.”

TL;DR

Every LLM serving framework today keeps the host CPU on the critical path for scheduling, batching, and token management. Blink (arXiv:2604.07609) eliminates this dependency by moving request handling to an NVIDIA BlueField-3 DPU and replacing host-driven scheduling with persistent CUDA kernels that run batching and KV-cache management directly on the GPU. The result on H100 hardware: 8.47x P99 TTFT reduction versus SGLang, 2.1x throughput over TensorRT-LLM, and 48.6% less energy per token. Under CPU interference from co-located workloads, baselines degrade by up to 139x on TTFT while Blink stays flat. Combined with AMD PACE’s NUMA-aware CPU inference engine (1.6x over vLLM on EPYC), the inference optimization frontier is moving beyond GPUs to every component in the server. For the foundational inference techniques these build on, see 14 techniques for high-performance ML inference.

Hero image

Why is the CPU a problem for LLM inference?

The standard LLM serving architecture looks straightforward: a request arrives over HTTP, the CPU parses it, tokenizes the input, submits work to the GPU, collects output tokens, and streams them back to the client. The CPU orchestrates every step.

This creates three problems that get worse at scale.

First, the CPU is shared. In production data centers, the host CPU runs the operating system, network stack, monitoring agents, logging, and often co-located non-inference workloads. Every LLC miss, every context switch, every TLB flush from a neighbor process adds latency jitter to inference. The Blink paper measured this directly: under CPU interference, LLC miss rates jump from 7.0% to 71.6%, and dTLB load misses increase from 66 million to over 1 billion. That is not a rounding error. It is an order-of-magnitude degradation in memory subsystem performance.

Second, the CPU is slow relative to what it orchestrates. A modern GPU completes a transformer layer forward pass in microseconds. The CPU overhead for launching a single CUDA graph from the host takes 11-17 microseconds. When every decode iteration requires a host round-trip for scheduling decisions, batching updates, and token sampling, the CPU adds milliseconds of overhead per token across a batch. At 30+ tokens per second per request, those milliseconds compound.

Third, the CPU is expensive to reserve. Operators who discover CPU interference must either pin cores for inference (wasting multi-tenant capacity) or over-provision CPU resources (increasing cost per GPU hour). Neither option scales well. The Blink paper reports that baselines retain only 28-48% of their throughput under realistic colocation scenarios.

Blink splits the serving stack across two non-CPU components: the BlueField-3 DPU handles the frontend, and a persistent GPU kernel handles the backend. The host CPU is not on the critical path at any point during steady-state inference.

graph LR
    subgraph Traditional["Traditional: CPU on critical path"]
        A1[Client] -->|HTTP| B1[CPU: Parse + Tokenize]
        B1 -->|Host memory copy| C1[CPU: Schedule batch]
        C1 -->|CUDA launch| D1[GPU: Forward pass]
        D1 -->|Host callback| E1[CPU: Sample token]
        E1 -->|Host memory| F1[CPU: Stream response]
    end

    subgraph Blink["Blink: CPU-free path"]
        A2[Client] -->|HTTP| B2[DPU: Parse + Tokenize]
        B2 -->|RDMA to GPU VRAM| C2[GPU: Ring buffer slot]
        C2 -->|Persistent kernel| D2[GPU: Schedule + Forward + Sample]
        D2 -->|Ring buffer| E2[DPU: Poll + Stream response]
    end

    style B1 fill:#ff6b6b,color:#000
    style C1 fill:#ff6b6b,color:#000
    style E1 fill:#ff6b6b,color:#000
    style F1 fill:#ff6b6b,color:#000
    style B2 fill:#4ecdc4,color:#000
    style D2 fill:#4ecdc4,color:#000
    style E2 fill:#4ecdc4,color:#000

The key insight is that the GPU already has all the compute needed to run scheduling and batching decisions. A single thread block (256 threads) scanning a ring buffer of 4,096 slots takes 1-5 microseconds. That is faster than a single CPU-to-GPU CUDA launch. By keeping the scheduler resident on the GPU, Blink eliminates the round-trip that every other serving framework pays per decode iteration.

How does the DPU frontend work?

The BlueField-3 DPU runs 16 ARM Cortex-A78 cores with 32 GB of dedicated memory. Blink’s frontend runs entirely on these cores, implementing:

HTTP server with OpenAI-compatible API. Accepts requests, validates parameters, and manages SSE streaming for token-by-token output. The DPU handles all network I/O without touching the host CPU or host memory.

ARM-optimized tokenizer. The paper reports their tokenizer runs 8-19.7x faster than HuggingFace’s implementation by using ARM NEON SIMD instructions for regex pre-tokenization at 16 bytes per cycle. Tokenization happens on the DPU, not on the host.

One-sided RDMA to GPU memory. After tokenization, input token IDs are written directly into GPU VRAM via RDMA through the 200 Gbps link. No intermediate copy through host memory. The DPU writes to a specific ring buffer slot in GPU memory, sets the slot state to prefill_pending via atomic CAS, and the GPU-resident scheduler picks it up.

Adaptive token polling. The DPU reads completed output tokens from the ring buffer and streams them to the client. A background reader with adaptive polling avoids busy-waiting while maintaining low latency.

The total implementation is roughly 17,000 lines of C/C++ using the DOCA 3.2 SDK for RDMA orchestration.

How does the GPU-resident scheduler work?

This is the core technical contribution. Instead of the host CPU launching CUDA graphs and making scheduling decisions between iterations, Blink keeps a persistent kernel alive on the GPU that handles everything.

Persistent scheduler kernel. One thread block (256 threads) runs continuously. Each thread scans a range of the 4,096-slot ring buffer in parallel, looking for slots in prefill_pending or decode_processing states. The scan completes in 1-5 microseconds.

Device-side CUDA graph launch. When the scheduler identifies work, it launches pre-captured TensorRT inference graphs directly from the GPU in fire-and-forget mode. This takes approximately 2 microseconds versus 11-17 microseconds for a host-initiated launch. The system maintains a cache of 650-1,000 pre-captured graphs covering different batch sizes and sequence lengths.

On-GPU token sampling. Top-P sampling runs inside the CUDA graphs themselves. There is no host round-trip to read logits, compute probabilities, and send the sampled token back. The entire decode loop — forward pass, sampling, KV-cache update, scheduling decision for the next iteration — happens without the CPU.

Ring buffer state machine. Each slot transitions through: emptyprefill_pendingprefill_processingdecode_processingdecode_completed. Ownership transfers use atomic compare-and-swap operations with no locks. The DPU writes one end, the GPU writes the other.

Window-based tail-launch recovery. CUDA has a hard limit of 120 concurrent graph launches. The scheduler implements a windowing mechanism that tracks in-flight launches and recovers when hitting the ceiling, preventing deadlocks under burst traffic.

The GPU backend is approximately 16,000 lines of CUDA and C++ targeting CUDA 13.1.

How do the benchmarks compare?

Blink was evaluated against vLLM v0.13.0, SGLang v0.5.8, and TensorRT-LLM v1.1.0 on an NVIDIA H100 (96 GB HBM3) with the same TensorRT v10.14 engines. The host machine had 2x Intel Xeon Gold 6336Y (96 cores) and 256 GB DDR5.

Isolation (no CPU interference)

Model System P99 TTFT (ms) P99 TPOT (ms) Throughput (req/s)
Llama-3 8B Blink 653.8 15.1 11.87
Llama-3 8B TensorRT-LLM 880.0 17.7 10.80
Llama-3 8B vLLM 1309.6 22.4 9.12
Llama-3 8B SGLang 1747.1 30.7 7.88
Qwen-3 30B-A3B (MoE) Blink 1397.5 35.5 4.85
Qwen-3 30B-A3B (MoE) TensorRT-LLM 4814.7 65.8 3.61
Qwen-3 30B-A3B (MoE) vLLM 8919.2 89.3 2.91
Qwen-3 30B-A3B (MoE) SGLang 11839.8 120.8 2.62

Even without interference, Blink wins across every metric. The gains are largest on the MoE model (Qwen-3 30B-A3B), where irregular expert routing patterns amplify the cost of CPU scheduling jitter.

Under CPU interference (multi-tenant colocation)

This is where the architecture matters most.

Model Metric Blink TensorRT-LLM vLLM SGLang
Llama-3 8B TTFT inflation 1.02x 8.43x 12.71x 18.84x
Llama-3 8B Throughput retained 98% 48% 41% 38%
Phi-4 15B TTFT inflation 0.97x 3.82x 7.14x 10.66x
Qwen-3 30B-A3B TTFT inflation 1.14x 1.98x 3.27x 4.90x

Blink’s TTFT inflation under interference stays between 0.92x and 1.14x across all models. It is effectively immune to CPU contention because the CPU is not involved. The baselines degrade catastrophically — SGLang’s TTFT on Llama-3 8B inflates by 18.84x, meaning a user waits nearly 19 times longer for the first token because a neighbor process is thrashing the LLC.

The throughput story is equally stark. Under interference, vLLM retains 28-47% of its isolated throughput depending on model. Blink retains 92-102%.

Energy efficiency

Model Blink (mJ/tok) TensorRT-LLM (mJ/tok) Savings
Llama-3 8B 312 607 48.6%
Qwen-3 30B-A3B 607 1180 48.6%
Llama-3 8B (interference) 318 1083 70.7%

The energy savings come from two sources: eliminating CPU power consumption on the inference path, and completing requests faster (less time under load means less total energy). Under interference, the gap widens to 70.7% because the baselines burn CPU cycles fighting for cache lines while Blink’s CPU sits idle.

Where does AMD PACE fit in?

While Blink removes the CPU from GPU-based inference, AMD is moving in the opposite direction: making the CPU itself a viable inference accelerator.

AMD PACE (Platform Aware Compute Engine), released April 8, 2026 for 5th Gen EPYC processors, is a CPU-native inference engine that understands the memory hierarchy it runs on. The key architectural decisions:

NUMA-aware workload partitioning. On a 128-core EPYC 9755, PACE splits inference across NUMA domains rather than running a single monolithic instance. Each domain gets dedicated cores and local memory bandwidth, eliminating cross-socket traffic that kills throughput on large NUMA machines.

Unified prefill-decode scheduling. PACE runs prefill and single-token decode together with workload partitioning so neither phase blocks the other. This is the same problem Blink solves on the GPU with its persistent scheduler kernel, but solved for CPU execution.

Speculative decoding integration. Combined with AMD’s PARD (Parallel Draft Models), PACE achieves 2.5-3.2x geometric mean speedup over vLLM autoregressive mode at batch size 1, peaking at 4.45x on Llama-3.1-8B with long output sequences. In autoregressive mode alone, PACE delivers 1.6x geometric mean speedup over vLLM across Llama-3.1-8B, Qwen2-7B, Phi-4, and Gemma-3 models.

The connection between Blink and PACE matters for a specific reason: they define the two endpoints of a spectrum.

Approach Where inference runs CPU role Best for
Blink GPU + DPU None (bypassed) Latency-critical GPU serving, multi-tenant
Standard vLLM/SGLang GPU Orchestrator (bottleneck) Single-tenant GPU serving
AMD PACE CPU The compute engine CPU-only inference, cost-sensitive

If you run GPU inference in a multi-tenant environment, Blink shows the CPU is a liability. If you cannot afford GPUs or need to serve smaller models at massive scale, PACE shows the CPU is an underutilized asset. Either way, the default architecture — CPU orchestrating GPU — loses.

What does this mean for BlueField-4 and the DPU roadmap?

NVIDIA announced BlueField-4 for second-half 2026 availability as part of the Vera Rubin platform. BlueField-4 doubles the network bandwidth to 800 Gbps and delivers 6x the compute of BlueField-3. More relevant to Blink’s architecture: BlueField-4 is explicitly positioned for KV-cache management and inference memory hierarchy, integrated with the NIXL library and NVIDIA Dynamo software for cluster-level coordination.

If Blink achieves 8.47x P99 TTFT improvement on BlueField-3 (16 ARM cores, 200 Gbps), the BlueField-4 generation has headroom to push further — handling larger ring buffers, faster RDMA transfers, and potentially running parts of the tokenization pipeline that currently bottleneck on ARM core count.

NVIDIA is building inference awareness into the network fabric itself. BlueField-4’s DOCA microservices are already adopted by CoreWeave, Lambda, Together.ai, and xAI for multi-tenant networking in AI clusters. Blink’s architecture aligns with where the hardware vendor is heading.

What are the limitations?

Blink requires specific hardware. The BlueField-3 DPU is not a commodity component — it adds cost and complexity to server configurations. The DOCA SDK has a learning curve. And the 33,000 lines of CUDA/C++ implementation is not something you drop into an existing vLLM deployment.

The ring buffer design caps at 4,096 concurrent request slots. For serving scenarios with tens of thousands of concurrent connections (streaming chat applications), this would need scaling — likely through multiple ring buffers or hierarchical scheduling.

Blink does not address the decode phase specifically. The persistent scheduler eliminates CPU overhead during decode iterations, but the actual transformer computation is unchanged. Techniques like speculative decoding or fleet-level routing compose orthogonally with Blink’s architecture.

The paper evaluates single-GPU configurations. Multi-GPU tensor parallel and pipeline parallel setups would require extending the ring buffer protocol across GPUs, which introduces coordination challenges that may reintroduce some CPU involvement.

Finally, the evaluation uses FP16 precision throughout. Quantized models (INT8, INT4) change the compute-to-memory ratio and may shift where bottlenecks appear. The CPU overhead that Blink eliminates might matter less when the GPU is already memory-bandwidth bound from quantization.

How would you adopt this today?

Blink is a research system from KTH Royal Institute of Technology, not a production framework. There is no pip install. But the architectural ideas translate to actionable decisions:

For multi-tenant GPU inference operators: The CPU interference data alone justifies isolating inference from co-located workloads. If you cannot deploy a DPU, at minimum pin CPU cores for your inference process, disable hyperthreading on those cores, and use numactl --membind to prevent cross-socket memory access. These are table-stakes mitigations that the Blink paper shows most operators skip.

For vLLM/SGLang users: Watch for GPU-resident scheduling features. vLLM v0.13.0 and SGLang v0.5.8 (the tested baselines) are entirely CPU-orchestrated. The performance gap Blink demonstrates will pressure framework developers to move scheduling logic onto the GPU, likely as an optional mode for latency-sensitive deployments.

For infrastructure planners: The BlueField-3 DPU is already deployed in hyperscaler fleets for networking offload. Repurposing it for inference frontend duties requires no new hardware purchase if your servers already have DPUs — only new software. The DOCA 3.2 SDK is the entry point.

For CPU-only inference: AMD PACE is the more directly useful result. If you serve models under 15B parameters and throughput matters more than per-request latency, PACE’s 1.6x over vLLM on EPYC hardware changes the cost calculus. The speculative decoding integration (4.45x peak) makes CPU inference viable for a wider set of models than previously assumed.

Takeaways

  • Blink eliminates the host CPU from LLM inference by splitting the serving stack between a BlueField-3 DPU (frontend) and persistent GPU kernels (backend scheduling). The result: 8.47x P99 TTFT reduction, 2.1x throughput, 48.6% energy savings per token on H100.
  • Under CPU interference from co-located workloads, conventional serving frameworks degrade by up to 139x on TTFT. Blink stays flat because the CPU is not on the critical path. This is the strongest argument for the architecture.
  • The GPU-resident scheduler scans 4,096 ring buffer slots in 1-5 microseconds using 256 threads, then launches TensorRT graphs directly from the GPU in 2 microseconds versus 11-17 microseconds from the host. Token sampling also runs on-GPU, eliminating every host round-trip in the decode loop.
  • AMD PACE represents the inverse approach: making the CPU itself a first-class inference engine via NUMA-aware scheduling and speculative decoding. 1.6x over vLLM in autoregressive mode, 4.45x peak with parallel draft models.
  • The default architecture — CPU orchestrating GPU — is increasingly the worst option for both latency and efficiency. Either remove the CPU from the path (Blink) or make the CPU the path (PACE).
  • BlueField-4 (800 Gbps, 6x compute over BF-3, second-half 2026) with explicit inference memory hierarchy support suggests DPU-accelerated serving will become a first-class deployment option, not a research prototype.

For more on optimizing the serving stack these systems plug into, see vLLM fleet inference and workload routing and the semantic router architecture for multi-model inference.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch