DeepSeek V4 Updates with DSpark Speculative Decoding Framework, Boosting Inference Speed by Up to 85%

DeepSeek recently deployed a new speculative decoding framework, DSpark, for its V4 models (Flash and Pro versions) and open-sourced the accompanying training framework, DeepSpec. Developed in collaboration with Peking University, DSpark has replaced the previous MTP-1 solution in production traffic.

Core Innovations

DSpark is not a new model but adds a speculative decoding module on top of V4, focusing on engineering deployment. Its two key innovations are:

Semi-autoregressive Generation: Retains the high throughput advantage of parallel draft models while adding a lightweight serial module (default Markov head, low-rank decomposition r=256) to model token dependencies, mitigating the acceptance rate degradation at the tail of parallel drafts. The paper found that parallel drafts actually have a higher first-token acceptance rate than autoregressive models (e.g., 0.88 vs 0.81 on math tasks). DSpark combines the strengths of both.
Confidence-based Scheduling Verification: Each draft position is equipped with a confidence head that predicts the probability of token verification passing. After STS calibration, a hardware-aware scheduler dynamically determines the verification length based on real-time load, avoiding wasted computation on low-probability tokens at the tail under high concurrency.

Performance Data

Offline Evaluation: On Qwen3 series (4B/8B/14B) and Gemma4-12B, the average acceptance length improves by 26.7%–30.9% over Eagle3 and 16.3%–18.4% over DFlash.
Online Production: While maintaining the same overall throughput, V4-Flash user generation speed increases by 60%–85%, and V4-Pro by 57%–78%. Under strict latency requirements (e.g., 120 tok/s/user), DSpark supports higher concurrency, with a relative throughput gap of up to +661% (the paper emphasizes this figure illustrates the expanded usable interaction range).

Open-Source Framework DeepSpec

DeepSpec provides a full-stack toolchain covering data preparation, training, and evaluation stages. Data preparation requires attention to target cache size (e.g., Qwen3-4B ~38 TB). It includes three draft models: DSpark, DFlash, and Eagle3, supporting Qwen3 and Gemma series target models, with default configuration for single-node 8-GPU environments.

Background and Limitations

DeepSeek has continuously invested in inference efficiency: MLA in V2, MTP in V3, sparse attention in V3.2. DSpark is the first to be directly used in flagship products. The paper notes limitations: the drafting cost remains a fixed overhead, and for complex requests with low acceptance rates, the initial investment may not be recouped. Future directions include allowing the draft model to stop early based on difficulty.

DeepSeek V4 Updates with DSpark Speculative Decoding Framework, Boosting Inference Speed by Up to 85%

Core Innovations

Performance Data

Open-Source Framework DeepSpec

Background and Limitations

Documentation

Getting Started

Learn more