DeepSeek Releases DSpark Inference Acceleration Technology, Boosting V4 Online Inference Speed by Up to 85%
DeepSeek, in collaboration with Peking University, recently released the speculative decoding framework DSpark and open-sourced the full-stack training framework DeepSpec. DSpark has been deployed in the Flash and Pro versions of DeepSeek-V4's online traffic, replacing the previous MTP-1 solution. This technology is not a new model architecture but adds a speculative decoding module on top of V4, focusing on engineering implementation.
Core Technology: Semi-Autoregressive Generation + Confidence-Scheduled Verification
DSpark's core innovations include two points:
-
Semi-Autoregressive Generation: Combines the high throughput of parallel draft models (e.g., DFlash) with the coherence of autoregressive draft models (e.g., Eagle3). The parallel backbone generates logits for all candidate tokens at once, followed by a lightweight serial module (default Markov head, low-rank decomposition r=256) that injects prefix dependencies token by token, correcting the "suffix decay" problem common in parallel approaches. This serial module adds only 0.2%–1.3% latency.
-
Hardware-Aware Confidence-Scheduled Verification: Each draft position is equipped with a confidence head that predicts the probability of a token passing verification, calibrated via Sequential Temperature Scaling (STS). A scheduler dynamically determines the verification length for each request based on real-time GPU load: more verification under low load, tighter under high load, avoiding wasted computation on tail tokens likely to be rejected.
Performance Data: Significant Improvements Both Offline and Online
In offline evaluations, on target models Qwen3 series (4B/8B/14B) and Gemma4-12B, DSpark's average acceptance length improved by 26.7%–30.9% compared to Eagle3 and 16.3%–18.4% compared to DFlash. Structured tasks (math, code) had higher acceptance lengths than open-ended dialogue.
Online production data (compared to MTP-1 baseline):
- Under the same overall throughput, V4-Flash user generation speed increased by 60%–85%, and V4-Pro by 57%–78%.
- Under strict single-user speed requirements (e.g., 120 tok/s/user), MTP-1 was near its limit, while DSpark maintained performance, with a relative throughput gap of up to +661% (the paper emphasizes this figure reflects scalable interaction tiers, not an actual six-fold improvement).
Open-Source Framework DeepSpec
DeepSpec is a companion full-stack training and evaluation codebase supporting three draft models (DSpark, DFlash, Eagle3) and target models like Qwen3 and Gemma. The pipeline includes data preparation (requiring ~38 TB of target cache), training (default 8 GPUs), and evaluation (covering benchmarks like GSM8K, MATH500, HumanEval). This framework standardizes the engineering practices of speculative decoding, facilitating reproduction and customization by researchers.
Limitations and Future Directions
The paper notes that DSpark's drafting cost remains unavoidable: the parallel backbone's initial draft generation is a fixed overhead, and for complex requests with low acceptance rates, the upfront investment may not be recouped. The team's future direction is to enable the draft model to stop early based on difficulty.
Also available in 中文.