StepFun Open-Sources JetSpec: Up to 10x Decoding Speed Boost for LLMs
StepFun, in collaboration with multiple universities, proposes a new speculative decoding method called JetSpec. By using causal parallel tree draft generation, it achieves up to 9.64x end-to-end decoding acceleration on Qwen3-8B, with an average of 10.76 tokens accepted per verification step. This work, together with DeepSeek's concurrently released DSpark, addresses inference efficiency bottlenecks from different angles, both targeting the core need for inference efficiency in large-scale agent deployment.
Core Acceleration Results
- End-to-end speedup: Compared to standard autoregressive decoding, JetSpec achieves 9.64x speedup on MATH-500, 7.12x on HumanEval, 7.67x on LiveCodeBench, and 4.58x on MT-Bench using Qwen3-8B.
- Acceptance length: On MATH-500, an average of 10.76 tokens are accepted per verification step; with a speculation budget of 128, the average acceptance length reaches 9.82, surpassing DFlash's 7.34 and DDTree's 8.66.
Technical Principle: Causal Parallel Tree Drafting
Speculative decoding uses a lightweight draft model to generate candidate tokens, which are then verified in parallel by the target model. Traditional methods face a trade-off between causal consistency and parallel efficiency: autoregressive drafting (e.g., EAGLE) has good causality but many serial steps, while block-parallel drafting (e.g., DFlash) has low cost but lacks branch-level causal constraints, leading to low acceptance rates.
JetSpec directly incorporates causality into the parallel draft head, generating path-conditioned draft trees. This allows larger draft budgets to translate into longer acceptable prefixes. In low-latency scenarios, the system can tolerate slightly higher draft computation costs to improve acceptance rates, thereby converting compute directly into lower per-user latency.
Complementary Relationship with DSpark
DeepSeek's concurrently released DSpark targets high-concurrency, budget-constrained scenarios, using lightweight serial heads and confidence estimation to control verification costs and improve throughput. JetSpec targets low-concurrency, latency-sensitive scenarios, maximizing acceptance rate per verification step. The two approaches address complementary sides of the throughput-latency boundary, together highlighting that inference efficiency is becoming a fundamental variable for large-scale agent deployment.
Team and Open Source
The JetSpec paper authors include StepFun CEO Jiang Daxin and CTO Zhu Yibo, with first author Lanxiang Hu (a UCSD PhD student who completed the work during an internship at StepFun). Other authors are from Zhejiang University, UIUC, and Nanjing University. The project is open-sourced: paper at https://arxiv.org/abs/2606.18394 , code repository at https://github.com/hao-ai-lab/JetSpec .
Also available in 中文.