StepFun and Multiple Universities Open-Source JetSpec, Achieving Up to 9.64x Speedup in Speculative Decoding
StepFun, in collaboration with teams from UC San Diego, Zhejiang University, University of Illinois, and Nanjing University, has recently open-sourced JetSpec, a speculative decoding framework. JetSpec introduces a causal parallel draft head that generates an entire candidate tree in a single forward pass while maintaining branch-level causal consistency, breaking through the scaling ceiling of traditional speculative decoding.
Core Speedup Results
On H100 GPUs, JetSpec achieves significant acceleration on the Qwen3-8B model:
- MATH-500: End-to-end decoding speedup of 9.64x, with an average accepted length of 10.76 tokens
- GSM8K: 7.82x
- AIME25: 8.78x
- HumanEval: 7.12x
- MBPP: 6.73x
- LiveCodeBench: 7.67x
- MT-Bench: 4.58x
On the MoE model Qwen3-30B-A3B, MATH-500 achieves a 9.45x speedup and AIME25 reaches 9.35x, demonstrating that the method is architecture-agnostic.
Technical Principle: Resolving the Causality-Efficiency Dilemma
The speedup of speculative decoding is limited by the draft generation cost (c) and the per-token acceptance rate (α). Traditional methods face a dilemma:
- Autoregressive drafts (e.g., EAGLE series): Good causal consistency and high acceptance rates, but serial generation steps increase draft cost with tree depth.
- Block parallel drafts (e.g., DFlash series): Generate multiple candidates in one forward pass with very low cost, but lack branch-level causal conditioning, leading to "locally plausible, globally inconsistent" outputs and low acceptance rates.
JetSpec's causal parallel draft head reuses frozen hidden states from the target model and employs a tree-causal attention mask: each tree node can only see the original prefix and its branch's ancestor tokens. All nodes are computed in parallel in a single forward pass while maintaining autoregressive dependencies within branches. Training uses forward KL divergence distillation on a dataset comprising 780K samples from the Nemotron Post-Training Dataset V2 and 20K samples from CodeAlpaca.
Complementarity with Contemporary Work DSpark
JetSpec and DeepSeek's contemporaneous open-source DSpark optimize inference efficiency from different angles:
- DSpark: Targets high-concurrency, budget-constrained scenarios, improving throughput via lightweight correction heads and confidence scheduling.
- JetSpec: Targets low-latency, compute-budget-rich scenarios, maximizing single-verification acceptance rate through causal parallel tree generation to reduce single-user latency.
Both point to causality as the key to next-generation speculative decoding.
Service Scenarios and Budget Strategies
- Low concurrency (batch size=1): Increasing tree budget from 16 to 128 raises throughput from 443.3 TPS to 968.2 TPS, with speedup increasing from 3.09x to 6.75x.
- High concurrency (batch size=32): At budget 256, speedup drops to 2.85x; small to medium budgets are recommended.
The team currently evaluates only static budgets; dynamic adjustment is left for future work.
Open Source and Team
The project is open-sourced on GitHub (hao-ai-lab/JetSpec), the paper is on arXiv (2606.18394), and model weights are on Hugging Face (JetSpec). Authors include StepFun CEO Jiang Daxin, CTO Zhu Yibo, and UCSD PhD student Lanxiang Hu, among others.
Also available in 中文.