FrameworksJul 5, 2026
DeepSeek DSpark Technology Ported to Apple Silicon, Boosts Local Mac LLM Speed by Up to 60%
DeepSeek's speculative decoding technology DSpark, open-sourced on June 27, has been ported to Apple Silicon (Mac) by engineer Abdur Rahim under the project name mlx-dspark. The port supports Gemma-4 12B and Qwen3-4B models, achieving speedups of approximately 1.6× (from 18.4 tok/s to ~30 tok/s) and 1.4× (from 52.9 tok/s to ~73 tok/s) on M4 Pro, respectively.
Technical Principles and Implementation
- DSpark core idea: A small draft model quickly generates candidate tokens, which are then batch-verified by the target model. If accepted, they are kept; otherwise, regeneration occurs.
- Cost differences on Apple Silicon: Data center GPUs have fixed batch verification costs, while Apple Silicon's verification cost scales linearly with the number of candidate tokens. Rahim measured that each additional token verification for Gemma-4 12B takes about 14 ms, and built a cost model yielding a theoretical speedup upper bound of ~2.2×.
- Implementation details: The draft model is extracted from HuggingFace checkpoints and quantized to 4-bit (only 1.8GB). The target model defaults to 8-bit precision (bf16 verification is more expensive and actually slower).
High-Fidelity Reproduction and Sampling Support
- Most local ports only support greedy decoding, but mlx-dspark implements the temperature sampling method from the paper, producing output distributions strictly identical to the target model, byte-for-byte.
- Rahim found: When the draft model is paired with a non-instruction-tuned base target model, the candidate acceptance rate is only 47%; switching to an instruction-tuned version raises the acceptance rate to 82%.
Integration of DFlash Scheme
- At the request of DFlash paper author Jian Chen, Rahim integrated the DFlash scheme into mlx-dspark. DFlash uses parallel block diffusion (generating 16 tokens at once), achieving a speedup of ~2.1× (~36 tok/s) on code and math tasks, outperforming DSpark.
- However, in open-ended chat scenarios, DFlash's acceptance length is limited, making DSpark faster. mlx-dspark v0.0.3 allows users to manually adjust DFlash's effective block length to suit different tasks.
Impact and Outlook
- This is the first native Apple Silicon implementation of DSpark since its open-source release, enabling Mac users to enjoy acceleration without relying on data center GPUs.
- Rahim indicates the method can scale to larger draft models (e.g., Qwen3-8B and 14B).
- Concurrently, DeepSeek is actively recruiting, including Tsinghua University PhD student Gu Yuxian (Apple PhD Scholar), whose research covers model compression and efficient architectures, synergizing with technologies like DSpark.
Also available in 中文.