Google Open-Sources 26B Text Diffusion MoE Model DiffusionGemma, Achieving Up to 4x Speed Boost
Google has open-sourced its experimental text diffusion model DiffusionGemma under the Apache 2.0 license. Based on the Gemma 4 architecture, the model has 26B total parameters and is a mixture-of-experts (MoE) model, activating only 3.8B parameters during inference. Unlike traditional autoregressive models that generate tokens one by one, DiffusionGemma uses a diffusion process to generate blocks of 256 tokens at once, achieving over 1000 tokens per second on a single NVIDIA H100 and over 700 tokens per second on an RTX 5090—about 4x faster than autoregressive models of similar size.
Core Principles and Speed Advantages
- Parallel Generation: The model starts from random noise and iteratively denoises to generate entire text blocks simultaneously, shifting the bottleneck from memory bandwidth to computation and fully leveraging GPU parallelism.
- Hardware Friendly: After quantization, it can run on consumer GPUs with less than 18GB VRAM (e.g., RTX 4090), lowering the barrier for local deployment.
- Bidirectional Attention: Each token can see all other tokens during generation, enabling real-time self-correction. This excels in tasks requiring context coordination, such as Sudoku (fine-tuned success rate jumps from 0% to 80%).
Performance and Quality Trade-offs
- Benchmarks: On several standard benchmarks, DiffusionGemma's generation quality is lower than the autoregressive Gemma 4 of the same size. Google explicitly states that standard Gemma 4 remains the top choice for high-quality production output.
- Use Cases: DiffusionGemma targets speed-sensitive local interaction scenarios like inline editing, code completion, rapid iteration, and nonlinear text structure generation. In high-concurrency cloud services, autoregressive models can fully utilize compute through batching, potentially diminishing the parallel advantage of diffusion models.
Ecosystem Support and Open Source
- Framework Compatibility: Already supported by inference frameworks such as vLLM, MLX, Unsloth, and NeMo; llama.cpp integration is underway.
- Hardware Coverage: Full support from NVIDIA RTX 4090 to H100 and DGX Spark.
- Open Source License: Apache 2.0, with weights available on Hugging Face for commercial use.
Industry Context
Diffusion text models are not entirely new. In February, startup Inception Labs released Mercury 2, claiming speeds 5-10x faster than Claude and Gemini. Google had previously showcased Gemini Diffusion at last year's I/O, achieving 1479 tokens/s sampling speed, but then went quiet. With DiffusionGemma's release and comprehensive ecosystem support, Google is actively pushing diffusion models toward practical text generation.
Also available in 中文.