← Back to news
模型Jun 14, 2026

Google Open-Sources 26B Text Diffusion MoE Model DiffusionGemma, Achieving Up to 4x Speed Boost

Google has open-sourced its experimental text diffusion model DiffusionGemma under the Apache 2.0 license. Based on the Gemma 4 architecture, the model has 26B total parameters and is a mixture-of-experts (MoE) model, activating only 3.8B parameters during inference. Unlike traditional autoregressive models that generate tokens one by one, DiffusionGemma uses a diffusion process to generate blocks of 256 tokens at once, achieving over 1000 tokens per second on a single NVIDIA H100 and over 700 tokens per second on an RTX 5090—about 4x faster than autoregressive models of similar size.

Core Principles and Speed Advantages

  • Parallel Generation: The model starts from random noise and iteratively denoises to generate entire text blocks simultaneously, shifting the bottleneck from memory bandwidth to computation and fully leveraging GPU parallelism.
  • Hardware Friendly: After quantization, it can run on consumer GPUs with less than 18GB VRAM (e.g., RTX 4090), lowering the barrier for local deployment.
  • Bidirectional Attention: Each token can see all other tokens during generation, enabling real-time self-correction. This excels in tasks requiring context coordination, such as Sudoku (fine-tuned success rate jumps from 0% to 80%).

Performance and Quality Trade-offs

  • Benchmarks: On several standard benchmarks, DiffusionGemma's generation quality is lower than the autoregressive Gemma 4 of the same size. Google explicitly states that standard Gemma 4 remains the top choice for high-quality production output.
  • Use Cases: DiffusionGemma targets speed-sensitive local interaction scenarios like inline editing, code completion, rapid iteration, and nonlinear text structure generation. In high-concurrency cloud services, autoregressive models can fully utilize compute through batching, potentially diminishing the parallel advantage of diffusion models.

Ecosystem Support and Open Source

  • Framework Compatibility: Already supported by inference frameworks such as vLLM, MLX, Unsloth, and NeMo; llama.cpp integration is underway.
  • Hardware Coverage: Full support from NVIDIA RTX 4090 to H100 and DGX Spark.
  • Open Source License: Apache 2.0, with weights available on Hugging Face for commercial use.

Industry Context

Diffusion text models are not entirely new. In February, startup Inception Labs released Mercury 2, claiming speeds 5-10x faster than Claude and Gemini. Google had previously showcased Gemini Diffusion at last year's I/O, achieving 1479 tokens/s sampling speed, but then went quiet. With DiffusionGemma's release and comprehensive ecosystem support, Google is actively pushing diffusion models toward practical text generation.

Also available in 中文.