Google DiffusionGemma generates text four times faster than standard models

Google has released DiffusionGemma, an experimental open AI model that takes a fundamentally different approach to generating text. Brendan O’Donoghue and Sebastian Flennerhag write for Google’s blog The Keyword that the model can produce text up to four times faster than conventional large language models on dedicated GPUs. It is available under an Apache 2.0 license.

Most AI language models work like a typewriter: they generate one word at a time, from left to right. DiffusionGemma works differently. Instead of producing tokens sequentially, it drafts an entire block of 256 tokens at once and then refines the result through multiple passes. Google compares this to upgrading from a typewriter to a printing press.

Who benefits and who should wait

This approach pays off most on local hardware with a single user. On an NVIDIA H100 data center GPU, the model exceeds 1,000 tokens per second. On a consumer GeForce RTX 5090, it still achieves over 700 tokens per second. In cloud environments serving many users at once, the advantage shrinks and costs can rise.

The model is a 26-billion-parameter Mixture of Experts architecture, but it activates only 3.8 billion parameters during inference. When quantized, it fits within 18 GB of video memory, making it compatible with high-end consumer graphics cards.

Google positions DiffusionGemma for developers working on speed-critical and interactive tasks such as inline text editing, code completion, and non-linear content structures. Because the model generates all tokens in parallel, every token can reference every other token in the block, which benefits tasks like code infilling or solving Sudoku puzzles. Standard Gemma 4 remains the recommended choice for applications that require the highest output quality, as DiffusionGemma trades some accuracy for speed.

The model weights are available on Hugging Face. Supported inference frameworks include MLX, vLLM, and Hugging Face Transformers. Support for llama.cpp is announced as coming soon.

Who benefits and who should wait

Stay up to date

Related posts: