Inception Labs has unveiled Mercury, a new family of diffusion-based large language models (dLLMs) that can generate text up to 10 times faster than conventional autoregressive LLMs. According to the company, Mercury models can process over 1,000 tokens per second on NVIDIA H100 GPUs, speeds previously achievable only with specialized hardware.
The company’s first publicly available model, Mercury Coder, is designed specifically for code generation. Inception Labs claims the model matches or exceeds the performance of speed-optimized models like GPT-4o Mini and Claude 3.5 Haiku on standard coding benchmarks, while operating at significantly higher speeds.
Unlike traditional autoregressive models that generate text one token at a time from left to right, diffusion models refine their output through multiple “denoising” steps in a coarse-to-fine approach. This allows Mercury to edit multiple tokens simultaneously during generation.
“Because diffusion models are not restricted to only considering previous output, they are better at reasoning and at structuring their responses,” Inception Labs stated in their announcement.
Mercury Coder is available for testing in a public playground. The company also offers enterprise access to both code and generalist models via API and on-premise deployments, with fine-tuning capabilities.
Inception Labs suggests their technology will enable new capabilities, including more efficient AI agents, improved reasoning with error correction, controllable text generation, and better performance in resource-constrained environments like mobile devices.