French AI company Mistral has released Voxtral TTS, an open-weight text-to-speech model aimed at enterprise use cases such as customer support, sales, and real-time translation. Unlike competitors such as ElevenLabs, Deepgram, and OpenAI, Mistral is releasing the full model weights, allowing companies to run the system on their own infrastructure without sending data to a third party.
The model is built on a 3.4-billion-parameter architecture and requires roughly three gigabytes of memory when optimized for inference. According to Mistral, it can run on a laptop, smartphone, or even older hardware in real time. It supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
Voxtral TTS can adapt to a custom voice using less than five seconds of reference audio and can switch between languages while preserving the speaker’s vocal characteristics. Mistral says the model produces its first audio output within 90 milliseconds and generates speech at six times real-time speed.
In human evaluations conducted by Mistral, listeners preferred Voxtral TTS over ElevenLabs Flash v2.5 roughly 63 percent of the time on standard voices and nearly 70 percent of the time on voice customization tasks. ElevenLabs has not commented on these results.
Pierre Stock, Mistral’s VP of science operations, described voice as a central interface for AI agents going forward. “We see audio as a big bet and as a critical and maybe the only future interface with all the AI models,” he told VentureBeat.
Voxtral TTS complements Mistral’s existing Voxtral Transcribe speech-to-text model, moving the company toward a complete voice pipeline for enterprise customers.
Sources: TechCrunch, VentureBeat