Mistral AI has released Voxtral Transcribe 2, a family of speech-to-text models designed for both batch processing and real-time transcription. The company positions the technology as more accurate and significantly cheaper than competing services while enabling on-device processing for sensitive data.
The release includes two models. Voxtral Mini Transcribe V2 handles pre-recorded audio files at $0.003 per minute, which Mistral claims is approximately one-fifth the cost of major competitors. Voxtral Realtime processes live audio with latency configurable down to 200 milliseconds. Both models support 13 languages including English, Chinese, Hindi, Spanish, and Arabic.
Pierre Stock, Mistral’s vice president of science operations, emphasized the privacy advantage of on-device processing. The Realtime model uses 4 billion parameters, small enough to run on smartphones and laptops without transmitting audio to remote servers. This addresses concerns in regulated industries like healthcare and finance, where data sovereignty matters.
The company says Voxtral Realtime ships under an Apache 2.0 open-source license, allowing developers to download and modify the model weights without licensing fees. API access costs $0.006 per minute.
Enterprise features include speaker diarization, which identifies who spoke when, and context biasing, which allows customers to provide lists of specialized terminology the model should favor during transcription. Mistral says the models maintain accuracy in high-noise environments like factory floors and call centers.
The company claims its models achieve lower word error rates than offerings from OpenAI, Google, and specialized transcription services. Mistral has released an audio playground in Mistral Studio where developers can test the technology.
Sources: Mistral News, VentureBeat