Nvidia has launched a new open-source automatic speech recognition (ASR) model called Parakeet-TDT-0.6B-v2. According to VentureBeat reporter Carl Franzen, the model can transcribe 60 minutes of audio in just one second when running on Nvidia’s GPU hardware. The new model currently tops the Hugging Face Open ASR Leaderboard with a word error rate of only 6.05%.
Released on May 1, 2025, Parakeet-TDT-0.6B-v2 is available under a Creative Commons CC-BY-4.0 license, making it free for commercial use. This gives developers and companies an alternative to proprietary models like OpenAI’s GPT-4o-transcribe and ElevenLabs Scribe, which have slightly lower error rates but are not freely available.
The 600-million-parameter model supports punctuation, capitalization, and word-level timestamping. It was trained on the Granary dataset, which includes approximately 120,000 hours of English audio from various sources. Nvidia plans to make this dataset publicly available following its presentation at Interspeech 2025.
Despite its powerful capabilities, the model is relatively efficient and can run on systems with as little as 2GB of RAM, though it performs best on Nvidia’s GPU hardware such as the A100, H100, T4, and V100.
Developers can access Parakeet-TDT-0.6B-v2 through Hugging Face or Nvidia’s NeMo toolkit, with comprehensive documentation available for implementation. The model is designed for applications including transcription services, voice assistants, subtitle generation, and conversational AI platforms.