OpenAI has introduced three new AI models designed to enhance speech-to-text and text-to-speech capabilities. The models gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts offer improved accuracy and customization options for developers building voice applications.
According to OpenAI, the new transcription models significantly outperform their predecessor, Whisper, particularly in noisy environments and with various accents. The company’s internal benchmarks show the gpt-4o-transcribe model achieves a word error rate of just 2.46% in English, though performance varies across languages. For some Indic and Dravidian languages, the error rate approaches 30%.
The text-to-speech model, gpt-4o-mini-tts, delivers more realistic-sounding speech and offers greater control over vocal qualities. Developers can customize how the AI speaks through natural language instructions like “speak like a mad scientist” or “use a serene voice, like a mindfulness teacher.”
Key features and applications
These models integrate with OpenAI’s API, allowing developers to implement voice interactions with minimal code changes. Jeff Harris, a technical staff member at OpenAI, explained during a demonstration that existing text-based applications can add voice capabilities with just “nine lines of code.”
The company positions these models as ideal for:
- Customer service call centers
- Meeting transcription
- AI-powered assistants
- E-commerce applications with voice interaction
Pricing and availability
The new models are immediately available through OpenAI’s API with the following pricing structure:
- gpt-4o-transcribe: $6.00 per million audio input tokens (approximately $0.006 per minute)
- gpt-4o-mini-transcribe: $3.00 per million audio input tokens (approximately $0.003 per minute)
- gpt-4o-mini-tts: $0.60 per million text input tokens, $12.00 per million audio output tokens (approximately $0.015 per minute)
Unlike Whisper, which was released as open-source software, OpenAI will not make these new transcription models openly available. Harris noted that the models are “much bigger than Whisper” and not suitable for running locally on devices.
The models face competition from specialized speech AI companies like ElevenLabs, which offers its Scribe model with similar pricing, and Hume AI, which provides customizable text-to-speech with its Octave TTS model.
Early adopters report promising results. EliseAI, a property management automation company, claims the new text-to-speech model has enabled more natural interactions with tenants. Decagon, which builds AI voice experiences, reports a 30% improvement in transcription accuracy.
OpenAI states these models fit into its broader vision of building automated systems that can independently accomplish tasks for users, with more “agent” applications expected in the coming months.
Sources: TechCrunch, VentureBeat