Google has released Gemini 3.1 Flash TTS, a new text-to-speech model that the company describes as its most natural and expressive to date. The model is available in preview through the Gemini API, Google AI Studio, Vertex AI for enterprise users, and Google Vids for Workspace users.
The model supports more than 70 languages and can handle conversations between multiple speakers. On the Artificial Analysis TTS leaderboard, which measures blind human preferences, it achieved an Elo score of 1,211. According to Artificial Analysis, the model falls in the highest-value category for its combination of output quality and cost.
Audio tags: directing speech with plain text
The headline feature of the new model is audio tags. These are short, plain-language instructions embedded directly into a script that control how the AI reads the text. Developers and users can specify tone, pace, accent, and non-verbal sounds without writing a single line of code. Examples include:
- “[Read this like you’re excited]: Your script here.”
- “This [pause] is amazing!”
- “[laugh] That was a great point.”
Google describes this approach as placing the developer in a “director’s chair.” In Google AI Studio, users can define scene context, assign individual speaker profiles, and export the final configuration as API code for use across projects.
Enterprise and consumer rollout
For enterprise customers on Vertex AI, audio tags are available as part of the preview rollout. Google Workspace users get access through Google Vids, where the feature supports 30 new voice options across 24 languages. The update expands supported languages in Vids to include Arabic, Bengali, Hindi, Russian, Ukrainian, and others, joining previously available languages such as English, Spanish, French, and German.
In Google Vids, the voiceover tool generates speech one scene at a time or across all scenes at once. Scripts are limited to 2,500 characters per voiceover and support plain text only. The interface also flags outdated voiceovers when a script has been changed after audio was generated.
Pricing and watermarking
According to The Decoder, the paid API tier costs $1.00 per million tokens for text input and $20.00 per million tokens for audio output. Batch processing cuts those rates in half. A free tier exists but allows Google to use the data for product improvement. On the paid tier, Google states it does not use the data for this purpose.
All audio generated by the model is tagged with SynthID, Google’s imperceptible watermark for AI-generated content. Google says the watermark is embedded directly into the audio and allows reliable detection of AI-generated material.
Sources: Google Blog, Google Workspace Updates, Google Help, The Decoder
Stay up to date
AI for content creation: the latest tools, tips and trends. Every two weeks in your inbox: