Google's new voice AI lets you direct speech like a film director

Google has released Gemini 3.1 Flash TTS, a new text-to-speech model that the company describes as its most natural and expressive to date. The model is available in preview through the Gemini API, Google AI Studio, Vertex AI for enterprise users, and Google Vids for Workspace users.

The model supports more than 70 languages and can handle conversations between multiple speakers. On the Artificial Analysis TTS leaderboard, which measures blind human preferences, it achieved an Elo score of 1,211. According to Artificial Analysis, the model falls in the highest-value category for its combination of output quality and cost.

Audio tags: directing speech with plain text

The headline feature of the new model is audio tags. These are short, plain-language instructions embedded directly into a script that control how the AI reads the text. Developers and users can specify tone, pace, accent, and non-verbal sounds without writing a single line of code. Examples include:

“[Read this like you’re excited]: Your script here.”
“This [pause] is amazing!”
“[laugh] That was a great point.”

Google describes this approach as placing the developer in a “director’s chair.” In Google AI Studio, users can define scene context, assign individual speaker profiles, and export the final configuration as API code for use across projects.

Enterprise and consumer rollout

For enterprise customers on Vertex AI, audio tags are available as part of the preview rollout. Google Workspace users get access through Google Vids, where the feature supports 30 new voice options across 24 languages. The update expands supported languages in Vids to include Arabic, Bengali, Hindi, Russian, Ukrainian, and others, joining previously available languages such as English, Spanish, French, and German.

In Google Vids, the voiceover tool generates speech one scene at a time or across all scenes at once. Scripts are limited to 2,500 characters per voiceover and support plain text only. The interface also flags outdated voiceovers when a script has been changed after audio was generated.

Pricing and watermarking

According to The Decoder, the paid API tier costs $1.00 per million tokens for text input and $20.00 per million tokens for audio output. Batch processing cuts those rates in half. A free tier exists but allows Google to use the data for product improvement. On the paid tier, Google states it does not use the data for this purpose.

All audio generated by the model is tagged with SynthID, Google’s imperceptible watermark for AI-generated content. Google says the watermark is embedded directly into the audio and allows reliable detection of AI-generated material.

Sources: Google Blog, Google Workspace Updates, Google Help, The Decoder

Google’s new voice AI lets you direct speech like a film director

Audio tags: directing speech with plain text

Enterprise and consumer rollout

Pricing and watermarking

Related posts:

Audio tags: directing speech with plain text

Enterprise and consumer rollout

Pricing and watermarking

Stay up to date

Related posts: