Text To Speech (TTS), also known as speech synthesis, is a process in which text is converted into a human-sounding voice. Developers and business users alike use TTS to turn traditional human-to-human interactions into seamless, machine-to-human interactions, and make every interaction over voice a frictionless and first-class experience.
Instead of recording audio files with human voices to play back in a call, which has limited flexibility and is not a scalable option, TTS prompts can be dynamically, programmatically generated from raw text as a response to events in your application. Whether the use case is an Interactive Voice Response (IVR), a conversational assistant for scaling contact centers, or Voice notifications to deliver critical messages over a phone call, Text To Speech capabilities enable efficiency at global scale while enhancing customer engagement.
Click on a section name below to jump to that section.
You can provide text and Twilio will synthesize speech in real time and speak back the audio in any call or conference. TTS is available via the
<Say> TwiML verb and Studio's Say/Play Widget.
The <Say> verb allows you to provide plain text that Twilio converts to synthesized speech.
For example, when Twilio executes the following TwiML during a call, the caller hears "Hello world!" The synthesized voice the caller hears is the default voice and language of the Twilio Account (configured in the Twilio Console).
<Response> <Say>Hello world!</Say> </Response>
<Say> also allows you to modify the language, accent, and voice of the synthesized speech via the
voice attributes. The example below uses Amazon Polly's "Joanna" voice and American English:
<Response> <Say language="en-US" voice="Polly.Joanna">Hello. I am Joanna and I speak American English!</Say> </Response>
<Say> offers different options for voices, each with its own supported set of languages and genders, so you can customize your application with Text To Speech capabilities according to your needs and preferences.
To start using TTS, complete the following steps:
- Configure your Account-wide Text To Speech Settings in the Twilio Console.
- Use <Say> to programmatically define TTS instructions.
Twilio Studio is a visual, serverless tool that uses Widgets to represent various parts of Twilio's platform features and functionality to design and build applications with little or no code.
The Say/Play Widget allows you to add Text To Speech capabilities to your application with ease, including embedding SSML for supported voices.
To start using TTS with Studio, complete the following steps:
Twilio's Text To Speech offering has a variety of different voices in multiple languages and locales with their associated accents and pronunciations. There are three types of voices with different quality, language coverage and pricing: Basic, Standard and Premium.
Basic voices are first-generation voices. They can be used to get started and familiarize yourself with Text To Speech capabilities using <Say>, but may not have enough human-like qualities to build conversational applications and deliver superior user experiences over a voice call. The voices in this tier are available in a limited number of languages at no cost.
Standard voices offer standard TTS technology and produce natural-sounding synthesized speech with a variety of lifelike voices. The voices in this tier are provided by Amazon (Amazon Polly) and Google (Standard), with support for SSML (Speech Synthesis Markup Language), which allows developers to control many aspects of the synthesized speech.
These voices are generated using the latest technology and innovation in synthesized speech, providing the most human-like, expressive and natural-sounding text-to-speech voices possible, with higher quality than Standard voices. The voices in this tier are provided by Amazon (Amazon Polly Neural) and Google (WaveNet, Neural2), with support for SSML, which allows developers to control many aspects of the synthesized speech.
See the Pricing section below for additional information.
Effective June 26 2023, Alice voices are no longer supported for Text-To-Speech and any request will be redirected to an alternate voice. It is recommended to update configuration in your Console, Studio Flows, and backend application to remove any references to
alice voices. For more information, visit the Changelog.
Google voices (Standard, WaveNet and Neural2) are available in Public Beta
The following table contains all voices available for each language and locale. You can test the different voices from the TTS Settings page in the Twilio Console.
Note: Invalid combination of
language attributes may result in error and
<Say> instruction failure.
Voices listed with (*) are fully bilingual voices. At the moment only Amazon Polly has this capability for a limited number of voices. Learn more by visiting Amazon's Bilingual Voices documentation.
The Text To Speech page in the Twilio Console allows you to define a default voice and language for your Account. These defaults are used when no
voice attribute is provided in your
<Say> TwiML. If you are using Studio, the defaults are used when “Default” is selected.
You can test different voices and messages in this section of the Console.
In the screenshot above, the DEFAULT PROVIDER is set to Basic and the DEFAULT VOICE is set to Man, en-US. With these TTS settings, Twilio uses the Man voice and the en-US (American English) accent and pronunciation when executing the following TwiML:
<Response> <Say>Hello. I am a man!</Say> </Response>
Twilio updates the Text To Speech voices offering regularly. In order to have access to the latest voices without the need to review your code to change a voice for a new one, it is recommended to use the Language Mapping feature. Your application only needs the language and the text, and Twilio will automatically select and use the corresponding voice that can be updated any time from the Console.
On the Text to Speech page in the Console, you can set a voice for every locale. This means that you can specify the
language without needing to specifying the
voice when using TTS capabilities in your application.
To set a voice for a locale, complete the following:
- Under the Current Language Mapping heading, click on the language/locale you wish to configure, e.g. English (British)(en-GB).
- In the Test & Configure Voices By Language modal, select the PROVIDER and VOICE you wish to use, e.g. Amazon Polly and Emma.
- Click Save.
- Repeat steps 1-3 for other languages/locales if necessary.
For example, if you configure English (British)(en-GB) to use Amazon Polly and Emma, Twilio uses the Amazon Polly Emma voice when executing
<Say> with the
language attribute set to
en-GB and no
voice attribute (see TwiML example below).
<Response> <Say language="en-GB">Hello. I am Emma!</Say> </Response>
voice attribute allows you to override any default provider/voice settings that were configured in the Console (i.e. Account-level and Language Mapping defaults).
For example, if your Account's default TTS voice is Amazon Polly Salli but you want to use Amazon Polly Joanna for a specific call, set the
voice attribute to
<Response> <Say voice="Polly.Joanna">Hello. I am Joanna!</Say> </Response>
You can also use the
voice attribute to override a Language Mapping's defaults.
For example, if your Language Mapping for English (British)(en-GB) uses Amazon Polly and Emma but you want to use the Amazon Polly Joanna voice for a specific
<Say> instruction, you would use the
voice attribute set to
Polly.Joanna. The TwiML below causes Twilio to use the Amazon Polly Joanna voice, which overrides your Account's default Language Mapping:
<Response> <Say language="en-GB" voice="Polly.Joanna">Hello. I am Joanna!</Say> </Response>
language attribute allows you to override any default language/locale settings that were configured in the Console.
For example, if your Account's default TTS Language is English (US) (en-US), but wish to use German for a specific call, set the
language attribute to
de-DE in your TwiML:
<Response> <Say language="de-DE">Hallo. Ich spreche Deutsch!</Say> </Response>
SSML support is only available in Standard and Premium voices.
Speech Synthesis Markup Language (SSML) uses XML-based tags that allow you to fine-tune the synthesized speech generated by TTS. SSML functionality includes the ability to: specify where pauses should be, provide pronunciations for acronyms, abbreviations, dates and times, and increase or decrease the speed at which text is spoken.
While the W3C specification covers many capabilities, Twilio currently only supports a subset of them.
In addition, SSML support (including tags and accepted values) may differ between TTS providers and/or may be limited to specific voices. Review the provider-specific SSML documentation and test your application. Use of unsupported SSML tags with any TTS provider may result in error and
<Say> instruction failure.
As per the SSML specification, the root element for SSML starts with
<speak>. However, when you are using SSML with
<Say> it is not needed, so you can skip <speak> and insert the rest of the SSML inside
The table below lists the supported SSML tags, but you should refer to the appropriate, provider-specific documentation to ensure you're using the SSML tags correctly.
|Action||SSML tag||Provider documentation|
|Add a pause||
|Specify another language for specific words||
|Add a pause between paragraphs||
|Use phonetic pronunciation||
|Control volume, speaking rate, and pitch||
|Add a pause between sentences||
|Control how special types of words are spoken||
|Pronounce acronyms and abbreviations||
|Improve pronunciation by specifying parts of speech||
Amazon Polly ; Google N/A
<prosody> tag allows you to control the volume, rate, and pitch of synthesized speech.
<Response> <Say voice="Polly.Joanna"> Prosody can be used to change the way words sound. The following words are <prosody volume="x-loud"> quite a bit louder than the rest of this passage. </prosody> Each morning when I wake up, <prosody rate="x-slow">I speak slowly and deliberately until I have my coffee.</prosody> I can also change the pitch of my voice using prosody. Do you like <prosody pitch="+5%"> speech with a pitch higher,</prosody> or <prosody pitch="-10%"> is a lower pitch preferable?</prosody> </Say> </Response>
<say-as> tag allows you to indicate specific categories of text, so that the synthesized speech pronounces the text correctly.
<say-as>, a phone number would be pronounced as a number, e.g. "four billion, one hundred fifty-five million, five hundred fifty-one thousand, two hundred twelve."
The TwiML example below uses
<say-as> so that the synthesized speech reads the phone number as "four one five, five five five, one two one two."
<Response> <Say voice="Polly.Joanna">John’s phone number is, <say-as interpret-as="telephone">4155551212</say-as></Say> </Response>
The code sample below shows Helper Library code that generates the following SSML and TwiML:
<Response> <Say voice="Polly.Joanna"> Hi <break strength="x-weak" time="100ms"/> <emphasis level="moderate">Words to emphasize</emphasis> <p>Words to speak</p> aaaaaa <phoneme alphabet="x-sampa" ph="pɪˈkɑːn">Words to speak</phoneme> bbbbbbb <prosody pitch="-10%" rate="85%" volume="-6dB">Words to speak</prosody> <s>Words to speak</s> <say-as interpret-as="spell-out">Words to speak</say-as> <sub alias="alias">Words to be substituted</sub> <w>Words to speak</w> </Say> </Response>
- There is a 4,000 character limit on text that
<Say>can process with Basic voices (
- Basic voices (
woman) don’t support SSML tags.
- There is a 3,000 character limit on text, non-SSML, that
<Say>can process with Amazon Polly voices.
- Amazon-specific SSML tags such as
<amazon:effect>among others are not currently supported.
- Lexicons are not supported in Amazon Polly voices.
- SSML support in Amazon TTS may vary between Polly and Polly Neural voices, please refer to the Amazon Polly SSML documentation for detailed information
- SSML support in Google TTS may vary between Standard, WaveNet and Neural2 voices, please refer to Google SSML documentation for detailed information
- There is a 5,000 character limit on text, including SSML, that <Say> can process with Google voices.
- SSML tags, newlines and spaces are included in the total character count by Google TTS hence billed.
- Google-specific SSML tags such as <par> or <seq> among others are not currently supported.
Note: Use of unsupported SSML tags with any TTS provider may result in error and
<Say> instruction failure. Please review provider-specific SSML documentation and test your application
Basic voices (
woman) are free of charge.
Standard voices (Amazon Polly and Google Standard) pricing starts at $0.0008 per 100 characters with the following volume discounts:
|Minimum characters||Maximum characters||Price per 100 characters*|
* Usage is rounded towards the end of call and priced in blocks of 100 characters. For example, if 546 characters are used on a call, then you’re charged $0.004 for the use of Standard voices on that call. However, if less than 100 characters are used, you’re charged $0.0008, even for using just one character.
Premium voices (Amazon Polly Neural, Google WaveNet and Google Neural2) price starts at $0.0032/100 characters with the following volume discounts:
|Minimum characters||Maximum characters||Price per 100 characters*|
* Usage is rounded towards the end of call and priced in blocks of 100 characters. For example, if 546 characters are used on a call, then you’re charged $0.0128 for the use of Premium voices on that call. However, if less than 100 characters are used, you’re charged $0.0032, even for using just one character.