Text-to-Speech (TTS)
Text To Speech (TTS), also known as speech synthesis, is a process in which text is converted into a human-sounding voice. Developers and business users alike use TTS to turn traditional human-to-human interactions into seamless, machine-to-human interactions, and make every interaction over voice a frictionless and first-class experience.
Instead of recording audio files with human voices to play back in a call, which has limited flexibility and is not a scalable option, TTS prompts can be dynamically, programmatically generated from raw text as a response to events in your application. Whether the use case is an Interactive Voice Response (IVR), a conversational assistant for scaling contact centers, or Voice notifications to deliver critical messages over a phone call, Text To Speech capabilities enable efficiency at global scale while enhancing customer engagement.
Table of Contents
Click on a section name below to jump to that section.
Get Started with Text To Speech
You can provide text and Twilio will synthesize speech in real time and speak back the audio in any call or conference. TTS is available via the <Say>
TwiML verb and Studio's Say/Play Widget.
TwiML
The <Say> verb allows you to provide plain text that Twilio converts to synthesized speech.
For example, when Twilio executes the following TwiML during a call, the caller hears "Hello world!" The synthesized voice the caller hears is the default voice and language of the Twilio Account (configured in the Twilio Console).
<Response>
<Say>Hello world!</Say>
</Response>
<Say>
also allows you to modify the language, accent, and voice of the synthesized speech via the language
and voice
attributes. The example below uses Amazon Polly's "Joanna" voice and American English:
<Response>
<Say language="en-US" voice="Polly.Joanna">Hello. I am Joanna and I speak American English!</Say>
</Response>
<Say>
offers different options for voices, each with its own supported set of languages and genders, so you can customize your application with Text To Speech capabilities according to your needs and preferences.
To start using TTS, complete the following steps:
- Configure your Account-wide Text To Speech Settings in the Twilio Console.
- Use <Say> to programmatically define TTS instructions.
Studio
Twilio Studio is a visual, serverless tool that uses Widgets to represent various parts of Twilio's platform features and functionality to design and build applications with little or no code.
The Say/Play Widget allows you to add Text To Speech capabilities to your application with ease, including embedding SSML for supported voices.
To start using TTS with Studio, complete the following steps:
- Configure your Account-wide Text To Speech Settings in the Twilio Console.
- Use the Say/Play Widget to add TTS to your Studio Flow.
Text To Speech voices overview
Twilio's Text To Speech offering has a variety of different voices in multiple languages and locales with their associated accents and pronunciations. There are three types of voices with different quality, language coverage and pricing: Basic, Standard and Premium.
Basic voices
Basic voices are first-generation voices. They can be used to get started and familiarize yourself with Text To Speech capabilities using <Say>, but may not have enough human-like qualities to build conversational applications and deliver superior user experiences over a voice call. The voices in this tier are available in a limited number of languages at no cost.
Standard voices
Standard voices offer standard TTS technology and produce natural-sounding synthesized speech with a variety of lifelike voices. The voices in this tier are provided by Amazon (Amazon Polly), with support for SSML (Speech Synthesis Markup Language), which allows developers to control many aspects of the synthesized speech.
Premium voices
These voices are generated using the latest technology and innovation in synthesized speech, providing the most human-like, expressive and natural-sounding text-to-speech voices possible, with higher quality than Standard voices. The voices in this tier are provided by Amazon (Amazon Polly Neural) with support for SSML, which allows developers to control many aspects of the synthesized speech.
See the Pricing section below for additional information.
Available voices and languages
Effective June 26 2023, Alice voices are no longer supported for Text-To-Speech and any request will be redirected to an alternate voice. It is recommended to update configuration in your Console, Studio Flows, and backend application to remove any references to alice
voices. Learn more in the Help Center.
The following table contains all voices available for each language and locale. You can test the different voices from the TTS Settings page in the Twilio Console.
Note: Invalid combination of voice
and language
attributes may result in error and <Say>
instruction failure.
Voices listed with (*) are fully bilingual voices. At the moment only Amazon Polly has this capability for a limited number of voices. Learn more by visiting Amazon's Bilingual Voices documentation.
Text To Speech settings
Default voice and language
The Text To Speech page in the Twilio Console allows you to define a default voice and language for your Account. These defaults are used when no language
or voice
attribute is provided in your <Say>
TwiML. If you are using Studio, the defaults are used when “Default” is selected.
You can test different voices and messages in this section of the Console.
In the screenshot above, the DEFAULT PROVIDER is set to Basic and the DEFAULT VOICE is set to Man, en-US. With these TTS settings, Twilio uses the Man voice and the en-US (American English) accent and pronunciation when executing the following TwiML:
<Response>
<Say>Hello. I am a man!</Say>
</Response>
Language mapping
Twilio updates the Text To Speech voices offering regularly. In order to have access to the latest voices without the need to review your code to change a voice for a new one, it is recommended to use the Language Mapping feature. Your application only needs the language and the text, and Twilio will automatically select and use the corresponding voice that can be updated any time from the Console.
On the Text to Speech page in the Console, you can set a voice for every locale. This means that you can specify the language
without needing to specifying the voice
when using TTS capabilities in your application.
To set a voice for a locale, complete the following:
- Under the Current Language Mapping heading, click on the language/locale you wish to configure, e.g. English (British)(en-GB).
- In the Test & Configure Voices By Language modal, select the PROVIDER and VOICE you wish to use, e.g. Amazon Polly and Emma.
- Click Save.
- Repeat steps 1-3 for other languages/locales if necessary.
For example, if you configure English (British)(en-GB) to use Amazon Polly and Emma, Twilio uses the Amazon Polly Emma voice when executing <Say>
with the language
attribute set to en-GB
and no voice
attribute (see TwiML example below).
<Response>
<Say language="en-GB">Hello. I am Emma!</Say>
</Response>
Override default settings
Override default providers/voices
<Say>
's voice
attribute allows you to override any default provider/voice settings that were configured in the Console (i.e. Account-level and Language Mapping defaults).
For example, if your Account's default TTS voice is Amazon Polly Salli but you want to use Amazon Polly Joanna for a specific call, set the voice
attribute to Polly.Joanna
:
<Response>
<Say voice="Polly.Joanna">Hello. I am Joanna!</Say>
</Response>
You can also use the voice
attribute to override a Language Mapping's defaults.
For example, if your Language Mapping for English (British)(en-GB) uses Amazon Polly and Emma but you want to use the Amazon Polly Joanna voice for a specific <Say>
instruction, you would use the voice
attribute set to Polly.Joanna
. The TwiML below causes Twilio to use the Amazon Polly Joanna voice, which overrides your Account's default Language Mapping:
<Response>
<Say language="en-GB" voice="Polly.Joanna">Hello. I am Joanna!</Say>
</Response>
Override default language/locales
<Say>
's language
attribute allows you to override any default language/locale settings that were configured in the Console.
For example, if your Account's default TTS Language is English (US) (en-US), but wish to use German for a specific call, set the language
attribute to de-DE
in your TwiML:
<Response>
<Say language="de-DE">Hallo. Ich spreche Deutsch!</Say>
</Response>
SSML
SSML support is only available in Standard and Premium voices.
Speech Synthesis Markup Language (SSML) uses XML-based tags that allow you to fine-tune the synthesized speech generated by TTS. SSML functionality includes the ability to: specify where pauses should be, provide pronunciations for acronyms, abbreviations, dates and times, and increase or decrease the speed at which text is spoken.
Supported SSML tags
While the W3C specification covers many capabilities, Twilio currently only supports a subset of them.
In addition, SSML support (including tags and accepted values) may differ between TTS providers and/or may be limited to specific voices. Review the provider-specific SSML documentation and test your application. Use of unsupported SSML tags with any TTS provider may result in error and <Say>
instruction failure.
As per the SSML specification, the root element for SSML starts with <speak>
. However, when you are using SSML with <Say>
it is not needed, so you can skip <speak> and insert the rest of the SSML inside <Say>
directly.
The table below lists the supported SSML tags, but you should refer to the appropriate, provider-specific documentation to ensure you're using the SSML tags correctly.
Action | SSML tag | Provider documentation |
Add a pause | <break> |
|
Emphasize words | <emphasis> |
|
Specify another language for specific words | <lang> |
|
Add a pause between paragraphs | <p> |
|
Use phonetic pronunciation | <phoneme> |
|
Control volume, speaking rate, and pitch | <prosody> |
|
Add a pause between sentences | <s> |
|
Control how special types of words are spoken | <say-as> |
|
Pronounce acronyms and abbreviations | <sub> |
|
Improve pronunciation by specifying parts of speech | <w> |
SSML Examples
Modify speed and volume of synthesized speech
The SSML <prosody>
tag allows you to control the volume, rate, and pitch of synthesized speech.
<Response>
<Say voice="Polly.Joanna">
Prosody can be used to change the way words sound. The following words are
<prosody volume="x-loud"> quite a bit louder than the rest of this passage.
</prosody> Each morning when I wake up, <prosody rate="x-slow">I speak slowly and
deliberately until I have my coffee.</prosody> I can also change the pitch of my voice
using prosody. Do you like <prosody pitch="+5%"> speech with a pitch higher,</prosody>
or <prosody pitch="-10%"> is a lower pitch preferable?</prosody>
</Say>
</Response>
Read a phone number correctly
The SSML <say-as>
tag allows you to indicate specific categories of text, so that the synthesized speech pronounces the text correctly.
Without <say-as>
, a phone number would be pronounced as a number, e.g. "four billion, one hundred fifty-five million, five hundred fifty-one thousand, two hundred twelve."
The TwiML example below uses <say-as>
so that the synthesized speech reads the phone number as "four one five, five five five, one two one two."
<Response>
<Say voice="Polly.Joanna">John’s phone number is, <say-as interpret-as="telephone">4155551212</say-as></Say>
</Response>
Generate SSML with Twilio's Helper Libraries
You can generate TwiML with SSML within the <Say>
verb using one of Twilio's helper libraries for C#, Java, Node.js, PHP, Python, Ruby, or Go.
The code sample below shows Helper Library code that generates the following SSML and TwiML:
<Response>
<Say voice="Polly.Joanna">
Hi
<break strength="x-weak" time="100ms"/>
<emphasis level="moderate">Words to emphasize</emphasis>
<p>Words to speak</p>
aaaaaa
<phoneme alphabet="x-sampa" ph="pɪˈkɑːn">Words to speak</phoneme>
bbbbbbb
<prosody pitch="-10%" rate="85%" volume="-6dB">Words to speak</prosody>
<s>Words to speak</s>
<say-as interpret-as="spell-out">Words to speak</say-as>
<sub alias="alias">Words to be substituted</sub>
<w>Words to speak</w>
</Say>
</Response>
Limits
- There is a 4,000 character limit on text that
<Say>
can process with Basic voices (man
andwoman
). - Basic voices (
man
andwoman
) don’t support SSML tags. - There is a 3,000 character limit on text, non-SSML, that
<Say>
can process with Amazon Polly voices. - Amazon-specific SSML tags such as
<amazon:auto-breath>
or<amazon:effect>
among others are not currently supported. - Lexicons are not supported in Amazon Polly voices.
- SSML support in Amazon TTS may vary between Polly and Polly Neural voices, please refer to the Amazon Polly SSML documentation for detailed information
Note: Use of unsupported SSML tags with any TTS provider may result in error and <Say>
instruction failure. Please review provider-specific SSML documentation and test your application
Pricing
Basic voices
Basic voices (man
and woman
) are free of charge.
Standard voices
Standard voices (Amazon Polly) price starts at $0.0008 per 100 characters with the following volume discounts:
0 | 5,000,000 | $0.00080 |
5,000,001 | 50,000,000 | $0.00072 |
50,000,001 | 100,000,000 | $0.00068 |
100,000,001 | $0.00064 | |
Minimum characters | Maximum characters | Price per 100 characters* |
* Usage is rounded towards the end of call and priced in blocks of 100 characters. For example, if 546 characters are used on a call, then you’re charged $0.004 for the use of Standard voices on that call.
Premium voices
Premium voices (Amazon Polly Neural) price starts at $0.0032/100 characters with the following volume discounts:
0 | 5,000,000 | $0.0032 |
5,000,001 | 50,000,000 | $0.0029 |
50,000,001 | 100,000,000 | $0.0027 |
100,000,001 | $0.0025 | |
Minimum characters | Maximum characters | Price per 100 characters* |
* Usage is rounded towards the end of call and priced in blocks of 100 characters. For example, if 546 characters are used on a call, then you’re charged $0.0128 for the use of Premium voices on that call.
Need some help?
We all do sometimes; code is hard. Get help now from our support team, or lean on the wisdom of the crowd by visiting Twilio's Stack Overflow Collective or browsing the Twilio tag on Stack Overflow.