Text-to-Speech
Text To Speech (TTS) - also known as Speech Synthesis - is a process where text is converted into a human-sounding voice. TTS has been a popular choice for developers and business users alike when building IVR (Interactive Voice Response) solutions and other voice applications, as it accelerates time to production without having to record audio files with human voice. Instead of recorded files, where changing a message requires re-recording with a human voice, TTS prompts can be dynamically generated from raw text.
Getting Started
Twilio <Say> verb makes it easy to synthesize speech. You provide the text, and Twilio will synthesize speech in real time and playback the audio in any call. For example, the following TwiML plays back Hello World. By default, the text will be played in US English dialect using Twilio’s default Male voice.
<Response> <Say>Hello World!</Say> </Response>
When using <Say> you have choice between using Man, Woman, Alice or Amazon Polly Voices. To use one of these voices, either configure the Text to Speech settings in the Twilio Console, or provide Voice attribute on <Say>.
Text to Speech Console
The TTS console page makes it easy to test different voices and set the default TTS voice and locale for your account. To get started navigate to https://www.twilio.com/console/voice/twiml/text-to-speech
All Twilio accounts are defaulted to Basic Provider on account creation and to change from Basic to Amazon Polly as provider, navigate to the console page and select Amazon Polly as the provider.
Once the default provider is changed to Amazon Polly, you will notice the default Voice & locale are changed to Salli, en-US respectively. After this change when the following TwiML is used Twilio synthesizes the text using Salli voice,
<Response> <Say>Hello I am Salli!</Say> </Response>
In the past developers are forced to use attributes on <Say> to synthesize text using different voice or locale. While this still option is still available, the TTS console makes it very easy to select voice & local for your account so that a code change is not required. To change the default voice, simply click edit link next to Default Voices and select appropriate default locale & Voice for your account. For example, to change Default locale to French simply select French under Locale and press Save.
In addition you can also change the voices assigned by default to each locale by Twilio from the console page. For example by default voice for en-GB is set to Amy and to change the voice to Emma, simply click on English (British) (en-GB) under the Locale Mapping table and select Emma under Voice drop-down.
Once these changes are made simply return the following TwiML to hear Text in Emma’s voice,
<Response> <Say language="en-GB">Hello I am Emma!!</Say> </Response>
As a developer you can always override the default voices and locale on your account by providing the attributes on <Say> verb. For example, if Default Voice, locale on your account is set to Salli, en-US, and you’d like to Joanna for on a specific call you can simply provide voice attribute,
<Response> <Say voice="Polly.Joanna">Hello I am Joanna!</Say> </Response>
You can learn more about these attributes in <Say> API Docs page.
Amazon Polly
Amazon Polly is one of the leading providers for life like text to speech that offers voices across many languages, locales and comes with support for SSML that allows developers to control many aspects of the synthesized speech.
Voices
The following table contains the list of Polly voices that can be used with voice attribute on <Say>.
Polly Voice |
Gender |
Danish (da-DK) |
|
Polly.Mads |
Male |
Polly.Naja |
Female |
Dutch (nl-NL) |
|
Polly.Lotte |
Female |
Polly.Ruben |
Male |
English (Australian) (en-AU) |
|
Polly.Nicole |
Female |
Polly.Russell |
Male |
English (British) (en-GB) |
|
Polly.Amy |
Female |
Polly.Brian |
Male |
Polly.Emma |
Female |
English (Indian) (en-IN) |
|
Polly.Raveena |
Female |
English (US) (en-US) |
|
Polly.Ivy |
Female |
Polly.Joanna |
Female |
Polly.Joey |
Male |
Polly.Justin |
Male |
Polly.Kendra |
Female |
Polly.Kimberly |
Female |
Polly.Matthew |
Male |
Polly.Salli |
Female |
English (Welsh) (en-GB-WLS) |
|
Polly.Geraint |
Male |
French (fr-FR) |
|
Polly.Céline/Polly.Celine |
Female |
Polly.Mathieu |
Male |
French (Canadian) (fr-CA) |
|
Polly.Chantal |
Female |
German (de-DE) |
|
Polly.Hans |
Male |
Polly.Marlene |
Female |
Polly.Vicki |
Female |
Icelandic (is-IS) |
|
Polly.Dóra/Polly.Dora |
Female |
Polly.Karl |
Male |
Italian (it-IT) |
|
Polly.Carla |
Female |
Polly.Giorgio |
Male |
Japanese (ja-JP) |
|
Polly.Mizuki |
Female |
Polly.Takumi |
Male |
Norwegian (nb-NO) |
|
Polly.Liv |
Female |
Polish (pl-PL) |
|
Polly.Jacek |
Male |
Polly.Jan |
Male |
Polly.Ewa |
Female |
Polly.Maja |
Female |
Portuguese (Brazilian) (pt-BR) |
|
Polly.Ricardo |
Male |
Polly.Vitória/Polly.Vitoria |
Female |
Portuguese (European) (pt-PT) |
|
Polly.Cristiano |
Male |
Polly.Inês/Polly.Ines |
Female |
Romanian (ro-RO) |
|
Polly.Carmen |
Female |
Russian (ru-RU) |
|
Polly.Maxim |
Male |
Polly.Tatyana |
Female |
Spanish (Castilian) (es-ES) |
|
Polly.Conchita |
Female |
Polly.Enrique |
Male |
Spanish (Latin American) (es-US) |
|
Polly.Miguel |
Male |
Polly.Penélope/Polly.Penelope |
Female |
Swedish (sv-SE) |
|
Polly.Astrid |
Female |
Turkish (tr-TR) |
|
Polly.Filiz |
Female |
Welsh (cy-GB) |
|
Polly.Gwyneth |
Female |
SSML
Speech Synthesis Markup Language (SSML) is a W3C specification that allows developers to use XML-based markup language for assisting the generation of synthesized speech. We are excited to bring these capabilities to you via partnership with Amazon Polly so that you can easily use <Say> to control the synthesized speech.
As per the SSML spec, the root element for SSML starts with <speak>, however when you’re using SSML with <Say> you can skip <speak> and insert rest of the SSML inside <Say>. For example,
<Response> <Say><prosody rate="fast"> Speech Synthesis Markup Language (SSML) is a W3C specification that allows developers to use XML-based markup language for assisting the generation of synthesized speech. </prosody></Say> </Response>
Let’s take a quick look at a few SSML tags and how you can use them with <Say>.
<prosody>
You can use <prosody> to control the volume, rate, pitch of synthesized speech.
<Response> <Say>Prosody can be used to change the way words sound. The following words are <prosody volume="x-loud"> quite a bit louder than the rest of this passage. </prosody> Each morning when I wake up, <prosody rate="x-slow">I speak slowly and deliberately until I have my coffee.</prosody> I can also change the pitch of my voice using prosody. Do you like <prosody pitch="+5%"> speech with a pitch higher,</prosody> or <prosody pitch="-10%"> is a lower pitch preferable?</prosody></Say> </Response>
<say-as>
As per the W3C spec, the say-as element allows you to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.
For example, if you are trying to repeat a phone number without <say-as> with the following <Say> you will hear, “John’s phone number is ... four billion one hundred fifty five million five hundred fifty one thousand two hundred twelve”.
<Response> <Say>John’s phone number is, 4155551212</Say> </Response>
To synthesize the text so that the phone number is read back correctly, you’d rewrite the <Say> as follows so that you hear, “John’s phone number is ... four one five ... five five five ... one two one two”.
<Response> <Say>John’s phone number is, <say-as interpret-as="telephone">4155551212</say-as></Say> </Response>
Amazon Polly SSML Support
While the W3C specification covers many capabilities, Amazon Polly currently only supports the following SSML, click on individual actions to learn more.
Action |
SSML Tag |
<break> |
|
<emphasis> |
|
<lang> |
|
<p> |
|
<phoneme> |
|
<prosody> |
|
<s> |
|
<say-as> |
|
<sub> |
|
<w> |
Pricing
Amazon Polly price starts at $0.0008/100 characters with the following volume discounts.
Characters Min |
Characters Max |
*Price per 100 Characters |
0 |
5,000,000 |
$0.00080 |
5,000,001 |
50,000,000 |
$0.00072 |
50,000,001 |
100,000,000 |
$0.00068 |
100,000,001 |
$0.00064 |
* Usage is rounded towards the end of call and priced in blocks of 100 characters. For example, if 546 characters are used on a call, then you’re charged $0.004 for the use of Polly Voices on that call.
Commit to a monthly volume and receive a significant discount beyond standard volume discounts. Contact our sales team to learn more.
Limits
The following limits apply when using Amazon Polly Voices.
- There is a 3,000 character limit on text that <Say> can process with Polly Voices.
- Amazon specific SSML tags are not currently supported. For example, <amazon:auto-breath>
- Lexicons are not supported.
Need some help?
We all do sometimes; code is hard. Get help now from our support team, or lean on the wisdom of the crowd browsing the Twilio tag on Stack Overflow.