Menu

Expand
Rate this page:

Text-to-Speech (TTS)

Text To Speech (TTS), also known as speech synthesis, is a process in which text is converted into a human-sounding voice. Developers and business users alike use TTS to turn traditional human-to-human interactions into seamless, machine-to-human interactions, and make every interaction over voice a frictionless and first-class experience.

Instead of recording audio files with human voices to play back in a call, which has limited flexibility and is not a scalable option, TTS prompts can be dynamically, programmatically generated from raw text as a response to events in your application. Whether the use case is an Interactive Voice Response (IVR), a conversational assistant for scaling contact centers, or Voice notifications to deliver critical messages over a phone call, Text To Speech capabilities enable efficiency at global scale while enhancing customer engagement.

Table of Contents

Click on a section name below to jump to that section.

Get Started with Text To Speech

You can provide text and Twilio will synthesize speech in real time and speak back the audio in any call or conference. TTS is available via the <Say> TwiML verb and Studio's Say/Play Widget.

TwiML

The <Say> verb allows you to provide plain text that Twilio converts to synthesized speech.

For example, when Twilio executes the following TwiML during a call, the caller hears "Hello world!" The synthesized voice the caller hears is the default voice and language of the Twilio Account (configured in the Twilio Console).

<Response>
   <Say>Hello world!</Say>
</Response>

<Say> also allows you to modify the language, accent, and voice of the synthesized speech via the language and voice attributes. The example below uses Amazon Polly's "Joanna" voice and American English:

<Response>
   <Say language="en-US" voice="Polly.Joanna">Hello. I am Joanna and I speak American English!</Say>
</Response>

<Say> offers different options for voices, each with its own supported set of languages and genders, so you can customize your application with Text To Speech capabilities according to your needs and preferences.

To start using TTS, complete the following steps:

  1. Configure your Account-wide Text To Speech Settings in the Twilio Console.
  2. Use <Say> to programmatically define TTS instructions.

Studio

Twilio Studio is a visual, serverless tool that uses Widgets to represent various parts of Twilio's platform features and functionality to design and build applications with little or no code.

The Say/Play Widget allows you to add Text To Speech capabilities to your application with ease, including embedding SSML for supported voices.

To start using TTS with Studio, complete the following steps:

  1. Configure your Account-wide Text To Speech Settings in the Twilio Console.
  2. Use the Say/Play Widget to add TTS to your Studio Flow.

Text To Speech voices overview

Twilio's Text To Speech offering has a variety of different voices in multiple languages and locales with their associated accents and pronunciations. There are three types of voices with different quality, language coverage and pricing: Basic, Standard and Premium.

Basic voices

Basic voices are first-generation voices. They can be used to get started and familiarize yourself with Text To Speech capabilities using <Say>, but may not have enough human-like qualities to build conversational applications and deliver superior user experiences over a voice call. The voices in this tier are available in a limited number of languages at no cost.

Standard voices

Standard voices offer standard TTS technology and produce natural-sounding synthesized speech with a variety of lifelike voices. The voices in this tier are provided by Amazon (Amazon Polly) and Google (Standard), with support for SSML (Speech Synthesis Markup Language), which allows developers to control many aspects of the synthesized speech.

Premium voices

These voices are generated using the latest technology and innovation in synthesized speech, providing the most human-like, expressive and natural-sounding text-to-speech voices possible, with higher quality than Standard voices. The voices in this tier are provided by Amazon (Amazon Polly Neural) and Google (WaveNet, Neural2), with support for SSML, which allows developers to control many aspects of the synthesized speech.

See the Pricing section below for additional information.

Available voices and languages

Effective June 26 2023, Alice voices are no longer supported for Text-To-Speech and any request will be redirected to an alternate voice. It is recommended to update configuration in your Console, Studio Flows, and backend application to remove any references to alice voices. For more information, visit the Changelog.

Google voices (Standard, WaveNet and Neural2) are available in Public Beta

The following table contains all voices available for each language and locale. You can test the different voices from the TTS Settings page in the Twilio Console.

Note: Invalid combination of voice and language attributes may result in error and <Say> instruction failure.

Voices listed with (*) are fully bilingual voices. At the moment only Amazon Polly has this capability for a limited number of voices. Learn more by visiting Amazon's Bilingual Voices documentation.

Text To Speech settings

Default voice and language

The Text To Speech page in the Twilio Console allows you to define a default voice and language for your Account. These defaults are used when no language or voice attribute is provided in your <Say> TwiML. If you are using Studio, the defaults are used when “Default” is selected.

You can test different voices and messages in this section of the Console.

The Text To Speech Settings Console page shows the Default Provider set to "Basic"

In the screenshot above, the DEFAULT PROVIDER is set to Basic and the DEFAULT VOICE is set to Man, en-US. With these TTS settings, Twilio uses the Man voice and the en-US (American English) accent and pronunciation when executing the following TwiML:

<Response>
  <Say>Hello. I am a man!</Say>
</Response>

Language mapping

Twilio updates the Text To Speech voices offering regularly. In order to have access to the latest voices without the need to review your code to change a voice for a new one, it is recommended to use the Language Mapping feature. Your application only needs the language and the text, and Twilio will automatically select and use the corresponding voice that can be updated any time from the Console.

On the Text to Speech page in the Console, you can set a voice for every locale. This means that you can specify the language without needing to specifying the voice when using TTS capabilities in your application.

To set a voice for a locale, complete the following:

Console screenshot showing the steps for configuring Language Mapping. Steps described in page text below this image.

  1. Under the Current Language Mapping heading, click on the language/locale you wish to configure, e.g. English (British)(en-GB).
  2. In the Test & Configure Voices By Language modal, select the PROVIDER and VOICE you wish to use, e.g. Amazon Polly and Emma.
  3. Click Save.
  4. Repeat steps 1-3 for other languages/locales if necessary.

For example, if you configure English (British)(en-GB) to use Amazon Polly and Emma, Twilio uses the Amazon Polly Emma voice when executing <Say> with the language attribute set to en-GB and no voice attribute (see TwiML example below).

<Response>
  <Say language="en-GB">Hello. I am Emma!</Say>
</Response>

Override default settings

Override default providers/voices

<Say>'s voice attribute allows you to override any default provider/voice settings that were configured in the Console (i.e. Account-level and Language Mapping defaults).

For example, if your Account's default TTS voice is Amazon Polly Salli but you want to use Amazon Polly Joanna for a specific call, set the voice attribute to Polly.Joanna:

<Response>
  <Say voice="Polly.Joanna">Hello. I am Joanna!</Say>
</Response>

You can also use the voice attribute to override a Language Mapping's defaults.

For example, if your Language Mapping for English (British)(en-GB) uses Amazon Polly and Emma but you want to use the Amazon Polly Joanna voice for a specific <Say> instruction, you would use the voice attribute set to Polly.Joanna. The TwiML below causes Twilio to use the Amazon Polly Joanna voice, which overrides your Account's default Language Mapping:

<Response>
  <Say language="en-GB" voice="Polly.Joanna">Hello. I am Joanna!</Say>
</Response>

Override default language/locales

<Say>'s language attribute allows you to override any default language/locale settings that were configured in the Console.

For example, if your Account's default TTS Language is English (US) (en-US), but wish to use German for a specific call, set the language attribute to de-DE in your TwiML:

<Response>
  <Say language="de-DE">Hallo. Ich spreche Deutsch!</Say>
</Response>

SSML

SSML support is only available in Standard and Premium voices.

Speech Synthesis Markup Language (SSML) uses XML-based tags that allow you to fine-tune the synthesized speech generated by TTS. SSML functionality includes the ability to: specify where pauses should be, provide pronunciations for acronyms, abbreviations, dates and times, and increase or decrease the speed at which text is spoken.

Supported SSML tags

While the W3C specification covers many capabilities, Twilio currently only supports a subset of them.

In addition, SSML support (including tags and accepted values) may differ between TTS providers and/or may be limited to specific voices. Review the provider-specific SSML documentation and test your application. Use of unsupported SSML tags with any TTS provider may result in error and <Say> instruction failure.

As per the SSML specification, the root element for SSML starts with <speak>. However, when you are using SSML with <Say> it is not needed, so you can skip <speak> and insert the rest of the SSML inside <Say> directly.

The table below lists the supported SSML tags, but you should refer to the appropriate, provider-specific documentation to ensure you're using the SSML tags correctly.

Action SSML tag Provider documentation
Add a pause <break>

Amazon Polly ; Google

Emphasize words <emphasis>

Amazon Polly ; Google

Specify another language for specific words <lang>

Amazon Polly ; Google

Add a pause between paragraphs <p>

Amazon Polly ; Google

Use phonetic pronunciation <phoneme>

Amazon Polly ; Google

Control volume, speaking rate, and pitch <prosody>

Amazon Polly ; Google

Add a pause between sentences <s>

Amazon Polly ; Google

Control how special types of words are spoken <say-as>

Amazon Polly ; Google

Pronounce acronyms and abbreviations <sub>

Amazon Polly ; Google

Improve pronunciation by specifying parts of speech <w>

Amazon Polly ; Google N/A

SSML Examples

Modify speed and volume of synthesized speech

The SSML <prosody> tag allows you to control the volume, rate, and pitch of synthesized speech.

<Response>
  <Say voice="Polly.Joanna">
    Prosody can be used to change the way words sound. The following words are
    <prosody volume="x-loud"> quite a bit louder than the rest of this passage.
    </prosody> Each morning when I wake up, <prosody rate="x-slow">I speak slowly and 
    deliberately until I have my coffee.</prosody> I can also change the pitch of my voice 
    using prosody. Do you like <prosody pitch="+5%"> speech with a pitch higher,</prosody> 
    or <prosody pitch="-10%"> is a lower pitch preferable?</prosody>
  </Say>
</Response>

Read a phone number correctly

The SSML <say-as> tag allows you to indicate specific categories of text, so that the synthesized speech pronounces the text correctly.

Without <say-as>, a phone number would be pronounced as a number, e.g. "four billion, one hundred fifty-five million, five hundred fifty-one thousand, two hundred twelve."

The TwiML example below uses <say-as> so that the synthesized speech reads the phone number as "four one five, five five five, one two one two."

<Response>
   <Say voice="Polly.Joanna">John’s phone number is, <say-as interpret-as="telephone">4155551212</say-as></Say>
</Response>

Generate SSML with Twilio's Helper Libraries

You can generate TwiML with SSML within the <Say> verb using one of Twilio's helper libraries for C#, Java, Node.js, PHP, Python, Ruby, or Go.

The code sample below shows Helper Library code that generates the following SSML and TwiML:

<Response>
  <Say voice="Polly.Joanna">
    Hi
    <break strength="x-weak" time="100ms"/>
    <emphasis level="moderate">Words to emphasize</emphasis>
    <p>Words to speak</p>
    aaaaaa
    <phoneme alphabet="x-sampa" ph="pɪˈkɑːn">Words to speak</phoneme>
    bbbbbbb
    <prosody pitch="-10%" rate="85%" volume="-6dB">Words to speak</prosody>
    <s>Words to speak</s>
    <say-as interpret-as="spell-out">Words to speak</say-as>
    <sub alias="alias">Words to be substituted</sub>
    <w>Words to speak</w>
  </Say>
</Response>
Loading Code Sample...
        
        

        SSML with Helper Library Example

        Limits

        • There is a 4,000 character limit on text that <Say> can process with Basic voices (man and woman).
        • Basic voices (man and woman) don’t support SSML tags.
        • There is a 3,000 character limit on text, non-SSML, that <Say> can process with Amazon Polly voices.
        • Amazon-specific SSML tags such as <amazon:auto-breath> or <amazon:effect> among others are not currently supported.
        • Lexicons are not supported in Amazon Polly voices.
        • SSML support in Amazon TTS may vary between Polly and Polly Neural voices, please refer to the Amazon Polly SSML documentation for detailed information
        • SSML support in Google TTS may vary between Standard, WaveNet and Neural2 voices, please refer to Google SSML documentation for detailed information
        • There is a 5,000 character limit on text, including SSML, that <Say> can process with Google voices.
        • SSML tags, newlines and spaces are included in the total character count by Google TTS hence billed.
        • Google-specific SSML tags such as <par> or <seq> among others are not currently supported.

        Note: Use of unsupported SSML tags with any TTS provider may result in error and <Say> instruction failure. Please review provider-specific SSML documentation and test your application

        Pricing

        Basic voices

        Basic voices (man and woman) are free of charge.

        Standard voices

        Standard voices (Amazon Polly and Google Standard) pricing starts at $0.0008 per 100 characters with the following volume discounts:

        0 5,000,000 $0.00080
        5,000,001 50,000,000 $0.00072
        50,000,001 100,000,000 $0.00068
        100,000,001 $0.00064
        Minimum characters Maximum characters Price per 100 characters*

        * Usage is rounded towards the end of call and priced in blocks of 100 characters. For example, if 546 characters are used on a call, then you’re charged $0.004 for the use of Standard voices on that call. However, if less than 100 characters are used, you’re charged $0.0008, even for using just one character.

        Premium voices

        Premium voices (Amazon Polly Neural, Google WaveNet and Google Neural2) price starts at $0.0032/100 characters with the following volume discounts:

        0 5,000,000 $0.0032
        5,000,001 50,000,000 $0.0029
        50,000,001 100,000,000 $0.0027
        100,000,001 $0.0025
        Minimum characters Maximum characters Price per 100 characters*

        * Usage is rounded towards the end of call and priced in blocks of 100 characters. For example, if 546 characters are used on a call, then you’re charged $0.0128 for the use of Premium voices on that call. However, if less than 100 characters are used, you’re charged $0.0032, even for using just one character.

        Rate this page:

        Need some help?

        We all do sometimes; code is hard. Get help now from our support team, or lean on the wisdom of the crowd by visiting Twilio's Stack Overflow Collective or browsing the Twilio tag on Stack Overflow.

        Thank you for your feedback!

        Please select the reason(s) for your feedback. The additional information you provide helps us improve our documentation:

        Sending your feedback...
        🎉 Thank you for your feedback!
        Something went wrong. Please try again.

        Thanks for your feedback!

        thanks-feedback-gif