The Pronunciation Challenge


Text to Speech is hard. When TTS works well and pronounces everything correctly, no one cares. When you mispronounce common words – or worse names – you can annoy or even offend your customers. As humans, we have it quite a bit easier. We can understand the context and know when to break the normal pronunciation rules.

While our man/woman voice is good the vast majority of the time, the new Alice Text to Speech (TTS) voice gives us a new tool to improve your customer’s experience. Not only does she add 26 dialects to the mix, but she’s just plain better.

Of course, the important question is: “how much better?”

Let’s find out by comparing them head to head in three scenarios: acronyms, heteronyms, and the VoIP/Text to Speech testing standards known as “Harvard sentences.”

Pronouncing Abbreviations and Acronyms

There are numerous types of acronyms, but we’re going to test the most common two. The first type – words such as CNN or HTTP – are spelled out letter by letter individually. The second type – words such as NASA or SCUBA – are pronounced phonetically.

Saying Abbreviations

To test the spelled acronyms, we’ll start with a simple array of terms said with each voice:

And here are our results:

While the inflection is slightly different between the voices, all of words are handled the same until UN. The woman voice says “un” as in “fun” while Alice says UN as in the organization.

Tip #1: To make the voices pronounce individual letters, we can take one of two approaches. First and easiest, we can just use Alice. Alternatively, we can insert spaces (or periods) between the letters. If the text is static, we can do it manually. But if the text is dynamic, we can use a regular expression. Just remember that regular expressions can be expensive depending on the complexity of the search and length of the text.

Here’s one solution:

For context, the \A denotes the beginning of the word while \Z denotes the end. By specifying these, we can catch UN while avoiding similar acronyms like UNICEF.

Saying Acronyms

To test pronouncing acronyms, we’ll use the same script as the sample above with a new set of terms: NASA, NATO, AIDS, scuba, and laser. And here are our results:

Other than inflection, there were no differences between the two voices until the last term: POTUS. Admittedly, using this term is a reach. It is rarely used outside the US political circles of Washington, DC but it leads to our next strategy.

Tip #2: To make the voices correctly pronounce uncommon words – especially those unique to a particular industry – we can switch specific words for their phonetic spellings with a relatively simple regular expression:

Once again, regular expressions can be expensive, so if this text can be static, manually inserting phonetic spellings may be a better approach.

Pronouncing Heteronyms

Time for a quick grammar lesson! Heteronyms are words which are spelled the same but are pronounced differently to have different meanings. For an example:

“I project my voice when I speak about my project” or “I object to your use of objects.”

To pronounce these correctly, we need to understand the context in which they’re used. From a grammatical perspective, it usually turns out that one is a noun while the other is a verb. To read it aloud, most people will read the entire sentence to themselves to figure out the context and then say it out loud.

To test this one, we’ll use the first ten heteronym phrases from Wikipedia:

Here are our results:

And this is where we find our first big challenge: the voices were both correct in 4 cases, both wrong in 4, and each independently wrong in another for a total of six problem cases.

Luckily, there are a few more hacks in the English language. One comes in the form of homonyms. Homonyms are words which are pronounced the same but spelled differently, such as does (female deer) versus doze.

Tip #3: Using homonyms in the place of heteronyms lets us use peculiarities of the English language in our favor for the TTS voice.

While there are a limited number of homonyms useful to us, this strategy solves two of six problem cases:

1. The buck does funny things when doze are present.
2. He could lead if he got the led out.

The Harvard Sentences

Finally, the Harvard Sentences serve as a standardized test for speech quality measurements within telecommunication systems. Each combination of ten sentences includes all of the phonetics of the English language whole also using them at a same frequency as they appear. They are considered the gold standard and date back to IEEE recommendations from 1965 and 1969.

In the case of both voices, the results are fantastic. Starting with List #11 (available from wikipedia):

We get a perfectly pronounced set of sentences back:

Tip #4: Think less about the proper spelling of words and more about the proper pronunciation. In some cases, that may include treating your TTS options as another translation within your internationalization layer.

As a thought exercise, I encourage you to listen to the final version of the heteronym phrases and see if you can figure out both my solution and how to automate it beginning from the base sentences above:

Hint: Pronouncing heteronyms correctly is usually dependent on which part of speech the word is serving as.

When you have a solution, let me know via @CaseySoftware or Happy hacking!

  • robert

    alice sounds much more robotic :(

    • Hey Robert,

      We’d love your feedback to see how we can improve. Can you shoot a note to

  • ConfusedCarrier

    It’s great to see how they compare and learn that Alice is better, But how do we switch our twilio requests to force the system use the Alice voice? Seems like that would be the main point of the article, but I read it twice, and unless I managed to completely miss it, I didn’t see that answered?

    • rickyrobinett

      Great question. You can change the voice to alice by using the voice attribute on the Say verb in your TwiML:

      Hello World!

      If you’re using a client library to generate your TwiML, you can add the voice attribute there. This post shows how to do this with the PHP library by passing a second argument of parameters to the say function:
      $response->say($item, array(‘voice’ => ‘alice’));

      Hope that helps! Don’t hesitate to reach out if you have any other questions.

  • Simon Bernier

    I wish there would be a way to specify phonetically what we want to say in a specific way, using for example Are there any plans in the future for this kind of customization?