The Pronunciation Challenge

Text to Speech is hard. When TTS works well and pronounces everything correctly, no one cares. When you mispronounce common words – or worse names – you can annoy or even offend your customers. As humans, we have it quite a bit easier. We can understand the context and know when to break the normal pronunciation rules.

While our man/woman voice is good the vast majority of the time, the new Alice Text to Speech (TTS) voice gives us a new tool to improve your customer’s experience. Not only does she add 26 dialects to the mix, but she’s just plain better.

Of course, the important question is: “how much better?”

Let’s find out by comparing them head to head in three scenarios: acronyms, heteronyms, and the VoIP/Text to Speech testing standards known as “Harvard sentences.”

Pronouncing Abbreviations and Acronyms

There are numerous types of acronyms, but we’re going to test the most common two. The first type – words such as CNN or HTTP – are spelled out letter by letter individually. The second type – words such as NASA or SCUBA – are pronounced phonetically.

Saying Abbreviations

To test the spelled acronyms, we’ll start with a simple array of terms said with each voice:

And here are our results:

While the inflection is slightly different between the voices, all of words are handled the same until UN. The woman voice says “un” as in “fun” while Alice says UN as in the organization.

Tip #1: To make the voices pronounce individual letters, we can take one of two approaches. First and easiest, we can just use Alice. Alternatively, we can insert spaces (or periods) between the letters. If the text is static, we can do it manually. But if the text is dynamic, we can use a regular expression. Just remember that regular expressions can be expensive depending on the complexity of the search and length of the text.

Here’s one solution:

For context, the \A denotes the beginning of the word while \Z denotes the end. By specifying these, we can catch UN while avoiding similar acronyms like UNICEF.

Saying Acronyms

To test pronouncing acronyms, we’ll use the same script as the sample above with a new set of terms: NASA, NATO, AIDS, scuba, and laser. And here are our results:

Other than inflection, there were no differences between the two voices until the last term: POTUS. Admittedly, using this term is a reach. It is rarely used outside the US political circles of Washington, DC but it leads to our next strategy.

Tip #2: To make the voices correctly pronounce uncommon words – especially those unique to a particular industry – we can switch specific words for their phonetic spellings with a relatively simple regular expression:

Once again, regular expressions can be expensive, so if this text can be static, manually inserting phonetic spellings may be a better approach.

Pronouncing Heteronyms

Time for a quick grammar lesson! Heteronyms are words which are spelled the same but are pronounced differently to have different meanings. For an example:

“I project my voice when I speak about my project” or “I object to your use of objects.”

To pronounce these correctly, we need to understand the context in which they’re used. From a grammatical perspective, it usually turns out that one is a noun while the other is a verb. To read it aloud, most people will read the entire sentence to themselves to figure out the context and then say it out loud.

To test this one, we’ll use the first ten heteronym phrases from Wikipedia:

Here are our results:
 

And this is where we find our first big challenge: the voices were both correct in 4 cases, both wrong in 4, and each independently wrong in another for a total of six problem cases.

Luckily, there are a few more hacks in the English language. One comes in the form of homonyms. Homonyms are words which are pronounced the same but spelled differently, such as does (female deer) versus doze.

Tip #3: Using homonyms in the place of heteronyms lets us use peculiarities of the English language in our favor for the TTS voice.

While there are a limited number of homonyms useful to us, this strategy solves two of six problem cases:

1. The buck does funny things when doze are present.
2. He could lead if he got the led out.

The Harvard Sentences

Finally, the Harvard Sentences serve as a standardized test for speech quality measurements within telecommunication systems. Each combination of ten sentences includes all of the phonetics of the English language whole also using them at a same frequency as they appear. They are considered the gold standard and date back to IEEE recommendations from 1965 and 1969.

In the case of both voices, the results are fantastic. Starting with List #11 (available from wikipedia):

We get a perfectly pronounced set of sentences back:
 

Tip #4: Think less about the proper spelling of words and more about the proper pronunciation. In some cases, that may include treating your TTS options as another translation within your internationalization layer.

As a thought exercise, I encourage you to listen to the final version of the heteronym phrases and see if you can figure out both my solution and how to automate it beginning from the base sentences above:

Hint: Pronouncing heteronyms correctly is usually dependent on which part of speech the word is serving as.

When you have a solution, let me know via @CaseySoftware or keith@twilio.com. Happy hacking!