Eleven Best Practices, Tips, and Tricks for using Speech Recognition and Virtual Agent Bots with Voice Calling on the Twilio CPaaS Platform

October 13, 2023

Conversational bot designers and developers – as well as callers into speech-enabled Interactive Voice Response (IVR) and Virtual Agents, alike – are continually asking themselves the same questions: “Why doesn’t this bot understand me? What more does it need to be able to understand what I just said to it?”

While AI-based Automated Speech Recognition (ASR) can be inherently challenging (especially in noisy environments) and there are inherent accuracy and latency trade-offs to navigate, there are ways to improve speech recognition performance. This post will give you the best practices to maximize the odds of a superior automated self-service experience with Twilio.  

Although the tools provided in the Twilio CPaaS tool bench are powerful, some of the coolest features in our recently GA’d <Virtual Agent> bot from Google remain somewhat hidden. Read on below (or after, in our post on Dialog CX tips) for how to get at the best parts of them.

Twilio’s Recommendations for improving <Gather><Speech> Recognition in an IVR

By implementing the following tips and recommendations, you can increase the likelihood that Google’s ASR used by Twilio will recognize spoken text correctly and that the customer’s application (or Twilio Studio Flow) can take the appropriate next action to attempt to confirm or validate the caller’s input. These best practices will minimize disturbance to the caller, delivering a more conversational IVR or customer engagement experience (or automated self-help experience). All of this will reduce caller frustration and ensure better overall efficiency and cost performance of any Twilio customer’s IVR system.

1. Rely on the (ever-improving) powers of mobile devices

Twilio recommends that customers utilize the mobile phone’s microphone for improved audio quality, along with the noise-canceling features already available on devices themselves.

To minimize outside noise interference, we recommended using the phone's handset mode rather than speakerphone mode to capture user input. By reducing the impact of background noise on speech recognition, noise-canceling microphones and acoustic echo cancellation can significantly enhance recognition accuracy.

2. Choose a high-quality PSTN connection provider  

You cannot recognize speech on calls that do not get successfully connected to your app. Twilio has high-quality, reliable interconnects with multiple providers, serving both Inbound and Outbound calling use cases (including Number Porting) – at cloud/elastic scale – the world over. Don’t let poor connectivity foil your attempts to engage and service your customers!

3. Leverage "Hints" in <Gather> Verb to the max

Include all the possible inputs that a user may speak as part of the "Hints" in the <Gather> verb. Add as many as you like into your code; there is no scaling penalty running an app with 1 or 10 hints vs. 99 (we allow hundreds – here are the current limits). Adding these will guide users’ input and increase the likelihood of accurate recognition.

If you’re expecting an Address or a Dollar/Currency amount, these are particularly relevant.

Here are some examples of supported class tokens by language in Twilio’s Docs and Google’s Docs:

  • $ADDRESSNUM (street number), $STREET (street name), and $POSTCALCODE
  • $MONEY (amount with currency unit)
  • $OPERAND (numeric)
  • DTMF, etc.

In this next example, we use the Class Token $OOV_CLASS_DIGIT_SEQUENCE as the account number requested is numbers. The action URL will send the result to the Application URL when Gather completes.  

Digits

 

<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Gather action="https://actionurl.html" input="speech" timeout="3" hints="$00V_CLASS_DIGIT_SEQUENCE">
<Say>
Please speak your account number 
</Say>
</Gather>
</Response>

Temperature

 

<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Gather action="https://actionurl.html" input="speech" timeout="3" hints="$00V_CLASS_TEMPERATURE">
                <Say>
                        Please speak your local temperature.
                </Say>
        </Gather>
</Response>

Phone Number

<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Gather action="https://actionurl.html" input="speech" timeout="3" hints="$00V_CLASS_FULLPHONENUM">
                <Say>
                        Please speak your Phone Number
                </Say>
        </Gather>
</Response>

Street Address 

<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Gather action="https://actionurl.html" input="speech"  timeout="3" hints="$00V_CLASS_ORDINAL">
                <Say>
                        Please speak your account address, followed by the pound sign
                </Say>
        </Gather>
</Response>

Something you define 

<Response>
  <Gather input="speech" hints="this is a phrase I expect to hear, keyword, product name, name">
    <Say>Please say something</Say>
  </Gather>
</Response>

<Gather> speech recognition is not yet optimized for alphanumeric inputs (e.g., ABC123). There are many homonyms in alphanumerics, which make them harder to recognize. Please see Tip #11 for more information on dealing with mixed alphanumerics.

Using hints to discern between relevant and irrelevant homonyms depending on use case, (e.g., between “chicken” and “checking,”) is one strategy; re-prompting based on the available relevant choices (e.g., “I think you said ‘checking’ not savings – is that correct?”) is another. A third strategy is to use a virtual agent “bot” that can make probabilistic statistical “informed guesses” (i.e., use predictive AI) to pick the best choice from among relevant alternatives, such as our integration with Dialogflow CX.

4. Use Enhanced Speech Recognition, and pick the Twilio <Gather><Speech> Google ASR Speech model best suited for your use case

The enhanced attribute instructs <Gather> to use a premium speech model that will improve the accuracy of transcription results. The premium speech model is only supported with the phone_call speechModel. The premium phone_call model was built using thousands of hours of training data. It ensures 54% fewer errors when transcribing phone conversations when compared to the basic phone_call model.

The following TwiML instructs <Gather> to use premium phone_call model:

<Gather input="speech" enhanced="true" speechModel="phone_call">
        <Say>Please tell us why you're calling</Say>
</Gather>

<Gather> will ignore the enhanced attribute if any other speechModel, other than phone_call, is used.

For most use cases related to Voice input when collecting short individual utterances from an English speaking user, Twilio recommends using the enhanced phone_call model with speechTimeout set to auto. This is instead of using Google’s default speech model, as  phone_call is the speech model best suited for use cases where you'd expect to receive queries such as voice commands or voice search.

In languages other than English, for better endpointing (i.e., lower latency start of speech recognition),  experimental_utterances may be a better choice. For more on those experimental models, see below.

The Dialogflow CX <Virtual Agent> uses Google’s default speech model by default, but other speech models can be specified with the <Config> noun nested inside the <Virtual Agent> noun. The phone_call speech model is also best for audio that originated from a PSTN phone call (typically an 8khz sample rate). This is because of how the model was trained (training data) and the noise tolerance and noise reduction characteristics of the model, particularly the enhanced phone_call speech model. Google has written extensively about how they trained their models and how they perform vs other LLMs.

With  phone_call, if you don’t use speechTimeout set to auto as suggested above, you will need to enter a positive integer number of seconds for speechTimeout.

If you’re picking these variables and aren’t sure of the combination, you can use Twilio’s Notifications which will send a debugger event letting you know the combination is invalid. (See Warning 13335 or Warning 13334).

Twilio’s Experimental speech models are designed to give access to Google’s latest speech technology and machine learning research for some more specialized use cases. They can provide higher accuracy for speech recognition versus other available models, depending upon use case and language. However, some features that are supported by other available speech models are not yet supported by the experimental models, such as confidence scores (more on that, below).

Of special note, the experimental_utterances model is best suited for short utterances of only a few seconds in length for languages other than English. It’s especially useful for trying to capture commands or other single-shot directed speech use cases  (e.g., "press 0 or say 'support' to speak with an agent" in non-English languages). Alternatively, the speech model numbers_and_commands might also work for such cases.

The experimental_conversations model supports longer and spontaneous speech and conversations. For example, it is useful for responses to a prompt like "tell us why you're calling today," or capturing the transcript of an interactive session, or longer spoken messages in the 60-second snippets that <Gather> supports.

Both experimental_conversations and experimental_utterances values for speechModel support the set of languages listed here.

One final but especially important point when it comes to building speech models into your ASR application: you can change the speech models used multiple times, within a single TwiML application, over the course of multiple questions or prompts, to best suit the type of speech input you’re expecting for potentially each question or prompt. That is, you can specify the speech model, hints, etc., per each individual <Gather> done in a TwiML app, to optimize your speech results’ accuracy.

5. Engineer your prompts to encourage natural and clear speech input

Encourage users to speak naturally and avoid rushing during interactions with the IVR system. Natural speech patterns improve the accuracy of speech recognition.

First, you’ll want prompts to be long enough for Twilio and Google to be ready for the speech input – but not so long that the user is put off. Telling a user what or how to speak – or giving examples – isn’t a bad idea.

In addition, the prompting questions asked should either:

  1. Be sufficiently narrow in scope that a generalized speech recognition engine has a decent chance of recognizing the answers from amongst a limited set of possible valid ones (for example, using plenty of verbal cues, such as “you can say things like ‘account balance’, or ask when your local branch is open”)  or 
  2. If a very broad question is the right starting point for conversations with your customers, consider using other, more structured tools for managing detected intents, utterances and phrases. Additionally, take advantage of those tools’ auto-generation of training (phrases) data and management of homonyms.

In short, if the set of answers and possible actions is small and short, building your own bot with ASR tools alone is a great idea. If the list of answers and actions is long, getting successful recognitions and correct routing and answers can be complicated, so consider also using a predictive AI bot-building tool like Twilio’s <Virtual Agent> connector using Google Dialogflow CX in addition to speech recognition.

6. Keep it clean (if you want)

The profanityFilter attribute of <Gather> specifies whether Twilio should filter profanities out of your speech recognition results and transcription. This attribute defaults to true, which replaces all but the initial character in each filtered profane word with asterisks. You can also use Twilio Voice Intelligence and recorded transcripts to detect customer sentiment for later flagging or Segment profile updating.

7. Offer DTMF as Backup

Provide Dual-Tone Multi-frequency (DTMF), also known as “touch tones,” as an alternative input method when speech recognition fails. This allows users to input responses using the keypad if needed.

The input attribute allows you to specify which inputs (DTMF or speech) Twilio should accept – the default input for <Gather> is dtmf, but you can set input to dtmf, speech, or dtmf speech

If you’re expecting DTMF but the input from the caller might be speech, see the “Hints” section above in tip #3. You can set the number of digits you expect from your caller by including numDigits in <Gather>.

If you set input to speech, Twilio will gather speech from the caller for a maximum duration of 60 seconds.

If you set dtmf speech for your input, the first detected input (speech or dtmf) will take precedence. If speech is detected first, finishOnKey(finish on a specified DTMF key press) will be ignored.

8. Stream it

Particularly if multiple call orchestration steps in real-time are NOT required (depending on your use case), consider using Twilio Media Streams with external speech recognition providers. Explore the option of using media streams to send speech data to an external speech recognition provider through the Twilio Marketplace.

Twilio Marketplace speech recognition partners can, for example, develop vocabularies optimized for certain industry verticals or use cases, or optimize for longer speech recognition “batching,” leading to improved recognition accuracy and performance in your application built as a Twilio customer. But, do note that Media Streams doesn’t yet support DTMF – that’s coming in a future version of Media Streams.

9. Leverage confidence scoring in the prompting application

When the caller finishes speaking or entering digits (or the timeout is reached), Twilio will make an HTTP request to the URL that the action attribute takes as a value. Twilio may send some extra parameters with its request after the <Gather> ends.

If you specify speech as an input with input="speech", Twilio will also include a Confidence parameter value along with the recognized speech result. Confidence contains a confidence score between 0.0 and 1.0 (the percentage confidence level of the result from 0% to 100% confidence). A higher confidence score means a better likelihood that the transcribed speech result is accurate.

Not all speech models return Confidence as a required field. Depending on your model choice, your code should not expect Confidence to be returned, but if it is present, you can leverage its value to take various actions.

After <Gather> ends and Twilio sends its request to your action URL, if the Confidence score is present you can act on the result. Speech Recognition will never explicitly tell you it didn’t recognize a word, you need to infer that from the Confidence score.

For example, you could run a re-prompting routine on the user after a result below a specific threshold Confidence (e.g., < 0.2), until recognized. Reprompting after low Confidence scores rather than simply moving forward with an empty or low-confidence speech recognition result can avoid 500 errors sent back from your programmatic endpoint – or frustrating end users.

Using re-prompting cleverly (for instance, while applying hints, and with a more specific, constrained choice re-prompt) to select from amongst the available relevant choices is a great “combination” strategy, combining this tip with tip 3 above.

10. Don’t exhaust your callers’ patience

After a reasonable number of retries – likely two or three at most – try another tactic.

After a certain number of failures, consider transferring the call to a live agent, who may be better able to cope with noisy, indistinct, or unexpected input. Studio and Dialogflow CX make this straightforward with a “Live Agent Handoff” option configurable on the Studio widget. Or, if a customer’s response or question is sufficiently off-script but you still wish to handle it with an automated agent, consider doing a voice-enabled generative AI search for an answer to their query.

11. Implement 2FA or other post-processing techniques to deal with for near-homonyms

Though its capabilities are getting better quickly, ASR struggles mightily with homonyms: words or phonemes that sound alike, but have different meanings. In particular, alphanumerics – for example, an insurance policy number, bank account number, or patient ID with both letters and numbers in it – can be extremely problematic.

Unfortunately, these are also quite commonly needed in self-service (IVR) automation use cases where ASR is used in delivering Notifications.

One solution to this is to use a combination of tools, such as Twilio Verify for Two-Factor Authentication (2FA) to request only a portion of a mixed alphanumeric number. For example,  you could ask the last four digits, or only the numeric section of an ID.

With part of an ID, you can then verify via a text message that the system has looked up not only the correct account number but also that the system is talking to the correct person. Other (post-processing) solutions involve using 2FA along with some combination of the above techniques: prompting upon getting a low-confidence recognition score, prompt engineering (to be more specific around letters used in the reprompt), and so on.

Maximizing your chance of success with Automatic Speech Recognition applications

Hopefully, this post has given you some valuable hints, tips, and tricks to architect your speech recognition application for success. By implementing these best practices, you’ll find yourself with happier users – both customers and agents, alike!

Once you’ve implemented the best practices, read our next post in this series about using Google Dialogflow CX’s Virtual Agent Bot.

More resources

If you're developing an IVR with Twilio, we've put together an Interactive IVR Demo you'll want to see!

Russ Kahan is the Principal Product Manager for <gather> Speech Recognition, Dialogflow Virtual Agents, Media Streams and SIPREC at Twilio. He’s enjoyed programming voice apps and conversing with robots since sometime back in the late nineties – when and this stuff was still called “CTI,” for “Computer Telephony Integration” – but he also enjoys real-world pursuits like scouting, skiing, swimming, and mountain biking with his kids. Reach him at rkahan [at] twilio.com

Jeff Foster is a Software Engineer on Twilio's Programmable Voice team, and he’s been working on Speech Recognition at Twilio for the last 6 years – including the original Dialogflow prototype implementations more than 2 years ago. He can be reached at jfoster [at] twilio.com.