Speech Recognition is Now Generally Available

October 25, 2017
Written by
Kris Gutta
Twilion

Speech Recognition Generally Available
  • Convert speech to text and analyze its intent during any voice call.
  • Support for 119 languages and dialects.
  • Now Generally Available.

We are excited to announce that Speech Recognition is now Generally Available. This allows developers to convert speech to text and analyze its intent during any call coming through the Twilio Programmable Voice API.

With Speech Recognition, there are no models to train or machine learning to orchestrate. You simply specify that you’d like to take voice input and you’re good to go. It works in 119 languages and dialects and has a simple, pay-as-you-go pricing structure.

Here are some areas where Speech Recognition will come in handy:

  • IVR phone tree navigation: With Speech Recognition, you can build applications to navigate IVRs with voice instead of keypad input with DTMF tones. Letting your customers just say what they need is faster, easier, and leaves them happy.
  • Data capture: Now, you don’t need an agent to capture short snippets of data such as addresses, names, and account numbers—simply use Speech Recognition in your app.
  • Conversational bots: Use Speech Recognition together with Understand to let customers talk to bots instead of agents.
  • Real-time transcriptions: Convert conversations between your customers and agents to measure call success by analyzing keywords or using sentiment analysis. You can also get alerted on conversations where your attention is required by using keyword spotting.

How Does it Work?

Speech recognition is integrated directly into Twilio’s <Gather> verb so you can update the code you already have in place. Because it supports 119 languages and dialects, you can upgrade your application to support customers across a broad range of regions with hardly any effort at all. Adding speech is as simple as adding a new parameter called “input” as shown in the TwiML below:

<?xml version="1.0" encoding="UTF-8"?> 
<Response> 
  <Gather input="speech"> </Gather>
</Response>

You can also provide hints up to 500 words to boost the accuracy of the speech to text result. Customers in the beta version of Speech Recognition have found that providing good hints makes the matching very accurate.

<?xml version="1.0" encoding="UTF-8"?> 
<Response> 
  <Gather input="speech" hints="customer support, sales, marketing, engineering, product, sales enablement, sales engineering"> </Gather>
</Response>

Finally, you can use speechTimeout to control how long to wait after the caller has started speaking to terminate speech and provide the results back to you.

If you’re expecting a “short utterance” as an input, you should use speechTimeout with value auto. Examples of “short utterances” include collecting numbers, addresses or short commands.

On the other hand, if you expect your users to speak “long utterances”, you’ll need to set the value of speechTimeout to > 0. For example, if you expect your customer to say something like  “I would like help resetting my debit card pin” then you’ll want to specify a timeout.

Below is an example where the speechTimeout has been set to two seconds.

<?xml version="1.0" encoding="UTF-8"?> 
<Response> 
  <Gather input="speech" hints="customer support, sales, marketing, engineering, product, sales enablement, sales engineering" speechTimeout="2"> </Gather>
</Response>

To see Speech Recognition in action, check out this video—you’ll learn how to build a facts hotline that returns facts about cats, numbers, and Chuck Norris. You’ll also be able to see the usefulness of Speech Recognition in interactive voice response (IVR) applications.

 

Summary

Speech Recognition works in 119 languages and dialects and has a simple, pay-as-you-go pricing structure. To get started, log in or create a Twilio account, and check out our docs to start building.