Introducing Speech Recognition – Public Beta Now Open

May 24, 2017
Written by

tw2_speech-recognition_blog
  • Convert speech to text and analyze its intent during any voice call.
  • Support for 89 languages and dialects.
  • Available now in public beta.

Speech is a powerful and expressive medium for customer communications. With speech technology improving massively over the last four years, we were excited to leverage that progress to finally offer Twilio developers a speech recognition feature for Programmable Voice. Starting today, Twilio Speech Recognition allows developers to convert speech to text and analyze its intent during any voice call, and is available in public beta. There are no models to train or complicated machine learning to orchestrate.

Our customers have long used keypad input to navigate users through phone menus and collect their feedback on surveys. While keypad input is now universally understood by users, it can be cumbersome and imprecise, and isn’t always a great experience for the caller.

Over the next several years, we expect speech-driven interfaces to become ubiquitous. The potential for nuanced human-machine interaction driven by speech is readily apparent to anyone who has asked Alexa to play their favorite music from Spotify.

With Speech Recognition, you can now capture speech from your customers in real-time. It works in 89 languages and dialects, and has a simple, pay-as-you go pricing structure.

<Gather> with Speech

Speech recognition is integrated directly into Twilio’s <Gather> verb so you can update the code you already have in place. Because it supports 89 languages and dialects, you can upgrade your application to support customers across a broad range of regions. Adding speech is as simple as adding a new parameter called “input” as shown in the TwiML below.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Gather  input="speech" action="/finalresult">
   <Say>Welcome to Twilio, how can I help you?</Say>
  </Gather>
</Response>

If you specify speech as an input, Twilio will add a new parameter called SpeechResult in the request to your action url.

 

AccountSidAC25e16e9a616a4a1786a7c83f58e30082
ApiVersion2010-04-01
CallSidCA607dee6b7647243904ebc8db64a2a5c2
CallStatusin-progress
Called+18182004120
Confidence0.77388394
Directioninbound
From+15623000628
Languageen-US
SpeechResult     I’d like to learn more about Speech Recognition
To+18182104120

 

If you’d like to build more responsive applications, we also offer the ability to get speech results in real time as we process speech. To access the real-time voice stream, you can specify a partial results callback:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Gather  input="speech" action="/finalresult" partialResultCallback="/partialresult">
   <Say>Welcome to Twilio, how can I help you?</Say>
  </Gather>
</Response>

Once you specify a callback url for partialResultCallback, you will get requests as your customers speak. Since HTTP requests may arrive out of order, we include a sequence number to help you use your customer’s speech as it was spoken.

SequenceNumber: 1
UnstableSpeechResult: Disco
 
SequenceNumber: 2
UnstableSpeechResult: fiscal
 
SequenceNumber: 0
UnstableSpeechResult: this
 
SequenceNumber: 3
UnstableSpeechResult: Fiscal Sanity
 
SequenceNumber: 4
UnstableSpeechResult: this call Sandra
 
SequenceNumber: 5
UnstableSpeechResult: this will send
 
SequenceNumber: 7
UnstableSpeechResult: this will send requests
 
SequenceNumber: 6
UnstableSpeechResult: this will send requests
 
SequenceNumber: 8
UnstableSpeechResult: this will send requests
 
SequenceNumber: 9
UnstableSpeechResult: this will send requests
 
SequenceNumber: 10
UnstableSpeechResult: this will send requests as
 
SequenceNumber: 11
UnstableSpeechResult: This will send requests as you.
 
SequenceNumber: 12
UnstableSpeechResult: This will send requests as you see.
 
SequenceNumber: 13
UnstableSpeechResult: This will send requests as you speak.
 
SequenceNumber: 14
UnstableSpeechResult: This will send requests as you speak.
 
SequenceNumber: 15
UnstableSpeechResult: This will send requests as you speak.
 
SequenceNumber: 16
UnstableSpeechResult: This will send requests as you speak.
 
SpeechResult: This will send requests as you speak.

This allows you to evaluate the speech of your user as they speak to build responsive voice applications. A detailed explanation of Speech Recognition features and TwiML examples can be found here.

Pricing

Speech Recognition uses a scalable pay-as-you go model, with requests starting at $0.02 per 15 seconds of recognition. Those who have operated a speech recognition system know how time consuming and difficult planning for channels or ports can be. Speech Recognition from Twilio does away with this burden and scales with your businessplug it in and it just works. If you’re planning for significant traffic, it’s important to know that volume-based discounts can cut the price of Speech Recognition to as little as $0.008 per 15 seconds. Full volume tiers can be found here.

How to Get Started

Speech recognition is available to all Twilio developers today. To get started, check out our docs. If you have any questions about moving your traffic or adding Speech Recognition to your Twilio application, don’t hesitate to reach out to our Sales team.

What’s Next: Understand

Speech Recognition is only the beginning for voice-driven interfaces built on Twilio. Coming soon, we will be releasing a new verb: Understand. It’s exactly what you hope it is: an API to analyze text and determine intent during a live call using natural language understanding. Powered by machine learning, Understand will give developers what they need to build intelligent, nuanced human-machine interactions in order to turn freeform text into structured data. It will work natively with both Twilio Programmable Voice and SMS, as well as Amazon Alexa.

Stay tuned for morewe can’t wait to see what you build.