- Convert speech to text and analyze its intent during any voice call.
- Support for 89 languages and dialects.
- Available now in public beta.
Speech is a powerful and expressive medium for customer communications. With speech technology improving massively over the last four years, we were excited to leverage that progress to finally offer Twilio developers a speech recognition feature for Programmable Voice. Starting today, Twilio Speech Recognition allows developers to convert speech to text and analyze its intent during any voice call, and is available in public beta. There are no models to train or complicated machine learning to orchestrate.
Our customers have long used keypad input to navigate users through phone menus and collect their feedback on surveys. While keypad input is now universally understood by users, it can be cumbersome and imprecise, and isn’t always a great experience for the caller.
Over the next several years, we expect speech-driven interfaces to become ubiquitous. The potential for nuanced human-machine interaction driven by speech is readily apparent to anyone who has asked Alexa to play their favorite music from Spotify.
With Speech Recognition, you can now capture speech from your customers in real-time. It works in 89 languages and dialects, and has a simple, pay-as-you go pricing structure.
<Gather> with Speech
Speech recognition is integrated directly into Twilio’s <Gather> verb so you can update the code you already have in place. Because it supports 89 languages and dialects, you can upgrade your application to support customers across a broad range of regions. Adding speech is as simple as adding a new parameter called “input” as shown in the TwiML below.
<?xml version="1.0" encoding="UTF-8"?> <Response> <Gather input="speech" action="/finalresult"> <Say>Welcome to Twilio, how can I help you?</Say> </Gather> </Response>
If you specify speech as an input, Twilio will add a new parameter called SpeechResult in the request to your action url.
|SpeechResult||I’d like to learn more about Speech Recognition|
If you’d like to build more responsive applications, we also offer the ability to get speech results in real time as we process speech. To access the real-time voice stream, you can specify a partial results callback:
<?xml version="1.0" encoding="UTF-8"?> <Response> <Gather input="speech" action="/finalresult" partialResultCallback="/partialresult"> <Say>Welcome to Twilio, how can I help you?</Say> </Gather> </Response>
Once you specify a callback url for partialResultCallback, you will get requests as your customers speak. Since HTTP requests may arrive out of order, we include a sequence number to help you use your customer’s speech as it was spoken.
SequenceNumber: 1 UnstableSpeechResult: Disco SequenceNumber: 2 UnstableSpeechResult: fiscal SequenceNumber: 0 UnstableSpeechResult: this SequenceNumber: 3 UnstableSpeechResult: Fiscal Sanity SequenceNumber: 4 UnstableSpeechResult: this call Sandra SequenceNumber: 5 UnstableSpeechResult: this will send SequenceNumber: 7 UnstableSpeechResult: this will send requests SequenceNumber: 6 UnstableSpeechResult: this will send requests SequenceNumber: 8 UnstableSpeechResult: this will send requests SequenceNumber: 9 UnstableSpeechResult: this will send requests SequenceNumber: 10 UnstableSpeechResult: this will send requests as SequenceNumber: 11 UnstableSpeechResult: This will send requests as you. SequenceNumber: 12 UnstableSpeechResult: This will send requests as you see. SequenceNumber: 13 UnstableSpeechResult: This will send requests as you speak. SequenceNumber: 14 UnstableSpeechResult: This will send requests as you speak. SequenceNumber: 15 UnstableSpeechResult: This will send requests as you speak. SequenceNumber: 16 UnstableSpeechResult: This will send requests as you speak. SpeechResult: This will send requests as you speak.
This allows you to evaluate the speech of your user as they speak to build responsive voice applications. A detailed explanation of Speech Recognition features and TwiML examples can be found here.
Speech Recognition uses a scalable pay-as-you go model, with requests starting at $0.02 per 15 seconds of recognition. Those who have operated a speech recognition system know how time consuming and difficult planning for channels or ports can be. Speech Recognition from Twilio does away with this burden and scales with your business—plug it in and it just works. If you’re planning for significant traffic, it’s important to know that volume-based discounts can cut the price of Speech Recognition to as little as $0.008 per 15 seconds. Full volume tiers can be found here.
How to Get Started
Speech recognition is available to all Twilio developers today. To get started, check out our docs. If you have any questions about moving your traffic or adding Speech Recognition to your Twilio application, don’t hesitate to reach out to our Sales team.
What’s Next: Understand
Speech Recognition is only the beginning for voice-driven interfaces built on Twilio. Coming soon, we will be releasing a new verb: Understand. It’s exactly what you hope it is: an API to analyze text and determine intent during a live call using natural language understanding. Powered by machine learning, Understand will give developers what they need to build intelligent, nuanced human-machine interactions in order to turn freeform text into structured data. It will work natively with both Twilio Programmable Voice and SMS, as well as Amazon Alexa.
Stay tuned for more—we can’t wait to see what you build.