Introducing Speech Recognition – Public Beta Now Open

Speech Recognition for Programmable Voice Twilio
  • Convert speech to text and analyze its intent during any voice call.
  • Support for 89 languages and dialects.
  • Available now in public beta.

Speech is a powerful and expressive medium for customer communications. With speech technology improving massively over the last four years, we were excited to leverage that progress to finally offer Twilio developers a speech recognition feature for Programmable Voice. Starting today, Twilio Speech Recognition allows developers to convert speech to text and analyze its intent during any voice call, and is available in public beta. There are no models to train or complicated machine learning to orchestrate.

Our customers have long used keypad input to navigate users through phone menus and collect their feedback on surveys. While keypad input is now universally understood by users, it can be cumbersome and imprecise, and isn’t always a great experience for the caller.

Over the next several years, we expect speech-driven interfaces to become ubiquitous. The potential for nuanced human-machine interaction driven by speech is readily apparent to anyone who has asked Alexa to play their favorite music from Spotify.

With Speech Recognition, you can now capture speech from your customers in real-time. It works in 89 languages and dialects, and has a simple, pay-as-you go pricing structure.

<Gather> with Speech

Speech recognition is integrated directly into Twilio’s <Gather> verb so you can update the code you already have in place. Because it supports 89 languages and dialects, you can upgrade your application to support customers across a broad range of regions. Adding speech is as simple as adding a new parameter called “input” as shown in the TwiML below.

If you specify speech as an input, Twilio will add a new parameter called SpeechResult in the request to your action url.

 

AccountSid AC25e16e9a616a4a1786a7c83f58e30082
ApiVersion 2010-04-01
CallSid CA607dee6b7647243904ebc8db64a2a5c2
CallStatus in-progress
Called +18182004120
Confidence 0.77388394
Direction inbound
From +15623000628
Language en-US
SpeechResult      I’d like to learn more about Speech Recognition
To +18182104120

 

If you’d like to build more responsive applications, we also offer the ability to get speech results in real time as we process speech. To access the real-time voice stream, you can specify a partial results callback:

Once you specify a callback url for partialResultCallback, you will get requests as your customers speak. Since HTTP requests may arrive out of order, we include a sequence number to help you use your customer’s speech as it was spoken.

This allows you to evaluate the speech of your user as they speak to build responsive voice applications. A detailed explanation of Speech Recognition features and TwiML examples can be found here.

Pricing

Speech Recognition uses a scalable pay-as-you go model, with requests starting at $0.02 per 15 seconds of recognition. Those who have operated a speech recognition system know how time consuming and difficult planning for channels or ports can be. Speech Recognition from Twilio does away with this burden and scales with your businessplug it in and it just works. If you’re planning for significant traffic, it’s important to know that volume-based discounts can cut the price of Speech Recognition to as little as $0.008 per 15 seconds. Full volume tiers can be found here.

How to Get Started

Speech recognition is available to all Twilio developers today. To get started, check out our docs. If you have any questions about moving your traffic or adding Speech Recognition to your Twilio application, don’t hesitate to reach out to our Sales team.

What’s Next: Understand

Speech Recognition is only the beginning for voice-driven interfaces built on Twilio. Coming soon, we will be releasing a new verb: Understand. It’s exactly what you hope it is: an API to analyze text and determine intent during a live call using natural language understanding. Powered by machine learning, Understand will give developers what they need to build intelligent, nuanced human-machine interactions in order to turn freeform text into structured data. It will work natively with both Twilio Programmable Voice and SMS, as well as Amazon Alexa.

Stay tuned for morewe can’t wait to see what you build.

  • Michael Shields

    Awesome! I’ve been building this feature w/BlueMix this week; I’ll try this out today.

  • Rishav

    The link for full volume tier seems to be broken (404!) https://www.twilio.com/voice/speech-recognition

    • Thanks for the heads up. Want to drop me an email at gb@twilio.com? Would love to send you something to say thanks.

  • Mark Cofano

    A much needed feature! I know just how to put it to work.

  • paulatmodaco

    Very interesting, what is the correct link for volume pricing? Can’t find it!

  • coolspot

    Will the engine also support speech recognition grammars? It can be pretty difficult to parse the text manually?

  • Sam

    I see .012 for volume pricing but not .08

  • Exooc news

    Great and 10x for adding gather in czech language..
    Script now understands .CZ customers 👍 but to speak in CZ language I have to use speech synthetiser of 3rd party :(

  • john harkin

    Thanks, it looks like the Record verb also supports it https://www.twilio.com/docs/api/twiml/record and https://stackoverflow.com/questions/40255132/speech-to-text-using-twilio. Looks like 2 ways to achieve same thing? Which is the way to go?

    • Hey John,

      Speech to text using and transcription was the only way you could achieve this with Twilio until now. However, it’s not the best method as the transcriptions were asynchronous to the call and you could not know how long they would take. Using allows you to get partial results as they happen, as well as the final result in a callback and is definitely the way to go now that it is available. I would leave to just taking messages as part of a call now.

  • Jackpile

    I don’t mean to be negative. This is great, but your cost is 3.3x Google ASR ($0.006) and 5x Microsoft ($0.004). I would jump at this if you could offer this at $0.002/15-second utterance. What’s the point if you can’t be competitive?

    • Nico Acosta

      Hi Jack, thanks for the note, I appreciate your feedback. The two services you mention are different in an important way. The Speech-to-text APIs like the ones you mention great for web and mobile apps, but as standalone services are not usable with telephony. Twilio’s Speech Recognition API is desinged for real time communications, fully integrated with the telephony stack, with key features for this use case like barge-in where you can interrupt a with speech and support for both DTMF tones and speech at the same time. Other alternatives to do Speech Recognition for telephony IVRs cost hundreds of thousands of dollars, require you to define complicated grammars and host you own hardware. We’re happy to help with any use case you have in mind, let us know how we can help nico at twilio dot com

      • Jackpile

        Hi Nico, what’s your direct line and email so we can connect privately?

        • Megan Speir

          Hey Jack, Nico’s contact is nico at twilio dot com.

  • snewport

    Does this work in two-way conversations, or only in IVR-like scenarios. In other words, if my CSR is using Twilio Voice, can I be actively transcribing the inbound caller’s voice?