Transcribing Phone Calls using Twilio Media Streams with Java, WebSockets and Spring Boot

February 03, 2020
Written by

Transcribing Phone Calls using Twilio Media Streams with Java,  WebSockets and Spring Boot

Websockets are a web technology used to create long-lived bidirectional connections between a client and web server over the internet. With Twilio Media Streams you can stream real time audio from a phone call to your web application using WebSockets.

This blog post will show how to create a WebSocket server in Java using Spring Boot which will receive real time audio from a phone call, and forward the audio data to Google’s Speech-to-Text to provide a live transcription of the voices on the call.

Requirements

In order to follow along, you will need to have:

If you just want to skip to the end, you can check out the completed project on GitHub.

Getting Started

The fastest way to create a new project with Spring Boot is to use the Spring Initializr. Leave Project, Language and Spring Boot version at their defaults, and for Group and Artifact in the Project Metadata you are free to choose, so long as you follow the Maven naming conventions. In the example code, I used lol.gilliard as the Group and websockets-transcription for the Artifact.

Lower down the page, add the WebSocket dependency then click the “Generate” button to generate and download the project. It will be downloaded as a zip file which you can unzip and import into your favourite IDE.

Screenshot of the Spring Initializr as described in the text.

Building the Websocket Server

Spring will be able to manage the creation of the WebSocket connection, leaving the job of handling the data to us. Data is sent over WebSockets in small chunks called “Messages”.

Spring expects us to write code that can handle these messages. The easiest way to do this is to extend Spring’s AbstractWebSocketHandler.

Creating the WebSocket Handler

The project you downloaded from the Spring Initializr will have a single class in a subdirectory under src/main/java called <your_artifact_name>Application.java, and in the same package as that you should create a new class called TwilioMediaStreamsHandler, with this code:

// Note: package name will depend on your group and artifact name
// Your IDE should be able to help you here
package lol.gilliard.websocketstranscription;

import org.springframework.web.socket.TextMessage;
import org.springframework.web.socket.WebSocketSession;
import org.springframework.web.socket.handler.AbstractWebSocketHandler;

public class TwilioMediaStreamsHandler extends AbstractWebSocketHandler {

  @Override
  public void afterConnectionEstablished(WebSocketSession session) {
     System.out.println("New connection has been established");
  }

  @Override
  public void handleTextMessage(WebSocketSession webSocketSession, TextMessage textMessage) {
     System.out.println("Message received, length is " + textMessage.getPayloadLength());
  }

  @Override
  public void afterConnectionClosed(WebSocketSession session, CloseStatus status) {
     System.out.println("Connection closed");
  }
}

See the full code on GitHub

When a new WebSocket client connects the afterConnectionEstablished method will be called. Then the handleTextMessage method will be called repeatedly, every time there is a message. When the connection is closed the afterConnectionClosed method will be called.

Notice that the WebSocketSession is passed into all these methods, which enables the app to keep track of multiple WebSocket connections at once.

Spring’s WebSocket support can handle binary messages and text messages using separate methods. Twilio Media Streams supplies the audio data encoded as JSON which is why it’s only necessary to override the handleTextMessage. The first iteration of this code will print the size of each message to System.out to verify that messages are being received, which is done in the body of the handleTextMessage method:

System.out.println("Message received, length is " + textMessage.getPayloadLength());

See the source on GitHub

Configuring Spring to use our Handler

A WebSocket connection is established by the client sending a regular HTTP request. The server lets the client know that this endpoint expects WebSocket data with a handshake that starts with a response of HTTP 101 Switching Protocols. The client will acknowledge this and start sending messages. Spring can handle all this for us, with a little configuration.

In the same package as your existing classes create a new class called WebSocketConfig. This class will configure Spring to ensure that requests to a particular path (in our case /messages) will be handled by our WebSocketHandler code from above.

// Note: package name will depend on your group and artifact name
// Your IDE should be able to help you here
package lol.gilliard.websocketstranscription;

import org.springframework.context.annotation.Configuration;
import org.springframework.web.socket.config.annotation.EnableWebSocket;
import org.springframework.web.socket.config.annotation.WebSocketConfigurer;

@Configuration
@EnableWebSocket
public class WebSocketConfig implements WebSocketConfigurer {
  
}

This doesn’t compile as-is because implementing WebSocketConfigurer means we need to implement a method called registerWebSocketHandlers:

@Override
public void registerWebSocketHandlers(WebSocketHandlerRegistry registry) {
   registry.addHandler(new TwilioMediaStreamsHandler(), "/messages").setAllowedOrigins("*");
}

See the full code on GitHub

Streaming a Phone Call

This is enough to handle WebSocket clients - we now need to configure something to send data to our WebSocket endpoint. Enter Twilio Media Streams.

Twilio 101

You can buy a phone number from Twilio and configure what happens when someone calls it by creating a webhook which responds with a configuration language we like to call TwiML.

The Spring Boot application will serve this TwiML, as well as handling the WebSocket connections. Use the Twilio Java Helper Library, by adding the following to the <dependencies> section of pom.xml, next to the spring-boot-starter-websocket dep:

<dependency>
  <groupId>com.twilio.sdk</groupId>
  <artifactId>twilio</artifactId>
  <version>7.47.2</version>
</dependency>

Next, create a class called TwiMLController in the same package as your others which will serve the TwiML:

// Note: package name will depend on your group and artifact name
// Your IDE should be able to help you here
package lol.gilliard.websocketstranscription;

import com.twilio.twiml.VoiceResponse;
import com.twilio.twiml.voice.Pause;
import com.twilio.twiml.voice.Say;
import com.twilio.twiml.voice.Start;
import com.twilio.twiml.voice.Stream;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.ResponseBody;
import org.springframework.web.util.UriComponentsBuilder;

@Controller
public class TwiMLController {

   @GetMapping(value = "/twiml", produces = "application/xml")
   @ResponseBody
   public String getStreamsTwiml(UriComponentsBuilder uriInfo) {
       String wssUrl = "wss://" + uriInfo.build().getHost() + "/messages";

       return new VoiceResponse.Builder()
           .say(new Say.Builder("Hello! Start talking and the live audio will be streamed to your app").build())
           .start(new Start.Builder().stream(new Stream.Builder().url(wssUrl).build()).build())
           .pause(new Pause.Builder().length(30).build())
           .build().toXml();
   }
}

See the full code on GitHub

The TwiML created here has 3 parts:

  • Say a welcome message
  • Start the Media Stream, using the same hostname as the TwiML request and a path of /messages
  • Pause for 30 seconds, to give the caller time to speak. After 30s the call will be ended, but of course the caller can hang up before that if they want.

Configuring Twilio to use the Application

In order for Twilio to call your app, it will need to be available on a publicly accessible URL. As it is currently configured, the app will listen only on localhost, which is probably (hopefully!) not accessible from the internet. There are several options for public hosting, such as AWSDigitalOcean or Azure, but for our purposes it is simpler to use ngrok. Ngrok is a free tool that once installed can create a temporary tunnel from a public URL to your localhost.

Start your application running by using this command in a terminal, or through your IDE:

./mvnw spring-boot:run

Then start ngrok with

ngrok http 8080

You will see a public URL in the output for ngrok, different from the one below, but similarly composed of random letters and numbers:

Screenshot of ngrok output

You can test it by loading https://<YOUR_NGROK_SUBDOMAIN>.ngrok.io/twiml in your browser, and you will see a response like:

<Response>
  <Say>Hello! Start talking and the live audio will be streamed to your app</Say>
  <Start><Stream url="wss://0dd24d67.ngrok.io/messages" /></Start>
  <Pause length="30" />
</Response>

Setting up a Twilio phone number

Buying and configuring a phone number with Twilio only takes a couple of minutes. If you don’t already have a Twilio account, then a free trial account will work just fine for this app.

Buying a Phone Number

On the phone numbers page in your console you can buy numbers from hundreds of countries:

Screenshot of Twilio console buying a phone number

Choose one which is local to you, making sure that you select Voice capability:

Screenshot of Twilio console purchasing a phone number

After buying the number, you will be looking at the phone number configuration screen. Use the ngrok URL as above (don’t forget the /twiml at the end), and because we used the @GetMapping annotation in code, change the method to HTTP GET:

Screenshot of Twilio console configuring incoming calls

Save this configuration and you are all ready to call the number

A cat in sunglasses. Caption "I&#39;m ready"

Call your new phone number, and you’ll hear the <Say> message read out by a robot, then the Media Stream will start and the console will show something like this as you talk:

New connection has been established
Message received, length is 57
Message received, length is 338
Message received, length is 374
… Many more lines like this ...
Message received, length is 379
Message received, length is 194
Connection closed

🎉🎉 Congratulations 🎉🎉

You’ve got a WebSocket server up and running with Spring Boot, receiving live audio data from a phone call to your Twilio number.

There are loads of things you could do with the audio stream. The next part of this post will show one example: forwarding the data to Google’s Speech to Text service for live transcription.

Streaming Data to Google’s Transcription Service

Google’s Speech to Text service can accept streaming data, which makes it a good fit for our project. To use it you will need to set up a project and download your credentials to a file whose location is stored in the GOOGLE_APPLICATION_CREDENTIALS environment variable. It’s free to do but you need a credit card to create the account. You can follow Google’s instructions to do all that, and I’ll have a cup of tea and wait for you to come back.

A steaming hot cup of tea

Now that you’ve set up your Google project, we can continue.

You will need to add a new class to extract the data from Twilio’s WebSocket messages and send it to Google in the right format. Based on Google’s example code, I have created a class which can be copied from the repo on GitHub and used directly. Remember that your package name will probably be different depending on what you chose for Group and Artifact at the beginning. Your IDE should help you out here.

You will need to add a couple more dependencies into your pom.xml (next to where you added the dependency on the Twilio Helper Library):

<dependency>
  <groupId>org.json</groupId>
  <artifactId>json</artifactId>
  <version>20190722</version>
</dependency>

<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-speech</artifactId>
  <version>1.22.1</version>
</dependency>

Now, the last thing to do is change the code in TwilioMediaStreamsHandler to use the GoogleTextToSpeechService:

package lol.gilliard.websocketstranscription;

import org.springframework.web.socket.CloseStatus;
import org.springframework.web.socket.TextMessage;
import org.springframework.web.socket.WebSocketSession;
import org.springframework.web.socket.handler.AbstractWebSocketHandler;

import java.util.HashMap;
import java.util.Map;

public class TwilioMediaStreamsHandler extends AbstractWebSocketHandler {

   private Map<WebSocketSession, GoogleTextToSpeechService> sessions = new HashMap<>();

   @Override
   public void afterConnectionEstablished(WebSocketSession session) throws Exception {
       sessions.put(session, new GoogleTextToSpeechService(
           transcription -> {
               System.out.println("Transcription: " + transcription);
           }
       ));
   }

   @Override
   protected void handleTextMessage(WebSocketSession session, TextMessage textMessage) {
       sessions.get(session).send(message.getPayload());
   }

   @Override
   public void afterConnectionClosed(WebSocketSession session, CloseStatus status) throws Exception {
       sessions.get(session).close();
       sessions.remove(session);
   }
}

On line 13 a Map from WebSocketSession to GoogleTextToSpeechService is created, so that it is possible to support multiple inbound calls simultaneously without confusing Google by mixing up all the audio streams. Then on line 26, each incoming message’s payload is sent to a GoogleTextToSpeechService, which is configured to print out the transcription whenever Google sends it back.

You should still have ngrok running - if not, restart it with ngrok http 8080. Restart the server with ./mvnw spring-boot:run and call your number again.

After the spoken message you can talk and you will see something like this in your console:

Transcription:  Google
Transcription:  Google is
Transcription:  Google is
Transcription:  Google is transcribing
Transcription:  Google is transcribing
Transcription:  Google is transcribing
Transcription:  Google is transcribing the
Transcription:  Google is transcribing the live audio
Transcription:  Google is transcribing the live audio
Transcription:  Google is transcribing the live audio
Transcription:  Google is transcribing the live audio from
Transcription:  Google is transcribing the live audio from
Transcription:  Google is transcribing the live audio from
Transcription:  Google is transcribing the live audio from this
Transcription:  Google is transcribing the live audio from this phone call.

Isn’t it wonderful what you can achieve with a few classes and some powerful cloud services?

What next?

There are tons of possibilities now: You could stream the text to a translation service, record it into a file, try your hand at some sentiment analysis or pick out keywords that can trigger a follow-up text message after the call. I’m excited to hear what you do next with Java, WebSockets and Twilio’s Media Streams. Let me know in the comments below, or find me online:

Twitter: @MaximumGilliard

Email: mgilliard@twilio.com