Live Transcribing Simultaneous Phone Calls

April 22, 2020
Written by
Reviewed by

Live transcribing simultaneous phone calls

With Twilio Media Streams, you can extend the capabilities of your Twilio-powered voice application with real time access to the raw audio stream of phone calls.

This blog post follows on from my previous post that shows you how to get started with Twilio Media Streams and live transcription. If you haven’t set up a live call transcription before, I recommend working through that tutorial before moving on to this one. In this post we will scale our application to be able to handle multiple phone calls at the same time. We will be able to monitor the transcribed speech from multiple phone calls, live, in the browser, using Twilio and Google Speech-to-Text with Node.js.

You can quickly spin up working code by cloning my GitHub Repository and following the ReadMe to get setup. If you’d like to see how to refactor your code to accommodate simultaneous calls, follow these steps.

Requirements

Before we can get started, you’ll need to make sure to have:

Recap

Let’s recap how our basic call transcription application works. This picks up from a previous post. You can find that working code here: Basic Transcription Application. Follow the README to get it working.

  1. Twilio number receives a call and Twilio makes a POST request to our web server
  2. Our express application responds with TwiML, instructing Twilio to stream the audio from the call to our websocket serve
  3. Our websocket server uses the Google Speech-to-Text API to transcribe the audio into text
  4. Finally our websocket server broadcasts the text to any browser clients that are connected and, like magic, words appear in our web browser

When we try to call our Twilio Number from more than one phone simultaneously you may notice that transcription text can get a bit confused. Let’s fix this.

Differentiating incoming calls

First we need to differentiate incoming calls to our server. Thankfully, with Twilio we can add custom parameters to the Twilio Media Stream. Head over to the TwiML that your application returns and let’s add a custom parameter to hold the caller’s phone number. You could also include parameters to include the caller’s name or other information that you may have collected.

index.js


    <Response>
      <Start>
        <Stream url="wss://${req.headers.host}/">
          <Parameter name="number" value="${req.body.From}"/>
        </Stream>
      </Start>
      <Say>I will stream the next 60 seconds of audio through your websocket</Say>
      <Pause length="60" />
    </Response>

Tracking on-going calls

Every time we receive a new phone call to our number, Twilio will establish a new websocket connection. We need to keep tabs on all of these calls and their respective transcription responses. Let’s create a global variable to hold our active calls. I have placed mine just before the websocket on connection event listener.

index.js



let activeCalls = [];

wss.on("connection", function connection(ws) {

Whenever a new audio stream from Twilio starts, we want to add the new call’s details to this array. Let’s head over to the start case in our switch statement. We’ll make a few changes.

First, we will attach the streamSid to this websocket client as a property. This will be important when we are ending the call. Next we’ll add the streamSid to the information that we send out to browser clients and we will also push the information about the new call to our active calls array. Finally, just to keep track we will log how many active calls we currently have to the console.

index.js


      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        ws.streamSid = msg.streamSid;
        // Create Stream to the Google Speech to Text API
        recognizeStream = speechClient
          .streamingRecognize(transcriptionConfig)
          .on("error", console.error)
          .on("data", (data) => {
            wss.clients.forEach((client) => {
              if (
                client.readyState === WebSocket.OPEN
              ) {
                client.send(
                  JSON.stringify({
                    stream: msg.streamSid,
                    event: "interim-transcription",
                    text: data.results[0].alternatives[0].transcript,
                  })
                );
              }
            });
          });
        activeCalls.push({
          twilioStreamSid: msg.streamSid,
          fromNumber: msg.start.customParameters.number,
        });
        console.log(`There are ${activeCalls.length} active calls`);
        break;

We still have a problem. We are sending the transcripts from every call to every client connected to our websocket server. In order to fix that we’ll only send transcription data to the clients that have subscribed to this particular media stream. We will handle subscriptions in the next step.

index.js


        recognizeStream = speechClient
          .streamingRecognize(transcriptionConfig)
          .on("error", console.error)
          .on("data", (data) => {
            wss.clients.forEach((client) => {
              if (
                client.readyState === WebSocket.OPEN &&
                client.subscribedStream === msg.streamSid
              ) {
                client.send(
                  JSON.stringify({
                    stream: msg.streamSid,
                    event: "interim-transcription",
                    text: data.results[0].alternatives[0].transcript,
                  })
                );
              }
            });
          });

Subscribing from the browser

Now we need to add functionality to our web page to allow browser clients to see all the active calls, subscribe to a call and then display the transcript from that call.

First let’s restructure our index.html file. We have modified the contents to also include a div to hold a list for all the active calls as well as the transcription text we had before.

index.html


<!DOCTYPE html>
<html>
  <head>
    <title>Live Transcription with Twilio Media Streams</title>
    <link rel="stylesheet" type="text/css" href="style.css" />
  </head>
  <body>
    <h1>Live Transcription with Twilio Media Streams</h1>
    <h3>
      Call your Twilio Number, start talking and watch your words magically
      appear.
    </h3>
    <div class="wrapper">
      <div id="calls">
        <h3>Active Calls:</h3>
        <div id="call-list"></div>
      </div>
      <div id="transcription-container">
        <h4>Transcription Text:</h4>
        <p id="transcription-text"></p>
      </div>
    </div>
    <script>
      document.addEventListener("DOMContentLoaded", (event) => {
        webSocket = new WebSocket("ws://localhost:8080");

        webSocket.onmessage = function (msg) {
          const data = JSON.parse(msg.data);
          if (data.event === "interim-transcription") {
            document.getElementById("transcription-text").innerHTML = data.text;
          }
    </script>
  </body>
</html>

We need to populate this list with active calls. First, we need to have our server broadcast the list of active calls out to all the connected browser clients. Let’s head back over to our index.js file and add the following lines of code.

index.js


      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        ws.streamSid = msg.streamSid;
        // Create Stream to the Google Speech to Text API
        recognizeStream = speechClient
          .streamingRecognize(transcriptionConfig)
          .on("error", console.error)
          .on("data", (data) => {
            wss.clients.forEach((client) => {
              if (
                client.readyState === WebSocket.OPEN &&
                client.subscribedStream === msg.streamSid
              ) {
                client.send(
                  JSON.stringify({
                    stream: msg.streamSid,
                    event: "interim-transcription",
                    text: data.results[0].alternatives[0].transcript,
                  })
                );
              }
            });
          });
        activeCalls.push({
          twilioStreamSid: msg.streamSid,
          fromNumber: msg.start.customParameters.number,
        });
        wss.clients.forEach((client) => {
          client.send(
            JSON.stringify({
              event: "updateCalls",
              activeCalls,
            })
          );
        });
        console.log(`There are ${activeCalls.length} active calls`);
        break;

Let’s write a script back with our html to populate our ‘active calls’ list whenever the updateCalls event is emitted from our websocket server.

index.html


<!DOCTYPE html>
<html>
  <head>
    <title>Live Transcription with Twilio Media Streams</title>
    <link rel="stylesheet" type="text/css" href="style.css" />
  </head>
  <body>
    <h1>Live Transcription with Twilio Media Streams</h1>
    <h3>
      Call your Twilio Number, start talking and watch your words magically
      appear.
    </h3>
    <div class="wrapper">
      <div id="calls">
        <h3>Active Calls:</h3>
        <div id="call-list"></div>
      </div>
      <div id="transcription-container">
        <h4>Transcription Text:</h4>
        <p id="transcription-text"></p>
      </div>
    </div>
    <script>
      document.addEventListener("DOMContentLoaded", (event) => {
        webSocket = new WebSocket("ws://localhost:8080");
        const callList = document.getElementById("call-list");

        webSocket.onmessage = function (msg) {
          const data = JSON.parse(msg.data);
          if (data.event === "interim-transcription") {
            document.getElementById("transcription-text").innerHTML = data.text;
          } else if (data.event === "updateCalls") {
            console.log(data.activeCalls);
            callList.innerHTML = "";
            data.activeCalls.forEach((call) => {
              const button = document.createElement("BUTTON");
              button.className = "open-call";
              button.innerHTML = call.fromNumber;
              callList.appendChild(button);
            });
          }
        };
      });
    </script>
  </body>
</html>

Let’s pause for a moment and run a test. Save all the files, restart your web server and navigate with your browser to ‘http://localhost:8080’.  Give your twilio number a call. You should see a new button appear with your phone number as a label.

If you try clicking on the button, nothing happens. Let’s fix that. Next we’ll add some code that will send a ‘subscribe’ message to our server whenever a call button is clicked. Add the following lines of code to your html file.

index.html


<!DOCTYPE html>
<html>
  <head>
    <title>Live Transcription with Twilio Media Streams</title>
    <link rel="stylesheet" type="text/css" href="style.css" />
  </head>
  <body>
    <h1>Live Transcription with Twilio Media Streams</h1>
    <h3>
      Call your Twilio Number, start talking and watch your words magically
      appear.
    </h3>
    <div class="wrapper">
      <div id="calls">
        <h3>Active Calls:</h3>
        <div id="call-list"></div>
      </div>
      <div id="transcription-container">
        <h4>Transcription Text:</h4>
        <p id="transcription-text"></p>
      </div>
    </div>
    <script>
      document.addEventListener("DOMContentLoaded", (event) => {
        webSocket = new WebSocket("ws://localhost:8080");
        const callList = document.getElementById("call-list");

        webSocket.onmessage = function (msg) {
          const data = JSON.parse(msg.data);
          if (data.event === "interim-transcription") {
            document.getElementById("transcription-text").innerHTML = data.text;
          } else if (data.event === "updateCalls") {
            console.log(data.activeCalls);
            callList.innerHTML = "";
            data.activeCalls.forEach((call) => {
              const button = document.createElement("BUTTON");
              button.className = "open-call";
              button.innerHTML = call.fromNumber;
              button.addEventListener("click", () => {
                webSocket.send(
                  JSON.stringify({
                    event: "subscribe",
                    streamSid: call.twilioStreamSid,
                  })
                );
              });
              callList.appendChild(button);
            });
          }
        };
      });
    </script>
  </body>
</html>

Let’s go back to our server and add code to handle these subscriptions. Back in our switch statement, we’ll add a new case for incoming websocket messages with the ‘subscribe’ event. We will change the subscribedStream property to the streamSid we received in their message. Now that client will only receive transcript data from the stream they are subscribed to.

index.js


switch (msg.event) {
      case "connected":
        console.log(`A new call has connected.`);
        break;
      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        ws.streamSid = msg.streamSid;
        // Create Stream to the Google Speech to Text API
        recognizeStream = speechClient
          .streamingRecognize(transcriptionConfig)
          .on("error", console.error)
          .on("data", (data) => {
            wss.clients.forEach((client) => {
              if (
                client.readyState === WebSocket.OPEN &&
                client.subscribedStream === msg.streamSid
              ) {
                client.send(
                  JSON.stringify({
                    stream: msg.streamSid,
                    event: "interim-transcription",
                    text: data.results[0].alternatives[0].transcript,
                  })
                );
              }
            });
          });
        activeCalls.push({
          twilioStreamSid: msg.streamSid,
          fromNumber: msg.start.customParameters.number,
        });
        wss.clients.forEach((client) => {
          client.send(
            JSON.stringify({
              event: "updateCalls",
              activeCalls,
            })
          );
        });
        console.log(`There are ${activeCalls.length} active calls`);
        break;
      case "media":
        // Write Media Packets to the recognize stream
        recognizeStream.write(msg.media.payload);
        break;
      case "stop":
        console.log(`Call Has Ended`);
        recognizeStream.destroy();
        break;
      case "subscribe":
        console.log("Client Subscribed");
        ws.subscribedStream = msg.streamSid;
        break;
      default:
        break;
    }

Let’s test it out! Restart your server, head back to the browser and refresh the page. Now give your twilio number a ring and click on the button for your call. Start talking! You should see the words start to appear.

Incoming Call Being Transcribed

Ending Calls

One more loose end we need to tie up is to remove calls from the active calls list when calls have ended. Let’s go back to our switch statement and edit the stop case. We’ll search through the array of activeCalls to find the index that matches the streamSid of the call that has just ended. Once we find it we’ll splice it out of the array and then send an updated active calls list to all the connected clients.

index.js


wss.on("connection", function connection(ws) {
  console.log("New Connection Initiated");
  let recognizeStream;
  ws.on("message", function incoming(message) {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case "connected":
        console.log(`A new call has connected.`);
        break;
      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        ws.streamSid = msg.streamSid;
        // Create Stream to the Google Speech to Text API
        recognizeStream = speechClient
          .streamingRecognize(transcriptionConfig)
          .on("error", console.error)
          .on("data", (data) => {
            wss.clients.forEach((client) => {
              if (
                client.readyState === WebSocket.OPEN &&
                client.subscribedStream === msg.streamSid
              ) {
                client.send(
                  JSON.stringify({
                    stream: msg.streamSid,
                    event: "interim-transcription",
                    text: data.results[0].alternatives[0].transcript,
                  })
                );
              }
            });
          });
        activeCalls.push({
          twilioStreamSid: msg.streamSid,
          fromNumber: msg.start.customParameters.number,
        });
        wss.clients.forEach((client) => {
          client.send(
            JSON.stringify({
              event: "updateCalls",
              activeCalls,
            })
          );
        });
        console.log(`There are ${activeCalls.length} active calls`);
        break;
      case "media":
        // Write Media Packets to the recognize stream
        recognizeStream.write(msg.media.payload);
        break;
      case "stop":
        console.log(`Call Has Ended`);
        const i = activeCalls.findIndex(
          (stream) => stream.streamSid === ws.streamSid
        );
        activeCalls.splice(i, 1);
        wss.clients.forEach((client) => {
          client.send(
            JSON.stringify({
              event: "updateCalls",
              activeCalls: activeCalls,
            })
          );
        });
        recognizeStream.destroy();
        break;
      case "subscribe":
        console.log("Client Subscribed");
        ws.subscribedStream = msg.streamSid;
        break;
      default:
        break;
    }
  });
});

One last test, bring a friend

Now for the last test, we’ll call our Twilio number from multiple phones simultaneously. If you have any willing friends or colleagues, ask them to call your twilio number and speak all at the same time. You should see multiple phone numbers appearing and you can switch between their transcriptions by clicking on their phone numbers. As they hang up you should see the active calls disappear.

Transcribe multiple incoming calls

 

Wrapping up

Congratulations! You can now harness the power of Twilio media streams to extend your voice applications. This could be useful in a call center where you have multiple agents able to see the transcribed text from the calls that they are on. Now that you have live transcription, try translating the text with Google’s Translate API to create live speech translation or run sentiment analysis on the audio stream to work out the emotions behind the speech.

If you have any questions, feedback or just want to show me what you build, feel free to reach out to me: