Transcribe Phone Calls in Real-Time using Node.js with AssemblyAI, and Twilio

February 07, 2024
Written by
Twilion
Reviewed by
Twilion

Real-Time Phone Call Transcription with Node.js, AssemblyAI, and Twilio

In this tutorial, you will build an application that transcribes a phone call to text in real-time. When someone calls your Twilio phone number , you will use the Media Streams API to stream the voice call audio to your WebSocket server. Your server will pass the voice audio to AssemblyAI 's real-time transcription service to get the text back live.

Prerequisites

You'll need these things to follow along:

You can experiment with AssemblyAI's APIs on a free tier, but the real-time transcription feature requires an upgraded account, so make sure to have upgraded your account before continuing.

Create a WebSocket server for Twilio media streams

You'll need to create a Node.js project and add some modules to build your application.

First, open up your terminal and run the following commands to create a Node.js project:

mkdir transcriber-media-streams
cd transcriber-media-streams
npm init -y

Then run the following command to add the necessary NPM dependencies:

npm install --save express ws
  • express : This is a web framework for Node.js. Express makes it easier to route incoming HTTP requests and send back HTTP responses.
  • ws : This is a WebSocket client and server library for Node.js. You'll use ws to create a WebSocket server to which Twilio media streams will connect.

Open the package.json file on your preferred IDE and add the following property:

  "type": "module",

This tells Node.js that you'll be using the ES module syntax for importing and exporting modules and not the CommonJS syntax.

Next, create a file named server.js with the following code:

import { createServer } from 'http';
import express from 'express';
import { WebSocketServer } from 'ws';
const app = express();
const server = createServer(app);
app.get('/', (_, res) => res.type('text').send('Twilio media stream transcriber'));
// Tell Twilio to say something and then establish a media stream with the WebSocket server
app.post('/', async (req, res) => {
  res.type('xml')
    .send(
      `<Response>
        <Say>
          Speak to see your audio transcribed in the console.
        </Say>
        <Connect>
          <Stream url='wss://${req.headers.host}' />
        </Connect>
      </Response>`
    );
});
console.log('Listening on port 3000');
server.listen(3000);

This code above responds to HTTP GET requests with "Twilio media stream transcriber", and to HTTP POST requests with the following TwiML :

<Response>
  <Say>
    Speak to see your audio transcribed in the console.
  </Say>
  <Connect>
    <Stream url='wss://<your-host>' />
  </Connect>
</Response>

The following TwiML will tell Twilio to say a message using speech to the caller using the <Say> verb , and then create a media stream that will connect to your WebSocket server using the <Connect> Verb .

Next, add the following WebSocket server code before console.log('Listening on port 3000');:

// the WebSocket server for the Twilio media stream to connect to.
const wss = new WebSocketServer({ server });
wss.on('connection', async (ws) => {
  console.log('Twilio media stream WebSocket connected')
  ws.on('message', async (message) => {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case 'connected':
        console.info('Twilio media stream connected');
        break;
      case 'start':
        console.info('Twilio media stream started');
        break;
      case 'media':
        console.log(msg);
        break;
      case 'stop':
        console.info('Twilio media stream stopped');
        break;
    }
  });
  ws.on('close', async () => {
    console.log('Twilio media stream WebSocket disconnected');
  })
});

The code above starts a WebSocket server and handles the different media stream messages that Twilio will send.

This is all the code you’ll need to implement the Twilio part of this application. Let's try it out.

Run the application by running the following command on your terminal:

node server.js

For Twilio to be able to reach your server, you need to make your application publicly accessible. Open a different shell and run the following command to tunnel your locally running server to the internet using ngrok:

ngrok http 3000

Now copy the Forwarding URL that the ngrok command outputs. It should look something like this https://d226-71-163-163-158.ngrok-free.app.

Go to the Twilio Console , navigate to your active phone numbers, and click on your Twilio phone number.

Update the Voice Configuration so that Twilio sends a Webhook when a call comes in, to your ngrok forwarding URL, using HTTP POST.

Scroll to the bottom of the page and click Save configuration .

Voice Configuration so that Twilio sends a Webhook when a call comes in, to your ngrok forwarding URL, using HTTP POST.
As a result of this configuration, when your Twilio number is called, Twilio will send a webhook to your ngrok URL which will pass the HTTP request to your server.js application. Your application will respond with the TwiML instructions you wrote earlier.

Call your Twilio phone number, say a couple of words, and hang up.

Then, observe the output shown on your terminal where you ran the application.

Listening on port 3000
Twilio media stream WebSocket connected
Twilio media stream connected
Twilio media stream started
{
  event: 'media',
  sequenceNumber: '311',
  media: {
    track: 'inbound',
    chunk: '310',
    timestamp: '6200',
    payload: 'f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/fw=='
  },
  streamSid: 'MZcf793bae35470f8cb2110a37bd9ce82b'
}
{
  event: 'media',
  sequenceNumber: '312',
  media: {
    track: 'inbound',
    chunk: '311',
    timestamp: '6220',
    payload: 'f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/fw=='
  },
  streamSid: 'MZcf793bae35470f8cb2110a37bd9ce82b'
}
{
  event: 'media',
  sequenceNumber: '313',
  media: {
    track: 'inbound',
    chunk: '312',
    timestamp: '6240',
    payload: 'f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/f39/fw=='
  }
Twilio media stream stopped
Twilio media stream WebSocket disconnected

You'll see the logs of the different Media stream events, and especially be bombarded with a lot of media messages.

Great job. You finished one half of the puzzle, let's solve the other half.

Transcribe media stream using AssemblyAI real-time transcription

You're already receiving the audio from the Twilio voice call. Now, you have to forward the audio to AssemblyAI's real-time transcription service to turn the audio into text.

You'll need a couple more NPM packages. Stop the running application on your terminal and add the packages using the following command:

npm install --save assemblyai dotenv
  • The assemblyai module is the JavaScript SDK for AssemblyAI . The SDK makes it easier to interact with AssemblyAI's APIs.
  • dotenv loads secrets from the .env file into the process's environment variables.

Open the server.js file and update the imports at the top with the following highlighted lines:

import { createServer } from 'http';
import express from 'express';
import { WebSocketServer } from 'ws';
import 'dotenv/config';
import { RealtimeService } from 'assemblyai';

When you import dotenv/config, the dotenv module will load secrets from the .env file and add them to the process's environment variables. Create that .env file in the root of your project with the following contents below, and replace <ASSEMBLYAI_API_KEY> with your AssemblyAI API key. You can find the AssemblyAI API key here .

ASSEMBLYAI_API_KEY=<ASSEMBLYAI_API_KEY>

In the incoming connection handler for the WebSocket server, update the code to pass the audio to the RealtimeService and print the transcripts to the console.

wss.on('connection', async (ws) => {
  console.log('Twilio media stream WebSocket connected')
  const transcriber = new RealtimeService({
    apiKey: process.env.ASSEMBLYAI_API_KEY,
    // Twilio media stream sends audio in mulaw format
    encoding: 'pcm_mulaw',
    // Twilio media stream sends audio at 8000 sample rate
    sampleRate: 8000
  })
  const transcriberConnectionPromise = transcriber.connect();
  transcriber.on('transcript.partial', (partialTranscript) => {
    // Don't print anything when there's silence
    if (!partialTranscript.text) return;
    console.clear();
    console.log(partialTranscript.text);
  });
  transcriber.on('transcript.final', (finalTranscript) => {
    console.clear();
    console.log(finalTranscript.text);
  });
  transcriber.on('open', () => console.log('Connected to real-time service'));
  transcriber.on('error', console.error);
  transcriber.on('close', () => console.log('Disconnected from real-time service'));
  // Message from Twilio media stream
  ws.on('message', async (message) => {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case 'connected':
        console.info('Twilio media stream connected');
        break;
      case 'start':
        console.info('Twilio media stream started');
        break;
      case 'media':
        // Make sure the transcriber is connected before sending audio
        await transcriberConnectionPromise;
        transcriber.sendAudio(Buffer.from(msg.media.payload, 'base64'));
        break;
      case 'stop':
        console.info('Twilio media stream stopped');
        break;
    }
  });
  ws.on('close', async () => {
    console.log('Twilio media stream WebSocket disconnected');
    await transcriber.close();
  })
  await transcriberConnectionPromise;
});

Let's take a deeper look at the significant parts of the code.

  const transcriber = new RealtimeService({
    apiKey: process.env.ASSEMBLYAI_API_KEY,
    // Twilio media stream sends audio in mulaw format
    encoding: 'pcm_mulaw',
    // Twilio media stream sends audio at 8000 sample rate
    sampleRate: 8000
  })
  const transcriberConnectionPromise = transcriber.connect();

The code above creates a new RealtimeService, passing in the API key you configured in .env , and configuring the encoding and sampleRate to match that of Twilio media streams. Finally, you connect to the real-time transcription service using .connect() which returns a promise. The promise resolves when the service is ready and the transcription session has begun.

      case 'media':
        // Make sure the transcriber is connected before sending audio
        await transcriberConnectionPromise;
        transcriber.sendAudio(Buffer.from(msg.media.payload, 'base64'));
        break;

When Twilio sends a media message, the server checks if we're done connecting to the real-time service, then turns the audio data into a buffer, and finally sends it to the real-time service.

  transcriber.on('transcript.partial', (partialTranscript) => {
    // Don't print anything when there's silence
    if (!partialTranscript.text) return;
    console.clear();
    console.log(partialTranscript.text);
  });
  transcriber.on('transcript.final', (finalTranscript) => {
    console.clear();
    console.log(finalTranscript.text);
  });
The real-time transcription service uses a two-phase transcription strategy, broken into partial and final transcripts. Partial transcripts are immediately returned as you send audio. Final transcripts are returned at the end of an "utterance" (usually a pause in speech). The service will finalize the results and return a higher-accuracy transcript and add punctuation and casing to the transcription text. To learn more about the transcripts and what data is sent with them, check out the AssemblyAI documentation on real-time transcription .

The code above prints both partial and final transcripts to the console, but clears the console before doing so such that newly-spoken words appear appended to the end of the current line until an utterance is finished. Partial transcripts can have empty text when audio data containing silence is sent. If the partial transcript text is empty, nothing is printed to the console, and the console isn't cleared.

Test the application

That's all the code you need to write. Let's test it out. Restart the Node.js application (leave ngrok running), and give your Twilio phone number a call. As you talk in the call, you'll see the words you're saying printed on the console.

Console logs showing real-time transcription.
If the real-time transcription service is returning an error, make sure you have an upgraded account with sufficient funds. You can find a list of error codes and their meanings in the AssemblyAI docs .

Conclusion

You learned how to create a WebSocket server that handles Twilio media streams so you can receive the audio of a Twilio voice call. You then passed the audio of the media stream to AssemblyAI's real-time transcription service to turn speech into text and print the text to the console.

You can build on top of this to create many types of voice applications. For example, you could pass the final transcript to a Large Language Model (LLM) to generate a response, then use a text-to-speech service to turn the response text into audio.

Or you could tell an LLM that there are certain actions that the caller can take, and ask the LLM which action the caller wants based on their final transcript, then execute that action.

We can't wait to see what you build! Let us know!

Niels Swimberghe is a Belgian-American software engineer, a developer educator at AssemblyAI, and a Microsoft MVP. Contact Niels on Twitter @RealSwimburger and follow Niels’ blog on .NET, Azure, and web development at swimburger.net .