How to Add Live Transcriptions to Video Calls with OpenAI’s Whisper

September 25, 2023
Written by
Carlos Mucuho
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

To build this application the Twilio Programmable Video JavaScript SDK was used to create a video room where room participants can join, communicate, and share transcripts.

To add live transcriptions to this web application you will use an enhanced version of the MediaRecorder API to record the room participants' microphones in a 6-second interval. You will use the Tranformers.js package alongside a Whisper model to transcribe the audio sent by the room participants. You will also use the DataTrack API to share the transcripts among the room participants.

The Twilio Programmable Video JavaScript SDK is a set of utilities that enables you to seamlessly integrate real-time video and voice functionalities into your web or mobile applications.

The MediaRecorder API is a JavaScript interface that provides a standardized way to record media streams in web applications. It allows you to capture audio and video from various sources, such as microphones and webcams, and save the recorded data in a specified format.

Transformers.js is a package intended to be functionally equal to Hugging Face's Python package, meaning that the same pre-trained models may be run using a very similar API.

Whisper is a speech recognition model that can be used for a variety of purposes. It is a multitasking model that can do multilingual voice recognition, speech translation, and language identification.

By the end of this tutorial you will have an application that looks similar to the following:

Complete live transcriptions web app demo

Tutorial Requirements

To follow this tutorial you will need the following:

  • A free Twilio account
  • A basic understanding of how to use Twilio and Javascript to build a video web app;
  • Node.js v12+, NPM, and Git installed;
  • A graphics card (this is optional, Whisper will still run with just a CPU but it will be slower).

Getting the boilerplate code

In this section, you will clone a repository containing the boilerplate code needed to build the video call web application.

This boilerplate code is a modified version of the code obtained from this tutorial.

Open a terminal window and navigate to a suitable location for your project. Run the following commands to clone the repository containing the boilerplate code and navigate to the boilerplate directory:

git clone https://github.com/CSFM93/twilio-video-live-transcription.git
cd twilio-video-live-transcription

This boilerplate code includes an Express.js project that serves the client application and generates the necessary access tokens for utilizing the Twilio Video API within the client application.

This Node.js application comes with the following packages:

  • dotenv: is a package that allows you to load environment variables from a .env file without additional dependencies.
  • express: is a lightweight Node.js web application framework that offers a wide range of features for creating web and mobile apps. You will use the package to create the application server.
  • multer is a Node.js middleware designed to manage multipart/form-data, it is primarily used for file uploads. You will use this module to receive and store the audio recordings sent by each room participant.
  • node-dev: is a Node.js module used in development to automatically restart Node.js applications when code changes are detected, streamlining the development process.
  • transformers.js: is a Javascript library designed to closely mirror the functionality of Hugging Face's transformers Python library, enabling you to utilize the same pre-trained models through a highly similar API. You will use this library to run the Whisper Tiny model on the server to transcribe the audio recordings.
  • twilio: is a package that allows you to interact with the Twilio API.
  • uuid: is a package that is utilized to generate universally unique Identifiers, It will be used to create a unique identity when creating a Twilio access token.
  • wavefile: is a package that allows you to create, read, write, and process Wav files. It will be used to pre-process the audio recordings before transcribing them.

Use the following command to install the packages mentioned above:

npm install

Understanding the directories and files

Excluding the package.json and .gitignore files the project root’s directory has two directories named public and uploads and a file named server.js.

The public directory is where the client application static assets files are stored, it contains three files named index.html, styles.css, and index.js.

The index.html file creates the user interface for the application, including a div to display the main webcam feed, a container below this div to show the transcriptions, buttons to control the call, and a sidebar to display all webcam feeds. The file also links to the styles.css stylesheet and includes scripts for the Twilio Video SDK, Bootstrap, and a file named index.js.

The styles.css file and Bootstrap are used to style the application.

The index.js file contains the code that creates the Local Audio, Video, and Data tracks and then uses these tracks to create a video room.

The uploads directory is where the audio recordings will be stored. It initially contains a sample audio recording file named audio.wav which will be used to test the speech-to-text functionality.

The server.js file sets up an Express.js server that serves static assets from the public directory and handles video call room creation and access token generation using the Twilio Video API. Clients can request to join a room by providing a room name, and the server ensures the room exists and provides them with the necessary access token for the video call.

Collecting and storing your Twilio credentials

Create a new file named .env in your project root directory. This file is where you will store your Twilio account credentials that will be needed to create access tokens for the Twilio Video API. In total, you will need to collect and store three twilio credentials.

The first credential required is the Account SID, which is located in the Twilio Console. This credential should be stored in the .env file once obtained:


TWILIO_ACCOUNT_SID=<Your Account SID>

The second and third required credentials are an API Key SID and API Key Secret, which can be obtained by following this guide. After obtaining both credentials, copy and store them in the .env file:


TWILIO_ACCOUNT_SID=<Your Account SID>
TWILIO_API_KEY_SID=<Your API Key>
TWILIO_API_KEY_SECRET=<Your Secret Key>

After generating an API key you will only be able to see the API Key Secret one time. Be sure to copy and paste it somewhere safe, or directly into your .env file right away.

Go back to your terminal and run the following command to start the application:

npm start

Open two tabs on your preferred browser and navigate to http://localhost:3000/ URL and you should see a web page similar to the following:

Starter project demo

Adding whisper to your application

In this section, you will learn how to use the Transformers.js package alongside a Whisper model to transcribe audio files.

The Transformers.js package is similar to its Python counterpart and relies on the pipeline API. Pipelines offer a user-friendly and high-level API for executing machine learning models.

Pipelines bundle a pre-trained model along with input preprocessing and output post-processing, creating the most straightforward method for utilizing models within the library.

Create a file named whisper.js in your project root directory and add the following code to it:

const fs = require('fs');
const wavefile = require('wavefile');
let transcriber;

async function initialize() {
  const { pipeline } = await import('@xenova/transformers');
  transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
}

First, the code imports the fs and wavefile modules and declares a variable named transcriber.

Next, the code declares a function named initialize(). This function utilizes the import() function to load the pipeline function from the transformers.js ES module.

The initialize() function then initializes the transcriber variable by creating a pipeline for automatic speech recognition using the Xenova/whisper-tiny.en model.

Add the following code below the initialize() function:

async function getAudioData(filePath) {
  const file = fs.readFileSync(filePath);
  const buffer = Buffer.from(file);

  const wav = new wavefile.WaveFile(buffer);
  wav.toBitDepth('32f'); 
  wav.toSampleRate(16000); 
  let audioData = wav.getSamples();
  if (Array.isArray(audioData)) {
    if (audioData.length > 1) {
      const SCALING_FACTOR = Math.sqrt(2);
      for (let i = 0; i < audioData[0].length; ++i) {
        audioData[0][i] = SCALING_FACTOR * (audioData[0][i] + audioData[1][i]) / 2;
      }
    }
    audioData = audioData[0];
  }
  return audioData;
}

The code above defines a function named getAudioData() which takes a file path as a parameter. This function is responsible for reading and preprocessing audio data from the given file.

The code reads the audio file at the given file path synchronously using fs.readFileSync() and then converts the file's contents into a buffer.

The buffer is then processed using the wavefile module. The audio is converted to a 32-bit floating point format using .toBitDepth('32f') method and resampled to a rate of 16000 samples per second using .toSampleRate(16000) method.

The code uses conditional logic to check if the audio data contains multiple channels and merge them into a single channel if that is the case. This is done by averaging the values of the channels and scaling them with a factor. The final audio data is stored in the audioData variable.

Add the following code below the getAudioData() function:

function deleteFile(filePath) {
  fs.unlink(filePath, (err => {
    if (err) console.log(err);
  }));
}

The code above defines a function named deleteFile() that takes a file path as a parameter and deletes the audio file located at that path.

Add the following code below the  deleteFile() function:

async function transcribe(filePath) {
  const audioData = await getAudioData(filePath);
  const transcript = await transcriber(audioData);
  // deleteFile(filePath)
  return transcript;
}

The code defines a function named transcribe() that takes a file path as a parameter and transcribes the speech in the audio file located at that path.

The code inside this function calls the getAudioData() function and passes a file path as an argument to retrieve the preprocessed audio data.

Next, it uses the previously initialized transcriber pipeline to transcribe the audio data into text and stores the value returned in a variable named transcript.

Lastly, it returns the transcript variable.

Add the following code below the transcribe() function:

async function firstRun() {
  await initialize();
  const transcript = await transcribe('./uploads/audio.wav');
  console.log('transcript', transcript);
}

firstRun();

Here the code defines and calls a function named firstRun().

The code inside this function initializes the transcriber using the initialize() function.

It then transcribes the speech in the audio file named audio.wav located in the uploads directory using the transcribe() function and logs the resulting transcript.

You will use this function mainly to have the pipeline download and cache the Xenova/whisper-tiny.en model before using this script with your server application because when running for the first time, the pipeline can take a while to download and cache the pre-trained model. Subsequent calls will be much faster.

Go back to your terminal and use the following command to run this script:

node whisper.js

Wait a few seconds and you should get an output similar to this:

transcript { text: ' Hello there.' }

Run the command above as many times as you wish and watch how subsequent calls are much faster since the model has already been downloaded and cached.

Go back to your whisper.js file and uncomment the line where the code calls the deleteFile() function located inside the transcribe() function.

Next, comment out the firstRun() function call, and export the initialize() and transcribe() functions by adding the code below the firstRun() function:

module.exports = { initialize, transcribe };

Open your server.js file and add the following code to the bottom of the import statements section:

const { initialize, transcribe } = require('./whisper');
initialize();

Here the code imports the initialize() and the transcribe() functions and calls the initialize() function to initialize the transformers.js pipeline.

Add the following code below the /join-room route:

app.post('/uploadAudio', upload.single('audio'), async (req, res) => {
  const filePath = req. file.path;
  if (req.file.path === undefined) {
    res.send({
      success: false,
    });
  } else {
    const transcript = await transcribe(filePath);
    res.send({
      success: true,
      transcript: transcript === undefined ? '' : transcript.text,
    });
  }
});

The code above defines a route named /uploadAudio. This route uses the multer module to handle audio file uploads and the transcribe() function to transcribe the audio file. These audio files are microphone recordings of the room participants who have enabled the transcription functionality.

The code first uses conditional logic to check if a file was uploaded in the request received.

If a file wasn’t uploaded a response stating the request failed is sent back to the client.

If a file was uploaded the code calls the transcribe() function from the whisper.js module to transcribe the uploaded audio file and then sends the transcription result in a response.

Adding live transcriptions to the application

In this section, first, you will use the MediaRecorder API to allow the room participants to record their microphones. Next, you will upload the recordings to the server where they will be transcribed. Lastly, you will use the DataTrack API to share the transcript returned among the room participants and then you display it below the main webcam feed.

Open up the index.html file, go to the bottom of the body tag and add a script named transcription.js of type module to it:


<!DOCTYPE html>
<html lang="en">
…
<body>
  …
  <script src="index.js"></script>
  <script type="module" src="transcription.js"></script>
</body>
</html>

Create a file named transcription.js within the public folder and add the following code to it:

import { MediaRecorder, register } from 'https://cdn.skypack.dev/extendable-media-recorder';
import { connect } from 'https://cdn.skypack.dev/extendable-media-recorder-wav-encoder';

let isMuted = false;
let isTranscribing = false;
let mediaRecorder;

const btnMuteUnmute = document.getElementById('btnMuteUnmute');
const btnTranscribe = document.getElementById('btnTranscribe');
const btnHangup = document.getElementById('btnHangup');

The code first uses the Skypack CDN to import the extendable-media-recorder and extendable-media-recorder-wav-encoder packages. These packages will allow the application to use the MediaRecorder API to record the room participant’s microphones in the WAV format which is what the whisper module you created earlier expects to receive.

Next, it defines 3 variables named isMuted, isTranscribing, mediaRecorder which will be used to keep track of whether the room participant microphone is muted or not, whether the transcription functionality is enabled or not, and store the MediaRecorder instance respectively.

Finally, it defines variables named btnMuteUnmute, btnTranscribe, and btnHangup that will hold the Mute/Unmute, Transcribe, and HangUp buttons respectively. As the names suggest these buttons will allow the room participants to mute/unmute the microphone, toggle live transcription, and end the video call.

Add the following code below the btnHangup:

async function initializeWaveEncoder() {
  await register(await connect());
}
initializeWaveEncoder();

Here, you defined and called a function named initializeWaveEncoder() which is responsible for initializing the MediaRecorder wav encoder.

Add the following code below the initializeWaveEncoder() function call:

function handleMuteUnmute() {
  if (room === undefined) {
    return;
  }

  const iElement = btnMuteUnmute.getElementsByTagName('i')[0];
  const tooltipInstance = bootstrap.Tooltip.getInstance(btnMuteUnmute);
  if (!isMuted) {
    isMuted = true;
    iElement.classList.replace('bi-mic-mute-fill', 'bi-mic-fill');
    btnMuteUnmute.classList.replace('btn-danger', 'btn-success');
    tooltipInstance.setContent({ '.tooltip-inner': 'Unmute' });
  } else {
    isMuted = false;
    iElement.classList.replace('bi-mic-fill', 'bi-mic-mute-fill');
    btnMuteUnmute.classList.replace('btn-success', 'btn-danger');
    tooltipInstance.setContent({ '.tooltip-inner': 'Mute' });
  }
}

This code defines a function named handleMuteUnmute(). Its purpose is to toggle the mute/unmute status of a local audio track within your video call application.

The code begins with a conditional statement that checks if the room is undefined. If that is the case, the function immediately returns indicating that the user hasn’t joined a room yet.

The code then checks if the isMuted variable is set to false. If that is the case the code sets the isMuted variable to true and changes the Mute/Unmute button color to green, the icon to a microphone icon, and the tooltip text to Unmute to indicate that the microphone is muted.

However, if the isMuted variable is set to true, the code sets the variable to false and changes the Mute/Unmute button color to red and the icon to a microphone with a slash, and the tooltip text to Mute to indicate that the microphone isn’t muted.

Add the following code to the bottom of the handleMuteUnmute() function:


function handleMuteUnmute() {
  …
  room.localParticipant.audioTracks.forEach((trackPublication) => {
    if (isMuted) {
      trackPublication.track.disable();
    } else {
      trackPublication.track.enable();
    }
  });
}

The code above iterates over each audio track that belongs to the local participant and for each track found it disables the track if the isMuted variable is set to true (microphone muted) and enables the track if isMuted is set to false (microphone unmuted).

Add the following code below the handleMuteUnmute() function:

async function transcribe() {
  if (room === undefined) {
    return;
  }

  const tooltipInstance = bootstrap.Tooltip.getInstance(btnTranscribe);
  if (!isTranscribing) {
    isTranscribing = true;
    btnTranscribe.classList.replace('btn-success', 'btn-danger');
    tooltipInstance.setContent({ '.tooltip-inner': 'Disable live captions' });

    const participantDiv = document.getElementById(room.localParticipant.identity);
    const audioElement = participantDiv.getElementsByTagName('audio')[0];
    const audioStream = audioElement.srcObject;
    mediaRecorder = new MediaRecorder(audioStream, { mimeType: 'audio/wav' });
    recordAudio();
  } else {
  }
}

This code defines a function named transcribe. This function is responsible for toggling the live transcription feature, which can enable or disable real-time transcription of the video call audio content.

This code also begins with a conditional statement that checks if the room is undefined. If that is the case the function immediately returns indicating that the user hasn’t joined a room yet.

The code then checks if the isTranscribing is set to false meaning that the transcription feature is disabled.

If that is the case, the code sets isTranscribing to true. The code then changes the Transcribe button color to red, and the tooltip to Disable live captions to indicate that the transcription is active.

Next, it retrieves the audio stream from the local participant's audio element, initializes and stores a MediaRecorder instance in a variable named mediaRecorder using the audio stream, and specifies the MIME type as 'audio/wav'.

Finally, it calls a function named recordAudio() to start recording audio.

Add the following code to the else statement in the transcribe() function:


async function transcribe() {
  …
  const tooltipInstance = bootstrap.Tooltip.getInstance(btnTranscribe);
  if (!isTranscribing) {
    …
  } else {
    isTranscribing = false;
    btnTranscribe.classList.replace('btn-danger', 'btn-success');
    tooltipInstance.setContent({ '.tooltip-inner': 'Enable live caption' });
    mediaRecorder.stop();
  }
}

If isTranscribing is set to true meaning the transcription feature is enabled, the code sets isTranscribing to false.

Next, it changes the Transcribe button color to green, and the tooltip to Enable live captions to indicate that the transcription is disabled.

Finally, it uses the mediaRecorder.stop() method to stop the MediaRecorder.

Add the following code below the transcribe() function:

function recordAudio() {
  let chunks = [];
  mediaRecorder.start();

  const interval = setInterval(() => {
    mediaRecorder.stop();
  }, 6000);

  mediaRecorder.onstart = () => {
    console.log('recorder started');
  };

  mediaRecorder.ondataavailable = (e) => {
    if (!isMuted) {
      chunks.push(e.data);
    }
  };
}

This code defines the function named recordAudio() which is responsible for handling the recording of audio data using the MediaRecorder API and the extendable-media-recorder package.

The code first initializes an empty array called chunks. This array will be used to store audio data in chunks as it's recorded. It then uses the mediaRecorder.start() method to start the MediaRecorder, initiating the recording process.

Next, the code sets up an interval timer which It's configured to call mediaRecorder.stop() every 6 seconds. This timer effectively stops the recording process every 6 seconds thus limiting the audio recording to 6 seconds.

Next, it sets an event handler for the onstart event of the MediaRecorder. When the recording starts, it logs a message to the console indicating that the recorder has started.

Finally, it sets an event handler for the ondataavailable event of the MediaRecorder. This event is fired when audio data becomes available for the recorder.

Inside this event handler, there's a conditional check that checks whether the microphone is not muted. If it's not muted, the audio data e.data is pushed into the chunks array.

Add the remaining MediaRecorder event handler to the bottom of the recordAudio() function:


function recordAudio() {
  …

  mediaRecorder.onstop = async (e) => {
    console.log('recorder stopped');
    const blob = new Blob(chunks, { type: 'audio/wav' });
    uploadAudio(blob);

    chunks = [];
    if (isTranscribing) {
      mediaRecorder.start();
    } else {
      clearInterval(interval);
    }
  };
}

This code sets an event handler for the onstop event of the MediaRecorder. This event is fired when the recording stops, either due to manual stopping (pressing the Transcribe button) or because the 6-second interval timer triggered it.

Inside this event handler, the code first logs a message to the console indicating that the recorder has stopped.

The code then creates a Blob object named blob by assembling all the recorded audio data stored in the chunks array and sets the Blob type to audio/wav.

Next, It calls the uploadAudio() function passing the blob as an argument, to upload the recorded audio data.

After uploading the recorded data it resets the chunks array to an empty array.

Finally, if isTranscribing is set to true, indicating that transcription is still active, it restarts the MediaRecorder using the mediaRecorder.start() method. However, if isTranscribing is set to false, it clears the interval timer, effectively stopping the periodic recording.

Add the following code below the recordAudio() function:

function uploadAudio(blob) {
  const dlUrl = URL.createObjectURL(blob).split('/');
  const filename = `${dlUrl[3]}.wav`;
  const file = new File([blob], filename);

  const formData = new FormData();
  formData.append('audio', file);

};

This code defines a function named uploadAudio() which is responsible for uploading audio data to the /uploadAudio endpoint created earlier.

The code first creates a URL from the provided blob object and uses part of the URL to create a filename for the audio file.

The code then creates a new File object named file using the blob as its content and the filename as its name.

Next, the code creates an empty FormData object named formData and appends the file object to it with the field name 'audio'.

Add the following code to the bottom of the uploadAudio() function:


function uploadAudio(blob) {
  …

  fetch('/uploadAudio', {
    method: 'POST',
    body: formData,
  })
    .then(async (response) => {
      const res = await response.json();
      if (res.success) {
        showTranscript(res.transcript, room.localParticipant.identity);
        sendTranscript(res.transcript);
      }
      console.log(res);
    })
    .catch((err) => {
      console.log('err', err);
    });
};

The code added uses the fetch API to send an HTTP POST request containing the formData to the server at the '/uploadAudio' endpoint.

After the POST request is complete, the code checks if the request was successful and if that is the case it calls a function named showTranscript() with the transcript data and the local participant's identity as arguments and a function named sendTranscript() function with the transcript data as an argument.

Add the following code below the uploadAudio() function:

function sendTranscript(transcript) {
  localDataTrack.send(JSON.stringify({
    transcript: transcript,
  }));
}

function hangUp() {
  if (room !== undefined) {
    room.disconnect();
    window.location.replace('https://www.twilio.com/en-us/video');
  }
};

Here, the code defines two functions named sendTranscript() and hangUp() .

The sendTranscript() function uses the DataTrack API to send a message containing the local participant microphone recording transcript to the remote participants in the room.

The hangUp() function checks if a room exists and if that is the case it disconnects from the room and redirects the user to the Twilio Programmable Video API page.

Add the following code below the hangUp() function:

btnMuteUnmute.addEventListener('click', handleMuteUnmute);
btnTranscribe.addEventListener('click', transcribe);
btnHangup.addEventListener('click', hangUp);

Here you added click event listeners for the Mute/Unmute, Transcribe, and HangUp buttons.

When a button is clicked, it invokes the corresponding function (handleMuteUnmute(), transcribe(), or hangUp()), which carry out the intended action associated with that button, such as muting/unmuting audio, toggling live transcription, or ending a call.

Open the index.js file and add the following code to the bottom of this file:

function showTranscript(transcript, identity) {
  if (transcript !== '') {
    const pElement = document.createElement('p');
    pElement.setAttribute('class', 'transcript-p');
    const username = identity === room.localParticipant.identity ? '[ You ]' : `[ user-${truncate(identity, 10)} ]`;
    pElement.innerText = `${username}: ${transcript}`;

    if (divLiveTranscript.children.length < 2) {
      divLiveTranscript.appendChild(pElement);
    } else {
      divLiveTranscript.removeChild(divLiveTranscript.firstElementChild);
      divLiveTranscript.appendChild(pElement);
    }
  }
}

This code defines the showTranscript() function which is responsible for displaying the recording transcript in a user-friendly format on the web page below the main webcam feed.

It creates HTML paragraph elements for each recording transcript, uses the participant’s identity to create a formatted username, sets the text content, and manages the display of these messages within a container where the username appears before the transcript.

It ensures that only two transcripts are shown at a time, replacing the oldest transcript with the newest one when necessary.

Add the following code to the message track event handler inside the  handleTrackSubscription() function (located around line 109):


function handleTrackSubscription(track, participant) {
  if (track.kind === 'data') {
    console.log('data track subscription');
    track.on('message', (data) => {
      const message = JSON.parse(data);
      showTranscript(message.transcript, participant.identity);
    });
  }
}

Here you are calling the showTranscript() function and passing the transcript stored in the message received and the participant’s identity as arguments to display the transcript on the web page.

Use the following command to start the server application:

npm start

Once the server starts, refresh the tabs where you visited the http://localhost:3000/ URL. In one of the tabs click on the transcribe button, say something, and watch how a live transcript will appear in both the local and remote room participant’s tabs

Complete live transcriptions web app demo

Conclusion

In this tutorial, you learned how to add live transcriptions to a video call web app. First, you cloned a video call web app project, got familiarized with the code, and learned how the Twilio Video API creates a video room where room participants could join and communicate with each other. Next, you learned how to use the Tranformers.js package alongside the Whisper Tiny model to transcribe audio files and then added this functionality to the server application. Lastly, you used the MediaRecorder API to allow the room participants to record their microphones, send the recording to the server, transcribe the recording, use the DataTrack API to share the transcript among the room participants, and then display the transcript below the webcam feed.