How to Play Audio Files in a Twilio Video Call

March 10, 2022
Written by
Reviewed by

How to Play Audio Files in a Twilio Video Call

This article is for reference only. We're not onboarding new customers to Programmable Video. Existing customers can continue to use the product until December 5, 2024.


We recommend migrating your application to the API provided by our preferred video partner, Zoom. We've prepared this migration guide to assist you in minimizing any service disruption.

A common need in video calling applications is to allow a user to play an audio file for the other participants. This could be used to add background music to a call, to share a recorded conversation, or just to make calls more fun with sound effects or a rickroll.

In this short tutorial you will learn how a participant in a Twilio Video call can publish a secondary audio track and play back an audio file on it.

Brief introduction to the MediaStream API

The JavaScript Twilio Video library uses existing APIs in the browser to obtain access to the camera and microphone. More specifically, it uses the MediaDevices.getUserMedia() function to access MediaStream objects for these devices, which expose the raw video and audio tracks that are then published to the video call.

Interestingly, there are other ways to obtain MediaStream objects, beyond the camera and the microphone. The MediaDevices.getDisplayMedia() function prompts the user to select a display, window or browser tab to share, and returns a MediaStream object with a video track of the selected screen element, allowing the user to share their screen on the call (there is a tutorial available for this if you are interested).

Another one that is less known is the HTMLMediaElement.captureStream() function, which returns a MediaStream object associated with a <video> or <audio> element. To play an audio file on a video call, the browser application from the originating participant must get the MediaStream from an <audio> element and publish it to the call as a local audio track. Once the track is published, any audio played on this element will be received by the other participants in the call.

The following sections describe how to use HTMLMediaElement.captureStream() play audio files on a video call. Near the end of the article you can find the link to a complete example that you can try.

Adding an audio element

The audio file will need to play in a <audio> element on the page of the originating participant. This element can be created dynamically with JavaScript when needed, but given that it can be a completely invisible element, it can also be conveniently created from the start.

An invisible audio element can be added to the page with the following HTML:

<audio id="bgaudio"></audio>

The id attribute is not required, but makes it easier to locate this element from JavaScript later on. Using vanilla JavaScript, the element can be accessed as follows:

const bgAudio = document.getElementById('bgaudio');

Loading an audio file

When the participant decides to play audio on the call, you must select which audio file to play. If the audio file is known in advance, provide it as the src attribute when the <audio> element is defined:

<audio id="bgaudio" src="myAudioFile.mp3"></audio>

In many cases the application should allow the user to select an audio file while the call is taking place. This can be done in the browser via drag and drop, or with a file input element. In both cases, the selected file can be retrieved with JavaScript as a File object.

This File object needs to be converted to a URL that can be assigned to the src attribute of the audio element. The URL.createObjectURL() function does this for us:

bgAudio.src = URL.createObjectURL(file);

Loading the audio into the audio element happens asynchronously. The canplay event fires when the audio element is ready to play the file.

bgAudio.oncanplay = async () => {
  // TODO
};

Publishing the audio track to the video room

The audio element is now ready to play the audio file, so the next step is to publish a new audio track to the video room.

The captureStream() method of the audio element returns a MediaStream instance, which includes all the media tracks that are available. The audio tracks are provided by the getAudioTracks() method. Since we are interested in a single audio track, we can take the first track and ignore any extra ones.


bgAudio.oncanplay = async () => {
  const stream = bgAudio.captureStream();
  const audioStream = stream.getAudioTracks()[0];
  // TODO
};

The Twilio Video library uses the LocalAudioTrack class to represent audio from the local participant. The constructor from this class accepts standard browser’s audio tracks such as the one stored in audioStream above..


let bgAudioTrack;

bgAudio.oncanplay = async () => {
  const stream = bgAudio.captureStream();
  const audioStream = stream.getAudioTracks()[0];
  bgAudioTrack = new Twilio.Video.LocalAudioTrack(audioStream);
  // TODO
};

The bgAudioTrack variable is defined globally because this track will need to be accessed later when audio playback ends to clean everything up.

Now the LocalAudioTrack can be published to the video room:


bgAudio.oncanplay = async () => {
  const stream = bgAudio.captureStream();
  const audioStream = stream.getAudioTracks()[0];
  bgAudioTrack = new Twilio.Video.LocalAudioTrack(audioStream);
  await room.localParticipant.publishTrack(bgAudioTrack);
  // TODO
};

Playing the audio

The audio track is now published, and all participants are ready to receive audio on it. The last step to share this audio is to tell the audio element to start playing. If you are using a visible audio element, this can be done manually by the user, but in the case of an invisible audio element, you can use the play() method:


bgAudio.oncanplay = async () => {
  const stream = bgAudio.captureStream();
  const audioStream = stream.getAudioTracks()[0];
  bgAudioTrack = new Twilio.Video.LocalAudioTrack(audioStream);
  await room.localParticipant.publishTrack(bgAudioTrack);
  bgAudio.play();
};

At this point the audio will start playing for the local participant, and will also be streamed to the remaining participants as a secondary audio track from this participant. The participant sharing this audio file will still be able to speak on their microphone, as these are two independent audio tracks.

Cleanup

The application needs to decide when to stop sharing audio. One option is to offer a UI element for the user to stop playback, or it may rely on the controls offered in a visible audio element. Another option is to wait for the audio playback to end. This really depends on the application, but whenever the application decides to stop sharing this audio, the audio track that was published to the call must be unpublished.

In the following example, a handler for the audio element’s ended event is used to perform the cleanup operations:

bgAudio.onended = async () => {
  await room.localParticipant.unpublishTrack(bgAudioTrack);
  bgAudioTrack = null;
};

When the track is unpublished, all the participants will remove it.

A working example

Are you interested in trying this out with a fully working application? I have implemented the techniques discussed in this article on the project I developed for my serverless video tutorial.

To try it out you need the following:

Clone the project’s repository with the following commands:

git clone https://github.com/miguelgrinberg/twilio-serverless-video
git checkout bgaudio

Note that the audio file playback support is in the bgaudio branch of the repository.

Create a file named .env in the project directory with the following contents:

ACCOUNT_SID=XXXXX
API_KEY_SID=XXXXX
API_KEY_SECRET=XXXXX

If you don’t know what to set these three variables to, see the original tutorial for detailed instructions.

Run the project locally with the following command, and navigate to the application in your browser at http://localhost:3000/index.html.

npm start

The command below deploys the application to the Twilio Serverless platform:

twilio serverless:deploy

Note that for this command to work your Twilio CLI must be authenticated in advance. You can authenticate with the twilio login command, or by setting the TWILIO_ACCOUNT_SID and TWILIO_AUTH_TOKEN environment variables.

The deploy command will give you the URLs for all the assets and functions. You can use the URL for the index.html file, or just navigate to the domain, without any files.

Once the application is running, connect to the video room in two or more browsers, and then drag and drop an audio file on the local video on any of the browsers to play the audio on the call.

Next steps

I hope this article gives you some ideas on how to work with audio files in your video calling application.

If you are wondering if there is a way to apply the same technique to video, the answer is yes! Users can share a secondary video track as well as audio. The code examples shown in this article can be adapted to work with a video element, which may expose secondary video and audio tracks to share media played in a local video element. Let me know what the results are if you attempt this.

I can’t wait to see what you build with Twilio Video!

Miguel Grinberg is a Principal Software Engineer for Technical Content at Twilio. Reach out to him at mgrinberg [at] twilio [dot] com if you have a cool project you’d like to share on this blog!