Build vs. Buy: An Architect's Guide to Video & WebRTC

Developer coworkers discussing Video and WebRTC
August 02, 2022
Written by
Reviewed by
Paul Kamp
Twilion
Lyssa Test
Twilion

This article is for reference only. We're not onboarding new customers to Programmable Video. Existing customers can continue to use the product until December 5, 2024.


We recommend migrating your application to the API provided by our preferred video partner, Zoom. We've prepared this migration guide to assist you in minimizing any service disruption.

The last few years have dramatically shifted the world of video. While this channel was steadily growing in popularity before 2020, the pandemic accelerated its growth as organizations scrambled to find new ways to bring employees and customers together. In fact, business and video conferencing apps experienced record downloads as people moved online to stay connected during lockdowns.

But even as restrictions ease, video has proved it’s here to stay. Customers now expect virtual, face-to-face interaction with their favorite brands and rely upon video to stay informed, remain connected, and do their jobs.

Behind these shifting customer expectations and boost in adoption is the Web Real-Time Communication (WebRTC) standard. WebRTC lowers the barrier to entry for video, providing teams a standard way to implement impactful video across a wide variety of browsers and platforms. But even with this standardization, implementing WebRTC can be a challenge. In fact, if your business makes poor strategic architectural decisions, you can quickly turn a successful implementation into a nightmare.

In this article, we’ll look at the challenges of implementing WebRTC in your application, the pros and cons of building a custom WebRTC solution versus using a cloud-based API, and how you might solve these challenges using Twilio.

A brief introduction to WebRTC

First introduced in 2011, WebRTC is an open-source project designed to embed audio and video functionality within browsers and mobile applications using a shared set of protocols. It allows peer-to-peer audio and video communication without the need for plugins or native apps. All major browsers support it natively, covering 96% of users, and quite a few major brands use it, like Facebook, Google Hangouts, and Houseparty.

When implementing WebRTC, there are 2 basic choices:

  1. Build and deploy a solution using open-source software (OSS) offerings such as Jitsi or Kurento
  2. Integrate a cloud-based video API such as Twilio

While modern development often leans toward cloud-based solutions, there are still situations where you might choose to build a custom solution. For example, a custom solution might be best if you face regulatory and compliance issues that require maintaining your servers. Or if the fixed operational cost of managing services is a better fit for your business case.

In most situations, however, choosing a cloud-based API is the better and more cost-effective choice as it offers (among other benefits):

Hosting your own solution

So what are the design and architectural challenges of hosting a solution versus using a cloud-based API? Let’s explore building a custom solution using an OSS offering like Kurento.

Kurento is a powerful open-source WebRTC media server and set of client APIs (Java and JavaScript at publication). Given this, there are several major concerns you’ll need to address when hosting a solution with Kurento. Let’s look at each.

1. Deploying and maintaining a Kurento server

The first step when using Kurento is deploying and running your server. The server needs to be available via an internet-accessible IP address. Typically, this is accomplished by deploying to on-prem hardware, using an EC2 instance on AWS, or accessing a powerful virtual machine from another provider. Check out the Kurento documentation for information on deployment steps and alternatives.

2. Deploying and maintaining a STUN/TURN server

Once your Kurento server is up and running, the next challenge is that a Kurento deployment also requires separate servers that offer Session Traversal Utilities for NAT (STUN) and Traversal Using Relay around NAT (TURN) functionality. STUN is for connections located behind a NAT/firewall, allowing the server to discover the public IP address it uses. TURN, on the other hand, acts as a proxy to mediate connections that have to traverse NATs or firewalls. One popular project that implements the complexities of a STUN/TURN server is coturn.

Diagram depicting how WebRTC interacts with TURN relay
WebRTC Architecture

The above image shows how the WebRTC server plays together with the TURN relay. First, users connect to the WebRTC server to access the signaling layer or the layer where information exchanges as opposed to the audio/video stream. Clients then connect to the TURN relay to establish the audio/video streams.

3. Dealing with scalability and performance

Next, you’ll need to consider day-2 issues, like scalability, performance, and availability.

For example, coturn ships with an SQLite database by default. While SQLite is performant for simple situations, it doesn’t operate well in complex situations or across multiple coturn instances. That means your team may need to switch to an alternative database, such as PostgreSQL or MongoDB. Your team will also need to consider codecs, network topology, receive side scaling, receive packet steering, and more.

4. Building or adapting a custom client library

A final challenge with the various open-source offerings is finding a client library that meets your project’s specific needs. Most projects ship with a built-in API. Kurento, for example, ships with Java and JavaScript APIs. However, if you use a different framework than what the project provides, you’ll have to utilize a generic WebRTC client. With a generic client, you’ll need to manage all the connectivity functions (such as connections via STUN/TURN) manually.

Now that we’ve walked through implementing your solution, let’s look at an alternate path using a cloud-based API offering, like Twilio, to see how it can make your life easier.

Integrating WebRTC video with Twilio

When implementing video using the Twilio APIs, you don’t need to worry about infrastructure, scalability, performance, availability, STUN/TURN, and so on, as you would when deploying a custom solution. Twilio handles all those concerns for you. Plus, once you’re ready to start using video, integrating with Twilio is easy—simply call the Twilio API from your existing code and you’ll be well on your way.

Because Twilio is a cloud-based API, the code will work on-prem or in the cloud at all stages of your development cycle, without the need for upgrades or continual enhancements. Plus, there aren’t any necessary special hardware or complex architectural considerations. In fact, you can create and deploy a reference app for video collaboration in just minutes without needing to spend significant effort gathering requirements and then adjust and tune as you go.

Getting started

Let’s look at the basic design needed to integrate video into your application using Twilio.

Twilio provides API support for iOS, Android, and JavaScript. For an easy overview of these concepts (regardless of platform), check out this tutorial and these reference apps for web, Android, and iOS. For our example, we’ll use JavaScript with a Node.js back end.

1. Create a Room. While the first client to connect can do this, creating a room ahead of time via a server-side REST API call from your Node.js code allows for better control over the room type, codecs, number of participants, and so on.

The Room will have a unique identifier called an SID—this is what clients use to join that specific Room. It takes just 2 lines of code to create a new Room. In this example, we’re creating a peer-to-peer room, but we can also use WebRTC Go (a quick-to-deploy, limited, free solution) or a group (the default type that scales to 50 individuals).

// server side
const client = require('twilio')(process.env.TWILIO_ACCOUNT_SID, process.env.TWILIO_AUTH_TOKEN);
client.video.rooms.create({uniqueName: 'DailyStandup', type: 'peer-to-peer',})
                 .then(room => console.log(room.sid));

2. Give clients access tokens for the Room—Twilio provides libraries for this purpose. In our case, the getAccessToken() function on the client-side invokes a server-side API that will generate an access token.

// server side
const express = require('express');
const app = express();
const querystring = require('querystring');

const AccessToken = require('twilio').jwt.AccessToken;
const VideoGrant = AccessToken.VideoGrant;

app.get('/token', function (req, res) { 
  const identity = req.query.identity;
  const room = req.query.room;
  // Create an access token which we will sign and return to the client,
  // containing the grant we just created
  const token = new AccessToken(
    process.env.TWILIO_ACCOUNT_SID,
    process.env.TWILIO_API_KEY,
    process.env.TWILIO_API_SECRET
  );

  // Assign identity to the token
  token.identity = identity;

  // Grant the access token Twilio Video capabilities
  const grant = new VideoGrant();
  grant.room = room;
  token.addGrant(grant);

  // Serialize the token to a JWT string
  res.send(token.toJwt());
}


// client side

import { connect } from 'twilio-video';
const token = getAccessToken(); // invokes the above via a REST API call

3. Connect clients to the Room using the provided access token, which Twilio verifies to ensure access rights to that Room.

connect('$TOKEN', { name:'DailyStandup' }).then(room => {
  
  room.on('participantConnected', participant => {
    console.log(`Participant "${participant.identity}" connected`);
  });
}, error => {
  console.error(`Unable to connect to Room: ${error.message}`);
});

4. Allow the clients (now Participants) to publish/subscribe to media tracks and listen to events, assuming they have valid tokens.

// Attach the Participant's Media to a <div> element.
room.on('participantConnected', participant => {
  console.log(`Participant "${participant.identity}" connected`);

  participant.tracks.forEach(publication => {
    if (publication.isSubscribed) {
      const track = publication.track;
      document.getElementById('remote-media-div').appendChild(track.attach());
    }
  });

  participant.on('trackSubscribed', track => {
    document.getElementById('remote-media-div').appendChild(track.attach());
  });
});

That’s the gist of it, although there are additional details you need to consider if you want to handle special situations. For example, ensuring that you can handle events such as reconnections, muting/unmuting and enabling/disabling the camera, screen sharing (in Node.js, iOS, or Android), and changing the view based on the dominant speaker. While these situations would be complicated manually, the Twilio API can address them with ease.

Group chats

For the most part, the design considerations for group chats are similar to the 1:1 video chats: users connect to the Room, then participants generate or act upon events.

When hosting your solution, group video chats come with several challenges (including but not limited to):

  • Recording the video
  • Maintaining the quality of multiple videos at once
  • Dealing with the dynamic quality of the video
  • Creating the appropriate type of group Room

Luckily, Twilio handles these challenges for you via the group chat APIs. 

Simply create the group Room via a server-side REST API call. This is very similar to the previous 1:1 code but with a few extra parameters:

client.video.rooms
           .create({
              recordParticipantsOnConnect: true,
              statusCallback: 'http://example.org',
              type: 'group',
              uniqueName: 'DailyStandup'
            })
           .then(room => console.log(room.sid));

Note that you can also create ad-hoc group Rooms from your Twilio Console.

Group Rooms allow you to allocate bandwidth depending on the purpose of your video room. Twilio supports 3 modes—grid mode, collaboration mode, and presentation mode—which are part of our Network Bandwidth Profile API. Group Rooms on Twilio also have additional APIs for network quality checks, recordings, and integration with voice.

Integrate video into your WebRTC project with Twilio

Integrating video into your application can be challenging. For example, building a high-quality video application requires determining the signaling layer, the audio/video layer, multiple events, STUN/TURN, and more—and all of this across multiple connections. But understanding the differences and challenges of either a custom implementation or a cloud-based API, such as Twilio, is a solid first step to a successful project.

Ready to move forward with your WebRTC project? Download our Implementing WebRTC—Build Your Own vs. Twilio guide for an even more in-depth look at whether building or buying a WebRTC solution is the better choice for your business.

 

Sarah is a Developer Educator with Twilio, focusing on Twilio Video. Prior to her career in software, she studied psychology and French. She was a full-stack engineer and a site reliability engineer before joining the Developer Education team at Twilio. Outside of work, Sarah spends a lot of time knitting, sewing, and learning about other crafts.