Build the future of communications.
Start building for free

Set Phasers to STUN/TURN: Getting Started with WebRTC using Node.js, Socket.io and Twilio’s NAT Traversal Service

Twilio_Blog_Stun-Turn

It’s been an exciting few weeks of launches for Twilio. My favourite was the launch of our Network Traversal Service. Whilst that may sound a bit dry it’s an important service for WebRTC applications as it removes the overhead of deploying your own network of STUN and TURN servers. I’ve been dying to find an excuse to get playing with WebRTC and this was a great reason to do so.

It would, of course, be remiss of me to keep the code and the process of putting together a WebRTC application to myself. Throughout this post I will share how I got started with it building out a video chat application with WebRTC. Then you can spend fewer late nights wondering which callback you missed or which message you haven’t implemented yet and more time waving at your friends and thinking of cool applications for this technology.

Let’s make some WebRTC happen!

What is WebRTC?

Let’s start with a few definitions just to make sure we all know what we’re talking about.

WebRTC is a set of JavaScript APIs that enable plugin-free, real time, peer to peer video, audio and data communication between two browsers. Simple, right? We’ll see the JavaScript APIs in the code later.

What isn’t WebRTC?

It is also important to talk about what WebRTC doesn’t do for us, since that is the part of the application we actually need to build. Whilst a WebRTC connection between two browsers is peer to peer, we still require servers to do some work for us. The three parts of the application that are required are as follows:

Network configuration

This refers to information about the public IP address and port number on which a browser can be reached. This is where the Twilio Network Traversal Service comes in. As explained in the announcement when firewalls and NAT get involved it is not trivial to discover how to access an endpoint publicly. STUN and TURN servers can be used to discover this information. The browser does a lot of the work here but we’ll see how to set it up with access to Twilio’s service later.

Presence

Browsers usually live a solitary life, blissfully unaware of other browsers that may want to contact them. In order to connect one browser to another, we are going to have to discover the presence of another browser somehow. It is up to us to build a way for the browsers to discover other browsers that are ready to take a video call.

Signalling

Finally, once a browser decides to contact another peer it needs to send and receive both the network information received from the STUN/TURN servers, as well as information about its own media capabilities. This is known as signalling and is the majority of the work that we need to do in this application.

For a much more in depth view on WebRTC and the surrounding technologies I highly recommend the HTML5 Rocks introduction to WebRTC and their more detailed article on STUN, TURN and signalling.

Tools

To build out our WebRTC “Hello World!” (which, excitingly enough, is a video chat application) we need a few tools. Since we are speaking JavaScript on the front end I decided to use JavaScript for the back end too, so we will be using Node.js. We need something to serve our application too and for this project I picked Hapi.js (though it doesn’t really matter, you could easily use Express or even node-static). For the presence and signalling any two way communication channel can be used. I picked WebSockets using Socket.io for the simplicity of the API.

All we need to get started is a Twilio account, a computer with a webcam and Node.js installed. Oh, and a browser that supports WebRTC, right now that is Firefox, Chrome or Opera. Got that? Good, let’s write some code.

Getting started

On the command line, prepare your app:

$ mkdir video-chat
$ cd video-chat
$ npm init

Enter the information that npm init asks for (you can mostly press enter here). Now, install your dependencies:

$ npm install hapi@8.0.0 socket.io twilio --save

(Edit: Hapi keeps updating after this post was published. This command pins Hapi to version 8.0.0, which works with this code.)

Make up the files and directories you’re going to need too.

$ mkdir public
$ touch index.js public/index.html public/adapter.js public/app.js

In adapter.js we’re going to use Google’s adapter.js library which they maintain to normalise different implementations and vendor prefixed versions of the JavaScript APIs. Copy the file into your adapter.js file. If you have curl installed, you could do so with the following command:

$ curl https://webrtc.googlecode.com/svn-history/r4259/trunk/samples/js/base/adapter.js > public/adapter.js

Then open up public/index.html and enter the following bare bones HTML page:

<!doctype html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Video Chat</title>
</head>
<body>
  <h1>Video Chat</h1>
  <video id="local-video" height="150" autoplay></video>
  <video id="remote-video" height="150" autoplay></video>

  <div>
    <button id="get-video">Get Video</button>
    <button id="call" disabled="disabled">Call</button>
  </div>

  <script src="/socket.io/socket.io.js"></script>
  <script src="/adapter.js"></script>
  <script src="/app.js"></script>
</body>
</html>

As you can see this includes two empty <video> elements and some buttons that we will be using to control our calls, the JavaScript files we defined earlier alongside the Socket.io client library.

Finally, we’ll set up our server. Open up index.js and enter the following:

// index.js
var Hapi = require('hapi');
var server = new Hapi.Server()
server.connection({
  'host': 'localhost',
  'port': 3000
});
var socketio = require("socket.io");
var io = socketio(server.listener);
var twilio = require('twilio')(process.env.ACCOUNT_SID, process.env.AUTH_TOKEN);

// Serve static assets
server.route({
  method: 'GET',
  path: '/{path*}',
  handler: {
    directory: { path: './public', listing: false, index: true }
  }
});

// Start the server
server.start(function () {
  console.log('Server running at:', server.info.uri);
});

This is a basic setup for Hapi, we’re not really doing anything special here except attaching the Socket.io process to the Hapi server object.

We’ve loaded the Twilio node library here too and you can see that I’m including the API credentials from the environment. Before we run the server, we should make sure we have those credentials in the environment.

$ export TWILIO_ACCOUNT_SID=ACXXXXXXXXXX
$ export TWILIO_AUTH_TOKEN=YYYYYYYYY

Now run the server and make sure everything is looking ok.

$ node index.js

Open up http://localhost:3000 and check to see that you have a title, some empty video elements and two buttons. Is that all there? Let’s continue.

Video and Audio Streams

We’re all set up, so the first thing we need to do to start the video calling process is get hold of the user’s video and audio streams. For this we will use the navigator.getUserMedia');">[crayon-5b6d1da12bda2062940237-i/] API. It’s vendor prefixed in Chrome, Opera and Firefox, so this is where adapter.js helps us out for the first time.

We’re going to listen for a click on the first <button> element we added to the page and request the streams from the user’s webcam and microphone. Open up public/app.js and enter the following:

// app.js
var VideoChat = {
  requestMediaStream: function(event){
    getUserMedia(
      {video: true, audio: true},
      VideoChat.onMediaStream,
      VideoChat.noMediaStream
    );
  },

  onMediaStream: function(stream){
    VideoChat.localVideo = document.getElementById('local-video');
    VideoChat.localVideo.volume = 0;
    VideoChat.localStream = stream;
    VideoChat.videoButton.setAttribute('disabled', 'disabled');
    var streamUrl = window.URL.createObjectURL(stream);
    VideoChat.localVideo.src = streamUrl;
  },

  noMediaStream: function(){
    console.log("No media stream for us.");
    // Sad trombone.
  }
};

VideoChat.videoButton = document.getElementById('get-video');

VideoChat.videoButton.addEventListener(
  'click',
  VideoChat.requestMediaStream,
  false
);

The code above does a few things so let’s talk through it. I first set up a VideoChat object, this is to store a few objects and functions that we will be defining throughout the process. First object we grab hold of is the video button, to which we attach a click event listener (no jQuery here I’m afraid, this is all vanilla DOM APIs). When the button is clicked we make the request to access the video and audio streams through the getUserMedia function. Usually this would be called on the navigator object but adapter.js makes it available globally as getUserMedia.

The call to getUserMedia causes the browser to prompt the user to accept or deny the page’s request to use their media. In Firefox this looks like this:

getUserMedia permissions in Firefox

And in Chrome it looks like this:

getUserMedia permissions in Chrome

If you accept, the first callback to the getUserMedia method is called with the stream as an argument. If you deny the permissions, the second callback gets called. When we receive the stream we will save it to our VideoChat object and add it to the video element so you can see yourself (we also turn the volume down to 0 to avoid echoes). To do so we need to turn the stream into a URL, which we do with the window.URL.createObjectURL');">[crayon-5b6d1da12bdc5942148315-i/] function. We also disable the “Get Video“ button as we don’t need that anymore.

Save that, reload the page, click “Get Video“ and you should see the permissions popup, accept and you should see yourself!

Me waving at the camera!

User presence

Next we need to build a way of knowing we have another user on the other end ready to make a call. By the end of this section we will have enabled the “Call” button when we know there is someone on the other end.

In order to start passing messages between browsers as part of our signalling we need to start using our WebSockets. Open up index.js again and copy and paste the following code before the server.start function.

// index.js
io.on('connection', function(socket){
  socket.on('join', function(room){
    var clients = io.sockets.adapter.rooms[room];
    var numClients = (typeof clients !== 'undefined') ? Object.keys(clients).length : 0;
    if(numClients == 0){
      socket.join(room);
    }else if(numClients == 1){
      socket.join(room);
      socket.emit('ready', room);
      socket.broadcast.emit('ready', room);
    }else{
      socket.emit('full', room);
    }
  });
});

This is a very basic idea of a room and presence. Only two users can join the room at any one time. When a client tries to join a room we count how many clients are in the room right now. If it is zero they can join, if it is one they join and the socket emits to both clients that they are ready. If there are already 2 clients in the room then it is full and no further clients can join for now.

Now we need to join the room from the client. We need to start a connection to the socket server, we can do that by simply calling io(). Assign that to our VideoChat object so we can use it later. Then at the end of the onMediaStream function add two more lines, one to join the room and one to listen for the ready event. We then need a function to callback to once we hear that the room is ready. In that callback we will enable the “Call” button.

// app.js
var VideoChat = {
  socket: io(),
  //...
  onMediaStream: function(stream){
    VideoChat.localVideo = document.getElementById('local-video');
    VideoChat.localStream = stream;
    VideoChat.videoButton.setAttribute('disabled', 'disabled');
    VideoChat.localVideo.src = window.URL.createObjectURL(stream);
    VideoChat.socket.emit('join', 'test');
    VideoChat.socket.on('ready', VideoChat.readyToCall);
  },

  readyToCall: function(event){
    VideoChat.callButton.removeAttribute('disabled');
  },
  //...
};

We better get hold of that “Call” button too. At the bottom of the file where we grabbed the “Get Video” button, we’ll do the same for the “Call” button.

// app.js
VideoChat.callButton = document.getElementById('call');

VideoChat.callButton.addEventListener(
  'click',
  VideoChat.startCall,
  false
);

Let’s create a dummy startCall method in the VideoChat object to make sure things are going as planned.

// app.js
var VideoChat = {
  //...
  startCall: function(event){
    console.log("Things are going as planned!");
  }
};

Now, restart the node server ( Ctrl + C to stop the process and $ node index.js to start again), open 2 browser windows to http://localhost:3000 and click “Get Video” in both. Once both videos are playing the “Call” buttons in each window should be live. And clicking on the “Call” button should log a nice message to your browser’s console.

Start the signalling

Our “Call” button is very important as this is going to kick off the rest of the WebRTC process. It’s the last bit of interaction the user needs to do to get the call started.

The “Call” button is going to set up a number of processes. It is going to create the RTCPeerConnection');">[crayon-5b6d1da12bded225978671-i/] object that will manage creating the connection between the two browsers. This consists of producing information on the media capabilities of the browser and the network configuration. It is our job to send those to the other browser.

Signalling the network configuration

To set up the RTCPeerConnection object we need to give it details of the STUN and TURN servers that it will use to discover the network configuration. For this we will use the new Twilio STUN/TURN servers. The simplest method is to just use the STUN servers, they are free and don’t require any authorisation. iceServers (and iceCandidates that you will see later) refer to the overall Interactive Connectivity Establishment protocol that makes use of STUN and TURN servers.

// app.js
var VideoChat = {
  //...
  startCall: function(event){
    VideoChat.peerConnection = new RTCPeerConnection({
      iceServers: [{url: "stun:global.stun.twilio.com:3478?transport=udp" }]
    });
  }
};

In order to get the best possible chance of a connection we will want to use the TURN servers as well. In order to do this, we will need to request an ephemeral token from Twilio using the new Tokens endpoint that will give us access to the TURN servers from our front end JavaScript. We’ll have to request this token from our server and deliver the results back to the browser. Since we have a WebSocket connection already set up, we’ll use that. Here’s the flow we’ll be using in this next section:

The browser requests the token from the server over WebSockets, the server requests it from Twilio and when it gets it sends it back to the browser over the WebSocket.

Return to index.js and within the callback to the socket’s connection event place the following code:

// index.js
io.on('connection', function(socket){
  //...
  socket.on('token', function(){
    twilio.tokens.create(function(err, response){
      if(err){
        console.log(err);
      }else{
        socket.emit('token', response);
      }
    });
  });
});

Here, when the socket receives a token message it makes a request to the Twilio REST API. When it receives the token back in the callback to the request it emits the token back to the front end. Let’s build the front end part of that now.

Our startCall function now needs to use the socket to get a token, so we simply set up to listen for a token message from the server and emit one ourselves.

// app.js
var VideoChat = {
  //...
  startCall: function(event){
    VideoChat.socket.on('token', VideoChat.onToken);
    VideoChat.socket.emit('token');
  },
  //...
};

And now we need to define the onToken method to initialise our RTCPeerConnection with the iceServers returned from the API. This kicks off the process to get the network configuration so we need to add a callback function to the peerConnection to deal with the results of that. This is the onicecandidate callback and it is called every time the peerConnection generates a potential way of connecting to it from the outside world. As the developer, it is our job to share that candidate with the other browser, so right now we’ll send it down the WebSocket connection.

The callback receives a candidate and the caller shares the candidate with the other browser over the WebSocket.

// app.js
var VideoChat = {
  //...
  onToken: function(token){
    VideoChat.peerConnection = new RTCPeerConnection({
      iceServers: token.iceServers
    });

    VideoChat.peerConnection.onicecandidate = VideoChat.onIceCandidate;
  },

  onIceCandidate: function(event){
    if(event.candidate){
      console.log('Generated candidate!');
      VideoChat.socket.emit('candidate', JSON.stringify(event.candidate));
    }
  }
};

On the server, we need to send that candidate straight on to the other browser:

// index.js
io.on('connection', function(socket){
  //...
  socket.on('candidate', function(candidate){
    socket.broadcast.emit('candidate', candidate);
  });  
});

Then, we need to be able to receive those messages in the front end, this time on behalf of the other browser. We’ve set up the listener for the socket within the onToken function, since that is when we create the peerConnection and will be ready to deal with candidates.

// app.js
var VideoChat = {
  //...
  onToken: function(token){
    VideoChat.peerConnection = new RTCPeerConnection({
      iceServers: token.iceServers
    });
    VideoChat.peerConnection.onicecandidate = VideoChat.onIceCandidate;
    VideoChat.socket.on('candidate', VideoChat.onCandidate);
  },
  //...
  onCandidate: function(candidate){
    rtcCandidate = new RTCIceCandidate(JSON.parse(candidate));
    VideoChat.peerConnection.addIceCandidate(rtcCandidate);
  }
};

The onCandidate method receives the stringify’d candidate over the socket, turns it into an RTCIceCandidate and adds it to the browser’s peerConnection. You may be wondering where the second browser got a peerConnection object from since we only created that object when the user clicked the “Call” button in the first browser. You’re right to wonder but don’t worry, that is coming up very soon.

We can’t test this just yet, as the peerConnection object doesn’t start generating candidates until the next part is complete as well. We’re doing well, but there’s more information we need to share between browsers.

Sharing media configuration

In the last section we set up how the call initiator starts sharing their network config. Now we need to sort out sharing media information. The peerConnection objects in each browser will need to generate descriptions of their media capabilities. The caller will create an offer detailing those capabilities and send it over the WebSocket connection. The other browser takes that offer and creates an answer containing its own capabilities and sends it back to the caller. We will implement this below, but here’s a diagram to show what should happen.

The caller creates an offer and sends over the WebSocket, the receiver creates an answer and sends it back.

Making the offer

To start this process, we start with the offer. Once we have created the peerConnection object, we add our localStream to it. Once we’ve done that we call createOffer on the peerConnection. This generates the media configuration and calls back to the function passed in. In the callback, we call setLocalDescription with the offer on the peerConnection and send the offer over the socket to the other browser. We also need a callback for errors if createOffer isn’t successful.

// app.js
var VideoChat = {
  //...
  onToken: function(token){
    VideoChat.peerConnection = new RTCPeerConnection({
      iceServers: token.iceServers
    });
    VideoChat.peerConnection.onicecandidate = VideoChat.onIceCandidate;
    VideoChat.socket.on('candidate', VideoChat.onCandidate);
    VideoChat.peerConnection.addStream(VideoChat.localStream);
    VideoChat.peerConnection.createOffer(
      function(offer){
        VideoChat.peerConnection.setLocalDescription(offer);
        socket.emit('offer', JSON.stringify(offer));
      },
      function(err){
        console.log(err);
      }
    );
  },
  //...
};

On the server, we need to pass this message along again.

// index.js
io.on('connection', function(socket){
  //...
  socket.on('offer', function(offer){
    socket.broadcast.emit('offer', offer);
  });
});

Receiving the offer

Then in the front end we need to receive the offer. We’re setting up the listener in the onMediaStream this time as it will trigger the creation of the peerConnection in the other browser.

// app.js
var VideoChat = {
  //...
  onMediaStream: function(stream){
    VideoChat.localVideo = document.getElementById('local-video');
    VideoChat.localStream = stream;
    VideoChat.videoButton.setAttribute('disabled', 'disabled');
    VideoChat.localVideo.src = window.URL.createObjectURL(stream);
    VideoChat.socket.emit('join', 'test');
    VideoChat.socket.on('ready', VideoChat.readyToCall);
    VideoChat.socket.on('offer', VideoChat.onOffer);
  },

  onOffer: function(offer){
    console.log('Got an offer')
    console.log(offer);
  },    
  //...
};

Let’s run this to make sure we’re on track so far. Restart the server and go back to your open browser windows. Refresh both, click “Get Video” in both and accept the permissions request. Open a developer console in one window and click “Call” in the other browser. You should see “Got an offer” printed to the console followed by a JSON string of the offer that was sent. One side of our signalling is working!

There’s a lot of information in the offer but thankfully we don’t need to look deeply into that right now. It just needs to be passed between the peerConnection objects in each browser. Let’s carry on building.

At this point, we could make a call to VideoChat.startCall but that’s eventually going to create an offer and send it over the socket to the first browser which will then go through that process again in a loop. What we actually want to do here is create an answer and return it to the first browser. I think we need a refactor at this point.

Refactoring

What we need is a way to create a peerConnection object for ourselves and set up the listeners but decide whether we create an offer or an answer to send to the other browser.

To do this, I’m going to update the onToken function to take a callback function that will allow us to describe what happens once the peerConnection is set up. Since onToken is also used as a callback the function definition will now return a function that will become the callback:

// app.js
var VideoChat = {
  //...
  onToken: function(callback){
    return function(token){
      VideoChat.peerConnection = new RTCPeerConnection({
        iceServers: token.iceServers
      });
      VideoChat.peerConnection.addStream(VideoChat.localStream);
      VideoChat.peerConnection.onicecandidate = VideoChat.onIceCandidate;
      VideoChat.socket.on('candidate', VideoChat.onCandidate);
      callback();
    }
  },
  //...
};

So the callback function replaces our original method of creating the offer, which we will need a new function for:

// app.js
var VideoChat = {
  //...
  createOffer: function(){
    VideoChat.peerConnection.createOffer(
      function(offer){
        VideoChat.peerConnection.setLocalDescription(offer);
        VideoChat.socket.emit('offer', JSON.stringify(offer));
      },
      function(err){
        console.log(err);
      }
    );
  },
  //...
};

Then we change startCall to set up the callbacks like this:

// app.js
var VideoChat = {
  //...
  startCall: function(event){
    VideoChat.socket.on('token', VideoChat.onToken(VideoChat.createOffer));
    VideoChat.socket.emit('token');
  },
  //...
};

Now we can start defining the functions for creating an answer.

// app.js
var VideoChat = {
  //...
  createAnswer: function(offer){
    return function(){
      rtcOffer = new RTCSessionDescription(JSON.parse(offer));
      VideoChat.peerConnection.setRemoteDescription(rtcOffer);
      VideoChat.peerConnection.createAnswer(
        function(answer){
          VideoChat.peerConnection.setLocalDescription(answer);
          VideoChat.socket.emit('answer', JSON.stringify(answer));
        },
        function(err){
          console.log(err);
        }
      );
    }
  },
  //...
};

In this case we want to use createAnswer as the callback to the creation of the peerConnection but we also need to use the offer to set the remote description on the peerConnection. This time, we create a closure by calling the function with the offer and return a function to use as the callback. Now when the peerConnection is created we return to the inner function and turn the offer we received over the socket into a RTCSessionDescription object and set it as the remote description. We then create the answer on the peerConnection object, very much the same as we created the offer in the first place, and send it back over the socket.

This is how we set up our onOffer function now:

// app.js
var VideoChat = {
  //...
  onOffer: function(offer){
    VideoChat.socket.on('token', VideoChat.onToken(VideoChat.createAnswer(offer)));
    VideoChat.socket.emit('token');
  },
  //...    
};

Making the final connection

Now that we are sending an answer back over the socket, all we need to do is pass that on to the original caller and then wait for the browser to do its magic.

Back in index.js let’s set up the relay for the answer.

// index.js
io.on('connection', function(socket){
  //...
  socket.on('answer', function(answer){
    socket.broadcast.emit('answer', answer);
  });
});

Then, we need to set up receiving the answer in the browser. We’ll add one more listener to the socket when the peerConnection is created and build the callback function to save the answer as the remote description of the peerConnection.

// app.js
var VideoChat = {
  //...
  onToken: function(callback){
    return function(token){
      VideoChat.peerConnection = new RTCPeerConnection({
        iceServers: token.iceServers
      });
      VideoChat.peerConnection.addStream(VideoChat.localStream);
      VideoChat.peerConnection.onicecandidate = VideoChat.onIceCandidate;
      VideoChat.peerConnection.onaddstream = VideoChat.onAddStream;
      VideoChat.socket.on('candidate', VideoChat.onCandidate);
      VideoChat.socket.on('answer', VideoChat.onAnswer);
      callback();
    }
  },

  onAnswer: function(answer){
    var rtcAnswer = new RTCSessionDescription(JSON.parse(answer));
    VideoChat.peerConnection.setRemoteDescription(rtcAnswer);
  },
  //...
};

The browsers are now passing media capabilities and connection information between them leaving one more thing to do. When there is a successful connection the peerConnection will receive an onaddstream event with the stream of the peer’s media. We just need to connect that to our other <video> element and video chat will be on. We’ll add the onaddstream callback in where we create the peerConnection.

// app.js
var VideoChat = {
  //...
  onToken: function(callback){
    return function(token){
      VideoChat.peerConnection = new RTCPeerConnection({
        iceServers: token.iceServers
      });
      VideoChat.peerConnection.addStream(VideoChat.localStream);
      VideoChat.peerConnection.onicecandidate = VideoChat.onIceCandidate;
      VideoChat.peerConnection.onaddstream = VideoChat.onAddStream;
      VideoChat.socket.on('candidate', VideoChat.onCandidate);
      VideoChat.socket.on('answer', VideoChat.onAnswer);      
      callback();
    }
  },

  onAddStream: function(event){
    VideoChat.remoteVideo = document.getElementById('remote-video');
    VideoChat.remoteVideo.src = window.URL.createObjectURL(event.stream);
  },
  //...
};

And that should be it! Load up a couple of browsers next to each other, open up your development URL, get the video stream in both browsers and then click “Call” from one of them. You should find yourself looking at yourself. Four times!

Me waving at myself, twice!

This is just the beginning

This is just step one to building out all sorts of potential WebRTC applications. Once you get your head around the process required to set up the connection between two browsers, then what you do with that connection is up to you. In this instance, creating a way for users to hang up might be a start, or making a lobby area with much better presence controls.

Then there’s more fun stuff you could try out. You can alter the video streams by passing it to a canvas and playing about with it there, you could use the WebAudio API to change the sound and with the data channel (which I haven’t covered in this post) pass any data you wanted between peers.

You can see all the code from this post, fully commented, on GitHub.

I’d love to hear about the sorts of things you want to do or are already doing with WebRTC. Give me a shout on Twitter or drop me an email at philnash@twilio.com.

Authors
Sign up and start building
Not ready yet? Talk to an expert.