How We Built an Interactive Live Streaming App with Programmable Video + Twilio Live + Sync

February 18, 2022
Written by
Tim Rozum
Twilion
Reviewed by
Mia Adjei
Twilion

How We Built an Interactive Live Streaming App With Programmable Video + Twilio Live + Sync

This article is for reference only. We're not onboarding new customers to Programmable Video. Existing customers can continue to use the product until December 5, 2024.


We recommend migrating your application to the API provided by our preferred video partner, Zoom. We've prepared this migration guide to assist you in minimizing any service disruption.

Traditionally, live streaming has been defined as a one-to-many broadcast, with audiences passively consuming content. However, audience expectations have shifted; instead of watching, they want to participate. From listening to our customers, we learned not only that they wanted to connect with their audience through these immersive experiences, but that it’s really hard to build it themselves. That’s why we created Twilio Live. Twilio Live lets developers build immersive experiences that allow audience members to interact in near real time (2-3 seconds of latency). This low-latency environment is what makes interaction possible, from letting participants ask questions or comment in a chat, or even getting invited to the stage and becoming part of the show themselves.

So how can developers do this today? In addition to building Twilio Live, our product team created a reference app so you can get started in minutes and get sample code to jumpstart your own build. Today, we are going to show you how the reference app works, with a deep dive into how we used Programmable Video, Twilio Live, and Sync to build it.  Soon you will be on your way to creating your own immersive experience like Twitch, Reddit Talk, or Clubhouse.

Screenshot of the completed reference application

This blog post will walk through the architecture and explain how we combined these products to go beyond simple broadcasting with live audience interaction, such as inviting audience members to join speakers on the stage. For implementation details, check out this repo that contains all of the source code for the reference backend, ReactJS app, and iOS app. The reference backend is a REST API deployed to Twilio Functions, a serverless environment. Since this project is open source, you have complete control to customize the experience and change how it works. Follow these steps to deploy the app yourself in minutes!

User experience

Let’s go over the user experience to understand how the app works.

The app has three user roles: host, speaker, and viewer. A host is really just a special speaker that has some extra permissions, such as muting or removing other speakers. The host and speakers can talk to each other on the stage, where many viewers can see them and hear them speak. This approach can accommodate thousands of users per stream. The matrix below describes what each role can do.

HostSpeakerViewer
Creates the streamYesNoNo
Is seen and heard by others in the streamYesYesNo
Sees and hears speakersYesYesYes
Can raise hand to request to become speakerNoNoYes
Send speaker invitation to viewer with raised hand to make them a speakerYesNoNo
Mute other speakerYesNoNo
Force a speaker to move to viewersYesNoNo

In this reference app, each stream has exactly one host. When the host leaves the stream, the stream ends for all users.

The user roles are implemented in the reference app code so you have complete control to change the behavior.

Next we’ll look at some user flows.

Host creates video stream

The diagram below shows how a host creates a stream. After the stream is created, the host is the only speaker on stage until other users become speakers.

Diagram showing the Speaker Experience flow

Viewer joins video stream

After a host creates a stream, a user can join the stream as a viewer.

Diagram showing the Viewer Experience flow

For convenience, the reference app also allows users to immediately join the stream as a speaker. This may not make sense for a production app.

Viewer raises hand

If a viewer has something to say to the speakers, they can raise their hand to let the host know.

When a viewer taps the raise hand button, it changes color

Host invites viewer with raised hand to join speakers on stage

When a host sees that a viewer has raised their hand, they may send the viewer an invitation to join the speakers on stage.

Flow showing how a host can invite a viewer to speak on stage

Viewer accepts speaker invite and becomes speaker

After the host sends the speaker invite, the viewer is able to become a speaker.

Flow showing switch from viewer experience to speaker experience

Moderation features

The host can tap on other speakers to access a menu with moderation features. The host can mute other speakers. The host can also move a speaker back to viewers if they are misbehaving on stage.

Moving a speaker back to viewers list

Architecture overview

The app has a speaker experience for users that are on stage and a separate viewer experience for users in the audience. Below is a diagram that shows how the main pieces are connected.

Diagram showing how speakers and viewers are connected using Video, Sync, and Live

Twilio Programmable Video powers the video collaboration experience for speakers. Up to 50 speakers connect to a video room to communicate in real-time with each other.

Twilio Live uses video tracks and other speaker data from the video room to compose a single video stream for viewers. The stream has a latency of only 2 to 3 seconds, which is key to enabling interactive features. Millions of users can connect to a stream!

Twilio Sync is used to synchronize user state in real-time. For example, when a viewer raises their hand, Sync sends an update to all users. Speaker and viewer state is stored in Sync so that any user can always view the status of all other users in the stream.

We used Sync because it makes it quick and easy for you to deploy the app yourself. You can use a different state synchronization tool if you want to. Sync is a general purpose tool and is not specifically optimized for live streaming like Programmable Video and Twilio Live are.

The reference backend is deployed to Twilio Functions. Functions is a serverless environment that is convenient to deploy to.

Client SDKs for iOS, Android, and web are available for Programmable Video, Twilio Live, and Sync.

Sync

Let’s take a closer look at Sync. Sync is essential for the features below:

  • All users can see a list of viewers.
  • Viewers can raise their hand when they want to become a speaker.
  • The host can send a speaker invitation to a viewer.

The diagram below provides an overview of the Sync objects for this app and how they are modified and read. The backend is a gatekeeper for modifying Sync state. This is more secure than allowing clients to modify shared sync state directly. With this strategy it is possible for the backend to detect misbehaving clients and react accordingly. It also reduces the opportunity for programming errors. For more on Sync security, see Securing your Sync App.

Diagram with overview of Sync objects

The backend uses a Sync webhook to listen for reachability events. This is very useful to reliably detect when a user leaves a stream. If a user closes a browser tab or force quits a mobile app, the backend receives an endpoint disconnected event and can update the Sync objects accordingly.

The backend creates a new Sync service for each live stream so that the Sync objects and reachability events are scoped to a single stream.

It is important to understand that Sync is not a database. Sync is a state synchronization tool. Only store short-term data in Sync. It will not scale well if used as a large, primary database. For more on Sync best practices, see Building Scalable Sync Apps: Best Practices & Use Cases.

Sync objects

Let's take a look at the structure of each sync object.

Speakers map

The speakers map contains all speakers and indicates which speaker is the host.

For convenience, the example below uses JSON to represent the entire map and not just map item data.

{
  "Bob": {
    "host": true
  },
  "Alice": {
    "host": false
  }
}

Viewers map

The viewers map contains all users that are viewers. The map item data is always empty but could be used to store more information about a viewer.

For convenience, the example below uses JSON to represent the entire map and not just map item data.

{
  "Ron": {

  },
  "Jennifer": {

  },
  "Jason": {

  },
  "Sandy": {

  }
}

Raised hands map

The raised hands map contains all users that have their hand raised.

At first it may appear that the raised hand data could be stored in the viewers map, but this would not scale well. There could be many viewers, and it is not practical for clients to fetch all of them to determine the subset of raised hands. And so the raised hand data must be stored in a separate map.

For convenience, the example below uses JSON to represent the entire map and not just map item data.

{
  "Tim": {

  },
  "Alice": {

  }
}

User document

Each user has their own user document to receive private data. We use Sync permissions to make sure each user can only access their own document.

The user document is currently only used to inform a user that they have received an invitation from the host to join speakers.

{
  "speaker_invite": true
}

Use cases

Now we will walk through each use case to see exactly how each step is sequenced together.

Host creates video stream

When the host creates a stream, an HTTP request is sent to the backend, at which point it creates a video room, a live stream, and a new sync service. The backend then returns a token with the host's identity and appropriate grants which allow the host to connect to the video room and sync service, using the client SDKs for Programmable Video and Sync.

Below is the sequence for a host creating a stream.

Sequence diagram for a host creating a stream
Viewer joins video stream

Below is the sequence for a user joining a stream as a viewer.

Sequence diagram for a viewer joining a stream

Notice that after the client connects to Twilio Live, the client makes a call to the backend to report that the viewer has connected to the stream. Then the backend adds the user to the viewers map. This is kind of a special situation. When a speaker connects to or disconnects from the video room, the backend can use a Programmable Video webhook to listen for changes and update the Sync objects. When a viewer leaves the stream, the backend can use a Sync webhook to detect the disconnect event and update Sync objects. Twilio Live does not have a webhook event for users connecting to the stream, so the client must explicitly report a successful connection to the backend.

View speakers and viewers

It is useful for a speaker or viewer to see information about other speakers and viewers in the stream. Let's take a look at how this information is made available to all users.

The set of speakers is stored in Sync for a few reasons. When there are a lot of speakers, only the most active speakers are visible in the video grid. Storing the speakers in Sync is the only way viewers that are not connected to the video room can access information for offscreen speakers. Also, the speakers map allows us to specify which speaker is the host and identify them in the UI.

The Rooms webhook is very useful for maintaining the speakers map. The diagram below shows how the webhook handles speaker connect and disconnect events.

Diagram showing the handling of connect and disconnect events

The set of viewers is stored in Sync so that all users can see who is viewing the stream. Twilio Live is responsible for making a video stream available to a very large number of viewers with little delay. Twilio Live does not provide detailed information about users that are viewing the stream, so Sync is used to keep track of viewer state. If we wanted to add more data for each viewer, such as profile image URL, the viewers map would enable this.

The Sync webhook is great for updating the viewers map when a viewer disconnects.

Diagram describing viewer disconnection event

Viewer moves to speakers

A viewer may raise their hand to inform the host that they have something to say and want to join the speakers on stage. The host can then send a speaker invitation to the viewer. If the viewer accepts the speaker invitation, they become a speaker.

Diagram showing how a viewer can become a speaker

Host mutes speaker

A host can mute a speaker, which is useful when a speaker does not realize they are unmuted. A speaker always has the ability to unmute again if they want to speak. This feature is intended to provide convenience and not to prohibit a speaker from speaking.

This app uses a Programmable Video data track to send a mute message from the host to the speaker. Because all speakers are connected to a video room, it is convenient to send this state change with a data track instead of Sync.

Diagram showing how a host can mute a speaker

Below is an example mute message. We used JSON to format the mute message for easy encoding and decoding, but any data can be sent over a data track.

{
        "message_type": "mute",
        "to_participant_identity": "Bob"
}

Host moves speaker to viewers

If the host, Alice, decides that Bob, a speaker, is misbehaving, she can choose to remove him by sending an HTTP request to the reference backend. This will cause the backend to disconnect Bob from the video room using the Rooms API. Bob's app will recognize the disconnect event from the video SDK, at which point his app will automatically join the stream as a viewer after obtaining a viewer token from the reference backend.

Diagram showing how a host moves a speaker back to a viewer experience

Speaker chooses to move to viewers

A speaker may choose to move back to viewers after they are done speaking.

Diagram showing speaker moving themselves back to viewers

Going from reference app to production

In general, this reference app was architected to be secure and scale well. However, we compromised in a few areas in order to make it quick and easy for you to deploy. Here are a few things to consider when building a production live streaming app.

Security

Replace passcode authentication with a secure authentication system

The reference backend uses a simple passcode solution for authentication. The passcode solution is convenient for a reference app, but it should not be used in a production environment. You should use a secure authentication system in your production app.

Scalability

Below we have tried to outline the key scaling considerations with this reference app and how they could be addressed in a production app.

Programmable video

Programmable Video should not present scaling issues for most live streaming apps. More on Programmable Video limits here.

Twilio Live

Twilio Live should not present scaling issues for most live streaming apps. More on Twilio Live limits here.

A very large number of concurrent authentication requests may result in request throttling. This guide describes how to generate playback grants at scale and avoid being throttled by the Twilio API.

Sync

Connection limit

This will probably be the biggest bottleneck for most live streaming apps. Make sure that Sync can provide enough connections for your app. If Sync cannot provide enough connections, you can swap Sync for another state synchronization tool that does.

You could also reduce concurrent Sync connections by only connecting viewers to Sync if they have their hand raised. A viewer must be connected to Sync after they raise their hand in order to receive a speaker invitation. Viewers that are not connected to Sync will not be able to see the status of other users, but that may be ok for some apps.

Write limit

Make sure that you won’t exceed Sync write limits. The main bottlenecks will probably be writing to the viewers map and the raised hands map. If a lot of viewers join or leave a single stream within a short period of time, it will cause a lot of writes to the viewers map. And if a lot of viewers in a single stream raise their hand at the same time, it will cause a lot of writes to the raised hands map. However, it should be straightforward to prevent issues by adding a little backend logic.

It is important to remember that Sync is a state synchronization tool and not a data store. You can decide what state is really useful to synchronize. Many popular live streaming apps only show a portion of the audience in the UI along with the total viewer count. For example, the backend could limit the viewer map to 100 viewers, and the backend could synchronize the total viewer count in another Sync document. Writes to the total viewer count would probably need to be throttled by the backend. The backend could also limit the raised hands map to 100 users and return an error when it is maxed out.

Adding some logic to your backend to only add necessary state to Sync should prevent scaling issues without degrading the user experience. If for some reason Sync does not work well for your app, you can swap Sync for another state synchronization tool that does.

Twilio Functions

The reference backend is deployed to Twilio Functions. Functions is a serverless environment that is convenient to deploy to. Functions is limited to 30 concurrent invocations so you may run into issues if you have a lot of users joining or leaving streams within a short period of time. Use a backend environment that will meet your scaling requirements.

Wrapping up

At Twilio we strive to wear the customer’s shoes as a means to understand our customer challenges and guide us toward building a better platform. We believe this open source reference app will help accelerate developers as they build the next generation of interactive live streaming experiences. We look forward to collaborating, and we can’t wait to see what you build!