On this page

Manually mixing Track Recording files

(warning)

Warning

This documentation is for reference only. We are no longer onboarding new customers to Programmable Video. Existing customers can continue to use the product until December 5, 2026.

We recommend migrating your application to the API provided by our preferred video partner, Zoom. We've prepared this migration guide to assist you in minimizing any service disruption.

(warning)

Warning

Twilio recommends Compositions to mix all track recordings from a single Room, as it is a refined product that solves the various edge cases of mixing recordings with different start or offset times. Only use the approach described here if you have a business reason for not using Compositions. The transcoding tools described below are not Twilio products and as such we will not provide support for errors or problems that may arise when developers try to compose Recordings with the following procedure.

When you record a Twilio Video Group Room, you will most likely end up with multiple individual track recordings once the room is completed. By default, when you record a Group Room, Twilio records all participants' audio and video tracks and creates individual recordings for each one of those tracks. This provides the flexibility to compose into a single video output and gives control over which particular tracks will be displayed in the final composition.

(information)

Info

If you don't want to automatically record the audio and video tracks for all participants in a Group Room, you can use Recording Rules to specify the tracks you want to record. Also, you can configure your Group Room default settings in the Twilio console. Here you can set the rooms to default to group rooms and to turn recording on/off. (You can't record peer-to-peer rooms because the media does not go through Twilio servers.)

Twilio offers Compositions, a service that creates playable files from Recordings and automatically takes into account the Recordings' timing variations. However, if your use case does not fit the Compositions product, you can manually mix recordings into a single playable file.

If you choose not to use Compositions, there are several factors you will need to consider when manually mixing the recordings together into a single output. In particular, you will need to take into account each recordings' start_time and offset values, as these might differ for each participant and cause synchronization issues when mixed.

In this tutorial, you'll learn how to synchronize two participants' track recordings, each with different start_time and offset values. The output will be a single video in either webm or mp4 format, with the participants' videos side-by-side in a 2x1 grid.

Tutorial requirements

ffmpeg for mixing Recordings into a single file. Click here to download. Click here for the official documentation.
- If you would like to create output files in mp4 format, you will need to compile a version of ffmpeg that includes the libfdk_aac audio codec. Click here for the official documentation.
ffprobe to gather information about each Recording's start time. Click here for the official documentation.
The SID of a Twilio Video Group Room with two participants where both participants' audio and video tracks were recorded. Click here to learn more about creating recordings.

Background

After you have recorded a Group Room, you might want to merge all of the recorded tracks into a single playable file so you can review the full contents of the Room. If you merge recorded tracks without considering their different start_time or offset values, the output will not be synchronized.

There are several reasons tracks from the same Room might have different start_time and offset values, such as:

Participants entering the Room at different points in time. You can see this when there are different start times in the recordings' metadata.
A Room crashing in the middle of a call. You can see this with the offset in the recordings' metadata. The offset is the time in milliseconds elapsed between an arbitrary point in time, common to all group rooms, and the moment when the source room of this track started.

The example for this tutorial will be a scenario in which you want to mix Recordings from a Group Room with two participants. In this scenario:

Alice and Bob were both participants in the same Group Room.
Alice joined the Group Room when it started, but Bob entered roughly 20 seconds after it started.
Both Alice and Bob have different offset values.
The video and audio tracks for both Bob and Alice were recorded, so there are four recordings for the Group Room once the Room has completed:
- Alice's audio
- Alice's video
- Bob's audio
- Bob's video

Mixing both Alice's and Bob's tracks together without taking into account the different start_time and offset values will result in a media file with synchronization issues, where Alice and Bob's tracks are not playing at the proper times.

The output file this tutorial produces will mix the two video and two audio tracks and ensure they are correctly synchronized. The video tracks will be placed side by side in a 2x1 grid layout, with a resolution of 1024x768.

1. Find the Recording SIDs for the Group Room

First, you will need to find the SID for each of the recordings you would like to mix. You can do this via the REST API. Below is the API Call to retrieve the recording SIDs. (Note that you should pass the Room SID in the GroupingSid argument as an array with a single item.) You will need these recording SIDs in the next step.

(information)

Info

Click "Show Sample Response" in the bottom left corner of the code samples below to see the JSON response that would be returned from making the API calls. In this example, you should retrieve the sid for each of the recordings.

Retrieve a list of all Recordings for a Room

Node.js

Python

Java

PHP

Ruby

twilio-cli

curl


_13// Download the helper library from https://www.twilio.com/docs/node/install
_13// Find your Account SID and Auth Token at twilio.com/console
_13// and set the environment variables. See http://twil.io/secure
_13const accountSid = process.env.TWILIO_ACCOUNT_SID;
_13const authToken = process.env.TWILIO_AUTH_TOKEN;
_13const client = require('twilio')(accountSid, authToken);
_13
_13client.video.v1.recordings
_13            .list({
_13               groupingSid: ['RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'],
_13               limit: 20
_13             })
_13            .then(recordings => recordings.forEach(r => console.log(r.sid)));

Output


_41{
_41  "recordings": [
_41    {
_41      "account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41      "status": "completed",
_41      "date_created": "2015-07-30T20:00:00Z",
_41      "date_updated": "2015-07-30T21:00:00Z",
_41      "date_deleted": "2015-07-30T22:00:00Z",
_41      "sid": "RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41      "source_sid": "MTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41      "size": 23,
_41      "type": "audio",
_41      "duration": 10,
_41      "container_format": "mka",
_41      "codec": "OPUS",
_41      "track_name": "A name",
_41      "offset": 10,
_41      "status_callback": "https://mycallbackurl.com",
_41      "status_callback_method": "POST",
_41      "grouping_sids": {
_41        "room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41        "participant_sid": "PAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
_41      },
_41      "media_external_location": "https://my-super-duper-bucket.s3.amazonaws.com/my/path/",
_41      "encryption_key": "public_key",
_41      "url": "https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41      "links": {
_41        "media": "https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Media"
_41      }
_41    }
_41  ],
_41  "meta": {
_41    "page": 0,
_41    "page_size": 50,
_41    "first_page_url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&PageSize=50&Page=0",
_41    "previous_page_url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&PageSize=50&Page=0",
_41    "url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&PageSize=50&Page=0",
_41    "next_page_url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&PageSize=50&Page=1",
_41    "key": "recordings"
_41  }
_41}

2. Retrieve the offset for each recording

The next step is to extract the offset value for each of the four recordings via its metadata. You can do this using the REST API.

Keep track of these offsets, as you will need them in a later step.

Retrieve the offset for each track's Recording

Node.js

Python

Java

PHP

Ruby

twilio-cli

curl


_10// Download the helper library from https://www.twilio.com/docs/node/install
_10// Find your Account SID and Auth Token at twilio.com/console
_10// and set the environment variables. See http://twil.io/secure
_10const accountSid = process.env.TWILIO_ACCOUNT_SID;
_10const authToken = process.env.TWILIO_AUTH_TOKEN;
_10const client = require('twilio')(accountSid, authToken);
_10
_10client.video.v1.recordings('RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
_10               .fetch()
_10               .then(recording => console.log(recording.offset));

Output


_24{
_24  "account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_24  "status": "processing",
_24  "date_created": "2015-07-30T20:00:00Z",
_24  "sid": "RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_24  "source_sid": "MTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_24  "size": 0,
_24  "url": "https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_24  "type": "audio",
_24  "duration": 0,
_24  "container_format": "mka",
_24  "codec": "OPUS",
_24  "track_name": "A name",
_24  "offset": 10,
_24  "status_callback": "https://mycallbackurl.com",
_24  "status_callback_method": "POST",
_24  "grouping_sids": {
_24    "room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
_24  },
_24  "media_external_location": "https://my-super-duper-bucket.s3.amazonaws.com/my/path/",
_24  "links": {
_24    "media": "https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Media"
_24  }
_24}

3. Download each recording

Next, you should download each recording. You can do this via the REST API. The following curl command retrieves the URL that you can use to download the media content of a Recording.


_10curl 'https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Media' \
_10  -u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN

You will get back a JSON response that contains a redirect_to URL, similar to the response below. Go to this URL to download the recording file.


_10{"redirect_to": "https://com-twilio-us1-video-recording..."}

The audio files you download will be in .mka format and the video files will be in .mkv format.

4. Find the start time of the Recordings with ffprobe

At this point, you should have four recordings downloaded on your machine, as well as the offset values for each of these recordings.

This next step uses ffprobe to retrieve the start_time for each recording. You will need to perform this step on each recording.

Below is an example of how to get Alice's audio start_time using the following ffprobe command:


_10ffprobe -show_entries format=start_time alice.mka

The output will look similar to the output below, and it will include the start_time:


_10Input #0, matroska,webm, from 'alice.mka':
_10Metadata:
_10    encoder         : GStreamer matroskamux version 1.8.1.1
_10    creation_time   : 2017-06-30T09:03:44.000000Z
_10Duration: 00:13:09.36, start: 1.564000, bitrate: 48 kb/s
_10    Stream #0:0(eng): Audio: opus, 48000 Hz, stereo, fltp (default)
_10    Metadata:
_10        title           : Audio
_10start_time=1.564000

After retrieving both the offset from the Recordings metadata and start_time from the ffprobe on the Recording media, you can create a table like the one below. (The creation_time will also be in the output of the ffprobe command above; we are referencing it below to demonstrate that it is not the correct value to use when mixing tracks. It is not needed in any of the following steps and will be removed from the table going forward.)

A table of each recording's offset and start time, along with its creation time.

Track Name	offset (in ms)	start_time (in ms)	creation_time
alice.mka	163481005731	1564	2017-06-30T09:03:44.000000Z
alice.mkv	163481005731	1584	2017-06-30T09:03:44.000000Z
bob.mka	163481005732	20789	2017-06-30T09:04:03.000000Z
bob.mkv	163481005732	20814	2017-06-30T09:04:03.000000Z

The start_time and offset for a participant's audio and video are not required to be the same. This can happen in the scenario of a Room recovery. You can also see the approximate 20 seconds that Alice was in the room before Bob reflected in the start_times of each participant's recordings.

It is important to use start_time as reference and not creation_time. The recording's creation_time is the time that the user joined the call, but the start_time refers to when the first sample of data was received for the recording. Additionally, creation_time does not have millisecond precision and could lead to synchronization issues.

5. Set the reference and relative offsets

Next, you will need to calculate the relative offset of each track so that the tracks will be synchronized. To calculate the relative offset:

Add the offset to the start_time . In the sample table below, we store this value in the Addition column.
Take the lowest Addition value of all tracks. In the sample, that's alice.mka , with an Addition value of 163481007295. Copy this value into every row of the Reference Value column, as you will need to reference it in the next step.
Subtract the Reference Value from the Addition value for each recording to create the relative_offset in milliseconds. You will need the relative_offset value when mixing the tracks together.

The following table shows the current values for our tracks, in which alice.mka is the reference value with 163481007295.

A table of each recording's offset, start time, and calculated fields for determining the relative offset.

Track Name	offset (in ms)	start_time (in ms)	Addition	Reference Value	relative_offset (in ms)
alice.mka	163481005731	1564	163481007295	163481007295	0
alice.mkv	163481005731	1584	163481007315	163481007295	20
bob.mka	163481005732	20789	163481026521	163481007295	19226
bob.mkv	163481005732	20814	163481026546	163481007295	19251

6. Mix the tracks with ffmpeg

The final step is to mix all the tracks in a single file. The command will:

Keep video and audio tracks in synchronization
Enable you to change the output video resolution
Pad the video tracks to keep the aspect ratio of the original videos

webm format

Below is the complete command to obtain the mixed file in webm format with a 1024x768 (width x height) resolution. It's a long command! You can see an explanation for each section below.


_17ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka \
_17    -filter_complex " \
_17    [0]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs0], \
_17    color=black:size=512x768:duration=0.020[b0], \
_17    [b0][vs0]concat[r0c0]; \
_17    [1]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs1], \
_17    color=black:size=512x768:duration=19.251[b1], \
_17    [b1][vs1]concat[r0c1]; \
_17    [r0c0][r0c1]hstack=inputs=2[video]; \
_17    [2]aresample=async=1[a0]; \
_17    [3]aresample=async=1,adelay=19226.0|19226.0[a1]; \
_17    [a0][a1]amix=inputs=2[audio]" \
_17    -map '[video]' \
_17    -map '[audio]' \
_17    -acodec libopus \
_17    -vcodec libvpx \
_17    output.webm

1. Input files


_10ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka \

In the first line of this command, you specify the input files, which are the four recordings.

2. Apply scales for video and delay to video and audio

The following section breaks down each line of the filter operation.


_10-filter_complex <script>

a. This will perform the filter operation specified in the following string.


_10"[0]scale=<half of width>:-2,pad=<half of width>:<height>:(ow-iw)/2:(oh-ih)/2[vs0]

b. This section selects the first input file (here, Alice's video) and scales it to half the width of the desired resolution (512) while maintaining the original aspect ratio. Additionally, it pads the scaled video (pad) and tags it [vs0].


_10color=black:size=<half of width>x<height>:duration=<relative offset in seconds>[b0],\

c. The next step is to generate black frames for the duration of the track's relative_offset (which you calculated step 5), in seconds. This is intended to delay the track to keep it in sync with the other recordings.


_10[b0][vs0]concat[r0c0];\

d. This step concatenates the black stream [b0] with the padded stream [vs0], and tags it as [r0c0]. Then it concatenates the black frames with the padded frames.


_10[1]scale=<half of width>:-2,pad=<half of width>:<height>:(ow-iw)/2:(oh-ih)/2[vs1],\

e. This step is the same as step b, repeated for the second input file (Bob's video). The output of this line is tagged as [vs1].


_10color=black:size=<half of width>x<height>:duration=<relative offset in seconds>[b1],

f. This step is the same as step c, except the duration should be set to the relative_offset, in seconds, that you calculated for the second participant's video recording. In this example, it's 19.226. The output of this line is tagged as [b1].


_10[b1][vs1]concat[r0c1]

g. This is the same as step d. It concatenates the black stream [b1] with the padded stream [vs1], and tags it as [r0c1].


_10[r0c0][r0c1]hstack=inputs=2[video]

h. This line configures the filter that will perform the horizontal video stacking (creating the 2x1 video grid). In this example, there are two video tracks, which is why the argument is 2[video].


_10[2]aresample=async=1[a0];\

i. This line resamples the first audio input track (Alice's audio, which was the input at index [2] in the input list). resample fills and trims the audio track if needed (see more information in the resampler docs). The resampled audio is tagged as [a0].


_10[3]aresample=async=1,adelay=19226.0|19226.0[a1];\

j. This line similarly resamples the second audio input track, which in this example is Bob's audio. Here, the relative offset was 19226 ms. adelay specifies the audio delay for both left and right channels in milliseconds. The resampled and delayed audio is tagged as [a1].


_10[a0][a1]amix=inputs=2[audio]" \

k. This configures the filter that will perform the audio mixing. In this sample case, there are two tracks, so the argument is 2[audio]. This is the final line of the filter script.

3. Output definition

Below are the commands used to produce the output:


_10-map '[video]'

a. This selects the stream marked as video to be used in the output


_10-map '[audio]'

b. This selects the stream marked as audio to be used in the output


_10-acodec libopus

c. The audio codec to use. For mp4 use libfdk_aac. (See the note in Requirements about compiling a version of ffmpeg with libfdk_aac if you want to create an mp4 output file.)


_10-vcodec libvpx

d. The video codec to use. For mp4 use libx264.


_10output.webm

e. The output file name

mp4 output

The following command would produce an output file in mp4 format. The command follows the same format as the webm command above, with a few alterations:

The audio codec for the output is libfdk_aac and the video codec is libx264 .
There is an added -vsync 2 \ line immediately following the -map '[audio]' line. This line works with the libx264 video encoder.
The final output file is called output.mp4 .


_18ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka \
_18    -filter_complex "\
_18    [0]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs0],\
_18    color=black:size=512x768:duration=0.020[b0],\
_18    [b0][vs0]concat[r0c0];\
_18    [1]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs1],\
_18    color=black:size=512x768:duration=19.251[b1],\
_18    [b1][vs1]concat[r0c1];\
_18    [r0c0][r0c1]hstack=inputs=2[video];\
_18    [2]aresample=async=1[a0];\
_18    [3]aresample=async=1,adelay=19226.0|19226.0[a1];\
_18    [a0][a1]amix=inputs=2[audio]" \
_18    -map '[video]' \
_18    -map '[audio]' \
_18    -vsync 2 \
_18    -acodec libfdk_aac \
_18    -vcodec libx264 \
_18    output.mp4

Additional resources to personalize the composed media

There are many situations where developers want to know the start, end, or duration of a track. For example, if you would like to concatenate black frames after the video track ends, you would need to know the start and end of the media track. In order to find these values, you can leverage ffprobe.

The examples below demonstrate how to use ffprobe to find the start time, end time, and duration of a track. The examples below use the example video track alice.mkv.

Find the start_time in milliseconds


_10ffprobe -i alice.mkv -show_frames 2>/dev/null | head -n 30 | grep -w pkt_dts | grep -Eo '[0-9]+'

This command outputs the start time of alice.mkv, which is 1564 ms.

Find the end_time in milliseconds


_10ffprobe -i alice.mkv -show_frames 2>/dev/null | tail -n 30 | grep -w pkt_dts | grep -Eo '[0-9]+'

This command outputs the end time of alice.mkv, which is 142242 ms.

Duration in milliseconds of video track

The duration of the track is the difference between the end_time (142242 ms) and start_time (1564 ms), which results in a duration of 140678 ms.