Skip to contentSkip to navigationSkip to topbar
Rate this page:
On this page

Manually mixing Track Recording files


(warning)

Warning

This documentation is for reference only. We are no longer onboarding new customers to Programmable Video. Existing customers can continue to use the product until December 5, 2026(link takes you to an external page).

We recommend migrating your application to the API provided by our preferred video partner, Zoom. We've prepared this migration guide(link takes you to an external page) to assist you in minimizing any service disruption.

(warning)

Warning

Twilio recommends Compositions to mix all track recordings from a single Room, as it is a refined product that solves the various edge cases of mixing recordings with different start or offset times. Only use the approach described here if you have a business reason for not using Compositions. The transcoding tools described below are not Twilio products and as such we will not provide support for errors or problems that may arise when developers try to compose Recordings with the following procedure.

When you record a Twilio Video Group Room, you will most likely end up with multiple individual track recordings once the room is completed. By default, when you record a Group Room, Twilio records all participants' audio and video tracks and creates individual recordings for each one of those tracks. This provides the flexibility to compose into a single video output and gives control over which particular tracks will be displayed in the final composition.

(information)

Info

If you don't want to automatically record the audio and video tracks for all participants in a Group Room, you can use Recording Rules to specify the tracks you want to record. Also, you can configure your Group Room default settings in the Twilio console(link takes you to an external page). Here you can set the rooms to default to group rooms and to turn recording on/off. (You can't record peer-to-peer rooms because the media does not go through Twilio servers.)

Twilio offers Compositions, a service that creates playable files from Recordings and automatically takes into account the Recordings' timing variations. However, if your use case does not fit the Compositions product, you can manually mix recordings into a single playable file.

If you choose not to use Compositions, there are several factors you will need to consider when manually mixing the recordings together into a single output. In particular, you will need to take into account each recordings' start_time and offset values, as these might differ for each participant and cause synchronization issues when mixed.

In this tutorial, you'll learn how to synchronize two participants' track recordings, each with different start_time and offset values. The output will be a single video in either webm or mp4 format, with the participants' videos side-by-side in a 2x1 grid.


Tutorial requirements

tutorial-requirements page anchor

After you have recorded a Group Room, you might want to merge all of the recorded tracks into a single playable file so you can review the full contents of the Room. If you merge recorded tracks without considering their different start_time or offset values, the output will not be synchronized.

There are several reasons tracks from the same Room might have different start_time and offset values, such as:

  • Participants entering the Room at different points in time. You can see this when there are different start times in the recordings' metadata.
  • A Room crashing in the middle of a call. You can see this with the offset in the recordings' metadata. The offset is the time in milliseconds elapsed between an arbitrary point in time, common to all group rooms, and the moment when the source room of this track started.

The example for this tutorial will be a scenario in which you want to mix Recordings from a Group Room with two participants. In this scenario:

  • Alice and Bob were both participants in the same Group Room.
  • Alice joined the Group Room when it started, but Bob entered roughly 20 seconds after it started.
  • Both Alice and Bob have different offset values.
  • The video and audio tracks for both Bob and Alice were recorded, so there are four recordings for the Group Room once the Room has completed:
    • Alice's audio
    • Alice's video
    • Bob's audio
    • Bob's video

Mixing both Alice's and Bob's tracks together without taking into account the different start_time and offset values will result in a media file with synchronization issues, where Alice and Bob's tracks are not playing at the proper times.

The output file this tutorial produces will mix the two video and two audio tracks and ensure they are correctly synchronized. The video tracks will be placed side by side in a 2x1 grid layout, with a resolution of 1024x768.


1. Find the Recording SIDs for the Group Room

1-find-the-recording-sids-for-the-group-room page anchor

First, you will need to find the SID for each of the recordings you would like to mix. You can do this via the REST API. Below is the API Call to retrieve the recording SIDs. (Note that you should pass the Room SID in the GroupingSid argument as an array with a single item.) You will need these recording SIDs in the next step.

(information)

Info

Click "Show Sample Response" in the bottom left corner of the code samples below to see the JSON response that would be returned from making the API calls. In this example, you should retrieve the sid for each of the recordings.

Retrieve a list of all Recordings for a Room

retrieve-a-list-of-all-recordings-for-a-room page anchor
Node.js
Python
C#
Java
Go
PHP
Ruby
twilio-cli
curl

_13
// Download the helper library from https://www.twilio.com/docs/node/install
_13
// Find your Account SID and Auth Token at twilio.com/console
_13
// and set the environment variables. See http://twil.io/secure
_13
const accountSid = process.env.TWILIO_ACCOUNT_SID;
_13
const authToken = process.env.TWILIO_AUTH_TOKEN;
_13
const client = require('twilio')(accountSid, authToken);
_13
_13
client.video.v1.recordings
_13
.list({
_13
groupingSid: ['RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'],
_13
limit: 20
_13
})
_13
.then(recordings => recordings.forEach(r => console.log(r.sid)));

Output

_41
{
_41
"recordings": [
_41
{
_41
"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41
"status": "completed",
_41
"date_created": "2015-07-30T20:00:00Z",
_41
"date_updated": "2015-07-30T21:00:00Z",
_41
"date_deleted": "2015-07-30T22:00:00Z",
_41
"sid": "RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41
"source_sid": "MTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41
"size": 23,
_41
"type": "audio",
_41
"duration": 10,
_41
"container_format": "mka",
_41
"codec": "OPUS",
_41
"track_name": "A name",
_41
"offset": 10,
_41
"status_callback": "https://mycallbackurl.com",
_41
"status_callback_method": "POST",
_41
"grouping_sids": {
_41
"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41
"participant_sid": "PAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
_41
},
_41
"media_external_location": "https://my-super-duper-bucket.s3.amazonaws.com/my/path/",
_41
"encryption_key": "public_key",
_41
"url": "https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_41
"links": {
_41
"media": "https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Media"
_41
}
_41
}
_41
],
_41
"meta": {
_41
"page": 0,
_41
"page_size": 50,
_41
"first_page_url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&PageSize=50&Page=0",
_41
"previous_page_url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&PageSize=50&Page=0",
_41
"url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&PageSize=50&Page=0",
_41
"next_page_url": "https://video.twilio.com/v1/Recordings?Status=completed&DateCreatedAfter=2017-01-01T00%3A00%3A01Z&DateCreatedBefore=2017-12-31T23%3A59%3A59Z&SourceSid=source_sid&MediaType=audio&GroupingSid=RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&PageSize=50&Page=1",
_41
"key": "recordings"
_41
}
_41
}


2. Retrieve the offset for each recording

2-retrieve-the-offset-for-each-recording page anchor

The next step is to extract the offset value for each of the four recordings via its metadata. You can do this using the REST API.

Keep track of these offsets, as you will need them in a later step.

Retrieve the offset for each track's Recording

retrieve-the-offset-for-each-tracks-recording page anchor
Node.js
Python
C#
Java
Go
PHP
Ruby
twilio-cli
curl

_10
// Download the helper library from https://www.twilio.com/docs/node/install
_10
// Find your Account SID and Auth Token at twilio.com/console
_10
// and set the environment variables. See http://twil.io/secure
_10
const accountSid = process.env.TWILIO_ACCOUNT_SID;
_10
const authToken = process.env.TWILIO_AUTH_TOKEN;
_10
const client = require('twilio')(accountSid, authToken);
_10
_10
client.video.v1.recordings('RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
_10
.fetch()
_10
.then(recording => console.log(recording.offset));

Output

_24
{
_24
"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_24
"status": "processing",
_24
"date_created": "2015-07-30T20:00:00Z",
_24
"sid": "RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_24
"source_sid": "MTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_24
"size": 0,
_24
"url": "https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
_24
"type": "audio",
_24
"duration": 0,
_24
"container_format": "mka",
_24
"codec": "OPUS",
_24
"track_name": "A name",
_24
"offset": 10,
_24
"status_callback": "https://mycallbackurl.com",
_24
"status_callback_method": "POST",
_24
"grouping_sids": {
_24
"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
_24
},
_24
"media_external_location": "https://my-super-duper-bucket.s3.amazonaws.com/my/path/",
_24
"links": {
_24
"media": "https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Media"
_24
}
_24
}


3. Download each recording

3-download-each-recording page anchor

Next, you should download each recording. You can do this via the REST API. The following curl command retrieves the URL that you can use to download the media content of a Recording.


_10
curl 'https://video.twilio.com/v1/Recordings/RTXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Media' \
_10
-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN

You will get back a JSON response that contains a redirect_to URL, similar to the response below. Go to this URL to download the recording file.


_10
{"redirect_to": "https://com-twilio-us1-video-recording..."}

The audio files you download will be in .mka format and the video files will be in .mkv format.


4. Find the start time of the Recordings with ffprobe

4-find-the-start-time-of-the-recordings-with-ffprobe page anchor

At this point, you should have four recordings downloaded on your machine, as well as the offset values for each of these recordings.

This next step uses ffprobe to retrieve the start_time for each recording. You will need to perform this step on each recording.

Below is an example of how to get Alice's audio start_time using the following ffprobe command:


_10
ffprobe -show_entries format=start_time alice.mka

The output will look similar to the output below, and it will include the start_time:


_10
Input #0, matroska,webm, from 'alice.mka':
_10
Metadata:
_10
encoder : GStreamer matroskamux version 1.8.1.1
_10
creation_time : 2017-06-30T09:03:44.000000Z
_10
Duration: 00:13:09.36, start: 1.564000, bitrate: 48 kb/s
_10
Stream #0:0(eng): Audio: opus, 48000 Hz, stereo, fltp (default)
_10
Metadata:
_10
title : Audio
_10
start_time=1.564000

After retrieving both the offset from the Recordings metadata and start_time from the ffprobe on the Recording media, you can create a table like the one below. (The creation_time will also be in the output of the ffprobe command above; we are referencing it below to demonstrate that it is not the correct value to use when mixing tracks. It is not needed in any of the following steps and will be removed from the table going forward.)

A table of each recording's offset and start time, along with its creation time.

Track Nameoffset (in ms)start_time (in ms)creation_time
alice.mka16348100573115642017-06-30T09:03:44.000000Z
alice.mkv16348100573115842017-06-30T09:03:44.000000Z
bob.mka163481005732207892017-06-30T09:04:03.000000Z
bob.mkv163481005732208142017-06-30T09:04:03.000000Z

The start_time and offset for a participant's audio and video are not required to be the same. This can happen in the scenario of a Room recovery. You can also see the approximate 20 seconds that Alice was in the room before Bob reflected in the start_times of each participant's recordings.

It is important to use start_time as reference and not creation_time. The recording's creation_time is the time that the user joined the call, but the start_time refers to when the first sample of data was received for the recording. Additionally, creation_time does not have millisecond precision and could lead to synchronization issues.


5. Set the reference and relative offsets

5-set-the-reference-and-relative-offsets page anchor

Next, you will need to calculate the relative offset of each track so that the tracks will be synchronized. To calculate the relative offset:

  1. Add the offset to the start_time . In the sample table below, we store this value in the Addition column.
  2. Take the lowest Addition value of all tracks. In the sample, that's alice.mka , with an Addition value of 163481007295. Copy this value into every row of the Reference Value column, as you will need to reference it in the next step.
  3. Subtract the Reference Value from the Addition value for each recording to create the relative_offset in milliseconds. You will need the relative_offset value when mixing the tracks together.

The following table shows the current values for our tracks, in which alice.mka is the reference value with 163481007295.

A table of each recording's offset, start time, and calculated fields for determining the relative offset.

Track Nameoffset (in ms)start_time (in ms)AdditionReference Valuerelative_offset (in ms)
alice.mka16348100573115641634810072951634810072950
alice.mkv163481005731158416348100731516348100729520
bob.mka1634810057322078916348102652116348100729519226
bob.mkv1634810057322081416348102654616348100729519251

6. Mix the tracks with ffmpeg

6-mix-the-tracks-with-ffmpeg page anchor

The final step is to mix all the tracks in a single file. The command will:

  • Keep video and audio tracks in synchronization
  • Enable you to change the output video resolution
  • Pad the video tracks to keep the aspect ratio of the original videos

webm format

webm-format page anchor

Below is the complete command to obtain the mixed file in webm format with a 1024x768 (width x height) resolution. It's a long command! You can see an explanation for each section below.


_17
ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka \
_17
-filter_complex " \
_17
[0]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs0], \
_17
color=black:size=512x768:duration=0.020[b0], \
_17
[b0][vs0]concat[r0c0]; \
_17
[1]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs1], \
_17
color=black:size=512x768:duration=19.251[b1], \
_17
[b1][vs1]concat[r0c1]; \
_17
[r0c0][r0c1]hstack=inputs=2[video]; \
_17
[2]aresample=async=1[a0]; \
_17
[3]aresample=async=1,adelay=19226.0|19226.0[a1]; \
_17
[a0][a1]amix=inputs=2[audio]" \
_17
-map '[video]' \
_17
-map '[audio]' \
_17
-acodec libopus \
_17
-vcodec libvpx \
_17
output.webm


_10
ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka \

In the first line of this command, you specify the input files, which are the four recordings.

2. Apply scales for video and delay to video and audio

2-apply-scales-for-video-and-delay-to-video-and-audio page anchor

The following section breaks down each line of the filter operation.


_10
-filter_complex <script>

a. This will perform the filter operation specified in the following string.


_10
"[0]scale=<half of width>:-2,pad=<half of width>:<height>:(ow-iw)/2:(oh-ih)/2[vs0]

b. This section selects the first input file (here, Alice's video) and scales it to half the width of the desired resolution (512) while maintaining the original aspect ratio. Additionally, it pads the scaled video (pad) and tags it [vs0].


_10
color=black:size=<half of width>x<height>:duration=<relative offset in seconds>[b0],\

c. The next step is to generate black frames for the duration of the track's relative_offset (which you calculated step 5), in seconds. This is intended to delay the track to keep it in sync with the other recordings.


_10
[b0][vs0]concat[r0c0];\

d. This step concatenates the black stream [b0] with the padded stream [vs0], and tags it as [r0c0]. Then it concatenates the black frames with the padded frames.


_10
[1]scale=<half of width>:-2,pad=<half of width>:<height>:(ow-iw)/2:(oh-ih)/2[vs1],\

e. This step is the same as step b, repeated for the second input file (Bob's video). The output of this line is tagged as [vs1].


_10
color=black:size=<half of width>x<height>:duration=<relative offset in seconds>[b1],

f. This step is the same as step c, except the duration should be set to the relative_offset, in seconds, that you calculated for the second participant's video recording. In this example, it's 19.226. The output of this line is tagged as [b1].


_10
[b1][vs1]concat[r0c1]

g. This is the same as step d. It concatenates the black stream [b1] with the padded stream [vs1], and tags it as [r0c1].


_10
[r0c0][r0c1]hstack=inputs=2[video]

h. This line configures the filter that will perform the horizontal video stacking (creating the 2x1 video grid). In this example, there are two video tracks, which is why the argument is 2[video].


_10
[2]aresample=async=1[a0];\

i. This line resamples the first audio input track (Alice's audio, which was the input at index [2] in the input list). resample fills and trims the audio track if needed (see more information in the resampler docs(link takes you to an external page)). The resampled audio is tagged as [a0].


_10
[3]aresample=async=1,adelay=19226.0|19226.0[a1];\

j. This line similarly resamples the second audio input track, which in this example is Bob's audio. Here, the relative offset was 19226 ms. adelay specifies the audio delay for both left and right channels in milliseconds. The resampled and delayed audio is tagged as [a1].


_10
[a0][a1]amix=inputs=2[audio]" \

k. This configures the filter that will perform the audio mixing. In this sample case, there are two tracks, so the argument is 2[audio]. This is the final line of the filter script.

Below are the commands used to produce the output:


_10
-map '[video]'

a. This selects the stream marked as video to be used in the output


_10
-map '[audio]'

b. This selects the stream marked as audio to be used in the output


_10
-acodec libopus

c. The audio codec to use. For mp4 use libfdk_aac. (See the note in Requirements about compiling a version of ffmpeg with libfdk_aac if you want to create an mp4 output file.)


_10
-vcodec libvpx

d. The video codec to use. For mp4 use libx264.


_10
output.webm

e. The output file name

The following command would produce an output file in mp4 format. The command follows the same format as the webm command above, with a few alterations:

  • The audio codec for the output is libfdk_aac and the video codec is libx264 .
  • There is an added -vsync 2 \ line immediately following the -map '[audio]' line. This line works with the libx264 video encoder.
  • The final output file is called output.mp4 .

_18
ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka \
_18
-filter_complex "\
_18
[0]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs0],\
_18
color=black:size=512x768:duration=0.020[b0],\
_18
[b0][vs0]concat[r0c0];\
_18
[1]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs1],\
_18
color=black:size=512x768:duration=19.251[b1],\
_18
[b1][vs1]concat[r0c1];\
_18
[r0c0][r0c1]hstack=inputs=2[video];\
_18
[2]aresample=async=1[a0];\
_18
[3]aresample=async=1,adelay=19226.0|19226.0[a1];\
_18
[a0][a1]amix=inputs=2[audio]" \
_18
-map '[video]' \
_18
-map '[audio]' \
_18
-vsync 2 \
_18
-acodec libfdk_aac \
_18
-vcodec libx264 \
_18
output.mp4


Additional resources to personalize the composed media

additional-resources-to-personalize-the-composed-media page anchor

There are many situations where developers want to know the start, end, or duration of a track. For example, if you would like to concatenate black frames after the video track ends, you would need to know the start and end of the media track. In order to find these values, you can leverage ffprobe.

The examples below demonstrate how to use ffprobe to find the start time, end time, and duration of a track. The examples below use the example video track alice.mkv.

Find the start_time in milliseconds

find-the-start_time-in-milliseconds page anchor

_10
ffprobe -i alice.mkv -show_frames 2>/dev/null | head -n 30 | grep -w pkt_dts | grep -Eo '[0-9]+'

This command outputs the start time of alice.mkv, which is 1564 ms.

Find the end_time in milliseconds

find-the-end_time-in-milliseconds page anchor

_10
ffprobe -i alice.mkv -show_frames 2>/dev/null | tail -n 30 | grep -w pkt_dts | grep -Eo '[0-9]+'

This command outputs the end time of alice.mkv, which is 142242 ms.

Duration in milliseconds of video track

duration-in-milliseconds-of-video-track page anchor

The duration of the track is the difference between the end_time (142242 ms) and start_time (1564 ms), which results in a duration of 140678 ms.


Rate this page: