Skip to contentSkip to navigationSkip to topbar
On this page

Real-time Transcriptions for Video


(new)

Beta

Real-Time Transcriptions is in Public Beta. The information in this document could change. We might add or update features before the product becomes Generally Available. Beta products don't have a Service Level Agreement (SLA). Learn more about beta product support(link takes you to an external page).

Real-Time Transcriptions Public Beta is not HIPAA eligible.

(warning)

Legal notice

Real-Time Transcriptions uses artificial intelligence or machine learning technologies. By enabling or using any of the features or functionalities within Twilio Video that are identified as using artificial intelligence or machine learning technology, you acknowledge and agree that your use of these features or functionalities is subject to the terms of the Predictive and Generative AI/ML Features Addendum(link takes you to an external page).


Overview

overview page anchor

Real-Time Transcriptions converts speech from any participant in a Video Room into text and sends that text to the Video Client SDKs (JavaScript, iOS, and Android). Your application can render the text in any style and format. Twilio supports multiple speech models and you can choose the model that best fits your use case.

Your app can implement transcriptions in two ways:

  • Start automatically when your app creates a Video Room.
  • Start, stop, or restart on demand while the Room is active.

You enable Real-Time Transcriptions at the Video Room level, so every participant is transcribed. You configure the spoken language and speech model, and the settings remain in effect until the Room ends. You can also set a default configuration in the Twilio Console.

When transcription is active, Twilio delivers the transcribed text, along with the Participant SID, to every participant in the Room.

If you enable partial results, the transcription engine delivers interim results so that your app can refresh the UI in near real time.


Transcription resource

transcription-resource page anchor

Transcription is a subresource of the Room resource that represents a Room's transcript. The resource URI is:

/v1/Rooms/{RoomNameOrSid}/Transcriptions/{TranscriptionTtid}

PropertyTypeDescription
ttidttidThe Twilio Type ID of the Transcription resource. It is assigned at creation time using the following format: video_transcriptions_{uuidv7-special-encoded}.
room_sidSID<RM>The ID of the room instance parent to the Transcription resource.
account_sidSID<AC>The account ID that owns the Transcription resource.
statusenum<string>The status of the Transcription resource. It can be: started, stopped or failed. The resource is created with status started by default.
configurationobject<string-map>Key value map with configuration parameters applied to audio tracks transcriptions. The following body parameters are supported: PartialResults, LanguageCode, TranscriptionEngine, ProfanityFilter, SpeechModel, Hints, EnableAutomaticPunctuation.
date_createdstring<date-time>The date and time in GMT when the resource was created, specified in ISO 8601 format.
date_updatedstring<date-time>The date and time in GMT when the resource was last updated, specified in ISO 8601 format.
start_timestring<date-time>The date and time in GMT when the resource last transitioned to state started, specified in ISO 8601 format.
end_timestring<date-time>The date and time in GMT when the resource last transitioned to state stopped, specified in ISO 8601 format.
durationintegerThe time in seconds the transcription has been in state started.
urlstring<uri>The absolute URL of the resource.

The Transcription resource transitions to status failed if an internal error is detected that prevents the transcriptions from being generated. The Twilio Console receives a debug event with the details of the failure. The resource can't be restarted once a failure is detected.

Transcription configuration properties

transcription-configuration-properties page anchor
NameTypeOptional or RequiredDescription
transcriptionEnginestringOptionalDefinition of the transcription engine to be used, among those supported by Twilio. Default is "google".
speechModelstringOptionalRecognition model used by the transcription engine, among those supported by the provider. Defaults to Google's "telephony".
languageCodestringOptionalLanguage code used by the transcription engine, specified in BCP-47 format. Default is "en-US". This attribute is useful for ensuring that the transcription engine correctly understands and processes the spoken language.
partialResultsbooleanOptionalIndicates whether to send partial results. Default is false. When enabled, the transcription engine sends interim results as the transcription progresses, providing more immediate feedback before the final result is available.
profanityFilterbooleanOptionalIndicates if the server will attempt to filter out profanities, replacing all but the initial character in each filtered word with asterisks. Google feature(link takes you to an external page). Default is true.
hintsstringOptionalThis field may contain a list of words or phrases that the transcription provider can expect to encounter during a transcription. Using the hints attribute can improve the transcription provider's recognition of words or phrases that are expected during the video call. Up to 500 words or phrases can be provided in the list of hints, each entry separated with a comma. Each word or phrase may be up to 100 characters each. Separate each word in a phrase with a space.
enableAutomaticPunctuationbooleanOptionalThe provider will add punctuation to the transcribed text. Default is true. When enabled, the transcription engine will automatically insert punctuation marks such as periods, commas, and question marks, improving the readability of the transcribed text.

Transcription engines and speech models

transcription-engines-and-speech-models page anchor

The following table lists the possible values for the transcriptionEngine and the associated speechModel properties.

Transcription engineSpeech modelDescription
googletelephony(link takes you to an external page)Use this model for audio that originated from an audio phone call, typically recorded at an 8 kHz sampling rate.
googlemedical_conversation(link takes you to an external page)Use this model for conversations between a medical provider, for example, a doctor or nurse, and a patient.
googlelong(link takes you to an external page)Use this model for any type of long form content, such as media or spontaneous speech and conversations. Consider using this model instead of the video or the default model, especially if they aren't available in your target language.
googleshort(link takes you to an external page)Use this model for short utterances that are a few seconds in length. It is useful for trying to capture commands or other single-short directed speech use cases. Consider using this model instead of the command and search model.
googletelephony_short(link takes you to an external page)Dedicated version of the telephony model for short or even single-word utterances for audio that originated from a phone call, typically recorded at an 8 kHz sampling rate. Useful for utterances only a few seconds long in customer service, teleconferencing, and automated kiosk applications.
googlemedical_dictation(link takes you to an external page)Use this model to transcribe notes dictated by a medical professional, for example, a doctor dictating notes about a patient's blood test results.
googlechirp_2(link takes you to an external page)Use the next generation of our Universal large Speech Model (USM) powered by our large language model technology for streaming and batch, and transcriptions and translations in diverse linguistic content and multilingual capabilities.
googlechirp_telephony(link takes you to an external page)Universal large Speech Model (USM) fine-tuned for audio that originated from a phone call (typically recorded at an 8 kHz sampling rate).
googlechirp(link takes you to an external page)Use our Universal large Speech Model (USM), for state-of-the-art non-streaming transcriptions in diverse linguistic content and multilingual capabilities.
deepgramnova-2(link takes you to an external page)Recommended for most use cases.

Notes:

Create a Room with transcriptions enabled

create-a-room-with-transcriptions-enabled page anchor

To create a Video Room with Real-Time Transcriptions automatically enabled, add the following two parameters to the Room POST(link takes you to an external page) request:

ParameterTypeDescription
TranscribeParticipantsOnConnectbooleanWhether to start real-time transcriptions when Participants connect. Default is false.
TranscriptionsConfigurationobjectKey-value configuration settings for the transcription engine. For more information see Transcriptio configuration properties.

To automatically enable transcriptions on the Video Room, set the TranscribeParticipantsOnConnect parameter to true.

Example:

curl -X POST "https://video.twilio.com/v1/Rooms" \\--data-urlencode 'TranscriptionsConfiguration={"languageCode": "EN-us", "partialResults": true}' \\--data-urlencode "TranscribeParticipantsOnConnect=true" \\-u $API_Key_Sid:$API_Key_Secret

Response:

1
{
2
"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3
"audio_only": false,
4
"date_created": "2025-06-12T15:42:32Z",
5
"date_updated": "2025-06-12T15:42:32Z",
6
"duration": null,
7
"empty_room_timeout": 5,
8
"enable_turn": true,
9
"end_time": null,
10
"large_room": false,
11
"links": {
12
"participants": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Participants",
13
"recording_rules": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/RecordingRules",
14
"recordings": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Recordings",
15
"transcriptions": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions"
16
},
17
"max_concurrent_published_tracks": 170,
18
"max_participant_duration": 14400,
19
"max_participants": 50,
20
"media_region": "us1",
21
"record_participants_on_connect": false,
22
"sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
23
"status": "in-progress",
24
"status_callback": null,
25
"status_callback_method": "POST",
26
"type": "group",
27
"unique_name": "test",
28
"unused_room_timeout": 5,
29
"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
30
"video_codecs": [
31
"VP8",
32
"H264"
33
]
34
}

Start transcriptions on an existing Room

start-transcriptions-on-an-existing-room page anchor

Create a Transcription with a POST request to the resource URI:

/v1/Rooms/{RoomNameOrSid}/Transcriptions

Path parameters

ParameterTypeDescription
RoomSidSID<RM>The ID of the parent room where the Transcription resource is created.

Request body parameters

ParameterTypeDescription
Configurationobject<string-map>Object with key-value configurations. See property configuration of the Transcription resource above for a description of supported keys.

Example:

1
curl -X POST "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions" \
2
--data-urlencode 'Configuration={"languageCode": "EN-us", "partialResults": true, "profanityFilter": true, "speechModel": "long"}' \
3
-u $API_Key_Sid:$API_Key_Secret

Response:

1
{
2
"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3
"configuration": {
4
"languageCode": "EN-us",
5
"partialResults": "true",
6
"profanityFilter": "true",
7
"speechModel": "long"
8
},
9
"date_created": "2025-07-22T14:14:35Z",
10
"date_updated": null,
11
"duration": null,
12
"end_time": null,
13
"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
14
"start_time": null,
15
"status": "started",
16
"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
17
"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
18
}

Stop transcription on a Room

stop-transcription-on-a-room page anchor

Update a Transcription resource with a POST request to the resource instance URI:

/v1/Rooms/{RoomSid}/Transcriptions/{ttid}

To stop transcriptions on a Room, set the Status to stopped.

Path parameters

ParameterTypeDescription
ttidttidThe TTID of the Transcription resource being updated. Current implementation supports a single transcription resource, but this might change in future implementations.
RoomSidSID<RM>The ID of the parent room where the Transcription resource is updated.

Request body parameters

ParameterTypeDescription
Statusenum<string>New status of the Transcription resource. Can be: started, stopped. There is no state transition if the resource property status already has the same value or if the parameter is missing.

Example:

1
curl -X POST "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" \
2
--data-urlencode "Status=stopped" \
3
-u ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Response:

1
{
2
"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3
"configuration": {
4
"languageCode": "EN-us",
5
"partialResults": "true"
6
},
7
"date_created": "2025-07-22T12:55:30Z",
8
"date_updated": "2025-07-22T12:56:02Z",
9
"duration": null,
10
"end_time": null,
11
"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
12
"start_time": null,
13
"status": "stopped",
14
"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
15
"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
16
}

Restart transcriptions on a Room

restart-transcriptions-on-a-room page anchor

To restart transcriptions on a Room that has a Transcription resource with a stopped state, update it with a POST request to the resource instance URI:

/v1/Rooms/{RoomSid}/Transcriptions/{ttid}

To restart transcription, set the Status to started.

Path parameters

ParameterTypeDescription
ttidttidThe TTID of the Transcription resource being updated. Current implementation supports a single transcription resource, but this might change in future implementations.
RoomSidSID<RM>The ID of the parent room where the Transcription resource is updated.

Request body parameters

ParameterTypeDescription
Statusenum<string>New status of the Transcription resource. Use started to restart transcriptions. There is no state transition if the resource property status already has the same value or if the parameter is missing.

Example:

1
"https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" \
2
--data-urlencode "Status=started" \
3
-u ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Response:

1
{
2
"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3
"configuration": {
4
"languageCode": "EN-us",
5
"partialResults": "true"
6
},
7
"date_created": "2025-07-22T12:57:24Z",
8
"date_updated": null,
9
"duration": null,
10
"end_time": null,
11
"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
12
"start_time": null,
13
"status": "started",
14
"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
15
"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
16
}

Retrieve the Transcription resource for a Room

retrieve-the-transcription-resource-for-a-room page anchor

Retrieve the Transcription resource in a Room with a GET request to the resource URI:

/v1/Rooms/{RoomSid}/Transcriptions

Real-Time Transcriptions supports only a single instance of the Transcription resource per Room, so the list will always have a single item.

Path parameters

ParameterTypeDescription
RoomSidSID<RM>The ID of the parent room from where Transcription resources are retrieved.

Example:

1
curl -X GET "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions" \
2
-u $API_Key_Sid:$API_Key_Secret

Response:

1
{
2
"meta": {
3
"first_page_url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions?PageSize=50&Page=0",
4
"key": "transcriptions",
5
"next_page_url": null,
6
"page": 0,
7
"page_size": 50,
8
"previous_page_url": null,
9
"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions?PageSize=50&Page=0"
10
},
11
"transcriptions": [
12
{
13
"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
14
"configuration": {},
15
"date_created": "2025-07-22T11:05:41Z",
16
"date_updated": null,
17
"duration": null,
18
"end_time": null,
19
"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
20
"start_time": null,
21
"status": "started",
22
"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
23
"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
24
}
25
]
26
}

Fetch a single Transcription resource

fetch-a-single-transcription-resource page anchor

Retrieve a specific Transcription resource with a GET request to the instance resource URI:

/v1/Rooms/{RoomSid}/Transcriptions/{ttid}

Path parameters

ParameterTypeDescription
ttidttidThe TTID of the Transcription resource being requested.
RoomSidSID<RM>The ID of the parent room where the Transcription resource is retrieved.

Example:

curl -X https://video.twilio.com/v1/Rooms/$sid/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX -u $API_Key_Sid:$API_Key_Secret

Response:

1
{
2
"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3
"configuration": {
4
"LanguageCode": "EN-us",
5
"ProfanityFilter": "true"
6
},
7
"date_created": null,
8
"date_updated": null,
9
"duration": null,
10
"end_time": null,
11
"links": {
12
"transcriptions": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
13
},
14
"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
15
"start_time": null,
16
"status": "created",
17
"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
18
}

Transcribed text delivery

transcribed-text-delivery page anchor

Twilio delivers transcribed text to the client SDKs through callback events.

The schema of the JSON delivery format contains a version number. Each event contains the transcription of a single utterance and details of the participant who generated the audio.

1
properties:
2
type:
3
const: extension_transcriptions
4
5
version:
6
description: |
7
Version of the transcriptions protocol used by this message. It is semver compliant
8
9
track:
10
$ref: /Server/State/RemoteTrack
11
description: |
12
Audio track from where the transcription has been generated.
13
14
participant:
15
$ref: /Server/State/Participant
16
description: |
17
The participant who published the audio track from where the
18
transcription has been generated.
19
20
sequence_number:
21
type: integer
22
description: |
23
Sequence number. Starts with one and increments monotonically. A sequence
24
counter is defined for each track to allow the receiver to identify
25
missing messages.
26
27
timestamp:
28
type: string
29
description: |
30
Absolute time from the real-time-transcription. It is
31
conformant with UTC ISO 8601.
32
33
partial_results:
34
type: boolean
35
description: |
36
Whether the transcription is a final or a partial result.
37
38
language_code:
39
type: string
40
description: |
41
Language code of the transcribed text. It is conformant with BCP-47.
42
43
transcription:
44
type: string
45
description: |
46
Utterance transcription

Example:

1
{
2
"version": "1.0",
3
"language_code": "en-US",
4
"partial_results": false,
5
"participant": "PA00000000000000000000000000000000",
6
"sequence_number": 3,
7
"timestamp": "2025-01-01T12:00:00.000000000Z",
8
"track": "MT00000000000000000000000000000000",
9
"transcription": "This is a test",
10
"type": "extension_transcriptions"
11
}

Starting and stopping transcript events with the JavaScript SDK

starting-and-stopping-transcript-events-with-the-javascript-sdk page anchor

To enable the flow of transcript events, set the receiveTranscriptions parameter in connectOptions to true. The default value is false. Once set to true and on condition that Real-Time Transcriptions is enabled at the Room level, callback events containing the transcribed text will start to flow.

Example:

1
import { connect } from 'twilio-video';
2
3
const room = await connect(token, {
4
name: 'my-room',
5
receiveTranscriptions: true
6
});
7
8
room.on('transcription', (transcriptionEvent) => {
9
console.log(`${transcriptionEvent.participant}: ${transcriptionEvent.transcription}`);
10
});

Starting and stopping transcript events with the iOS SDK

starting-and-stopping-transcript-events-with-the-ios-sdk page anchor

To enable the flow of transcript events, set the receiveTranscriptions parameter in TVIConnectOptions to true. The default value is false. The value can be retrived using the isReceiveTranscriptionsEnabled getter. Once set to true and on condition that Real-Time Transcriptions is enabled at the Room level, callback events containing the transcribed text will start to flow via the transcriptionReceived(room:transcription:) method in the RoomDelegate protocol.

Example:

1
let options = ConnectOptions(token: accessToken, block: { (builder) in
2
builder.roomName = "test"
3
builder.isReceiveTranscriptionsEnabled = true
4
})

Starting and stopping transcript events with the Android SDK

starting-and-stopping-transcript-events-with-the-android-sdk page anchor

To receive transcription events, set the receiveTranscriptions parameter in ConnectOptions to true. The default value is false. To check the current setting, call isReceiveTranscriptionsEnabled().

Once set to true and Real-Time Transcriptions is enabled for the Room, callback events containing the transcribed text are delivered through the onTranscription(@NonNull Room room, @NonNull JSONObject json) method of the Room.Listener interface.

Example:

1
ConnectOptions connectOptions = new ConnectOptions.Builder(accessToken)
2
.receiveTranscriptions(true)
3
.build();
4
5
Video.connect(context, connectOptions, roomListener);

Twilio Console configuration

twilio-console-configuration page anchor

To enable and configure Real-time Transcriptions in the Twilio Console, navigate to the Video Room settings page(link takes you to an external page).

Alt text for describing the image.

(information)

AI Nutrition Facts

Real-Time Transcriptions for Video uses third-party artificial technology and machine learning technologies.

Twilio's AI Nutrition Facts(link takes you to an external page) provide an overview of the AI feature you're using, so you can better understand how the AI is working with your data. The following Speech to Text Transcriptions - Nutrition Facts label outlines the AI qualities of Real-Time Transcriptions for Video. For more information, see Twilio's AI Nutrition Facts page(link takes you to an external page).

AI Nutrition Facts

Speech to Text Transcriptions - Programmable Voice, Twilio Video, and Conversational Intelligence

Description
Generate speech to text voice transcriptions (real-time and post-call) in Programmable Voice, Twilio Video, and Conversational Intelligence.
Privacy Ladder Level
N/A
Feature is Optional
Yes
Model Type
Generative and Predictive - Automatic Speech Recognition
Base Model
Deepgram Speech-to-Text, Google Speech-to-Text, Amazon Transcribe

Trust Ingredients

Base Model Trained with Customer Data
No

Conversational Intelligence, Programmable Voice, and Twilio Video only use the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.

Customer Data is Shared with Model Vendor
No

Conversational Intelligence, Programmable Voice, and Twilio Video only use the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.

Training Data Anonymized
N/A

Base Model is not trained using any customer data.

Data Deletion
Yes

Transcriptions are deleted by the customer using the Conversational Intelligence API or when a customer account is deprovisioned.

Human in the Loop
Yes

The customer views output in the Conversational Intelligence API or Transcript Viewer.

Data Retention
Until the customer deletes

Compliance

Logging & Auditing
Yes

The customer can listen to the input (recording) and view the output (transcript).

Guardrails
Yes

The customer can listen to the input (recording) and view the output (transcript).

Input/Output Consistency
Yes

The customer is responsible for human review.

Other Resources
https://www.twilio.com/docs/conversational-intelligence