Real-time Transcriptions for Video

(new)

Beta

Real-Time Transcriptions is in Public Beta. The information in this document could change. We might add or update features before the product becomes Generally Available. Beta products don't have a Service Level Agreement (SLA). Learn more about beta product support.

Real-Time Transcriptions Public Beta is not HIPAA eligible.

(warning)

Legal notice

Real-Time Transcriptions uses artificial intelligence or machine learning technologies. By enabling or using any of the features or functionalities within Twilio Video that are identified as using artificial intelligence or machine learning technology, you acknowledge and agree that your use of these features or functionalities is subject to the terms of the Predictive and Generative AI/ML Features Addendum.

Overview

Real-Time Transcriptions converts speech from any participant in a Video Room into text and sends that text to the Video Client SDKs (JavaScript, iOS, and Android). Your application can render the text in any style and format. Twilio supports multiple speech models and you can choose the model that best fits your use case.

Your app can implement transcriptions in two ways:

Start automatically when your app creates a Video Room.
Start, stop, or restart on demand while the Room is active.

You enable Real-Time Transcriptions at the Video Room level, so every participant is transcribed. You configure the spoken language and speech model, and the settings remain in effect until the Room ends. You can also set a default configuration in the Twilio Console.

When transcription is active, Twilio delivers the transcribed text, along with the Participant SID, to every participant in the Room.

If you enable partial results, the transcription engine delivers interim results so that your app can refresh the UI in near real time.

REST APIs

Transcription resource

Transcription is a subresource of the Room resource that represents a Room's transcript. The resource URI is:

/v1/Rooms/{RoomNameOrSid}/Transcriptions/{TranscriptionTtid}

Property	Type	Description
`ttid`	`ttid`	The Twilio Type ID of the Transcription resource. It is assigned at creation time using the following format: `video_transcriptions_{uuidv7-special-encoded}`.
`room_sid`	`SID<RM>`	The ID of the room instance parent to the Transcription resource.
`account_sid`	`SID<AC>`	The account ID that owns the Transcription resource.
`status`	`enum<string>`	The status of the Transcription resource. It can be: `started`, `stopped` or `failed`. The resource is created with status `started` by default.
`configuration`	`object<string-map>`	Key value map with configuration parameters applied to audio tracks transcriptions. The following body parameters are supported: `PartialResults`, `LanguageCode`, `TranscriptionEngine`, `ProfanityFilter`, `SpeechModel`, `Hints`, `EnableAutomaticPunctuation`.
`date_created`	`string<date-time>`	The date and time in GMT when the resource was created, specified in ISO 8601 format.
`date_updated`	`string<date-time>`	The date and time in GMT when the resource was last updated, specified in ISO 8601 format.
`start_time`	`string<date-time>`	The date and time in GMT when the resource last transitioned to state `started`, specified in ISO 8601 format.
`end_time`	`string<date-time>`	The date and time in GMT when the resource last transitioned to state `stopped`, specified in ISO 8601 format.
`duration`	`integer`	The time in seconds the transcription has been in state `started`.
`url`	`string<uri>`	The absolute URL of the resource.

The Transcription resource transitions to status failed if an internal error is detected that prevents the transcriptions from being generated. The Twilio Console receives a debug event with the details of the failure. The resource can't be restarted once a failure is detected.

Transcription `configuration` properties

Name	Type	Optional or Required	Description
`transcriptionEngine`	string	Optional	Definition of the transcription engine to be used, among those supported by Twilio. Default is `"google"`.
`speechModel`	string	Optional	Recognition model used by the transcription engine, among those supported by the provider. Defaults to Google's `"telephony"`.
`languageCode`	string	Optional	Language code used by the transcription engine, specified in BCP-47 format. Default is `"en-US"`. This attribute is useful for ensuring that the transcription engine correctly understands and processes the spoken language.
`partialResults`	boolean	Optional	Indicates whether to send partial results. Default is `false`. When enabled, the transcription engine sends interim results as the transcription progresses, providing more immediate feedback before the final result is available.
`profanityFilter`	boolean	Optional	Indicates if the server will attempt to filter out profanities, replacing all but the initial character in each filtered word with asterisks. Google feature. Default is `true`.
`hints`	string	Optional	This field may contain a list of words or phrases that the transcription provider can expect to encounter during a transcription. Using the `hints` attribute can improve the transcription provider's recognition of words or phrases that are expected during the video call. Up to 500 words or phrases can be provided in the list of hints, each entry separated with a comma. Each word or phrase may be up to 100 characters each. Separate each word in a phrase with a space.
`enableAutomaticPunctuation`	boolean	Optional	The provider will add punctuation to the transcribed text. Default is `true`. When enabled, the transcription engine will automatically insert punctuation marks such as periods, commas, and question marks, improving the readability of the transcribed text.

Transcription engines and speech models

The following table lists the possible values for the transcriptionEngine and the associated speechModel properties.

Transcription engine	Speech model	Description
`google`	`telephony`	Use this model for audio that originated from an audio phone call, typically recorded at an 8 kHz sampling rate.
`google`	`medical_conversation`	Use this model for conversations between a medical provider, for example, a doctor or nurse, and a patient.
`google`	`long`	Use this model for any type of long form content, such as media or spontaneous speech and conversations. Consider using this model instead of the `video` or the `default` model, especially if they aren't available in your target language.
`google`	`short`	Use this model for short utterances that are a few seconds in length. It is useful for trying to capture commands or other single-short directed speech use cases. Consider using this model instead of the command and search model.
`google`	`telephony_short`	Dedicated version of the `telephony` model for short or even single-word utterances for audio that originated from a phone call, typically recorded at an 8 kHz sampling rate. Useful for utterances only a few seconds long in customer service, teleconferencing, and automated kiosk applications.
`google`	`medical_dictation`	Use this model to transcribe notes dictated by a medical professional, for example, a doctor dictating notes about a patient's blood test results.
`google`	`chirp_2`	Use the next generation of our Universal large Speech Model (USM) powered by our large language model technology for streaming and batch, and transcriptions and translations in diverse linguistic content and multilingual capabilities.
`google`	`chirp_telephony`	Universal large Speech Model (USM) fine-tuned for audio that originated from a phone call (typically recorded at an 8 kHz sampling rate).
`google`	`chirp`	Use our Universal large Speech Model (USM), for state-of-the-art non-streaming transcriptions in diverse linguistic content and multilingual capabilities.
`deepgram`	`nova-2`	Recommended for most use cases.

Notes:

The google transcription engine corresponds to the Google Speech-to-Text V2 API.
Not all languages are available on all speech models. For valid combinations, see the following provider documentation:
- Google Speech-to-Text V2 supported languages
- Deepgram Nova 2 supported languages

Create a Room with transcriptions enabled

To create a Video Room with Real-Time Transcriptions automatically enabled, add the following two parameters to the Room POST request:

Parameter	Type	Description
`TranscribeParticipantsOnConnect`	boolean	Whether to start real-time transcriptions when `Participants` connect. Default is `false`.
`TranscriptionsConfiguration`	object	Key-value configuration settings for the transcription engine. For more information see Transcriptio configuration properties.

To automatically enable transcriptions on the Video Room, set the TranscribeParticipantsOnConnect parameter to true.

Example:

curl -X POST "https://video.twilio.com/v1/Rooms" \\--data-urlencode 'TranscriptionsConfiguration={"languageCode": "EN-us", "partialResults": true}' \\--data-urlencode "TranscribeParticipantsOnConnect=true" \\-u $API_Key_Sid:$API_Key_Secret

Response:

1{
2  "account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3  "audio_only": false,
4  "date_created": "2025-06-12T15:42:32Z",
5  "date_updated": "2025-06-12T15:42:32Z",
6  "duration": null,
7  "empty_room_timeout": 5,
8  "enable_turn": true,
9  "end_time": null,
10  "large_room": false,
11  "links": {
12    "participants": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Participants",
13    "recording_rules": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/RecordingRules",
14    "recordings": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Recordings",
15    "transcriptions": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions"
16  },
17  "max_concurrent_published_tracks": 170,
18  "max_participant_duration": 14400,
19  "max_participants": 50,
20  "media_region": "us1",
21  "record_participants_on_connect": false,
22  "sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
23  "status": "in-progress",
24  "status_callback": null,
25  "status_callback_method": "POST",
26  "type": "group",
27  "unique_name": "test",
28  "unused_room_timeout": 5,
29  "url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
30  "video_codecs": [
31    "VP8",
32    "H264"
33  ]
34}

Start transcriptions on an existing Room

Create a Transcription with a POST request to the resource URI:

/v1/Rooms/{RoomNameOrSid}/Transcriptions

Path parameters

Parameter	Type	Description
`RoomSid`	`SID<RM>`	The ID of the parent room where the Transcription resource is created.

Request body parameters

Parameter	Type	Description
`Configuration`	`object<string-map>`	Object with key-value configurations. See property `configuration` of the Transcription resource above for a description of supported keys.

Example:

1curl -X POST "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions" \
2--data-urlencode 'Configuration={"languageCode": "EN-us", "partialResults": true, "profanityFilter": true, "speechModel": "long"}' \
3-u $API_Key_Sid:$API_Key_Secret

Response:

1{
2  "account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3  "configuration": {
4    "languageCode": "EN-us",
5    "partialResults": "true",
6    "profanityFilter": "true",
7    "speechModel": "long"
8  },
9  "date_created": "2025-07-22T14:14:35Z",
10  "date_updated": null,
11  "duration": null,
12  "end_time": null,
13  "room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
14  "start_time": null,
15  "status": "started",
16  "ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
17  "url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
18}

Stop transcription on a Room

Update a Transcription resource with a POST request to the resource instance URI:

/v1/Rooms/{RoomSid}/Transcriptions/{ttid}

To stop transcriptions on a Room, set the Status to stopped.

Path parameters

Parameter	Type	Description
`ttid`	`ttid`	The TTID of the Transcription resource being updated. Current implementation supports a single transcription resource, but this might change in future implementations.
`RoomSid`	`SID<RM>`	The ID of the parent room where the Transcription resource is updated.

Request body parameters

Parameter	Type	Description
`Status`	`enum<string>`	New status of the Transcription resource. Can be: `started`, `stopped`. There is no state transition if the resource property status already has the same value or if the parameter is missing.

Example:

1curl -X POST "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" \
2--data-urlencode "Status=stopped" \
3-u ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Response:

1{
2  "account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3  "configuration": {
4    "languageCode": "EN-us",
5    "partialResults": "true"
6  },
7  "date_created": "2025-07-22T12:55:30Z",
8  "date_updated": "2025-07-22T12:56:02Z",
9  "duration": null,
10  "end_time": null,
11  "room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
12  "start_time": null,
13  "status": "stopped",
14  "ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
15  "url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
16}

Restart transcriptions on a Room

To restart transcriptions on a Room that has a Transcription resource with a stopped state, update it with a POST request to the resource instance URI:

/v1/Rooms/{RoomSid}/Transcriptions/{ttid}

To restart transcription, set the Status to started.

Path parameters

Parameter	Type	Description
`ttid`	`ttid`	The TTID of the Transcription resource being updated. Current implementation supports a single transcription resource, but this might change in future implementations.
`RoomSid`	`SID<RM>`	The ID of the parent room where the Transcription resource is updated.

Request body parameters

Parameter	Type	Description
`Status`	`enum<string>`	New status of the Transcription resource. Use `started` to restart transcriptions. There is no state transition if the resource property status already has the same value or if the parameter is missing.

Example:

1"https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" \
2--data-urlencode "Status=started" \
3-u ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Response:

1{
2  "account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3  "configuration": {
4    "languageCode": "EN-us",
5    "partialResults": "true"
6  },
7  "date_created": "2025-07-22T12:57:24Z",
8  "date_updated": null,
9  "duration": null,
10  "end_time": null,
11  "room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
12  "start_time": null,
13  "status": "started",
14  "ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
15  "url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
16}

Retrieve the Transcription resource for a Room

Retrieve the Transcription resource in a Room with a GET request to the resource URI:

/v1/Rooms/{RoomSid}/Transcriptions

Real-Time Transcriptions supports only a single instance of the Transcription resource per Room, so the list will always have a single item.

Path parameters

Parameter	Type	Description
`RoomSid`	`SID<RM>`	The ID of the parent room from where Transcription resources are retrieved.

Example:

1curl -X GET "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions" \
2-u $API_Key_Sid:$API_Key_Secret

Response:

1{
2  "meta": {
3    "first_page_url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions?PageSize=50&Page=0",
4    "key": "transcriptions",
5    "next_page_url": null,
6    "page": 0,
7    "page_size": 50,
8    "previous_page_url": null,
9    "url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions?PageSize=50&Page=0"
10  },
11  "transcriptions": [
12    {
13      "account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
14      "configuration": {},
15      "date_created": "2025-07-22T11:05:41Z",
16      "date_updated": null,
17      "duration": null,
18      "end_time": null,
19      "room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
20      "start_time": null,
21      "status": "started",
22      "ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
23      "url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
24    }
25  ]
26}

Fetch a single Transcription resource

Retrieve a specific Transcription resource with a GET request to the instance resource URI:

/v1/Rooms/{RoomSid}/Transcriptions/{ttid}

Path parameters

Parameter	Type	Description
`ttid`	`ttid`	The TTID of the Transcription resource being requested.
`RoomSid`	`SID<RM>`	The ID of the parent room where the Transcription resource is retrieved.

Example:

curl -X https://video.twilio.com/v1/Rooms/$sid/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX -u $API_Key_Sid:$API_Key_Secret

Response:

1{
2  "account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
3  "configuration": {
4    "LanguageCode": "EN-us",
5    "ProfanityFilter": "true"
6  },
7  "date_created": null,
8  "date_updated": null,
9  "duration": null,
10  "end_time": null,
11  "links": {
12    "transcriptions": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
13  },
14  "room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
15  "start_time": null,
16  "status": "created",
17  "ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
18}

Transcribed text delivery

Twilio delivers transcribed text to the client SDKs through callback events.

JSON schema

The schema of the JSON delivery format contains a version number. Each event contains the transcription of a single utterance and details of the participant who generated the audio.

1properties:
2  type:
3    const: extension_transcriptions
4
5  version:
6    description: |
7      Version of the transcriptions protocol used by this message. It is semver compliant
8
9  track:
10    $ref: /Server/State/RemoteTrack
11    description: |
12      Audio track from where the transcription has been generated.
13
14  participant:
15    $ref: /Server/State/Participant
16    description: |
17      The participant who published the audio track from where the
18      transcription has been generated.
19
20  sequence_number:
21    type: integer
22    description: |
23      Sequence number. Starts with one and increments monotonically. A sequence
24      counter is defined for each track to allow the receiver to identify 
25      missing messages.
26
27  timestamp:
28    type: string
29    description: |
30      Absolute time from the real-time-transcription. It is
31      conformant with UTC ISO 8601.
32
33  partial_results:
34    type: boolean
35    description: |
36      Whether the transcription is a final or a partial result.
37
38  language_code:
39    type: string
40    description: |
41      Language code of the transcribed text. It is conformant with BCP-47.
42
43  transcription:
44    type: string
45    description: |
46      Utterance transcription

Example:

1{
2  "version": "1.0",
3  "language_code": "en-US",
4  "partial_results": false,
5  "participant": "PA00000000000000000000000000000000",
6  "sequence_number": 3,
7  "timestamp": "2025-01-01T12:00:00.000000000Z",
8  "track": "MT00000000000000000000000000000000",
9  "transcription": "This is a test",
10  "type": "extension_transcriptions"
11}

JavaScript SDK

Starting and stopping transcript events with the JavaScript SDK

To enable the flow of transcript events, set the receiveTranscriptions parameter in connectOptions to true. The default value is false. Once set to true and on condition that Real-Time Transcriptions is enabled at the Room level, callback events containing the transcribed text will start to flow.

Example:

1import { connect } from 'twilio-video';
2
3const room = await connect(token, {
4  name: 'my-room',
5  receiveTranscriptions: true
6});
7
8room.on('transcription', (transcriptionEvent) => {
9  console.log(`${transcriptionEvent.participant}: ${transcriptionEvent.transcription}`);
10});

iOS SDK

Starting and stopping transcript events with the iOS SDK

To enable the flow of transcript events, set the receiveTranscriptions parameter in TVIConnectOptions to true. The default value is false. The value can be retrived using the isReceiveTranscriptionsEnabled getter. Once set to true and on condition that Real-Time Transcriptions is enabled at the Room level, callback events containing the transcribed text will start to flow via the transcriptionReceived(room:transcription:) method in the RoomDelegate protocol.

Example:

1let options = ConnectOptions(token: accessToken, block: { (builder) in
2    builder.roomName = "test"
3    builder.isReceiveTranscriptionsEnabled = true
4})

Android SDK

Starting and stopping transcript events with the Android SDK

To receive transcription events, set the receiveTranscriptions parameter in ConnectOptions to true. The default value is false. To check the current setting, call isReceiveTranscriptionsEnabled().

Once set to true and Real-Time Transcriptions is enabled for the Room, callback events containing the transcribed text are delivered through the onTranscription(@NonNull Room room, @NonNull JSONObject json) method of the Room.Listener interface.

Example:

1ConnectOptions connectOptions = new ConnectOptions.Builder(accessToken)
2        .receiveTranscriptions(true)
3        .build();
4
5Video.connect(context, connectOptions, roomListener);

Twilio Console configuration

To enable and configure Real-time Transcriptions in the Twilio Console, navigate to the Video Room settings page.

AI nutrition facts

(information)

AI Nutrition Facts

Real-Time Transcriptions for Video uses third-party artificial technology and machine learning technologies.

Twilio's AI Nutrition Facts provide an overview of the AI feature you're using, so you can better understand how the AI is working with your data. The following Speech to Text Transcriptions - Nutrition Facts label outlines the AI qualities of Real-Time Transcriptions for Video. For more information, see Twilio's AI Nutrition Facts page.

AI Nutrition Facts

Speech to Text Transcriptions - Programmable Voice, Twilio Video, and Conversational Intelligence

Description: Generate speech to text voice transcriptions (real-time and post-call) in Programmable Voice, Twilio Video, and Conversational Intelligence.
Privacy Ladder Level: N/A
Feature is Optional: Yes
Model Type: Generative and Predictive - Automatic Speech Recognition
Base Model: Deepgram Speech-to-Text, Google Speech-to-Text, Amazon Transcribe
Base Model Trained with Customer Data: No
Customer Data is Shared with Model Vendor: No
Training Data Anonymized: N/A
Data Deletion: Yes
Human in the Loop: Yes
Data Retention: Until the customer deletes
Logging & Auditing: Yes
Guardrails: Yes
Input/Output Consistency: Yes
Other Resources: https://www.twilio.com/docs/conversational-intelligence

Learn more about this label at nutrition-facts.ai

Real-time Transcriptions for Video

Beta

Legal notice

Overview

REST APIs

Transcription resource

Transcription configuration properties

Transcription engines and speech models

Create a Room with transcriptions enabled

Start transcriptions on an existing Room

Stop transcription on a Room

Restart transcriptions on a Room

Retrieve the Transcription resource for a Room

Fetch a single Transcription resource

Transcribed text delivery

JSON schema

JavaScript SDK

Starting and stopping transcript events with the JavaScript SDK

iOS SDK

Starting and stopping transcript events with the iOS SDK

Android SDK

Starting and stopping transcript events with the Android SDK

Twilio Console configuration

AI nutrition facts

AI Nutrition Facts

AI Nutrition Facts

Speech to Text Transcriptions - Programmable Voice, Twilio Video, and Conversational Intelligence

Trust Ingredients

Transcription `configuration` properties