TwiML™ Voice: <Transcription>
Legal notice
Real-Time Transcriptions, including the <Transcriptions> TwiML noun and API, use artificial intelligence or machine learning technologies. By enabling or using any of the features or functionalities within Programmable Voice that are identified as using artificial intelligence or machine learning technology, you acknowledge and agree that your use of these features or functionalities is subject to the terms of the Predictive and Generative AI/ML Features Addendum.
The <Transcription> TwiML noun allows you to transcribe live calls in near real-time. It is used in conjunction with <Start>. When Twilio executes the <Start><Transcription> instruction during a call, it forks the raw audio stream to a speech-to-text transcription engine that can provide streaming responses almost instantly.
This page covers <Transcription>'s supported attributes and provides sample code.
Important Notes
The <Transcription> TwiML noun is associated with Twilio's Real-Time Transcriptions product. It is not to be confused with Recording Transcriptions.
Consumers of <Transcription> should leverage the statusCallbackUrl webhook for live processing of conversation utterances in your application.
Real-Time Transcription persistence and post-call language intelligence support comes from integration with Conversational Intelligence. To store your transcripts with Twilio or run Language Operators after the call, add the intelligenceService attribute when starting a Real-Time Transcription session. Note: When using either Deepgram or Google as the transcriptionEngine value, Twilio supports persisted transcripts.
Below is a basic example of <Start><Transcription>:
1<Start>2<Transcription statusCallbackUrl="https://example.com/your-callback-url"/>3</Start>
The table below lists <Transcription>'s supported attributes, which modify the <Transcription> behavior. All attributes are optional.
| Attribute Name | Allowed Values | Default Value |
|---|---|---|
| name | Unique name for the Real-Time Transcription | None |
| statusCallbackUrl | An absolute URL | None |
| languageCode | A standard code that identifies human languages. | en-US |
| track | inbound_track, outbound_track, both_tracks | both_tracks |
| inboundTrackLabel | An alphanumeric label to associate to the inbound track being transcribed | None |
| outboundTrackLabel | An alphanumeric label to associate to the outbound track being transcribed | None |
| transcriptionEngine | Name of the speech-to-text transcription provider. e.g. google or deepgram | google |
| speechModel | Any speechModel value at the list of Twilio supported Google STTv2 speech models (except Chirp2 models and languages supported only by Chirp2; or nova-2 or 'nova-3' for Deepgram) | telephony |
| profanityFilter | (Google only) true or false | true |
| partialResults | (Google only) true or false | false |
| hints | (Google, Deepgram_nova-2, and Deepgram_nova-3 monolingual variants only) Comma-separated list of expected phrases or keywords for recognition | None |
| enableAutomaticPunctuation | (Google only) true or false | true |
| intelligenceService | (Google only) The Intelligence Service SID or unique name for persisting transcripts and running Language Operators | None |
The user-specified name of this Real-Time Transcription. This name can be used to stop the Real-Time Transcription.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', name: 'Contact center transcription'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" name="Contact center transcription" />5</Start>6</Response>
The statusCallbackUrl attribute is the absolute URL of an endpoint. Twilio sends Real-Time Transcription status updates and the call's transcript data to this URL.
Twilio sends a POST request to this URL whenever one of the following occurs:
- A Real-Time Transcription session starts. This is called the
transcription-startedevent. - Utterances (partial or final) of transcribed audio is available. This is called the
transcription-contentevent. - A Real-Time Transcription session stops. This is called the
transcription-stoppedevent. This event occurs when a Real-Time Transcription session is stopped via API or TwiML, or when the call ends. - An error occurs. This is called the
transcription-errorevent.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url"/>5</Start>6</Response>
When a Real-Time Transcription is started and a session is created, Twilio sends an HTTP POST request to your statusCallbackUrl for the transcription-started event. This event provides initial details about the transcription session.
These HTTP requests contain the properties listed below.
| Property | Description | Example |
|---|---|---|
| AccountSid | Twilio Account SID | AC11b76cdc7d217e72a72be6422d46a7ca |
| CallSid | Twilio Call SID | CA57af2620f427810cb4e430371e8d6e0f |
| TranscriptionSid | Unique identifier for this Real-Time Transcription session | GT20dfa03c8cf8aa8d0c4aeccde5558b66 |
| Timestamp | Time of the event in UTC ISO 8601 timestamp | 2023-10-19T22:33:22.611Z |
| SequenceId | Integer sequence number of the event | 1 |
| TranscriptionEvent | The event type | transcription-started |
| ProviderConfiguration | JSON string of the transcription provider | {\"profanityFilter\":\"true\",\"speechModel\":\"telephony\",\"enableAutomaticPunctuation\":\"true\",\"hints\":\"Alice Johnson, Bob Martin, ACME Corp, XYZ Enterprises, product demo, sales inquiry, customer feedback\"} |
| TranscriptionEngine | The name of the transcription engine | google |
| Name | Friendly name of the Real-Time Transcription session | session1 |
| Track | The track being transcribed: inbound_track, outbound_track, or both_tracks | inbound_track |
| InboundTrackLabel | Label associated with the inbound track | customer |
| OutboundTrackLabel | Label associated with the outbound track | agent |
| PartialResults | Whether partial results are enabled (true or false) | true |
| LanguageCode | The language code for the transcription | en-US |
Example of a transcription-started event payload:
1{2"TranscriptionSid": "GT8fbf72a043b98407a3ce68331cd0030a",3"Timestamp": "2024-06-25T18:45:12.135751Z",4"AccountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",5"ProviderConfiguration": "{\"profanityFilter\":\"true\",\"speechModel\":\"telephony\",\"enableAutomaticPunctuation\":\"true\",\"hints\":\"Alice Johnson, Bob Martin, ACME Corp, XYZ Enterprises, product demo, sales inquiry, customer feedback\"}",6"Name": "Chris Transcription",7"OutboundTrackLabel": "agent",8"LanguageCode": "en-US",9"PartialResults": "false",10"InboundTrackLabel": "customer",11"TranscriptionEvent": "transcription-started",12"CallSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",13"TranscriptionEngine": "google",14"Track": "both_tracks",15"SequenceId": "1"16}
When an individual utterance (partial or final) of audio is transcribed, Twilio sends an HTTP POST request to your statusCallbackUrl for the transcription-content event. This event provides TranscriptionData results for the transcribed audio.
Stability and Confidence
Stability and Confidence depend on partialResults. For example, if partialResults is true, then the stability property will be included in the event payload, and confidence will not. However, if partialResults is false, the opposite will be true. Always refer to Google's specific documentation (examples) for more details on each of these properties.
These HTTP requests contain the properties listed below.
| Property | Description | Example |
|---|---|---|
| AccountSid | Twilio Account SID | AC11b76cdc7d217e72a72be6422d46a7ca |
| CallSid | Twilio Call SID | CA57af2620f427810cb4e430371e8d6e0f |
| TranscriptionSid | Unique identifier for this Real-Time Transcription session | GT20dfa03c8cf8aa8d0c4aeccde5558b66 |
| Timestamp | Time of the event in UTC ISO 8601 timestamp | 2023-10-19T22:33:22.611Z |
| SequenceId | Integer sequence number of the event | 2 |
| TranscriptionEvent | The event type | transcription-content |
| LanguageCode | A BCP-47 standard language code (e.g. "en-US") | en-US |
| Track | The track being transcribed: inbound_track or outbound_track | inbound_track |
| TranscriptionData | JSON string containing transcription content. Note that TranscriptionData.Confidence is a decimal number. | {\"Transcript\":\"to be or not to be\",\"Confidence\":0.96823084} |
| Stability | String representing estimate of the likelihood Google will not change the guess it made about this partial result transcript. This property is only provided when partialResults is true. | Range between 0.0 (unstable) and 1.0 (stable). Example: 0.8 |
| Final | Boolean value indicating whether this event contains the final utterance (or partial utterance) | false |
Example of a transcription-content event payload when partialResults is equal to false:
1{2"LanguageCode": "en-US",3"TranscriptionSid": "GT8fbf72a043b98407a3ce68331cd0030a",4"TranscriptionEvent": "transcription-content",5"CallSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",6"TranscriptionData": "{\"transcript\":\"Hello, this is Sam from Horizon Financial Services. Just letting you know this call may be recorded for quality purposes. How can I assist you today?\",\"confidence\":0.9956335}",7"Timestamp": "2024-06-25T18:45:21.454203Z",8"Final": "true",9"AccountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",10"Track": "outbound_track",11"SequenceId": "2"12}
Example of a transcription-content event payload when partialResults is equal to true:
1{2"LanguageCode": "en-US",3"TranscriptionSid": "GT6ebb54a123f0c86b70605a4925836f69",4"Stability": "0.9",5"TranscriptionEvent": "transcription-content",6"CallSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",7"TranscriptionData": "{\"transcript\":\"Hello, this is Sam from Horizon Financial Services. Just letting you know this call may be recorded for\"}",8"Timestamp": "2024-06-25T16:30:21.600697Z",9"Final": "false",10"AccountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",11"Track": "outbound_track",12"SequenceId": "70"13}
When a Real-Time Transcription session is stopped or ends, Twilio sends an HTTP POST request to your statusCallbackUrl for the transcription-stopped event. This event provides final details about the transcription session.
These HTTP requests contain the properties listed below.
| Property | Description | Example |
|---|---|---|
| AccountSid | Twilio Account SID | AC11b76cdc7d217e72a72be6422d46a7ca |
| CallSid | Twilio Call SID | CA57af2620f427810cb4e430371e8d6e0f |
| TranscriptionSid | Unique identifier for this Real-Time Transcription session | GT20dfa03c8cf8aa8d0c4aeccde5558b66 |
| Timestamp | Time of the event, in UTC ISO 8601 format | 2023-10-19T22:33:22.611Z |
| SequenceId | Integer sequence number of the event | 3 |
| TranscriptionEvent | The event type | transcription-stopped |
An example of the transcription-stopped event payload:
1{2"TranscriptionSid": "GT8fbf72a043b98407a3ce68331cd0030a",3"TranscriptionEvent": "transcription-stopped",4"CallSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",5"Timestamp": "2024-06-25T18:45:23.839266Z",6"AccountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",7"SequenceId": "3"8}
When an error occurs during a Real-Time Transcription session, Twilio sends an HTTP POST request to your statusCallbackUrl for the transcription-error event.
Error Documentation
Documentation on Real-Time Transcription errors can be found on the Error and Warning Dictionary and range from 32650-32655. Errors are also viewable in the Twilio Console.
These HTTP requests contain the properties listed below.
| Property | Description | Example |
|---|---|---|
| AccountSid | Twilio Account SID | ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |
| CallSid | Twilio Call SID | CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |
| TranscriptionSid | Unique identifier for this Real-Time Transcription session | GT20dfa03c8cf8aa8d0c4aeccde5558b66 |
| Timestamp | Time of the event in UTC ISO 8601 timestamp | 2023-10-19T22:33:22.611Z |
| SequenceId | Integer sequence number of the event | 3 |
| TranscriptionEvent | The event type | transcription-error |
| TranscriptionErrorCode | Error code | 32655 |
| TranscriptionError | Error description | Provider Unavailable |
Example of a transcription-error event payload:
1{2"TranscriptionSid": "GT20dfa03c8cf8aa8d0c4aeccde5558b66",3"Timestamp": "2023-10-19T22:33:22.611Z",4"AccountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",5"SequenceId": "3",6"TranscriptionEvent": "transcription-error",7"CallSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",8"TranscriptionErrorCode": "32655",9"TranscriptionError": "Provider Unavailable"10}
The languageCode attribute specifies the language in which the transcription should be performed. It accepts a BCP-47 standard language code, such as en-US for American English. This attribute is useful for ensuring that the transcription engine correctly understands and processes the spoken language.
The following TwiML example demonstrates how to specify the languageCode attribute for a transcription for Mexican Spanish. This ensures that the transcription is performed in the specified language, which is particularly useful for calls in languages other than English.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', languageCode: 'es-MX'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" languageCode="es-MX" />5</Start>6</Response>
The track attribute specifies which audio track should be transcribed. It can take one of the following values: inbound_track, outbound_track, or both_tracks. This attribute is useful for determining whether to transcribe the audio coming from the caller, the callee, or both.
The following TwiML example demonstrates how to specify the track attribute for a transcription.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', track: 'inbound_track'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" track="inbound_track" />5</Start>6</Response>
The inboundTrackLabel attribute allows you to associate an alphanumeric label with the inbound track being transcribed. This can be useful for identifying and differentiating the inbound audio stream in the transcription results. Using labels helps to clearly identify who is speaking, especially in multi-party conversations or call center scenarios.
Refer to the Track labels section below to understand the importance of using labels.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', inboundTrackLabel: 'agent', outboundTrackLabel: 'customer'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" inboundTrackLabel="agent" outboundTrackLabel="customer" />5</Start>6</Response>
In an inbound call scenario, the call is initiated by the customer and received by the agent or business person. Here, the inbound audio track (agent's speech) is labeled for clarity in the transcription results.
1<Response>2<Start>3<Transcription track="inbound_track" inboundTrackLabel="agent" />4</Start>5</Response>
In this example, the inbound audio track is labeled as "agent". This is useful for scenarios like customer support calls, where distinguishing the agent's responses from the customer's speech is crucial for understanding the interaction.
In an outbound call scenario, the call is initiated by the agent or business person and received by the customer. Here, the inbound audio track (customer's speech) is labeled for clarity in the transcription results.
1<Response>2<Start>3<Transcription track="inbound_track" inboundTrackLabel="customer" />4</Start>5</Response>
In this example, the inbound audio track is labeled as "customer". This is useful for scenarios like sales calls, where distinguishing the customer's speech in the transcription can help in analyzing customer feedback and engagement.
The outboundTrackLabel attribute allows you to associate an alphanumeric label with the outbound track being transcribed. This can be useful for identifying and differentiating the outbound audio stream in the transcription results. Using labels helps to clearly identify who is speaking, especially in multi-party conversations or call center scenarios.
Refer to the Track labels section below to understand the importance of using labels.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', inboundTrackLabel: 'agent', outboundTrackLabel: 'customer'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" inboundTrackLabel="agent" outboundTrackLabel="customer" />5</Start>6</Response>
In an inbound call scenario, the call is initiated by the customer and received by the agent or business person. Here, the outbound audio track (customer's speech) is labeled for clarity in the transcription results.
1<Response>2<Start>3<Transcription track="outbound_track" outboundTrackLabel="customer" />4</Start>5</Response>
In this example, the outbound audio track is labeled as "customer". This is useful for scenarios like customer support calls, where distinguishing the customer's speech from the agent's responses is crucial for understanding the interaction.
In an outbound call scenario, the call is initiated by the agent or business person and received by the customer. Here, the outbound audio track (agent's speech) is labeled for clarity in the transcription results.
1<Response>2<Start>3<Transcription track="outbound_track" outboundTrackLabel="agent" />4</Start>5</Response>
In this example, the outbound audio track is labeled as "agent". This is useful for scenarios like sales calls, where distinguishing the agent's speech in the transcription can help in analyzing the effectiveness of the sales pitch.
To leverage specific features or optimizations that different transcription engines offer, set the transcriptionEngine attribute. For details about each provider's speech models, see speechModel in the following section. Both transcriptionEngine: google 'transcriptionEngine: deepgram' support persisted transcript resources.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', transcriptionEngine: 'google'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" transcriptionEngine="google" />5</Start>6</Response>
The speechModel attribute allows you to specify which speech model to use for the transcription.
Different speech models can optimize for different use cases, such as phone calls, video, or enhanced models for higher accuracy.
If Google is used as the transcriptionEngine, this maps to Transcription Model in Google terminology. Refer to the Google documentation to understand each speech model's specific capabilities and configurations.
The telephony speech model is optimized for phone call audio and can provide better accuracy for this type of audio.
The long speech model is optimized for long-form audio, such as lectures or extended conversations, and can provide better accuracy for lengthy audio.
When you set transcriptionEngine to google, Twilio only supports speech models and languages available on Google's global STT API endpoints. For the list of supported languages, see the Google STT v2 API Language List. This list excludes Chirp2 models or languages that only those models support.
If you use Deepgram as the transcriptionEngine, Real-Time Transcriptions rely on the nova-2 speech models or nova-3 monolingual speech modelsin their supported languages. Support for Nova-3's language-detecting multilingual models is in public beta. Public beta products are not covered by a Twilio Service Level Agreement.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', speechModel: 'telephony', transcriptionEngine: 'google'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" speechModel="telephony" transcriptionEngine="google" />5</Start>6</Response>
The profanityFilter attribute allows you to enable or disable the filtering of profane words in the transcription. When enabled, the transcription engine attempts to mask or omit any detected profanities in the transcription results.
Warning
By default, the Google Transcription Engine enables the profanityFilter for all calls. The medical_conversation speechModel doesn't support profanityFilter. When using the medical_conversation speechModel, set the profanityFilter attribute to false. Deepgram's profanity filter only works for some languages.
The example below demonstrates how to enable the profanity filter for the transcription. This is useful for ensuring that any profane language is masked or omitted in the transcription output.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', profanityFilter: false, transcriptionEngine: 'google'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" profanityFilter="false" transcriptionEngine="google" />5</Start>6</Response>
Maps to StreamingRecognitionResult specifically when ("is_final"=false) in Google Terminology. The partialResults attribute allows you to enable or disable the delivery of interim transcription results. When enabled, the transcription engine will send partial (interim) results as the transcription progresses, providing more immediate feedback before the final result is available.
The example below demonstrates how to enable partial results for the transcription.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', partialResults: true, transcriptionEngine: 'google'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" partialResults="true" transcriptionEngine="google" />5</Start>6</Response>
The hints attribute contains a list of words or phrases that the transcription provider can expect to encounter during a Real-Time Transcription. Using the hints attribute can improve the transcription provider's recognition of words or phrases you expect from your callers.
Hints are not supported with the Nova-3 multilingual speech model
Twilio doesn't support setting hints in real-time transcriptions with the Nova-3 multilingual speech model. You can set hints with other Nova-3 models.
You may provide up to 500 words or phrases in the list of hints, separating each entry with a comma. Your hints may be up to 100 characters each and separate each word in a phrase with a space. For example:
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', hints: 'Alice Johnson, Bob Martin, ACME Corp, XYZ Enterprises, product demo, sales inquiry, customer feedback'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" hints="Alice Johnson, Bob Martin, ACME Corp, XYZ Enterprises, product demo, sales inquiry, customer feedback" />5</Start>6</Response>
The hints attribute also supports Google's class token list to improve recognition. You can pass a class token directly in the hints attribute, as shown in the example below.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', hints: '$OOV_CLASS_ALPHANUMERIC_SEQUENCE'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" hints="$OOV_CLASS_ALPHANUMERIC_SEQUENCE" />5</Start>6</Response>
Maps to Automatic Punctuation in Google Terminology. The enableAutomaticPunctuation attribute allows you to enable or disable automatic punctuation in the transcription. When enabled, the transcription engine will automatically insert punctuation marks such as periods, commas, and question marks, improving the readability of the transcribed text.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', enableAutomaticPunctuation: true, transcriptionEngine: 'google'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" enableAutomaticPunctuation="true" transcriptionEngine="google" />5</Start>6</Response>
The intelligenceService attribute allows you to opt-in to sending your Real-Time Transcript to Twilio Conversational Intelligence for integrated post-processing. By enabling storage and analysis of calls transcribed in real-time, this feature helps you extract actionable insights from transcripts. This runs in parallel to statusCallbackUrl which streams utterance-level data and other session lifecycle events to your app during the call.
When enabled, this feature performs the following functions:
- Persists Live Transcripts: Stores real-time transcriptions in Conversational Intelligence's historical log for future reference and analysis.
- Runs Post-Call Language Operators: Triggers Language Operators configured in the referenced Intelligence Service. After the call ends, the Intelligence Service generates AI-powered insights and performs actions.
To use this feature, you need to meet the following conditions.
- Have or create an Intelligence Service.
- Set the
intelligenceServiceparameter to the Intelligence Service SID or unique name.
Important Notes:
- To transcribe a call without recording it, pass an
intelligenceServiceparameter without passing astatusCallbackUrlparameter. - Language Operators are executed after the Real-Time Transcription session concludes, either automatically through the call ending or manually by stopping the live transcription.
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const start = response.start();5start.transcription({statusCallbackUrl: 'https://example.com/your-callback-url', intelligenceService: 'GAaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Start>4<Transcription statusCallbackUrl="https://example.com/your-callback-url" intelligenceService="GAaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" />5</Start>6</Response>
Twilio's transcription service supports a variety of languages and models. The following examples are specific to Google Speech-to-Text. Depending on the language, certain attributes like speechModel, profanityFilter, and enableAutomaticPunctuation may have different levels of support.
To verify support for languages and speech models, see the following resources:
- Google speech-to-text supported languages
- Deepgram Nova-2 supported languages
- Deepgram Nova-3 supported (monolingual) language variants documentation.
Warning
These examples are subject to changes. To verify support for languages and speech models, customers should always refer back to the Google Speech-to-Text Supported Languages, the Deepgram nova-2 Supported Languages. or the nova-3 Supported (monolingual) Language variants documents, as appropriate.
This example demonstrates how to configure transcription for Chinese (Simplified, China) using the Chirp Model with support for automatic punctuation.
1<Response>2<Start>3<Transcription4transcriptionEngine="google"5languageCode="cmn-Hans-CN"6speechModel="chirp"7enableAutomaticPunctuation="true" />8</Start>9</Response>
In this example, the profanityFilter attribute, hints attribute, and other advanced features are not supported for this configuration.
This example demonstrates how to configure transcription for Spanish (Spain) using the telephony model with full support for all attributes.
1<Response>2<Start>3<Transcription4transcriptionEngine="google"5languageCode="es-ES"6speechModel="telephony"7profanityFilter="true"8enableAutomaticPunctuation="true" />9</Start>10</Response>
In this example, the telephony model supports automatic punctuation and profanity filter, but not model adaptation (e.g., hints).
This example demonstrates how to configure transcription for Hindi (India) using the short model with support for specific attributes.
1<Response>2<Start>3<Transcription4transcriptionEngine="google"5languageCode="hi-IN"6speechModel="short"7enableAutomaticPunctuation="true"8profanityFilter="true"9hints="संपर्क, सेवा, समर्थन, ग्राहक"10modelAdaptation="true" />11</Start>12</Response>
In this example, the short model supports automatic punctuation, profanity filter, model adaptation, and hints.
This example demonstrates how to configure transcription for French (Canada) using the long model with support for specific attributes.
1<Response>2<Start>3<Transcription4transcriptionEngine="google"5languageCode="fr-CA"6speechModel="long"7hints="service à la clientèle, rendez-vous, commande" />8</Start>9</Response>
In this example, the long model supports model adaptation through hints, but does not support automatic punctuation, profanity filter, or spoken punctuation.
If specifying inboundTrackLabel or outboundTrackLabel, the call direction mapping table below can be used as a guide.
| Track | Call Direction | Call Resource Mapping | TrackLabel |
|---|---|---|---|
| Inbound-track | Outbound | TO # | Label for "who is being called" in an outbound call from Twilio (e.g., inboundTrackLabel="customer"). |
| Outbound-track | Outbound | FROM # | Label for "who is calling" in an outbound call from Twilio (e.g., outboundTrackLabel="agent"). |
| Inbound-track | Inbound | FROM # | Label for "who is being called" in an inbound call to Twilio (e.g., inboundTrackLabel="agent"). |
| Outbound-track | Inbound | TO # | Label for "who is calling" in an inbound call to Twilio (e.g., outboundTrackLabel="customer"). |
Note: A call that has an "outbound" direction is a call that is outbound from Twilio, i.e., from Twilio to a customer.
If you provided a name attribute when starting a Real-Time Transcription session, you can stop a Real-Time Transcription using TwiML or via API.
Given a Real-Time Transcription that was started with the following TwiML instructions:
1<Response>2<Start>3<Transcription name="Contact center transcription" />4</Start>5</Response>
You can stop the Real-Time Transcription with the following TwiML instructions:
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const stop = response.stop();5stop.transcription({name: 'Contact center transcription'});67console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Stop>4<Transcription name="Contact center transcription" />5</Stop>6</Response>
If a name was not provided, you can stop an in-progress real-time Transcription via API using the SID of the Transcription. Learn more about the Transcriptions subresource.
HIPAA eligibility and PCI compliance varies depending on your selected speech model and whether you use webhooks or persisted transcripts. To determine whether your implementation may be HIPAA eligible or PCI compliant, see the following table.
| Transcription engine | Speech model | Transcript destination | HIPAA eligibility | PCI-compliant |
|---|---|---|---|---|
| Any supported model | Webhooks | Yes | Yes | |
| Any supported model | Persisted Transcript | Yes | No | |
| Deepgram | nova-2 or nova-3 monolingual variants | Webhooks | Yes | Yes |
| Deepgram | nova-2 or nova-3 monolingual variants | Persisted Transcript | Yes | No |
| Deepgram | nova-3 multilingual | Webhooks or Persisted Transcript | No | No |
AI Nutrition Facts
Real-Time Transcription, including <Transcriptions> TwiML noun and API, uses third-party artificial technology and machine learning technologies.
Twilio's AI Nutrition Facts provide an overview of the AI feature you're using, so you can better understand how the AI is working with your data. Real-Time Transcriptions AI qualities are outlined in the following Speech to Text Transcriptions - Programmable Voice Nutrition Facts label. For more information and the glossary regarding the AI Nutrition Facts Label, please refer to Twilio's AI Nutrition Facts page.
AI Nutrition Facts
Speech to Text Transcriptions - Programmable Voice, Twilio Video, and Conversational Intelligence
- Description
- Generate speech to text voice transcriptions (real-time and post-call) in Programmable Voice, Twilio Video, and Conversational Intelligence.
- Privacy Ladder Level
- N/A
- Feature is Optional
- Yes
- Model Type
- Generative and Predictive - Automatic Speech Recognition
- Base Model
- Deepgram Speech-to-Text, Google Speech-to-Text, Amazon Transcribe
- Base Model Trained with Customer Data
- No
- Customer Data is Shared with Model Vendor
- No
- Training Data Anonymized
- N/A
- Data Deletion
- Yes
- Human in the Loop
- Yes
- Data Retention
- Until the customer deletes
- Logging & Auditing
- Yes
- Guardrails
- Yes
- Input/Output Consistency
- Yes
- Other Resources
- https://www.twilio.com/docs/conversational-intelligence
Trust Ingredients
Conversational Intelligence, Programmable Voice, and Twilio Video only use the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.
Conversational Intelligence, Programmable Voice, and Twilio Video only use the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.
Base Model is not trained using any customer data.
Transcriptions are deleted by the customer using the Conversational Intelligence API or when a customer account is deprovisioned.
The customer views output in the Conversational Intelligence API or Transcript Viewer.
Compliance
The customer can listen to the input (recording) and view the output (transcript).
The customer can listen to the input (recording) and view the output (transcript).
The customer is responsible for human review.
Learn more about this label at nutrition-facts.ai