TwiML™ Voice: <ConversationRelay>
Legal notice
ConversationRelay, including the <ConversationRelay>
TwiML noun and API, uses artificial intelligence or machine learning technologies. By enabling or using any features or functionalities within Programmable Voice that Twilio identifies as using artificial intelligence or machine learning technology, you acknowledge and agree to certain terms. Your use of these features or functionalities is subject to the terms of the Predictive and Generative AI or ML Features Addendum.
ConversationRelay isn't compliant with the Payment Card Industry (PCI) and doesn't support Voice workflows that are subject to PCI.
Info
Before using ConversationRelay, you need to complete the onboarding steps and agree to the Predictive and Generative AI/ML Features Addendum. See the ConversationRelay Onboarding Guide for more details.
The <ConversationRelay>
TwiML noun under the <Connect>
verb routes a call to the ConversationRelay service, providing advanced AI-powered voice interactions. ConversationRelay handles the complexities of live, synchronous voice calls, such as speech-to-text (STT) and text-to-speech (TTS) conversions, session management, and low-latency communication with your application. This approach allows your system to focus on processing conversational AI logic and sending back responses effectively.
In a typical setup, <ConversationRelay>
connects to your AI application through a WebSocket, allowing real-time and event-based interaction. Your application receives transcribed caller speech in structured messages and sends responses as text, which ConversationRelay converts to speech and plays back to the caller. This setup is commonly used for customer service, virtual assistants, and other scenarios that require real-time, AI-based voice interactions.
The <ConversationRelay>
noun supports the following attributes:
Attribute name | Description | Default value | Required |
---|---|---|---|
url | The URL of your WebSocket server. The URL must begin with wss:// . | Required | |
welcomeGreeting | The message Twilio plays to the caller after answering a call. For example, Hello! How can I help you today? | Optional | |
welcomeGreetingInterruptible | Controls what interruptions from the caller are permitted during the welcome greeting. The value can be none , dtmf , speech , or any . | any | Optional |
language | The language of speech-to-text (STT) and text-to-speech (TTS). Setting this attribute is the same as setting both ttsLanguage and transcriptionLanguage . | en-US | Optional |
ttsLanguage | The language code to use for TTS when the text token message doesn't specify a language. | Optional | |
ttsProvider | The provider for TTS. Available choices are Google , Amazon , and ElevenLabs . | ElevenLabs | Optional |
voice | The voice used for TTS. Options vary based on the ttsProvider . For details, refer to the Twilio TTS voices. Additional voices are available for ConversationRelay. | UgBBYS2sOqTuMpoF3BR0 (ElevenLabs), en-US-Journey-O (Google), Joanna-Neural (Amazon) | Optional |
transcriptionLanguage | The language to use for speech-to-text when the session starts. This overrides the language attribute. | Optional | |
transcriptionProvider | The provider for STT (Speech Recognition). Available choices are Google and Deepgram . | Deepgram or Google (for accounts that used ConversationRelay before September 12, 2025). | Optional |
speechModel | The speech model used for STT. Choices vary based on the transcriptionProvider . Refer to the provider's documentation for an accurate list. | telephony (Google), nova-3-general (for Deepgram languages that support it) or nova-2-general (Deepgram, other languages) | Optional |
interruptible | Specifies if caller speech can interrupt TTS playback. Values can be none , dtmf , speech , or any . For backward compatibility, Boolean values are also accepted: true = any and false = none . | any | Optional |
dtmfDetection | Specifies whether the system sends Dual-tone multi-frequency (DTMF) keypresses over the WebSocket. Set to true to turn on DTMF events. | Optional | |
reportInputDuringAgentSpeech | Specifies whether your application receives prompts and DTMF events while the agent is speaking. Values can be none , dtmf , speech , or any . Note: The default value for this attribute has changed. The default was any before May 2025 and it's now none . | none | Optional |
preemptible | Specifies if the TTS of the current talk cycle can allow text tokens from the subsequent talk cycle to interrupt. | false | Optional |
hints | A comma-separated list of words or phrases that may appear in the speech. See Hints to learn more about this attribute. | Optional | |
debug | A space-separated list of options that you can use to subscribe to debugging messages. Options are debugging , speaker-events , and tokens-played . The debugging option provides general debugging information. speaker-events will notify your application about agentSpeaking and clientSpeaking events. tokens-played will provide messages about what's just been played over TTS. | Optional | |
elevenlabsTextNormalization | Specifies whether or not to apply text normalization while using the ElevenLabs TTS provider. Options are on , auto , or off . auto has the same effect as off for ConversationRelay voice calls. | off | Optional |
intelligenceService | A Conversational Intelligence Service SID or unique name for persisting conversation transcripts and running Language Operators for virtual agent observability. Please see this guide for more details. | Optional |
For more granular configuration, you can nest elements in <ConversationRelay>
.
The <Language>
element maps a language code to a set of text-to-speech and speech-to-text settings. Add one <Language>
element for each language that the session may use.
Info
Adding the <Language>
element doesn't set it as the text-to-speech or speech-to-text language. See Language settings to learn about how to set or change the TTS or STT language in a session.
Attributes
Attribute name | Description of attributes | Default value | Required |
---|---|---|---|
code | The language code (for example, en-US ) that applies to both STT and TTS. | Required | |
ttsProvider | The provider for TTS. Choices are Google , Amazon , and ElevenLabs . | Inherited from <ConversationRelay> | Optional |
voice | The voice used for TTS. Choices vary based on the ttsProvider . | Inherited from <ConversationRelay> | Optional |
transcriptionProvider | The provider for STT. Choices are Google and Deepgram . | Inherited from <ConversationRelay> | Optional |
speechModel | The speech model used for STT. Choices vary based on the transcriptionProvider . | Inherited from <ConversationRelay> | Optional |
Example
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const connect = response.connect();5const conversationrelay = connect.conversationRelay({6url: 'wss://mywebsocketserver.com/websocket'7});8conversationrelay.language({9code: 'sv-SE',10ttsProvider: 'amazon',11voice: 'Elin-Neural',12transcriptionProvider: 'google',13speechModel: 'long'14});15conversationrelay.language({16code: 'en-US',17ttsProvider: 'google',18voice: 'en-US-Journey-O'19});2021console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Connect>4<ConversationRelay url="wss://mywebsocketserver.com/websocket">5<Language code="sv-SE" ttsProvider="amazon" voice="Elin-Neural" transcriptionProvider="google" speechModel="long"/>6<Language code="en-US" ttsProvider="google" voice="en-US-Journey-O" />7</ConversationRelay>8</Connect>9</Response>
The <Parameter>
element allows you to send custom parameters from the TwiML directly into the initial "setup" message sent over the WebSocket. These parameters appear under the customParameters
field in the JSON message.
Example
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const connect = response.connect();5const conversationrelay = connect.conversationRelay({6url: 'wss://mywebsocketserver.com/websocket'7});8conversationrelay.parameter({9name: 'foo',10value: 'bar'11});12conversationrelay.parameter({13name: 'hint',14value: 'Annoyed customer'15});1617console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Connect>4<ConversationRelay url="wss://mywebsocketserver.com/websocket">5<Parameter name="foo" value="bar"/>6<Parameter name="hint" value="Annoyed customer"/>7</ConversationRelay>8</Connect>9</Response>
Resulting setup message
1{2"type": "setup",3"sessionId": "VX00000000000000000000000000000000",4"callSid": "CA00000000000000000000000000000000",5"...": "...",6"customParameters": {7"foo": "bar",8"hint": "Annoyed customer"9}10}
1const VoiceResponse = require('twilio').twiml.VoiceResponse;23const response = new VoiceResponse();4const connect = response.connect({5action: 'https://myhttpserver.com/connect_action'6});7connect.conversationRelay({8url: 'wss://mywebsocketserver.com/websocket',9welcomeGreeting: 'Hi! Ask me anything!'10});1112console.log(response.toString());
Output
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Connect action="https://myhttpserver.com/connect_action">4<ConversationRelay url="wss://mywebsocketserver.com/websocket" welcomeGreeting="Hi! Ask me anything!" />5</Connect>6</Response>
action
(optional): The URL that Twilio will request when the<Connect>
verb ends.url
(required): The URL of your WebSocket server (must use thewss://
protocol).welcomeGreeting
(optional): The message played to the caller after we answer the call and establish the WebSocket connection.
When the TwiML execution is complete, Twilio will make a callback to the action
URL with call information and the return parameters from ConversationRelay
.
You can set the text-to-speech language in four ways:
- The value of the
language
attribute on the<ConversationRelay>
noun - The value of the
ttsLanguage
attribute on the<ConversationRelay>
noun - The
ttsLanguage
value in the last switch language message that your app sent to Twilio - The
lang
value in the text token message your app sent to Twilio
Later items on this list override earlier items on the list. For example, if you set the text-to-speech language to sv-SE
in the text token message and en-US
in the ttsLanguage
attribute of the <ConversationRelay>
noun, the TTS language is sv-SE
for that token.
You can set the speech-to-text language in three ways. Later items on this list override earlier items on the list.
- The value of the
language
attribute on the<ConversationRelay>
noun - The value of the
transcriptionLanguage
attribute on the<ConversationRelay>
noun - The
transcriptionLanguage
value in the last switch language message that your app sent to Twilio
If your session could have more than one language, use the <Language>
noun to configure the speech model, transcription provider, text-to-speech provider, and voice for each language. Add one <Language>
noun for each language that you support.
The attributes of the <Language>
noun override attributes of the parent <ConversationRelay>
noun.
ConversationRelay generally uses default values when you don't specify a speech model, voice, or provider. For example, if you set the ttsProvider
attribute without the voice
attribute, the system uses a default voice for that text-to-speech provider.
ConversationRelay sends an error message to your app and disconnects the call when you've specified an invalid combination of transcriptionProvider
and speechModel
or of ttsProvider
and voice
.
When an action
URL is specified in the <Connect>
verb, ConversationRelay
will make a request to that URL when the <Connect>
verb ends. The request includes call information and session details.
Example payloads
1{2"AccountSid": "AC00000000000000000000000000000000",3"CallSid": "CA00000000000000000000000000000000",4"CallStatus": "in-progress",5"From": "client:caller",6"To": "test:conversationrelay",7"Direction": "inbound",8"ApplicationSid": "AP00000000000000000000000000000000",9"SessionId": "VX00000000000000000000000000000000",10"SessionStatus": "ended",11"SessionDuration": "25",12"HandoffData": "{\"reason\": \"The caller requested to talk to a real person\"}"13}
1{2"AccountSid": "AC00000000000000000000000000000000",3"CallSid": "CA00000000000000000000000000000000",4"CallStatus": "in-progress",5"From": "client:caller",6"To": "test:conversationrelay",7"Direction": "inbound",8"ApplicationSid": "AP00000000000000000000000000000000",9"SessionId": "VX00000000000000000000000000000000",10"SessionStatus": "failed",11"SessionDuration": "10",12"ErrorCode": "39001",13"ErrorMessage": "Network connection to WebSocket server failed."14}
1{2"AccountSid": "AC00000000000000000000000000000000",3"CallSid": "CA00000000000000000000000000000000",4"CallStatus": "completed",5"From": "client:caller",6"To": "test:conversationrelay",7"Direction": "inbound",8"ApplicationSid": "AP00000000000000000000000000000000",9"SessionId": "VX00000000000000000000000000000000",10"SessionStatus": "completed",11"SessionDuration": "35"12}
Use the hints
attribute of <ConversationRelay>
to help accurately transcribe phrases that may appear in the speech. Hints could include brand names, industry-specific terms, or other expressions that you think the call is likely to contain. Adding a phrase to hints
could increase the likelihood that it will be recognized by your speech-to-text provider. Separate each hint with a comma and capitalize proper nouns (like brand names) in the way that they're normally written.
If Nova-3 is your speech-to-text model, hints uses the keyterms feature of Deepgram. You can provide up to 100 hints and the hints attribute can contain up to 500 tokens. See the Deepgram documentation to learn more about hints with Nova-3.
1<?xml version="1.0" encoding="UTF-8"?>2<Response>3<Connect>4<ConversationRelay url="wss://YOUR_SERVER_URL"5interruptible="true"6welcomeGreeting="Hi! How can I help you today?"7hints="TwiML,ConversationRelay,JavaScript,XML,Code Exchange"8>9<!-- other elements -->10</ConversationRelay>11</Connect>12</Response>
ConversationRelay
, including the <ConversationRelay>
TwiML nouns and APIs, use artificial intelligence or machine learning technologies.
Our AI Nutrition Facts for ConversationRelay
provide an overview of the AI feature you're using, so you can better understand how the AI is working with your data. The below AI Nutrition Label details the ConversationRelay AI qualities. For more information and the glossary regarding the AI Nutrition Facts Label, refer to our AI Nutrition Facts page.
AI Nutrition Facts
ConversationRelay (STT and TTS) - Programmable Voice - Deepgram
- Description
- Generate speech to text in real-time through a WebSocket API in Programmable Voice.
- Privacy Ladder Level
- N/A
- Feature is Optional
- Yes
- Model Type
- Automatic Speech Recognition
- Base Model
- Deepgram Nova2
- Base Model Trained with Customer Data
- No
- Customer Data is Shared with Model Vendor
- No
- Training Data Anonymized
- N/A
- Data Deletion
- N/A
- Human in the Loop
- Yes
- Data Retention
- N/A
- Logging & Auditing
- Yes
- Guardrails
- Yes
- Input/Output Consistency
- Yes
- Other Resources
- Learn more about this label at nutrition-facts.ai
Trust Ingredients
ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.
ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.
Base Model is not trained using any Customer Data.
Customer Data is not stored or retained in the Base Model.
Customer can view and listen to the input and output in the customer's own terminal.
Compliance
Customer can view and listen to the input and output in the customer's own terminal.
Customer can view and listen to the input and output in the customer's own terminal.
Customer is responsible for human review.
Learn more about this label at nutrition-facts.ai
AI Nutrition Facts
ConversationRelay (STT and TTS) - Programmable Voice - Google AI
- Description
- Generate speech to text in real-time and convert text into natural-sounding speech through a WebSocket API in Programmable Voice.
- Privacy Ladder Level
- N/A
- Feature is Optional
- Yes
- Model Type
- Generative and Predictive - Automatic Speech Recognition and Text-to-Speech
- Base Model
- Google Speech-to-Text; Google Text-to-Speech
- Base Model Trained with Customer Data
- No
- Customer Data is Shared with Model Vendor
- No
- Training Data Anonymized
- N/A
- Data Deletion
- N/A
- Human in the Loop
- Yes
- Data Retention
- N/A
- Logging & Auditing
- Yes
- Guardrails
- Yes
- Input/Output Consistency
- Yes
- Other Resources
- Learn more about this label at nutrition-facts.ai
Trust Ingredients
ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.
ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.
Base Model is not trained using any Customer Data.
Customer Data is not stored or retained in the Base Model.
Customer can view and listen to the input and output in the customer's own terminal.
Compliance
Customer can view and listen to the input and output in the customer's own terminal.
Customer can view and listen to the input and output in the customer's own terminal.
Customer is responsible for human review.
Learn more about this label at nutrition-facts.ai
AI Nutrition Facts
ConversationRelay (STT and TTS) - Programmable Voice - Amazon AI
- Description
- Convert text into natural sounding speech through a websocket API in Programmable Voice.
- Privacy Ladder Level
- N/A
- Feature is Optional
- Yes
- Model Type
- Generative and Predictive
- Base Model
- Amazon Polly Text-to-Speech
- Base Model Trained with Customer Data
- No
- Customer Data is Shared with Model Vendor
- No
- Training Data Anonymized
- N/A
- Data Deletion
- N/A
- Human in the Loop
- Yes
- Data Retention
- N/A
- Logging & Auditing
- Yes
- Guardrails
- Yes
- Input/Output Consistency
- Yes
- Other Resources
- Learn more about this label at nutrition-facts.ai
Trust Ingredients
ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.
ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.
Base Model is not trained using any Customer Data.
Customer Data is not stored or retained in the Base Model.
Customer can view and listen to the input and output in the customer's own terminal.
Compliance
Customer can view and listen to the input and output in the customer's own terminal.
Customer can view and listen to the input and output in the customer's own terminal.
Customer is responsible for human review.
Learn more about this label at nutrition-facts.ai
AI Nutrition Facts
ConversationRelay (STT and TTS) - Programmable Voice - ElevenLabs
- Description
- Convert text into a human-sounding voice using speech synthesis technology from ElevenLabs.
- Privacy Ladder Level
- N/A
- Feature is Optional
- Yes
- Model Type
- Predictive
- Base Model
- ElevenLabs Text-To-Speech: Flash 2 and Flash 2.5
- Base Model Trained with Customer Data
- No
- Customer Data is Shared with Model Vendor
- No
- Training Data Anonymized
- N/A
- Data Deletion
- N/A
- Human in the Loop
- Yes
- Data Retention
- Customer can review TwiML logs, including <Say> Logs, to debug and troubleshoot for up to 30 days.
- Logging & Auditing
- Yes
- Guardrails
- Yes
- Input/Output Consistency
- Yes
- Other Resources
- Learn more about this label at nutrition-facts.ai
Trust Ingredients
The Base Model is not trained using any Customer Data.
Programmable Voice uses the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.
Base Model is not trained using any Customer Data.
The Base Model is not trained using any Customer Data.
Customers can view text input and listen to the audio output.
Compliance
Customers can view text input and listen to the audio output.
Customers can view text input and listen to the audio output.
Customer is responsible for human review.
Learn more about this label at nutrition-facts.ai