ConversationRelay Architecture for Voice AI Applications Built on AWS using Fargate and Bedrock

July 23, 2025
Written by
George Wolf
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by
Paul Kamp
Twilion

Enterprises have realized there is huge potential in voice-backed AI, or “agentic”, applications. Many new companies have sprung into existence offering Agentic-Applications-as-a-Service (has AAaaS been coined yet?) while other companies are experimenting and building their own.

While connecting a simple AI application to a voice channel is relatively straightforward, building a production-grade agentic application isn’t trivial. Achieving human-like latency and managing the conversation flow (e.g., handling interruptions or tool calls) while providing an engaging experience are voice-specific optimizations that are required for a delightful experience. These experiences are – or soon will be – a primary business to customer interface, and it will make sense for many enterprises to control things end-to-end. To those enterprises, outsourcing this functionality will not be appealing.

The purpose of this post is to provide insights on how you could build an agentic voice experience using Twilio ConversationRelay and AWS Fargate. This blog post includes a repo and instructions on how to set up a proof-of-concept application.

What do you need for a production AI Agent?

Before we get to the code, let’s go over five key areas to focus on when building production scale agentic applications:

  1. ability to scale
  2. ease of development
  3. the right SST (speech-to-text) & TTS (text-to-speech) providers for the task
  4. the right LLM for the task
  5. acceptable latency

Ability to scale

The adoption of agentic experiences is increasing rapidly and shows no signs of slowing down. And as the experiences get better, the flywheel will keep turning – your customers are going to expect a new, higher level of service. Your application will need to be able to handle this new demand as adoption and use cases increase. Twilio and AWS can help.

Red Twilio logo with the company name and a circular icon with four dots.

Twilio has been powering voice calls over APIs and software since 2008, and has a robust platform for enterprise voice applications.

Voice was Twilio’s first channel, and we now power billions of minutes of voice applications across the PSTN, SIP, and WebRTC with our Voice SDKs.

Read our 2024 Cyber Week Recap to get a sense of the voice traffic (as well as messaging and email traffic) Twilio handles.

Amazon Web Services (AWS) logo with text and an orange curved arrow underline.

AWS pioneered the cloud computing revolution and powers applications all across the globe. Their Well-Architected Framework guides and influences modern application development.

There are clear benefits to building on platforms with proven track records of providing resilient and reliable services and APIs.

Using Twilio and AWS to build and power your voice AI applications would be a good choice!

Ease of development

The agentic stack has many components. While it is possible to build them all yourself, it can make sense to let platforms like AWS and Twilio handle the elements that we are good at and allow your development efforts to focus on the components that differentiate your business.

Text comparing 'Hard Stuff' and 'Important Stuff' in software development, highlighting tasks in each category.

AWS and Twilio can help you offload the boilerplate so you can speed up the time to deployment and value when you are testing and building a solution. Both AWS and Twilio are known for their excellent documentation and developer-friendly APIs and design patterns.

User experience, business rules, prompt engineering, LLM choice, guardrails, tool calling and the customer experience elements are where you should spend the majority of your development efforts.

The right STT and TTS for the task

Twilio’s ConversationRelay handles the orchestration of speech-to-text and text-to-speech as well as key functionality like interruption handling. ConversationRelay handles STT and TTS so your application just needs to receive and process text and stream your application’s generated – or retrieved – text back to us.

Diagram showing the flow of interactions in a Conversation Relay Websocket API system.

Twilio has partnered with world-class providers such as AWS, Google, Deepgram, and ElevenLabs, and has provisioned and optimized capacity with providers to optimize for latency.

As speech-to-text and text-to-speech providers rapidly improve and evolve, Twilio is continually evaluating new releases from them as well as adding to our existing providers. And best of all, we leave the choice of providers and configurations ( voices and language for example) completely up to you… and configurable during each session.

The right LLM for the task

LLMs are also changing at a rapid pace, and your application needs to be able to switch models and providers to take advantage of improved features, latency, and cost as the ecosystem evolves.

AWS Bedrock is a compelling choice for many reasons. For starters, it is already in your AWS accounts and can be provisioned within your company’s existing infrastructure.

AWS Bedrock allows you to choose the LLM model your application needs based on your analysis of the cost, performance, and functionality of models in the catalog.

As this space matures, it is becoming clearer that while some agentic experiences may require the latest models, other experiences could be well served by lower cost models. Bedrock allows you the flexibility of trying different models via configuration so you can optimize your application for cost, latency, and function.

In addition you can control model outputs using guardrails and maintain observability for model output and for model distillation and optimization.

Acceptable latency

None of the four previous points will matter if you cannot address the time it takes for your model to start speaking a response back to your user – indicated by the red arrow in the screenshot below.

Blue and red audio waveforms with a red arrow pointing to the middle

Considering this time gap is crucial in providing agentic experiences that achieve human-normal latency.

The reference design will enable you to build agentic applications that can perform with human-level latency by leveraging Twilio’s Voice platform and ConversationRelay combined with a highly available and performant Docker application hosted on AWS Fargate.

We have already covered how ConversationRelay manages and optimizes the speech-to-text and text-to-speech components for you. While Twilio also handles the key voice interfaces (WebRTC/PSTN/SIP), you still need to build the application (indicated by the orange box in the diagram below labeled You Provide these services).

Flowchart showing Conversational API with voice and text inputs connecting to a custom assistant via WebSocket.

This post’s application uses Docker and AWS Fargate with an Application Load Balancer to deliver a highly performant and resilient application.

Prerequisites

This is not a beginner level build! You should also be comfortable with the concepts already presented and with programming and command line skills to be able to spin up this demo application.

ConversationRelay architecture with AWS Fargate and Bedrock

This application is intended to be straightforward to build and deploy, and serve as the foundation for something that you could take into production.

The application uses Node.js and Express.js, and can be run locally.

We use AWS DynamoDB for a persistence layer, and we use AWS Bedrock for LLM calling.These two services need to be configured for your build of this application, but you could certainly swap them out for different options.

The application is deployed using Docker and Fargate. Here is a high level overview of the entire reference architecture:

Diagram of Twilio services integrating with AWS using an application load balancer and Amazon ECS.
  1. Twilio Voice Infrastructure: Leverage Twilio’s CPaaS capabilities to connect to your customers. The inbound voice handler is routed to a REST API that will establish the ConversationRelay session using Programmable Voice (and Twilio’s Markup Languages, or TwiML).
  2. TwiML Response: Respond to a new inbound call using TwiML, Twilio’s Markup Language. This initial step can use user context to personalize the session for the user.
  3. Datasource: Production applications clearly need capable databases to maintain state. The reference application uses DynamoDB.
  4. ConversationRelay: The TwiML from #2 establishes a unique WebSocket session and is ready to convert speech-to-text and send the text to your application. It’s also ready to receive inbound streams of text to reply to the caller using the selected specified text-to-speech provider.
  5. WebSocket API: ConversationRelay sends transcribed text from the caller to your application, where you can handle the converted text as events. LLMs will stream text responses back to your application where you can, in turn, stream the text “chunks” back to ConversationRelay to be converted into speech.
  6. Business Application: This reference application uses Node.js and Express.js deployed as Docker images to AWS Fargate.
  7. LLM: The reference application makes LLM calls using AWS Bedrock.

Let’s build it!

Download the code from this GitHub repo.

This application is designed to run locally using Docker or as a local Node application for development and testing, and it can be deployed as an x86 image on AWS Fargate for production use. We are providing this application as a sample application with no warranties or support for production.

The README.md for the Github repo outlines all of the ways you can manage running the application. We recommend testing locally first and then deploying to Fargate.

Local Development

This guide is designed to get you going quickly with a local deployment of the infrastructure needed for the app and minimal resources used in the cloud. After you have a local version working, the README.md in the repo suggests how you could deploy the application using Docker or in the cloud using Fargate.

Step 1

Run ngrok. When you run ngrok, you will get a public internet URL to your proxy. Make a note of this for the next steps:

ngrok http http://localhost:3000

Screenshot showing Ngrok's session status, account info, and connection details with update information.

Step 2

Clone the Git repo from Github, if you have not already.

Ensure that the AWS CLI application is installed and can run from your terminal.

Ensure that the model you will be using in this demo is enabled for your AWS account for use in Bedrock.

Step 3

Deploy cloud resources using AWS SAM (sam build; sam deploy as shown in README.md) DynamoDB, Bedrock, and IAM permissions.

Step 4

In the ConversationRelay-LiveChat-Bedrock-Fargate/app directory, install the Node dependencies needed to run the application:

npm install

Then, edit the run-node-local.sh .sample file in that directory to point to the resources deployed and to ngrok on your machine.

Screenshot of a bash script configuring AWS settings and starting a local Node.js application.
mv run-node-local.sh.sample run-node-local.sh
chmod 777 run-node-local.sh

Then edit the run-node-local.sh .sample file to change the following fields:

AWS_PROFILE: Optional - this points to a local profile in your AWS client configuration.
WS_DOMAIN_NAME:  This is the public URL for your application proxy from Step 1.
TABLE_NAME:  This is the name of the DynamoDB table deployed in Step 3.
STACK_USE_CASE:  This will tell the application which use case to run.  The default is set for you.
WS_URL:  This is the websocket URL that Twilio will use to send data to your application.  The format will be:
wss://<your ngrok public URI>/
MODEL_IDENTIFIER:  This is the model that your application will select from Bedrock.
timeZone:  This is the time zone the application will run in; this is used to make sure dates and times reflect the correct time zone for the location of the service.

Step 5

Set the default use case prompts in the Dynamo DB table this stack created. This will enable the code to run one of the use cases we have predefined in the sample code. Here, we are going to deploy an application that simulates ordering from a pizza restaurant:

aws dynamodb put-item --table-name CR-AWS-BEDROCK-ConversationRelayAppDatabase --item "$(node ./configuration/dynamo-loaders/restaurantOrderingUseCase.mjs | cat)”

Step 6 (Optional)

Add a sample user profile to the application. This will allow the application to address you by name, based on the phone number from the phone that you are calling from. This step is optional.

aws dynamodb put-item --table-name CR-AWS-BEDROCK-ConversationRelayAppDatabase --item "$(node ./configuration/user-profile.js | cat)"

Step 7

Connect your application to your Twilio Voice phone number:

Twilio Voice Configuration screen showing settings for call routing using Webhooks and HTTP methods.

Go to your phone number configuration screen and select Webhook under ‘A call comes in’ and then enter the URL:

https://<your ngrok proxy URI>/twiml

in the URL field.

Step 8

Run your application locally:

cd ./app
./run-node-local.sh

This will start the Express server and run the app locally on port 3000.

Step 9

Now, it’s time to test! Call your application by your Twilio phone number.

You should see application logs with the Welcome Greeting being played from the pizza ordering app. You can now interact with the LLM!

Two audio waveforms in blue and red, each with an icon indicating voice input

Try putting the app through its paces while watching the logs to get a feel for the build. Remember the elements that we discussed in the beginning of this post: you should get a feel for the speech-to-text and text-to-speech capabilities, along with a feel for the strengths and weaknesses of your LLM choice. And, certainly, you’ll naturally get a feel for the latency with your conversation pacing.

(Try ordering a pizza from the restaurant, too. Just don’t expect a food delivery.)

Key Architectural Considerations

We needed to make several important architectural choices when building out this application. We highlight and explain our thought process below so that you can be prepared for the same choices while building your application.

LLM Performance Choices

The choice of LLM is very important to the use case you are implementing. While the latest model may be attractive, other factors like cost and latency need to be considered.

This reference application is designed to be able to call any LLM via API. You can look at app/lib/prepare-and-call-llm.mjs if you want to change this application to call a different LLM. If you choose to continue to use Bedrock, keep the following in mind:

  • Reserved, on demand, and provisioned capacity. Reserved demand can offer significant cost savings over on demand. Most voice applications will need provisioned capacity to optimize latency.
  • LLM choice is crucial (and it’s crucial enough to repeat!). The Bedrock model catalog makes it straightforward to experiment with different providers and models.
  • Use Inference Profiles or cross-region profiles to access available capacity across multiple regions.
  • Log LLM interactions for performance monitoring and auditability
  • Use KMS to encrypt data at rest

Text-to-Speech voices

The voices that you choose for your voice AI Applications are a central component of the user experience that your business provides to your customers. ConversationRelay lets you choose the text-to-speech providers and voice that best meets your requirements.

This application has preloaded English voices across three providers (Google, Amazon, and ElevenLabs)

The available voices are increasing rapidly. Refer to the Twilio Docs for the latest voices.

Conclusion

With voice becoming such an important customer interface, most enterprises will want to own the experience end-to-end. Hopefully this blog post and reference architecture shed some light on the key points your enterprise will need to consider while giving you some foundational code to begin your project. When you choose to build these applications with Twilio and AWS you will be able to focus on the work that differentiates your business and provides true value to your customers.

Additional resources

Dan Bartlett has been building web applications since the first dotcom wave. The core principles from those days remain the same but these days you can build cooler things faster. He can be reached at dbartlett [at] twilio.com.

George Wolf was a Senior Partner Solutions Architect at AWS for over five years. Prior to AWS, George worked at Segment and was responsible for partner technical relationships. He has experience with Amazon Personalize, Clean Rooms, Redshift, and Lambda, and has designed high volume data collection architectures for customer data and machine learning use cases for partners such as Amplitude, Segment, ActionIQ, Tealium, Braze, and Reltio. In his spare time, George likes mountain biking around the San Francisco Bay Area. You can reach George on LinkedIn.