Integrate Google Gemini with Twilio Voice Using ConversationRelay and Python
Time to read:
Imagine having real-time, human-like conversations with an AI model over the phone. With Twilio ConversationRelay, you can make that a reality. It connects your voice call with any Large Language Model (LLM) via a fast, event-driven WebSocket, allowing seamless communication.
In this post, I'll show you how to set up Google Gemini with Twilio Voice using ConversationRelay. By the end, you'll have a simple Python server running, ready to let you dial into a Twilio number and talk with Gemini. It's a perfect starting point for building more interactive, feature-rich voice applications.
Prerequisites
To follow along with this tutorial, you'll need:
- Python 3.10+ installed
- A Twilio phone number (Sign up for Twilio here)
- The IDE of your choice (such as Visual Studio Code)
- Ngrok or a similar tunneling service
- A Google AI Studio account and API key
- A phone to place your outgoing call to Twilio
Set up the project
Let's start by creating a new directory for your project:
Next, create a virtual environment and activate it:
Now, install the required dependencies using pip:
You'll use FastAPI as your framework to quickly spin up a server for both the WebSocket and the Twilio route.
Configure environment variables
You'll need to securely store your Google Gemini API key. Create a .env file in your project folder and add the following:
Be sure to replace the placeholders with your actual API key and your Ngrok forwarding URL (you'll set this up later in the post).
Create the server
Now, create a new Python file called main.py. This is where you'll write the main logic of the application.
1. Import necessary libraries and set up the environment
The first step in setting up the server is importing the necessary libraries. These libraries will help you build the server, handle WebSocket connections, interact with Google Gemini, and load environment variables.
- FastAPI: This is the main web framework used for building the server. It allows you to easily define routes and handle HTTP requests. It also enables WebSocket communication between the server and Twilio's ConversationRelay.
- uvicorn: A fast ASGI server that runs the FastAPI application.
- genai: The Google Gen AI SDK allows you to interact with Google's Gemini models to generate responses.
- dotenv: This module helps load environment variables (like your Google API key and Ngrok URL) from a .env file.
After importing the libraries, add the following line to load environment variables from the .env file:
This ensures sensitive information (like API keys) are not hardcoded into the code.
2. Constants and configuration
Define some constants to use throughout the code to customize the behavior of the application.
- PORT: The port for the server to listen on (either from an environment variable or default to
8080). - DOMAIN: The Ngrok URL that will be used to establish a WebSocket connection with Twilio. The code validates it immediately and raises an error if it's not set.
- WS_URL: The full WebSocket URL for the endpoint where Twilio will connect.
- WELCOME_GREETING: The greeting that will be spoken when the call connects.
- SYSTEM_PROMPT: This is a special message sent to the Gemini model to instruct it on how to behave. It's important because it helps shape the tone and format of the AI's responses (e.g., spelling out numbers, avoiding emojis). Since this is a voice conversation, the rules help ensure clean, speakable output.
- sessions: A dictionary to store active chat session objects for each call (using the
callSidas the key). Note: This implementation only cleans up on clean disconnect – in production, consider adding session timeout handling.
3. Initializing FastAPI and the Gemini client
Initialize the FastAPI application and set up the Gemini client using the API key stored in the .env file.
- GOOGLE_API_KEY: The code loads the API key from environment variables and validates it's present before proceeding.
- client: Initialize the Google Gen AI client using
genai.Client(). This is the new SDK pattern – if you've used the oldergoogle-generativeaipackage before, note that this is the newergoogle-genaiSDK. - FastAPI app: This creates the web server that will handle HTTP requests and WebSocket connections.
4. AI response function
Now create a function that will interact with the Gemini API and get the AI's response.
- gemini_response(): This function takes a Gemini chat session object and the user's prompt, sends the message, and returns the text response. The chat session object automatically maintains conversation history, so Gemini has full context of the conversation with each new message.
5. TwiML response for Twilio
The /twiml endpoint is designed to provide Twilio with instructions on how to connect a voice call to your WebSocket application. Twilio uses TwiML (Twilio Markup Language) to determine how to handle voice interactions.
- The TwiML response instructs Twilio to connect to the WebSocket server (via the
WS_URL) and to greet the caller with the WELCOME_GREETING message. - In this example, the code uses ElevenLabs as the text-to-speech provider with a specific voice ID (
FGY2WhTYpPnrIDTdsKH5— this is an example voice ID; replace it with your own if using ElevenLabs). This is an optional customization — you can remove thettsProviderandvoiceattributes entirely to use Twilio's default TTS, or swap "ElevenLabs" for "Amazon" or "Google" if you prefer a different provider. - Refer to the ConversationRelay Docs for more information on all the supported attributes like language, ttsProvider, and voice.
- Note: This endpoint doesn't validate Twilio request signatures. For production use, see Twilio's request validation documentation.
6. WebSocket connection handling
The /ws WebSocket endpoint handles real-time communication between Twilio and the server. It receives messages from Twilio, processes them, and sends responses back.
- The while loop listens for incoming messages from Twilio. The server processes different message types:
- setup: Initializes a new Gemini chat session for this call. The code uses
client.chats.create()with thegemini-2.5-flashmodel and passes the system prompt via theconfigparameter. This creates a stateful chat object that maintains conversation history automatically. - prompt: Processes the user's voice input (transcribed to text by Twilio) and sends it to Gemini via the
gemini_response()function. The response text is sent back to Twilio, which converts it to speech. - interrupt: Handles any interruptions during the conversation (e.g., when the caller speaks over the AI).
- Unknown message: Logs any unrecognized message types.
- sessions: Active Gemini chat sessions are stored in the
sessionsdictionary, indexed by thecall_sid. When the WebSocket disconnects, the session is cleaned up.
7. Run the FastAPI server
Finally, run the FastAPI server using Uvicorn:
uvicorn.run(): This starts the server, listening on the specified port (default is 8080). The server is now ready to handle both HTTP requests (for TwiML) and WebSocket connections (for real-time conversation).
Run the server
To run the server, first, open a terminal and start an Ngrok tunnel:
Open up the ngrok connection socket first, because you will need the ngrok URL in two places: in the Twilio console, and in your .env file.
Take note of your Ngrok URL (e.g., https://1234abcd.ngrok.app) and add the domain part to your .env file:
Now you can start the server with:
Configure Twilio
Go to the Twilio console and select your Twilio phone number. In the Configure tab, under Voice Configuration settings, set the webhook URL to your Ngrok URL (including /twiml), like so:
Test the integration
With everything set up, dial your Twilio phone number from your mobile phone. You should hear the greeting message, and you can start asking questions to the AI.
What's next?
This Python app sets up a FastAPI server that connects Twilio's voice capabilities with ConversationRelay, allowing real-time communication with an AI assistant powered by Google Gemini. The server takes care of both HTTP requests (to give Twilio its instructions) and WebSocket connections (for live communication with the AI). As you dial in, ConversationRelay orchestrates transcription handling, communications with the LLM, and text-to-speech based on the LLM's response, to power a fluid conversation between you and a Gemini-powered AI assistant.
Now that you have a working real-time voice assistant, you can explore other customizations like swapping TTS providers, adding Function Calling to let Gemini check the weather or book appointments during the call, or experimenting with different Gemini models.
If you are looking for more examples and demos with ConversationRelay, check out these blog posts from other Twilions:
- Integrate OpenAI with Twilio Voice Using ConversationRelay and Python
- Integrate OpenAI with Twilio Voice Using ConversationRelay in NodeJS
Prefer to watch? Here's my walkthrough of the Twilio and Gemini Flash integration:
Rishab Kumar is a Developer Evangelist at Twilio and a cloud enthusiast. Get in touch with Rishab on Twitter @rishabincloud and follow his personal blog on cloud, DevOps, and DevRel adventures at youtube.com/@rishabincloud
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.