Building an Inbound Voice Agent with Twilio and Deepgram
Time to read:
Building an Inbound Voice Agent with Twilio and Deepgram
Most voice agent starter code stops at "we have audio flowing." The hard parts get left as an exercise for the reader: How do you let a caller interrupt the agent mid-sentence? How does the agent know the caller is done talking and not just pausing? How do you connect voice conversations to a real backend? How do you secure the endpoint?
That's why my team at Deepgram shipped a reference implementation for building inbound telephony voice agents on Twilio that actually handles the production concerns. A caller dials your Twilio number, talks to an AI agent that can take actions on their behalf, and hangs up, with the whole thing deployed and secured out of the box.
Here's the architecture and what you get for free by forking it.
The architecture: one WebSocket bridge between Twilio and Deepgram
The core of the system is a class called VoiceAgentSession that bridges two WebSocket connections: one to Twilio, one to Deepgram's Voice Agent API. The Voice Agent API is a single endpoint that combines speech-to-text, LLM reasoning, and text-to-speech into one real-time loop. Instead of stitching together separate services for transcription, AI responses, and voice synthesis, you send audio in and get audio back. Your server just translates between Twilio's JSON-based audio protocol and Deepgram's binary audio protocol.
That's the whole bridge. Twilio sends base64-encoded mulaw audio (the standard encoding for telephony audio) as JSON. Deepgram wants raw mulaw bytes. Everything else (prompts, function definitions, backend integrations) is what you'll rewrite for your use case.
The call flow: a caller dials your Twilio number → Twilio POSTs to your /incoming-call webhook → your server returns TwiML with <Connect><Stream> → Twilio opens a Media Stream → VoiceAgentSession takes over.
What the inbound voice agent includes out of the box
The example scenario is a dental office receptionist that can check appointment slots, book appointments, look them up, and cancel them, all through Voice Agent API function calls. The scenario is just a starting point. Swap the prompt and the functions and it becomes a support line, a reservations desk, or a lead qualifier.
Here's what I'd point out specifically for Twilio developers:
Twilio request signature validation is on by default. The /incoming-call webhook verifies every Twilio request so nobody can hit your endpoint and get a free voice agent session on your dime. This is the kind of thing that's easy to skip in a prototype and painful to bolt on later, so it ships in the template.
Flux handles turn-taking natively. Flux is Deepgram's speech-to-text model built specifically for real-time voice agent conversations. It understands when the caller is actually done talking versus just pausing mid-sentence. For telephony this matters a lot. Without it, you're left tuning voice activity detection (VAD) thresholds and still getting the agent to talk over people.
Barge-in (caller interruption) is wired in. When the Voice Agent detects the caller started speaking, the server sends a Twilio clear event to flush the outbound audio buffer. The caller can cut in at any time without waiting for the agent to finish its sentence.
Function calls connect the voice agent to your backend. The Deepgram Voice Agent API supports tool use. When the LLM decides to call a function (like checking appointment availability), Deepgram sends a function call event, the server executes it against the backend service, and sends the result back. The agent incorporates it into its next response naturally. The repo includes a mock scheduling backend. Keep the same method signatures, replace the bodies with HTTP calls to your real API, and the voice agent layer doesn't need to change.
You can test without a Twilio account. The repo includes a dev_client.py that streams your microphone over the same WebSocket the server expects from Twilio. Same bridge and no phone number required. Iterate on prompts and functions at your desk before you touch a Twilio console.
Getting started with the inbound voice agent
The repo includes a setup wizard that configures Twilio and deploys to Fly.io (a platform for running app servers close to your users) in one command:
The wizard walks you through picking a Twilio number, deploying the server, and wiring the webhook. Deepgram gives you $200 in free credit to get started with no credit card required. Fork it, strip out the dental office, and drop in your own prompts and functions. MIT-licensed.
If you're looking for the outbound counterpart to this pattern, where your system initiates the call, handles answering machine detection, and posts structured outcomes back to a CRM, see Building an Outbound Voice Agent with Twilio and Deepgram.
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.