Build an AI Voice Assistant with Twilio Voice, the OpenAI Realtime Agents SDK, and Node.js

August 28, 2025
Written by
Dominik Kundel
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by
Paul Kamp
Twilion

One of the most powerful ways to interact with an AI agent is to talk to it. With help from Twilio Media Streams, you can connect a Twilio phone number to the OpenAI Realtime API.

The kind folks at OpenAI have made this even easier by building the OpenAI Agents SDK for TypeScript, powering not only the voice interaction, but also making helpful extensions like tool calling, guardrails, and more possible with a few lines of JavaScript/TypeScript code.

This tutorial will show you how to quickly build an AI Agent that can:

  • Stream OpenAI Realtime responses over a voice call
  • Integrate tool calling for an appointment scheduling bot
  • Manage output guardrails like blocking certain words from the agent response

Let's get started!

Prerequisites

To code along with this post, you will need:

Project setup

Create a new directory to host the code:

mkdir twilio-realtime-agent && cd twilio-realtime-agent

Then initialize a new node project and install the required dependencies:

npm init -y
npm install dotenv fastify @fastify/formbody @fastify/websocket zod @openai/agents @openai/agents-extensions
npm pkg set type="module"

Fastify is a lightweight webserver framework, zod helps with schema validation, and the OpenAI dependencies will make interacting with the Realtime API much easier.

Create a .env file and populate it with the following. Make sure to add .env to your .gitignore file if you plan on using source control:

PORT=5050
# https://platform.openai.com/api-keys
OPENAI_API_KEY="sk-proj…."

Create a new file named index.js and paste in the following code. This will set up a basic Twilio webhook using TwiML, Twilio's Markup Language, to tell the phone number how to respond to an incoming call. After an initial greeting, it will open a websocket and connect that to the OpenAI Realtime API. Using Twilio Voice Media Streams, the application will then provide a seamless interface between the caller and OpenAI.

import Fastify from "fastify";
import dotenv from "dotenv";
import fastifyFormBody from "@fastify/formbody";
import fastifyWs from "@fastify/websocket";
import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";
import { TwilioRealtimeTransportLayer } from "@openai/agents-extensions";
dotenv.config();
const { OPENAI_API_KEY } = process.env;
if (!OPENAI_API_KEY) {
 console.error("Missing OPENAI_API_KEY");
 process.exit(1);
}
const PORT = +(process.env.PORT || 5050);
const fastify = Fastify();
fastify.register(fastifyFormBody);
fastify.register(fastifyWs);
const agent = new RealtimeAgent({
 name: "Triage Agent",
 instructions: "You are a helpful assistant. Keep responses brief.",
});
fastify.get("/", async (req, reply) => reply.send({ ok: true }));
const WELCOME_GREETING = `Hello, I am a voice assistant powered by Twilio and OpenAI. Ask me anything!`
// Webhook for incoming phone call. Connect to Twilio phone number
fastify.all("/incoming-call", async (request, reply) => {
 const twimlResponse = `
<?xml version="1.0" encoding="UTF-8"?>
<Response>
 <Say voice="Polly.Joanna-Neural">${WELCOME_GREETING}</Say>
 <Connect>
   <Stream url="wss://${request.headers.host}/media-stream" />
 </Connect>
</Response>`.trim();
 reply.type("text/xml").send(twimlResponse);
});
// Twilio opens a WebSocket to this route for bidirectional audio+events
fastify.register(async (fastify) => {
 fastify.get("/media-stream", { websocket: true }, async (connection) => {
   try {
     const transport = new TwilioRealtimeTransportLayer({
       twilioWebSocket: connection,
     });
     const session = new RealtimeSession(agent, { transport });
     await session.connect({ apiKey: OPENAI_API_KEY });
     console.log("Connected to OpenAI Realtime API");
   } catch (err) {
     console.error("Realtime connection error", err);
     connection.close();
   }
 });
});
fastify.listen({ port: PORT }, (err) => {
 if (err) {
   console.error(err);
   process.exit(1);
 }
 console.log(`Server listening on ${PORT}`);
});

Start the server with:

node index.js

Expose the server publicly so it can talk to Twilio with ngrok:

ngrok http 5050

Copy the https://<subdomain>.ngrok.io URL to use in the next step.

Connect your Twilio phone number and test

Head over to the Twilio Console to configure your phone number. Select your phone number, then under Voice Configuration > A Call Comes In add the webhook endpoint:

Twilio Voice Configuration screen showing webhook URL setup for incoming calls.

Save the configuration and call your Twilio number. You should hear "Hello, I am a voice assistant powered by Twilio and OpenAI. Ask me anything!" and be able to interact with the agent.

The OpenAI Realtime Agents SDK will handle the API session and speech recognition and audio response over the Twilio Media Stream.

Extend your agent's capabilities with tool calling

This bot already does a lot, but it's too generic for most real use cases. Tools give your agents more power to take helpful action, and the Agents SDK will automatically decide when to call a tool and pass in the validated inputs.

Next, you’ll update the application to be more tailored for a Veterinary office by adding a new greeting, instructions, and appointment scheduling capabilities. In your index.js, add a tool definition for scheduling appointments. This code uses zod for schema validation, which helps make sure the inputs are structured correctly.

To reflect the new use case, update the welcome greeting to something you might hear when you call a doctor's office:

const WELCOME_GREETING = `Thank you for calling Dr. Vet's office! How can I help you today?`

Import zod and add tool as an import from the realtime SDK

import { z } from 'zod';
import { RealtimeAgent, RealtimeSession, tool } from '@openai/agents/realtime';

Then add the tool definition. For this example, we're hardcoding the response, but the SDK will ask you for your date preference for the appointment.

const scheduleAppointmentTool = tool({
 name: 'schedule_appointment',
 description: 'Schedule an appointment for a given date.',
 parameters: z.object({ date: z.string() }),
 execute: async (input) => {
   return `Appointment scheduled for ${input.date} at 10am`;
 },
});

Finally, update the agent configuration to add the tool:

const agent = new RealtimeAgent({
 name: "Triage Agent",
 instructions: "You are a helpful assistant at a veterinary office.",
 tools: [scheduleAppointmentTool],
});

Test it out by calling your Twilio phone number and asking "Can I schedule an appointment for Friday?" The agent should confirm the date, call your tool, and let you know the appointment has been scheduled for 10am. Notice that you didn't have to include instructions for how to call the tool, the SDK handles that automatically behind the scenes.

Add output guardrails to your agent

One of the other neat built-in features of the Agents SDK is the ability to add guardrails: these will trip the agent if you try to prompt it to give you information about something prohibited, whether that's specific words, content, or anything else you wish to define.

Let's add a block list of terms we don't want our agent to talk about, things like "cure" or "discount". Add a guardrails definition in your index.js:

const guardrails = [
 {
   name: "Blocklist terms",
   async execute({ agentOutput }) {
     const blocklistTerms = ["diagnosis", "cure", "discount", "refund"];
     const blocklistTermsInOutput = blocklistTerms.some((term) =>
       agentOutput.includes(term)
     );
     return {
       tripwireTriggered: blocklistTermsInOutput,
       outputInfo: { blocklistTermsInOutput },
     };
   },
 },
];

Then pass the guardrails into the RealtimeSession:

const session = new RealtimeSession(agent, {
  transport: twilioTransportLayer,
  outputGuardrails: guardrails,
});

Restart the server and test by calling your Twilio phone number and asking for a refund, the agent will say something like "I'm sorry, but I'm not able to help with that request."

Guardrails are incredibly powerful for building agents safely. This example focuses on keywords to exclude, but you can customize this in many other ways, like using another agent to detect content. Learn more in this example that detects and prevents the agent from doing math.

Next steps for building realtime agents

There are so many more things that OpenAI’s Agents SDK enables, such as handoffs to more purpose-built agents, enablement for human-in-the-loop interactions, features to trace agent decisions, and more. Check out the examples to learn more and get inspired.

If you want to build more with Twilio Voice and OpenAI, check out these resources:


Kelley Robinson works on the developer relations team at Twilio and has over 10 years of experience as a software engineer in a variety of API and data engineering roles. Prior to working in software she traded live cattle futures, planned art fairs, established an endowment fund, and designed promotional posters for a regional beer distributor. She has delivered dozens of technical talks to large audiences, including live coding from the NYSE floor. She graduated from the University of Michigan and now lives in Upstate New York with her partner and a pit bull named Fish.

Dominik Kundel works on Developer Experience at OpenAI.