Perform a Warm Transfer to a Human Agent from the OpenAI Realtime API using Twilio Programmable SIP and TypeScript

September 08, 2025
Written by
Reviewed by
Paul Heath
Twilion
Paul Kamp
Twilion

We’re so excited for our friends at OpenAI, who just released their Realtime API to general availability. Their multilingual and multimodal gpt-realtime model unlocks the ability to directly analyze a caller’s pitch and tone to better understand their sentiment and intent.

With their GA, OpenAI has also released a new SIP connector to connect to the gpt-realtime model via SIP. My colleague Paul has a tutorial on how you can get started with the OpenAI Realtime SIP Connector using Twilio’s Elastic SIP Trunking. In this tutorial, I will show how to use Twilio Programmable SIP to dynamically manage the call, unlocking the ability to perform a warm transfer to a human agent.

Architecture for SIP human escalation with the OpenAI Realtime API and Twilio

In this tutorial, you’ll be building the following architecture in order to perform a warm transfer from the virtual agent – powered by the OpenAI gpt-realtime model – to a human agent.

Call flowchart depicting interactions among Caller, App, Twilio, OpenAI, Websocket, and Human Agent.

Prerequisites

Before you can build this demo with OpenAI’s Realtime API, you need to check off a few prerequisites first.

  • An upgraded Twilio account with a Primary Business Profile - Sign up using this link
  • A Twilio number with Voice capabilities. See instructions to purchase
  • An OpenAI API key on a premium plan or with available credits
  • Access to the OpenAI Realtime API. Check here for more information
  • Node.js/Typescript (I used version 22.15.0 – you can download it from here)
  • A tunnelling solution like Ngrok (You can download ngrok here)

Get started with OpenAI and Ngrok

In order to connect to the OpenAI gpt-realtime model via Twilio Programmable SIP, you’ll first create a Webhook with OpenAI to direct SIP calls to. When OpenAI receives an incoming call to your OpenAI project, it will send a realtime.call.incoming event to the webhook. You’ll reply with the instructions for the session – in our example, the prompt instructions, voice, and available tools. You’ll also set up a WebSocket connection so you can exchange messages with the session, including receiving tool execution requests from the model.

If you have already completed my teammate Paul’s tutorial on how to Connect the OpenAI Realtime SIP Connector with Twilio Elastic SIP Trunking, you can reuse the webhook you’ve set up within OpenAI’s console and will just have to make some adjustments to your app. If you’re starting new here, no sweat – we’ll step through it here.

There is a repo for this build which you can find here.

Follow Paul’s Set up a static domain with Ngrok and the Get started with SIP and OpenAI Realtime steps, but come back before you set up the Elastic SIP Trunk.

Configure your Twilio Number

First, you will configure your Twilio number to get instructions for incoming calls from a webhook, /incoming_call you will set up later on. You will use the domain you created in the Set up a static domain with Ngrok step to build this URL.

You can add this URL either via the Console or the IncomingPhoneNumber API resource.

From the Console,

  1. Access the Active Numbers page in Console.
  2. Click the desired phone number to modify.
  3. Scroll to the Voice Configuration section.
  4. Add the url: “https://{DOMAIN}/incoming-call” for when A call comes in, using the domain you created previously.
Twilio Voice Configuration page showing regional call routing and webhook setup options.

Build the App

This section handles adding an incoming call to a Twilio Conference while the caller waits to speak to the virtual agent.

If you want to just start with the repo, you can find that here.

Conferences add a host of orchestration capabilities such as putting users on hold and muting, but most importantly for this use case, they let you keep the user engaged with the virtual agent until a human agent has accepted the call to join the conference. This workflow also ensures the caller is never disconnected – no dropped calls in our demo!

Set up the code base

Open up a terminal and run the following commands:

mkdir openai-programmable-sip && cd openai-programmable-sip
npm init -y
npm install twilio body-parser dotenv express openai ws
npm install --save-devs @types/express @types/node @types/ws tsx typescript
mkdir src
touch src/index.ts
mkdir dist
touch .env
touch tsconfig.json

This sets up the Node/TypeScript environment, sets up the production install packages and development environment packages, then creates a src directory and a dist build directory, an .env file to hold all your keys, and a tsconfig.json file to tell the tsx compiler what to do when compiling TypeScript.

Next, let’s set up some Environment variables.

Set up Environment Variables

Open up the .env file in your favorite text editor or IDE. Then, follow these steps to configure your app:

  • Enter your OpenAI project API Key
  • Enter the OpenAI Webhook secret (which you copied earlier when setting up the webhook)
  • Paste your OpenAI project ID. You can find this in the URL when viewing your OpenAI project in the OpenAI platform, or by going to Settings > Project > General on the left hand side.
  • Set a PORT so your server knows where to listen
  • Also add the domain you created in the previous step, without the scheme (the ‘https://’), for example mhughan.ngrok.io.
  • Next, add your Twilio credentials in TWILIO_ACCOUNT_SID and TWILIO_AUTH_TOKEN. You can find these in your Twilio Console.
  • Finally, enter the number for the human agent you want to transfer the call to in E.164 format

If you don’t have an additional phone to place the call and accept a second call, you can always use a second Twilio number with a voice URL that executes TwiML to simulate greeting the user for testing. (More on that later.)

OPENAI_API_KEY=sk-proj-
OPENAI_WEBHOOK_SECRET=whsec
OPENAI_PROJECT_ID=proj_
PORT=8000
DOMAIN=mhughan.ngrok.io
TWILIO_ACCOUNT_SID=
TWILIO_AUTH_TOKEN=
HUMAN_AGENT_NUMBER=+15551112222

Set up tsconfig.json

Next, enter the below into the tsconfig.json file. This tells the tsx compiler various options, for example what flavor of JavaScript/TypeScript you are working with, where your TypeScript files exist, and where the build files should be compiled to:

{
 "compilerOptions": {
   "target": "ES2020",
   "module": "ESNext",
   "moduleResolution": "Node",
   "strict": true,
   "esModuleInterop": true,
   "forceConsistentCasingInFileNames": true,
   "resolveJsonModule": true,
   "skipLibCheck": true,
   "outDir": "dist"
 },
 "include": ["src"]
}

Next, make sure the below text is included in the package.json file.

"scripts": {
   "test": "echo \"Error: no test specified\" && exit 1",
   "dev": "tsx watch src/index.ts",
   "build": "tsc -p .",
   "start": "node dist/index.js"
 }

I have no unit tests for this tutorial, but you may want to add some to yours (and you certainly will before you take an application to production!).

The dev command tells the tsx compiler where to watch for code (this has the nice upside of updating when you make changes). build tells the tsx compiler where to put the compiled JavaScript files, and start tells node where to find the compiled index JavaScript file, dist/index.js.

Lay the groundwork for the server

Nice work, now we’re cooking with gas.Next, I’ll help you write a basic Express server that can receive OpenAI webhooks, verify them, and then talk back to the Realtime API. To start, the app pulls in your environment variables and creates a client each for OpenAI and Twilio. The Express app is configured with a raw body parser because OpenAI’s signature verification needs the exact request bytes, untouched by JSON parsing.

To your src/index.ts file, add the following.

import express, { Request, Response } from "express";
import bodyParser from "body-parser";
import WebSocket from "ws";
import OpenAI from "openai";
import twilio from "twilio";
import "dotenv/config";
const DOMAIN = process.env.DOMAIN;
const PORT = Number(process.env.PORT ?? 8000);
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const OPENAI_PROJECT_ID = process.env.OPENAI_PROJECT_ID;
const WEBHOOK_SECRET = process.env.OPENAI_WEBHOOK_SECRET;
const openAiClient = new OpenAI({ apiKey: OPENAI_API_KEY });
const accountSid = process.env.TWILIO_ACCOUNT_SID;
const authToken = process.env.TWILIO_AUTH_TOKEN;
const client = twilio(accountSid, authToken);
const HUMAN_AGENT_NUMBER = process.env.HUMAN_AGENT_NUMBER;
if (!DOMAIN || !OPENAI_API_KEY || !OPENAI_PROJECT_ID || !WEBHOOK_SECRET || !accountSid || !authToken || !HUMAN_AGENT_NUMBER) {
 console.error("Missing some variables in your .env");
 process.exit(1);
}
const app = express();
app.use(bodyParser.raw({ type: "*/*" }));
const RealtimeIncomingCall = "realtime.call.incoming" as const;

Next add a variable to map the unique OpenAI call_id for each call to the Twilio Conference so you can add and remove participants from the call when you get tool execution requests from OpenAI.

You’ll also add two variables to map the conference name to the caller ID (From number) of the user who called in and the authentication callToken for that incoming call. Saving the callToken lets you reuse that caller’s From number for the outbound call to the human agent. This is nice because if for any reason the call drops (not that it should!), the agent can call back the user.

const callIDtoConferenceNameMapping: Record<string, string | undefined> = {};
const ConferenceNametoCallerIDMapping: Record<string, string | undefined> = {};
const ConferenceNametoCallTokenMapping: Record<string, string | undefined> = {};

Instruct the GPT Realtime model

Here you’ll define how you want the model to greet the user and any prompt instructions and available tools for the model. For this use case, you will define a function to addHumanAgent in the event that the user asks to speak to a real person.

const WELCOME_GREETING = "Hello, I'm an AI agent. How can I help you?";
const SYSTEM_PROMPT = "You are a support agent. Speak in English unless the user requests a different language. If the caller asks to speak to a real person, use the addHumanAgent function.";
const MODEL = "gpt-realtime";
const VOICE = "alloy";
const responseCreate = {
 type: "response.create",
 response: {
   instructions: `In English, say to the user: ${WELCOME_GREETING}`,
 },
} as const;
const callAccept = {
   instructions: SYSTEM_PROMPT,
   model: MODEL,
   audio: {
     output: { voice: VOICE },
   },
   type: "realtime",
   tools: [
     {
         type: 'function',
         name: 'addHumanAgent',
         description: 'Adds a human agent to the call with the user.',
         parameters: {"type": "object", "properties": {}, "required": []},
     }
 ]
} as const;

Receive incoming calls to your Twilio Number and orchestrate the conference

Next, create the webhook to receive incoming calls to your Twilio number.

When a user calls in, your app will make a call out to your OpenAI project’s SIP connector to add the virtual agent to the conference, powered by the Realtime API. Your app will then add the caller to that same conference.

You name the conference with the CallSid of the incoming call so it’s guaranteed to be unique for each conference assuming multiple users call your application simultaneously. You will also pass the conference name as a custom SIP header to OpenAI so when OpenAI sends back that header in the webhook request for the incoming call, you can map the OpenAI call_id to the conference.

app.post("/incoming-call", (req: Request, res: Response) => {
console.log("Incoming call webhook received");
 const rawBody = req.body.toString("utf8");
 const parsedBody = Object.fromEntries(new URLSearchParams(rawBody));
 const conferenceName = `${parsedBody.CallSid}`;
 ConferenceNametoCallerIDMapping[conferenceName] = parsedBody.From;
 ConferenceNametoCallTokenMapping[conferenceName] = parsedBody.CallToken;
 async function createParticipant() {
     await client
         .conferences(conferenceName)
         .participants.create({
             from: parsedBody.From, // Use the from number from the call
             label: "virtual agent",
             to: `sip:${OPENAI_PROJECT_ID}@sip.api.openai.com;transport=tls?X-conferenceName=${conferenceName}`,
             earlyMedia: false,
             callToken: parsedBody.CallToken,
             conferenceStatusCallback: `https://${DOMAIN}/conference-events`,
             conferenceStatusCallbackEvent: ['join']
         });     
 }
 createParticipant();
 const twimlResponse = `<?xml version="1.0" encoding="UTF-8"?>
                       <Response>
                           <Dial>
                               <Conference
                                   startConferenceOnEnter="true"
                                   participantLabel="customer"
                                   endConferenceOnExit="true"
                                   statusCallback="https://${DOMAIN}/conference-events"
                                   statusCallbackEvent="join"
                               >
                                   ${conferenceName}
                               </Conference>
                           </Dial>
                       </Response>`;
 res.type('text/xml').send(twimlResponse);
});

In the previous step, you added a conferenceStatusCallback to the conference so you would be notified in cases where a Participant joins the conference. When that webhook is called, you will check whether the Participant that joined was a “human agent” and if so, end the call with the virtual agent so the user can speak directly with the agent without interruption.

app.post("/conference-events", (req: Request, res: Response) => {
 const rawBody = req.body.toString("utf8");
 const parsedBody = Object.fromEntries(new URLSearchParams(rawBody));
 if (parsedBody.ParticipantLabel === 'human agent' && parsedBody.StatusCallbackEvent === 'participant-join') {
     async function findVirtualAgentandDisconnect() {
         const participants = await client
           .conferences(parsedBody.ConferenceSid)
           .participants.list({
             limit: 20,
           });
         for (const participant of participants) {
             if (participant.label === 'virtual agent') {
                 // End the virtual agent call
                 await client.calls(participant.callSid).update({ status: 'completed' });
                 console.log('Virtual agent call ended.');
             }
         }
       }
     findVirtualAgentandDisconnect();
 }
});

Receive incoming calls to the OpenAI SIP Connector

As we discussed above, when your app sends the SIP call to the OpenAI SIP connector, you will get a webhook request from OpenAI indicating they received an incoming call. Here, you will set up the app to receive that webhook request and respond with the model instructions defined previously.

You grab the call_id from the request and the conference name from the custom SIP header and add that to the mapping for later. You also accept the call from OpenAI and set up a WebSocket so you can receive and send data to the session.

app.get("/health", async (req: Request, res: Response ) => {
 return res.status(200).send(`Health ok`);
});
app.post("/", async (req: Request, res: Response) => {
 try {
   const event = await openAiClient.webhooks.unwrap(
     req.body.toString("utf8"),
     req.headers as Record<string, string>,
     WEBHOOK_SECRET
   );
   const type = (event as any)?.type;
   if (type === RealtimeIncomingCall) {
     const callId: string = (event as any)?.data?.call_id;
     const sipHeaders = (event as any)?.data?.sip_headers;
     let foundConferenceName: string | undefined;
     if (Array.isArray(sipHeaders)) {
       const conferenceHeader = sipHeaders.find(
         (header: any) => header.name === "X-conferenceName"
       );
       foundConferenceName = conferenceHeader?.value;
     }
      callIDtoConferenceNameMapping[callId] = foundConferenceName;
     // Accept the Call
     const resp = await fetch(
       `https://api.openai.com/v1/realtime/calls/${encodeURIComponent(callId)}/accept`,
       {
         method: "POST",
         headers: {
           Authorization: `Bearer ${OPENAI_API_KEY}`,
           "Content-Type": "application/json",
         },
         body: JSON.stringify(callAccept),
       }
     );
     if (!resp.ok) {
       const text = await resp.text().catch(() => "");
       console.error("ACCEPT failed:", resp.status, resp.statusText, text);
       return res.status(500).send("Accept failed");
     }
     // Connect the web socket after a short delay
     const wssUrl = `wss://api.openai.com/v1/realtime?call_id=${callId}`
     await connectWithDelay(wssUrl, 0); // lengthen delay if needed
     // Acknowledge the webhook
     res.set("Authorization", `Bearer ${OPENAI_API_KEY}`);
     return res.sendStatus(200);
   }
   return res.sendStatus(200);
 } catch (e: any) {
   const msg = String(e?.message ?? "");
   if (e?.name === "InvalidWebhookSignatureError" || msg.toLowerCase().includes("invalid signature")) {
     return res.status(400).send("Invalid signature");
   }
   return res.status(500).send("Server error");
 }
});
app.listen(PORT, () => {
 console.log(`Listening on http://localhost:${PORT}`);
});

Manage the WebSocket

The final step (finally!) is to define how to create that WebSocket, handle incoming messages, and execute tool calls. When the user asks to speak to a real person, OpenAI will send a message to the WebSocket with type response.done. Here, you will create a handleFunctionCall function to check if that message was a function_call and further check whether it was a request to execute the addHumanAgent tool.

If the conditions match, you will get the call_id of the overall OpenAI session from the URL (not to be confused with the call_id for this WebSocket message!) and use that call_id to add the human agent.

You might recall the variables you created earlier to map the call_id to the conference name and the conference name to the From number and call_token. That was all for this payoff – you can now place an outbound call to the human agent, using the call_token from the inbound call to prove you have the right to use the same From number for this outbound call.

Label this participant as “human agent” so you can properly remove the virtual agent from the conference once you get the status callback that the human agent joined. PHEW!

const connectWithDelay = async (sipWssUrl: string, delay: number = 1000): Promise<void> => {
 try{
   setTimeout(async () => await websocketTask(sipWssUrl), delay );
 }catch(e){
   console.error(`Error connecting web socket ${e}`);
 }
 }
const websocketTask = async (uri: string): Promise<void> => {
 const ws = new WebSocket(uri, {
   headers: {
     Authorization: `Bearer ${OPENAI_API_KEY}`,
     origin: "https://api.openai.com",
   },
 });
 ws.on("open", () => {
   console.log(`WS OPEN ${uri}`);
   ws.send(JSON.stringify(responseCreate));
 });
 ws.on("message", (data) => {
   const text = typeof data === "string" ? data : data.toString("utf8");
   try {
     const response = JSON.parse(text);
     if (response.type === 'response.done') {
         const output = response.response?.output?.[0];
         if (output) {
             handleFunctionCall(output, uri);
         };
         }
 } catch (error) {
     console.error('Error processing OpenAI message:', error, 'Raw message:', data);
 }
 });
 function handleFunctionCall(output: { type: string; name: string; call_id: any; }, uri: string | URL) {
   if (output?.type === "function_call" &&
       output?.name === "addHumanAgent" &&
       output?.call_id
     ) {
       const url = new URL(uri);
       const extractedCallId = url.searchParams.get("call_id");
        if (extractedCallId) {
         addHuman(extractedCallId);
       } else {
         console.error("Call ID is null, cannot add human agent.");
       }
        const keepChatting = {
           type: 'conversation.item.create',
           item: {
               type: 'message',
               role: 'user',
               content: [
                   {
                       type: 'input_text',
                       text: 'While we wait for the human agent, keep chatting with the user or ask if theres anything you can help with while they wait.',
                   }
               ]
           }
       };
        ws.send(JSON.stringify(keepChatting));
       ws.send(JSON.stringify({ type: 'response.create' }));
   }
 }
 async function addHuman(openAiCallId : string) {
   const conferenceName = callIDtoConferenceNameMapping[openAiCallId];
   if (!conferenceName) {
     console.error('Conference name is undefined for call ID:', openAiCallId);
     return;
   }
   console.log('Adding human to conference:', conferenceName);
   const callToken = ConferenceNametoCallTokenMapping[conferenceName];
   const callerID = ConferenceNametoCallerIDMapping[conferenceName];
   const participant = await client
       .conferences(conferenceName)
       .participants.create({
           from: callerID ?? (() => { throw new Error("CallerID is not defined"); })(),
           label: "human agent",
           to: HUMAN_AGENT_NUMBER ?? (() => { throw new Error("HUMAN_AGENT_NUMBER is not defined"); })(),
           earlyMedia: false,
           callToken: callToken ?? (() => { throw new Error("CallToken is not defined"); })(),
       });
 }
 ws.on("error", (e) => {
   console.error("WebSocket error:", JSON.stringify(e));
 });
 ws.on("close", (code, reason) => {
   console.log("WebSocket closed:", code, reason?.toString?.());
 });
}

Let’s run it!

Sweet, let’s boot back up our ngrok tunnel, then start the server.

In any terminal, execute:

ngrok http 8000 --url <ngrok domain endpoint>

Note that if your ngrok URL changes (for example, if you don’t have a custom ngrok domain or if you restart ngrok, you need to update the DOMAIN environment variable in your .env before you start your server.

From your project directory, execute:

npm run dev

If all went well, you should see something like this:

> openai-sip-trunking@1.0.0 dev
> tsx watch src/index.ts

Listening on http://localhost:8000

Now, give your Twilio number a call. Feel free to chat with the agent for a bit and then make your request to talk to a real person. Following your request, you should find that the human agent is added to the call and the virtual agent is removed. You should now be happily chatting with a real person, not frustrated by extended hold times or archaic IVRs, ideally having already had some of your questions answered by your virtual agent. WIN.

(Optional) Simulate the human agent

If you don’t happen to have two phones to both place and receive calls, you can always purchase a second Twilio number and add a TwiML bin to simulate greeting a user so you can test that the virtual agent is, in fact, removed. 

(I realize this does entirely defeat the purpose of warm transferring to a human, but try not to think about that!)

If you take that path, you can just create a TwiML Bin:

  1. Navigate to the TwiML Bin section of the Console.

  2. Click Create a new TwiML Bin.

  3. Give it a friendly name and populate the TwiML with the below code. We’ll add a 5 second pause so you can make sure the virtual agent doesn’t respond to the greeting to ensure the agent has been removed.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Say voice="Polly.Joanna-Generative">Hello this is a human!</Say>
  	<Pause length="5"/>
  	<Say voice="Polly.Joanna-Generative">Is the robot gone?</Say>
</Response>

4. Click Create.

5. Update the Voice Configuration for your second number to point to the TwiML Bin.

Voice configuration screen on account dashboard showing routing, configure options, and call handling.

Conclusion

In this tutorial we walked through how to perform a warm transfer from a virtual agent powered by the OpenAI gpt-realtime model to a human at the user’s request. 

By connecting to the OpenAI Realtime API using Twilio Programmable SIP, you can dynamically manage the call, ensuring that your users can be seamlessly moved throughout the conversation.

Additional resources


Margot Hughan is a Product Manager at Twilio and loves working on all things Voice. Her email address is mhughan [at] twilio.com