How to Call an AI Friend using GPT-3 with Twilio Voice and Functions

July 18, 2022
Written by
Reviewed by
Dainyl Cua
Twilion

How to call an AI friend Header

I always used to find myself getting into unwanted social situations that I could never get out of. As the engineer I am, I decided it’d be fun to build and publish a quick solution to this using my toolbelt here at Twilio: How to Escape Pesky Situations using Twilio Functions and TwiML Bins. This solution allowed me to text my “friend” Joanna – an Amazon Polly voice – a key phrase which would trigger Joanna to call me and respond as if there was an emergency using a human-like synthesized speech.

This solution then made me wonder what other fun ways Joanna could help me out. Since it’s also National Anti-Boredom Month (July), I started focusing on ways to tackle boredom and pass time.

Gif of bored kid on couch

 

Whether it’s walking home from the subway or waiting in line at the store I always find myself needing to kill time. I could call up a friend to pass time, but I wanted to experiment by talking to a robot like Joanna and observing if it can freely converse like a human. But how can I enable Joanna to freely chat like a human?

Luckily, with OpenAI’s GPT-3 engine I can provide a prompt and generate text that is indistinguishable from human text. I can even personalize it to make Joanna sound witty and comedic or even sound like my girlfriend🥲. Combining this with speech recognition from Twilio Programmable Voice, I’d be able to freely talk back and forth with Joanna; and to ensure our conversation sounds life-like, Joanna’s Neural voice will be used.

This blog will show how to leverage Twilio Voice with OpenAI’s GPT-3 engine to build a phone service to converse with an AI friend; the phone service will entirely be built on Twilio Functions.

Let’s get started!

Prerequisites

Here’s what you will need for this tutorial:

Overview

Before you start building, let’s go over Twilio Functions, TwiML, and OpenAI’s GPT-3 engine and how they’ll be used to create your phone service.

Twilio Functions and TwiML

If you aren’t familiar with the service, Twilio Functions is a serverless Node.js environment that allows you to seamlessly create Twilio applications all through the Twilio Console; it's lightweight, scalable and secure. For this tutorial, you will be creating a Functions service that will handle incoming calls to your Twilio number.

Twilio's Markup Language (TwiML) is an XML based language that has a set of simple verbs and nouns that allows you tell Twilio what to do with incoming and outgoing calls, SMS and MMS messages. For this tutorial, you’ll be using and exploring TwiML’s <Say> and <Gather> verbs. The <Say> verb converts text to speech that is read back to the caller. It offers different attributes for voices, including different sets of languages, genders and also has support for Amazon Polly voices. The <Gather> verb transcribes the caller's speech during a voice call.

With Amazon Polly you can also use the <Say> verb to control the synthesized speech. Speech Synthesis Markup Language (SSML) is also supported by TwiML and can be used to control synthesized speech. With the <prosody> tag you can control the volume, pitch and rate of the synthesized speech.

Handle responses with GPT-3

The core of the OpenAI API is the completions endpoint: you input some text as a prompt, and the model will generate a text completion based on the context or the pattern of the prompt.

For example, if you give the API the prompt, “As Descartes said, I think, therefore”, it will return the completion “I am”. To have a better understanding of the completion's endpoint, feel free to play around with OpenAI’s Playground or take a look at their examples.

To better tune Joanna for your needs, you’ll be using the Chat example provided by OpenAI. This preset will give you an AI with the nature of being helpful, creative, clever and very friendly.

Chat example from OpenAI"s Examples section

This example above provides you with a prompt along the necessary parameters in the settings for the intended output.

The settings shown above fine tune GPT-3 even further, however this tutorial won’t be covering the details of that. For a deeper dive into GPT-3, check out The Ultimate Guide to OpenAI's GPT-3 Language Model.

Functions architecture

You will need to configure two Functions for this tutorial:

  • /transcribe - This function will be configured to your Twilio phone number and initiate the conversation with Joanna. It will use the <Gather> verb to listen to and transcribe your response. This transcription will then be passed on to the /respond function.
  • /respond - This function will receive the transcription and use the OpenAI API to generate a text response with the GPT-3 engine and a given preset. This response will be converted from text to speech with the <Say> verb. The call will then be redirected back to the /transcribe function.

The phone call will be passed back and forth between the two functions where one will be listening and transcribing the user’s voice response and the other will be generating and outputting a voice response.

To ensure you’ll have a proper conversation with Joanna, you’ll be caching both your and Joanna’s past responses and append them to a conversation log that will be used in the input prompt to the GPT-3 engine so Joanna can remember past responses and respond appropriately.

You’re probably wondering how the conversation will be cached since each Functions execution is ephemeral and stateless. The conversation will be passed back and forth between the two Functions through the query parameters of each Function call.

For the sake of simplicity and testing purposes, you’ll be storing the data within the query parameters. This will lead to an unbounded length of the query string as each call will append a response to the conversation query parameter.

In a production environment, it’s recommended to store cached data – like this conversation – within session cookies or a database.

Now that you’ve gone over how the phone service will work, you can begin building it!

Create a Twilio Function Service

Log in to your Twilio Console and navigate to the Functions and Assets > Services tab on the left side (or click here). Click the blue Create Service button, enter AI-Friend as the name for your service, and click the blue Next button. You will then be directed to your newly created service.

Front page of new twilio functions service

Configure OpenAI API to the Functions service

In this section, you will be setting up the OpenAI API with your Functions service so it’s ready to be used when you write the code. You’ll first need to add your OpenAI API keys as an environment variable and then add the OpenAI Node.js library as a dependency.

Twilio Functions offers the ability to store environment variables like phone numbers, API keys, and URLs rather than hardcoding them into your Functions code. Using environment variables ensures that your code is portable and that simple configuration changes can be made instantly.

Click on Environment Variables under the Settings tab at the bottom of your new Twilio Functions page. Here is where you will store your Twilio phone number and your API keys.

Environment variables section of functions service

If you don’t have your OpenAI API key yet, sign up for an account and generate a key from your OpenAI Developer Quickstart profile.

Head back to the Environment Variables section of your Twilio Functions service and type OPENAI_API_KEY for the key and paste in your API key as the value and click Add.

Now under the Settings tab, click Dependencies. Here is where you add the dependencies for your Functions. You will need to add the openai npm module to interact and call the OpenAI API.

Dependencies section of Twilio functions service

In the Module textbox enter openai and in the Version textbox enter latest and click Add.

Add the Functions

As mentioned before you’ll need to add two Functions to your service: /transcribe which will listen and transcribe the speech from the user, and /respond which reads the transcription, feeds it into the GPT-3 engine, and speaks the output text to the caller.

/transcribe

Click on the blue Add + button, and then click Add Function from the dropdown to add a new Function and name it /transcribe. Replace the existing default code in the text editor with the following:

exports.handler = function(context, event, callback) {
   const twiml = new Twilio.twiml.VoiceResponse();
   let convo = event.convo || '';
   // If no previous conversation is present, start the conversation
   if(!convo) {
       twiml.say({
           voice: 'Polly.Joanna-Neural'
       }, 'Hey!');
       convo += 'Joanna: Hey!'
   }

   // Listen to user response and pass input to /respond
   const params = new URLSearchParams({ convo: convo });
   twiml.gather({
       enhanced: "true",
       speechTimeout: 'auto',
       speechModel: "phone_call",
       input: 'speech',
       action:`/respond?${params}`,
   })

   return callback(null, twiml);
};

This function starts off by creating a new VoiceResponse object which will be used to instruct Twilio what to do with the call. The function then checks the incoming query parameters, which is found in the event object, and checks to see if a conversation has already been initiated by reading the convo variable.

If the /transcribe function is called for the first time in a call, there will be no convo variable and Joanna will respond with “Hey” using the <Say> verb and append “Hey” to the convo variable.

The <Gather> verb is then used to grab the user's speech input and transcribe it. The enhanced attribute is used to improve the accuracy of transcription. The speechTimeout attribute is set to auto and will stop transcribing the caller's speech when a pause is detected. The speechModel attribute is set to phone_call to improve the accuracy of results. The input attribute is set to speech and instructs the TwiML to listen for speech rather than a DTMF. Lastly, the action attribute is set to /respond?${params} which instructs the TwiML to make an HTTP request (with the transcription data) to the /respond Function once the speech timeout has been reached.

Once copied, your /transcribe Function should look like the following:

/transcribe Function with completed code

Enabling the enhanced attribute uses a premium speech model that improves the accuracy of transcription. It costs 50% more at $0.03 per 15 seconds of speech than the standard phone_call model.

The premium speech model was built using thousands of hours of training data and ensures there's 54% fewer errors when transcribing phone conversations when compared to the basic model.

The params variable appended to the URL of the action attribute is constructed on line 13 and creates a query parameter with the convo variable. This allows you to pass the existing conversation to the /respond route.

Once your voice response has been set to the twiml object, it’ll be passed to the callback function to emit the response. Click the blue Save button below the text editor to finish configuring the /transcribe function.

/respond

Click on the blue Add + button above the Functions list, and then click Add Function from the dropdown to add a new Function and name it /respond. Replace the existing default code in the text editor with the following:

const presetPrompt =`The following is a conversation with an AI friend named Joanna. Joanna is friendly, funny, creative, and very talkative.\n\n`;
const { OpenAI } = require("openai");

exports.handler = async function(context, event, callback) {
   // Initialize TwiMl and OpenAI
   const openai = new OpenAI({ api_key: 'API_KEY'});
   const twiml = new Twilio.twiml.VoiceResponse();

   // Grab previous conversations and the users voice input from the request
   let convo = event.convo;
   const voiceInput = event.SpeechResult;

   //Format input for GPT-3 and voice the response
   convo += `\nYou: ${voiceInput}\nJoanna:`;
   const aiResponse = await generateAIResponse(convo);
   convo += aiResponse;
   const say = twiml.say({
       voice: 'Polly.Joanna-Neural'
   }, aiResponse);

   //Pass new convo back to /listen
   const params = new URLSearchParams({ convo: convo });
   twiml.redirect({
       method: 'POST'
   }, `/transcribe?${params}`);

   return callback(null, twiml);


   async function generateAIResponse(convo) {
       const apiResponse = await openai.completions.create({
           model: "text-curie-001",
           prompt: presetPrompt + convo,
           max_tokens: 60,
           temperature: 0.8,
           stop: ['\n', '\n\n'],
       })
       console.log(apiResponse);
       if(apiResponse.choices[0].text == '') return await generateAIResponse(convo);

       else return apiResponse.choices[0].text;
   }
};

The Function starts off by initializing the TwiML VoiceResponse object and the OpenAI package for generating the text response.

It then grabs the existing conversation data and the speech transcription sent from the /transcribe function and places them into the convo and voiceInput variables respectively. Lines 15 to 20 will then format both variables to be inserted as a prompt into the GPT-3 engine through the generateAIReponse() function. This prompt will be similar to the Chat preset mentioned in the overview section.

The generateAIReponse() helper function is placed at the bottom of the Function. The GPT-3 engine model that will be used is Curie; you’ll be using this engine over Davinci so that you get a faster response time and so that Joanna will sound more human-like without long delays. The presetPrompt variable (found on line 1) is used in the prompt parameter to better model Joanna for our needs. Feel free to take a look at the function to see what other parameters are used for the completion endpoint of the OpenAI API.

Once the aiResponse is returned from the function, the <Say> verb will convert the text to speech in Joanna’s Neural voice which will be sent to the caller. This response will then be appended to the convo variable and lines 23 to 26 will carry the call and convo back to the /transcribe Function using the <Redirect> verb, resulting in a back and forth conversation between the caller and Joanna.

Once copied, your /respond Function should look like the following:

/respond Function with completed code in the text editor

OpenAI's API usage is based on the number of your input and output tokens - A token is approximately 4 characters. The input prompt also has a limit of 2048 tokens so it's best to limit your input lengths.

The code above allows input lengths of an unbounded amount of tokens since it appends the voiceInput and aiResponse to the existing convo every time the Function is called which is not recommended.

Finally, click Save and then click the Deploy All button to deploy your completed Functions service.

Configure the Functions service to your Twilio number

Before you test out your function, you will need to hook up your Functions service with your Twilio number.

Head over to the Active Numbers section of the Twilio console and click on your Twilio number.

Scroll down to the Voice & Fax section and under A CALL COMES IN select Function. For SERVICE, select the Function service you just created which should be named Ai-Friend. Then choose ui for ENVIRONMENT and lastly choose /transcribe for the FUNCTION PATH since this is where your phone call should route to first.

Your messaging configuration should look like the following image below:

Voice and fax settings of twilio number

Now, click Save and Joanna should be ready to talk with you!

Call your Twilio num-... I mean call Joanna and have a chat to see how it works! Feel free to experiment with a different preset prompt or with different completion settings.

Conclusion

In this tutorial we walked through how to use Twilio Programmable Voice with Twilio Functions, TwiML, and the OpenAI API to create an AI friend that’s able to keep up a conversation with you. I hope you had some fun with this tutorial and learned a few things along the way!

If you’re looking to explore more interesting projects that use Twilio and OpenAI’s GPT-3 engine, take a look at these tutorials:

Happy Building!

Dhruv Patel is a Developer on Twilio’s Developer Voices team. You can find Dhruv working in a coffee shop with a glass of cold brew or he can either be reached at dhrpatel [at] twilio.com or LinkedIn.