AI Voice: Analyze your Pronunciation with Twilio Programmable Voice, OpenAI Realtime API, and Azure AI Speech

June 10, 2025
Written by
Danny Santino
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

Analyze your Pronunciation with Twilio, OpenAI, and Azure

Introduction

For millions of language learners worldwide, few options exist that provide both real-time voice conversations and an analysis of pronunciation skills. By combining the capabilities of Twilio, Azure AI Services, and OpenAI’s powerful Realtime API, you can build an intelligent voice coach that listens, responds naturally, and gives feedback on pronunciation skills.

In this tutorial, you will learn how to build an app that:

Prerequisites

Before getting started, here’s a checklist of items required to follow this tutorial:

  • Python 3.12.4 installed on your machine.
  • A Twilio account with a voice-enabled phone number.
  • An OpenAI account, API key, and access to the OpenAI Realtime API.
  • An Azure account with an active subscription. Click here to get started with creating your account.
  • A WhatsApp-enabled phone number.
  • A code editor like VS Code and a terminal on your machine.
  • An ngrok account and an authtoken to make your local app accessible online.

Building the app

Let’s take a quick look at how the app works:

  • A user calls in and selects the language they’d like to practice.
  • The server initiates a voice practice session between the caller and a Realtime model.
  • With the help of Twilio Media Streams, the server sends the caller’s live audio to OpenAI and Azure simultaneously over a WebSocket connection.
  • The Realtime model provides an audio response, while Azure AI Speech analyzes pronunciation in the background.
  • At the end of the call, Twilio sends the analyzed results to the user via WhatsApp.

You’ll be using the FastAPI web framework to build your server. It is lightweight and has native support for asynchronous WebSocket endpoints, making it an ideal choice for our real-time streaming requirements.

You’ll begin by setting up your development environment and installing the necessary packages and libraries. Next, you’ll configure ngrok, write the server code, and finally test the app to ensure everything works as expected.

Let’s get started.

Set up the development environment

When working with Python in development, it’s good practice to use a virtual environment. Think of this as an isolated sandbox to help avoid conflicts between dependencies used across different projects.

Open a terminal and enter the following commands:

mkdir pronunciation-analysis
cd pronunciation-analysis
python -m venv .venv
source .venv/bin/activate   # use . .venv/Scripts/activate for Git Bash on Windows
Depending on your operating system and Python configuration, you may have to run Python commands with python3 instead. Additionally, if you intend to push your code to a remote repository like GitHub, be sure to create a .gitignore file and add the .venv folder to avoid accidentally uploading it.

Store the environment variables

Throughout this project, you’ll be working with sensitive keys, and you shouldn’t place these directly in your code. Create a .env file to store them:

touch .env

To get your Azure AI Services API key and service region details, you’ll first need to create a Resource group and a Speech service in the Azure portal.

  1. From the portal Home screen, search for Resource groups and click Create.
  2. Follow the on-screen instructions to create a group. A common naming convention for an app like the one we’re building would be spch-<app-name>-<environment>. For example: spch-aivoicebot-dev.
  3. Next, search for and create a Speech service.
  • Once done, navigate to the Overview page where you’ll find your key and region:
Screenshot of the Overview page of a Speech service on the Microsoft Azure portal.

To get your Twilio Number SID, open the console and navigate to:

Phone Numbers > Manage > Active numbers

Select your phone number and click the Properties tab to locate the Number SID.

You can find your account SID and auth token on the homepage of the console.

For OpenAI, log in to your account and navigate to Settings > API keys

Your ngrok authtoken is available on your dashboard.

Now, open the .env file in your code editor and add your keys from their respective service:

TWILIO_ACCOUNT_SID=****************************
TWILIO_AUTH_TOKEN=****************************
TWILIO_NUMBER_SID=****************************
OPENAI_API_KEY=****************************
AZURE_SPEECH_KEY=****************************
AZURE_SERVICE_REGION=****************************
NGROK_AUTHTOKEN=****************************

Install packages and libraries

Open your terminal and install the required packages for this project:

pip install "fastapi [standard]" aiohttp twilio ngrok dotenv azure-cognitiveservices-speech

Write the server code

With the Python environment ready, you can begin writing the server code. The first order of business will be to expose the local server to the internet. Twilio needs a way to send requests to the app, and this is where ngrok comes in.

Create the file to house the main server code:

touch main.py

Import required modules

Open the file in your code editor and import the required modules and .env keys:

import os
import asyncio
import aiohttp
import json
import base64
from dotenv import load_dotenv
import ngrok
import uvicorn
from fastapi import FastAPI, Request, WebSocket
from fastapi.responses import HTMLResponse
from fastapi.websockets import WebSocketDisconnect
from contextlib import asynccontextmanager
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Gather, Connect
load_dotenv()
PORT = 5000
NGROK_AUTHTOKEN = os.getenv("NGROK_AUTHTOKEN")
TWILIO_ACCOUNT_SID = os.getenv("TWILIO_ACCOUNT_SID")
TWILIO_AUTH_TOKEN = os.getenv("TWILIO_AUTH_TOKEN")
TWILIO_NUMBER_SID = os.getenv("TWILIO_NUMBER_SID")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
TWILIO_SANDBOX_NUMBER = os.getenv("TWILIO_SANDBOX_NUMBER")
WHATSAPP_PHONE_NUMBER = os.getenv("WHATSAPP_PHONE_NUMBER")
language_mapping = {
    "1": {
        "code": "en-US",
        "initial_prompt": "Starting the interactive session... What would you like to talk about today?"
    },
    "2": {
        "code": "es-MX",
        "initial_prompt": "Iniciando la sesión interactiva... ¿De qué le gustaría hablar hoy?"
    },
    "3": {
        "code": "fr-FR",
        "initial_prompt": "Démarrage de la session interactive... De quoi aimeriez-vous parler aujourd'hui?"
    }
}

Take note of the dictionary labeled language_mapping, as well as the TWILIO_SANDBOX_NUMBER and WHATSAPP_PHONE_NUMBER environment variables. You’ll use them later in the tutorial.

Set up an ngrok tunnel

ngrok’s default functionality is to generate a random URL every time you start an agent. This is not ideal, as you’d have to update your Twilio webhook URL manually each time that happens. Instead, with the help of ngrok’s free static domain and a little extra code, you’ll set up some logic to handle all that automatically.

Head to the ngrok website to claim your free domain:

Screenshot from ngrok's website showing how to claim a free static domain.

Next, add the code below to your main.py while replacing your-ngrok-static-domain with your static domain:

client = Client(TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN)
@asynccontextmanager
async def lifespan(app:FastAPI):
    print("Setting up ngrok tunnel...")
    ngrok.set_auth_token(NGROK_AUTHTOKEN)
    # establish connectivity
    listener = await ngrok.forward(
        addr=PORT,
        proto="http",
        domain="your-ngrok-static-domain" # do not include this line if you don’t have one
    )
    print(listener.url())
    # configure twilio webhook to use the generated url
    twilio_phone = client.incoming_phone_numbers(
        TWILIO_NUMBER_SID
    ).update(voice_url=listener.url() + "/gather")
    print("Twilio voice URL updated:", twilio_phone.voice_url)
    # error handling and cleanup
    try:
        yield
    except asyncio.CancelledError:
        print("Lifespan cancel received.")
    except KeyboardInterrupt:
        print("KeyboardInterrupt in lifespan")
    finally:
        print("Tearing down ngrok tunnel...")
        ngrok.disconnect()
app = FastAPI(lifespan=lifespan)
# route definitions go here
if __name__ == "__main__":
    uvicorn.run("main:app", host="127.0.0.1", port=PORT, reload=True)

The code above represents a Lifespan Event. This term refers to logic that runs once when the application starts up, and again when it shuts down.

Here’s how it works:

  • The server authenticates with ngrok and starts a listener using the configuration options.
  • This listener returns the URL, and the Twilio client configures a Twilio webhook for your phone number.
  • When the app shuts down, it stops the ngrok agent while handling two likely exceptions to allow for a graceful shutdown.

The last bit of code defines the server’s entry point and sets up uvicorn to launch the FastAPI app.

If you’re unable to use a reserved domain, remove the domain option from the ngrok.forward() configuration.

After running the above code from the terminal with:

python main.py

You should see the message “Twilio voice URL updated:” printed to the terminal, along with the listening URL. Confirm this by going to the voice configuration page on your Twilio console:

Screenshot of Twilio console showing voice configuration for a Twilio number.

Once confirmed, stop the server for now with Ctrl + C and comment out the Twilio configuration part of the code. This ensures you don’t send a redundant update request each time you restart the app. However, if you’re not using a reserved domain, leave it in place so the URL updates dynamically.

You can learn more about Lifespan Events from the FastAPI docs.

Handle incoming calls with Twilio Programmable Voice

With all the preliminary setup out of the way, you’re ready to begin writing the core app logic.

The next order of business is to define the routes for your server. Add the following code to main.py, in-between the app initialization and the server entry point:

@app.api_route("/gather", methods=["GET", "POST"])
def gather():
    response = VoiceResponse()
    gather = Gather(num_digits=1, action="/voice")
    gather.say("Welcome to the Language Assistant. For English, press 1.")
    gather.say("Para español, presione 2.", language="es-MX")
    gather.say("Pour le français, appuyez sur 3.", language="fr-FR")
    response.append(gather)
    # if caller fails to select an option, redirect them into a loop
    response.redirect("/gather")
    return HTMLResponse(content=str(response), media_type="application/xml")
@app.api_route("/voice", methods=["GET", "POST"])
async def voice_response(request: Request):
    response = VoiceResponse()
    form_data = await request.form()
    if "Digits" in form_data:
        choice = form_data["Digits"]
        if choice in language_mapping:
            language = language_mapping.get(choice, {}).get("code")
            prompt = language_mapping.get(choice, {}).get("initial_prompt")
            response.say(prompt, language=language)
            host = request.url.hostname
            connect = Connect()
            connect.stream(
                url=f"wss://{host}/audio-stream/{language}",
                status_callback=f"https://{host}/send-analysis"
            )
            response.append(connect)
            return HTMLResponse(content=str(response), media_type="application/xml")
    # if caller selected invalid choice, redirect them to /gather
    response.say("Sorry, you have made an invalid selection. Please choose a number between 1 and 3.")
    response.redirect("/gather")
    return HTMLResponse(content=str(response), media_type="application/xml")

The /gather route greets the caller, takes their language selection, and passes it on to the /voice route. When the caller makes a choice, Twilio forwards it using the action=”/voice” attribute on the Gather verb.

Inside the /voice route, the app matches the caller’s input against the language_mapping dictionary to retrieve the corresponding language and prompt. Feel free to modify the options but ensure to select languages the Pronunciation Assessment feature supports.

The Say verb speaks the prompt to the caller, while connect.stream() initiates a connection to the audio-stream websocket endpoint. The Stream noun also specifies a status_callback URL, which Twilio will call when the stream starts and ends. This route will send the analysed results to the caller via WhatsApp.

Integrate the Realtime API and Azure AI Services

OpenAI Realtime API

OpenAI provides two main approaches for building AI assistants that provide audio responses:

  • Chained architecture: Converts audio to text, generates a text response with a Large Language Model (LLM), and synthesizes the response into audio.
  • Speech-to-speech architecture: Directly processes audio input and output in real time, providing low-latency interactions and a more natural feel to the conversation.

While Chained architecture has its advantages, the better option for our use case is the Speech-to-speech architecture. In this scenario, the voice model responds in real time, providing low-latency interactions and a more natural feel to the conversation.

Let’s write the code to set this up. Begin by creating utility functions and a class to handle various tasks at different points during the conversation.

Create a file named speech_utils.py:

touch speech_utils.py

Import the required modules for this file and load your environment variables:

import os
import asyncio
import random
from dotenv import load_dotenv
import azure.cognitiveservices.speech as speechsdk
load_dotenv()
AZURE_SPEECH_KEY = os.getenv("AZURE_SPEECH_KEY")
AZURE_SERVICE_REGION = os.getenv("AZURE_SERVICE_REGION")

Add these two helper functions, and I’ll explain what they do right after:

async def send_to_openai(openai_ws, base64_audio):
    audio_byte = {
        "type": "input_audio_buffer.append",
        "audio": base64_audio
    }
    await openai_ws.send_json(audio_byte)
async def update_default_session(openai_ws):
    system_message = (
        "You are a helpful language learning voice assistant. "
        "Your job is to engage the caller in an interesting conversation to help them improve their pronunciation. "
        "If the caller speaks in a non-English language, you should respond in a standard accent or dialect they might find familiar. "
        "Keep your responses as short as possible."
    )
    voices = [
        "alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse"
    ]
    voice = random.choice(voices)
    print(voice)
    session_update = {
        "type": "session.update",
        "session": {
            "modalities": ["text", "audio"],
            "instructions": system_message,
            "voice": voice,
            "input_audio_format": "g711_ulaw",
            "output_audio_format": "g711_ulaw",
            "turn_detection": { "type": "server_vad" },
            "temperature": 0.8
        }
    }
    await openai_ws.send_json(session_update)

The send_to_openai function does exactly what it says. It takes a WebSocket connection to the Realtime API (which we’ll set up later), along with a chunk of the caller’s audio encoded in base64 format, and sends this data to the Realtime model. This function gets called iteratively, as you’ll find out when we set up the websocket connections. The type: input_audio_buffer.append field specifies the type of event we’re sending, which lets OpenAI know what to do with it.

The update_default_session function reconfigures the default session with our specifications. It sets a system message with behavioral instructions for the Realtime model, then picks a voice at random from a list of supported ones. This way, you hear a different AI voice each time the app starts up, and you can then choose which one you prefer and stick with it. Finally, it crafts the event object with the fields we want to update:

  • modalities: A list of options the model can respond with. Please note that “text” is required.
  • input_audio_format and output_audio_format: Specifies an audio format of g711_ulaw, which both Twilio and OpenAI Realtime support.
  • turn_detection: Enables server-side Voice Activity Detection (VAD). Setting this field to type: server_vad lets the model automatically detect when the caller starts and stops speaking, so it knows when to respond.

Azure AI Services

The next step is to define a class to configure Azure’s speech recognition and pronunciation assessment.

Add the following code to your speech_utils.py file:

class AzureSpeechRecognizer():
    def __init__(self):
        self.results = []
    def configure(self, language):
        self.language = language
        speech_config = speechsdk.SpeechConfig(subscription=AZURE_SPEECH_KEY, region=AZURE_SERVICE_REGION)
        # set up audio input stream
        audio_format = speechsdk.audio.AudioStreamFormat(
            samples_per_second=8000,
            bits_per_sample=16,
            channels=1,
            wave_stream_format=speechsdk.AudioStreamWaveFormat.MULAW
        )
        self.stream = speechsdk.audio.PushAudioInputStream(stream_format=audio_format)
        audio_config = speechsdk.audio.AudioConfig(stream=self.stream)
        # instantiate the speech recognizer
        self.speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, language=language, audio_config=audio_config)
        self.recognition_done = asyncio.Event()
        # configure pronunciation assessment
        pronunciation_config = speechsdk.PronunciationAssessmentConfig(
            grading_system=speechsdk.PronunciationAssessmentGradingSystem.HundredMark,
            granularity=speechsdk.PronunciationAssessmentGranularity.Phoneme
        )
        # prosody scores only available if language/locale == "en-US":
        if language == "en-US":
            pronunciation_config.enable_prosody_assessment()
        pronunciation_config.apply_to(self.speech_recognizer)
        print("Azure speech recognizer configured:", language)
        # define callbacks to signal events fired by the speech recognizer
        self.speech_recognizer.recognizing.connect(lambda evt: print(f"Recognizing: {evt.result.text}"))
        self.speech_recognizer.recognized.connect(self.recognized_cb)
        self.speech_recognizer.session_started.connect(lambda evt: print(f"Azure session started: {evt.session_id}"))
        self.speech_recognizer.session_stopped.connect(self.stopped_cb)
        self.speech_recognizer.canceled.connect(self.canceled_cb)
    def start_recognition(self):
        result_future = self.speech_recognizer.start_continuous_recognition_async()
        result_future.get()
    def stop_recognition(self):
        self.speech_recognizer.stop_continuous_recognition_async()
    async def send_to_azure(self, audio):
        try:
            self.stream.write(audio)
        except Exception as e:
            print(f"Error writing to Azure stream: {e}")
    # callback if speech is recognized
    def recognized_cb(self, evt: speechsdk.RecognitionEventArgs):
        if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
            print(f"\nPronunciation assessment for: {evt.result.text}")
            pronunciation_result = speechsdk.PronunciationAssessmentResult(evt.result)
            print(
                f"Accuracy: {pronunciation_result.accuracy_score} \n\n"
                f"Pronunciation score: {pronunciation_result.pronunciation_score} \n\n"
                f"Completeness score: {pronunciation_result.completeness_score} \n\n"
                f"Fluency score: {pronunciation_result.fluency_score}\n\n"
                f"Prosody score: {pronunciation_result.prosody_score} \n\n"
            )
            # provide further analysis
            print("     Word-level details:")
            for idx, word in enumerate(pronunciation_result.words):
                print(f"     {idx + 1}. word: {word.word}\taccuracy: {word.accuracy_score}\terror type: {word.error_type}")
            # gather results to send to caller
            analysis = {
                "Assessment for:": evt.result.text,
                "Accuracy -": pronunciation_result.accuracy_score,
                "Pronunciation -": pronunciation_result.pronunciation_score,
                "Completeness -": pronunciation_result.completeness_score,
                "Fluency -": pronunciation_result.fluency_score
            }
            if self.language == "en-US":
                analysis["Prosody -"] = pronunciation_result.prosody_score
            self.results.append(analysis)
    def canceled_cb(self, evt: speechsdk.SpeechRecognitionCanceledEventArgs):
        if evt.result.reason == speechsdk.ResultReason.Canceled:
            cancellation_details = evt.result.cancellation_details
            print(f"Cancellation details: {cancellation_details.reason.CancellationReason}")
            if cancellation_details.reason == speechsdk.CancellationReason.Error:
                print(f"Error details: {cancellation_details.error_details}")
                self.recognition_done.set()
    def stopped_cb(self, evt):
        print(f"Azure session stopped: {evt.session_id}")
        self.recognition_done.set()

The AzureSpeechRecognizer class bundles all the logic needed for speech recognition and assessment. It declares an initial state for the speech_recognizer object: an empty list which will eventually contain the final results of the speech analysis.

The configure() method sets up speech recognition and pronunciation assessment. When called iteratively, the send_to_azure() method writes incoming audio chunks to the configured input stream.

Meanwhile, start_recognition() and stop_recognition() run asynchronously to initiate and halt the continuous recognition process. This happens in the background so other tasks can run in parallel. Lastly, we define callback methods for specific events fired by the speech recognizer.

And that wraps up the logic for speech_utils.py.

Bringing it all together

It’s time to set up the WebSocket endpoint to receive and process audio from Twilio Media Streams. For that, we go back to the route setup in the main.py file.

First, update the import statements and instantiate a speech_recognizer object:

from speech_utils import AzureSpeechRecognizer, send_to_openai, update_default_session
speech_recognizer = AzureSpeechRecognizer()

Now, add this block of code right below the /voice route:

@app.websocket("/audio-stream/{language}")
async def stream_audio(twilio_ws: WebSocket, language: str):
    await twilio_ws.accept()
    stream_sid = None
    speech_recognizer.configure(language)
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-mini-realtime-preview-2024-12-17"
    try:
        async with aiohttp.ClientSession() as session:
            async with session.ws_connect(
                url,
                headers={
                    "Authorization": f"Bearer {OPENAI_API_KEY}",
                    "OpenAI-Beta": "realtime=v1"
                }
            ) as openai_ws:
                await update_default_session(openai_ws)
                # receive and process Twilio audio
                async def receive_twilio_stream():
                    nonlocal stream_sid
                    try:
                        async for message in twilio_ws.iter_text():
                            data = json.loads(message)
                            match data["event"]:
                                case "connected":
                                    print("Connected to Twilio media stream")
                                    speech_recognizer.start_recognition()
                                case "start":
                                    stream_sid = data["start"]["streamSid"]
                                    print("Twilio stream started:", stream_sid)
                                case "media":
                                    base64_audio = data["media"]["payload"]
                                    mulaw_audio = base64.b64decode(base64_audio)
                                    azure_task = asyncio.create_task(speech_recognizer.send_to_azure(mulaw_audio))
                                    openai_task = asyncio.create_task(send_to_openai(openai_ws, base64_audio))
                                    await asyncio.gather(azure_task, openai_task)
                                case "stop":
                                    print("Twilio stream has stopped")
                    except WebSocketDisconnect:
                        print("Twilio webSocket disconnected")
                    finally:
                        speech_recognizer.stream.close()
                        speech_recognizer.stop_recognition()
                        if not openai_ws.closed:
                            await openai_ws.close()
                # send AI response to Twilio
                async def send_ai_response():
                    nonlocal stream_sid
                    try:
                        async for ws_message in openai_ws:
                            openai_response = json.loads(ws_message.data)
                            if ws_message.type == aiohttp.WSMsgType.TEXT:
                                match openai_response["type"]:
                                    case "error":
                                        print("Error in OpenAI response:", openai_response["error"])
                                    case "input_audio_buffer.speech_started":
                                        print("Speech detected")
                                        await twilio_ws.send_json({
                                            "event": "clear",
                                            "streamSid": stream_sid
                                        })
                                    case "response.audio.delta":
                                        try:
                                            audio_payload = base64.b64encode(base64.b64decode(openai_response["delta"])).decode("utf-8")
                                            audio_data = {
                                                "event": "media",
                                                "streamSid": stream_sid,
                                                "media": {
                                                    "payload": audio_payload
                                                }
                                            }
                                            await twilio_ws.send_json(audio_data)
                                            # send mark message to signal media playback complete
                                            mark_message = {
                                                "event": "mark",
                                                "streamSid": stream_sid,
                                                "mark": { "name": "ai response" }
                                            }
                                            await twilio_ws.send_json(mark_message)
                                        except Exception as e:
                                            print("Error sending Twilio audio:", e)
                    except Exception as e:
                        print(f"Error in send_ai_response: {e}")
                await asyncio.gather(receive_twilio_stream(), send_ai_response())
    except Exception as e:
        print("Error in aiohttp Websocket connection:", e)
        await twilio_ws.close()

This endpoint establishes two WebSocket connections: one with FastAPI’s WebSocket API (twilio_ws) and another with the aiohttp library (openai_ws). aiohttp integrates well with asyncio, allowing your application to keep everything non-blocking during a concurrent stream.

Upon successfully connecting to the Realtime API, the endpoint defines two coroutines:

  • receive_twilio_stream: This listens for incoming messages from Twilio and takes action based on the type of message:
  • connected: Starts the Azure speech recognition process.
  • start: Captures and sets the stream_sid, which we’ll use to send the AI response back to the caller.
  • media: Sends the audio to OpenAI and Azure based on their supported formats.
  • stop: Logs a message to the console when the stream ends.
  • send_ai_response: This also listens and takes action based on specific responses, this time from OpenAI:
  • error: Logs errors to the console so they’re easier to debug.
  • input_audio_buffer.speech_started: Manages interruptions by sending a clear message to the Twilio stream. The Realtime server sends this response when it successfully detects speech.
  • response.audio.delta: Processes the audio chunks and forwards them to Twilio for playback. It also sends a mark message that signals when each bit of audio is complete. The Realtime server sends this event alongside the audio response from the model.

I know it’s been quite the journey, but you’re almost there! Just one final piece remains…

Send feedback via WhatsApp

To send a WhatsApp message from your server, you need to activate the Twilio Sandbox for WhatsApp.

From the Twilio console, under the Develop tab, go to:

Explore products > Messaging > Try WhatsApp

If a dialog box pops up, review and accept the terms, then click Confirm to continue.

On the next screen, follow the setup instructions to activate the Sandbox. You’ll need to send a “join” message to the provided Sandbox number before you can begin sending WhatsApp messages. Once the message is received, Twilio will reply confirming that the Sandbox is active.

You should see your To and From WhatsApp numbers on the Twilio console after you have connected to the sandbox. To avoid exposing phone numbers on a platform like GitHub, add them to your .env file:

WHATSAPP_PHONE_NUMBER=****************************
TWILIO_SANDBOX_NUMBER=****************************

Remember, you already retrieved these earlier in the tutorial.

Next, send another message (e.g., “Hello!”) to begin a user-initiated conversation. This allows you to send a message from your server without having to use one of Twilio’s pre-configured templates.

Once this is done, add the code below to main.py to send the results to the caller:

@app.api_route("/send-analysis", methods=["GET", "POST"])
async def send_analysis(request: Request):
    form_data = await request.form()
    stream_event = form_data["StreamEvent"]
    if stream_event == "stream-stopped":
        results = speech_recognizer.results
        if len(results) > 0:
            message_body = "\n----------------\n".join(
                "\n".join(f"{param} {result[param]}" for param in result)
                for result in results
            )
        else:
            message_body = "No assessment results available for your latest session."
        message = client.messages.create(
            body=message_body,
            from_=f"whatsapp:{TWILIO_SANDBOX_NUMBER}",
            to=f"whatsapp:{WHATSAPP_PHONE_NUMBER}",
            status_callback=f"https://{request.url.hostname}/message-status"
        )
    return "OK"
@app.api_route("/message-status", methods=["GET", "POST"])
async def message_status(request: Request):
    form_data = await request.form()
    print(
        f"Message SID: {form_data["MessageSid"]} \n"
        f"Message status: {form_data["MessageStatus"]}"
    )
    return "OK"

Remember the status_callback attribute we specified on connect.stream() earlier in the /voice route? Well, Twilio sends a request to the specified URL under two conditions: when the stream starts and when it stops. We only want to take action on one of them: stream-stopped.

When the request meets this condition, we loop through the results and craft them in a readable format for the message body. Make sure to replace the text following whatsapp: in both the from_ and to fields with the appropriate values.

With all of the right information in place, we send the results to the caller. The /message-status route logs the message status, whether it was successfully delivered or encountered an error.

Test your app

With ngrok and your Twilio phone number already configured, it’s time to run the app. Ensure your virtual environment is active, and then start the server by running:

python main.py

If everything is set up correctly, you should see a log of the startup process in your terminal. Now, place a call to your Twilio phone number, and you should hear the greeting message from the /gather route, prompting you to select a language.

Once you do, the AI voice will respond in your selected language. As you speak, you will also see an assessment of your side of the conversation printed to the terminal in real time!

When you’re done talking, hang up the phone, and you should receive your analysis results on WhatsApp.

Troubleshooting

Twilio trial accounts come with certain limitations that may cause errors in your app. The Free Trial Limitations page provides more information on how this might affect your setup during testing.

If you’re running into issues not related to Twilio, try tweaking your setup at different points to detect what the problem might be. For instance, if you’re getting an AI response but no speech recognition output, temporarily suspend the open_ai task to narrow down the problem.

The server setup also includes a fair bit of error handling, so paying attention to exceptions logged to the console could help resolve some issues.

Conclusion

That’s it! You have successfully built an AI Voice app that evaluates your pronunciation skills and provides real-time feedback and post-call analysis.

Now that you have your very own AI Voice coach, you can extend your app’s functionality by adding a few extra features, like:

  • Speech emotion analysis to tailor feedback based on tone.
  • Detailed breakdown of assessment scores. Here’s a link to some of the key pronunciation assessment results to get you started.

This tutorial builds on Paul Kamp’s article: Build an AI Voice Assistant with Twilio Voice, the OpenAI Realtime API, and Python. Shout-out to the Twilio Developer Voices team for laying a solid foundation for building AI Voice apps with Twilio and the OpenAI Realtime API.

Danny Santino is a software developer and a language learning enthusiast. He enjoys building fun apps to help people improve their speaking skills. You can find him on GitHub and LinkedIn .