Voice Insights Conversation Relay Summary
Legal Notice
Voice Insights Conversation Relay Insights is not a HIPAA Eligible Service or PCI compliant and should not be used in workflows that are subject to HIPAA or PCI.
The Call Summary page allows you to inspect the performance and configuration of an individual call. This view is essential for debugging specific interaction failures, latency issues, or customer complaints.

These top-level metrics provide a snapshot of the interaction's health and the "turn-taking" quality between the customer and the virtual agent.
- Interruptions by customer: Counts the number of times the customer began speaking before the virtual agent finished speaking. High counts might indicate the agent was too verbose or that latency caused the customer to become impatient.
- Interruptions by agent: Counts the number of times the virtual agent began speaking before the customer finished. This often signals that the agent incorrectly detected the end of the customer's speech or that background noise triggered a false start.
- Total turns: The total number of conversational exchanges in the call. A "turn" is defined as a sequence starting with the customer beginning to speak and ending with the agent beginning to speak.
- Low turns: May indicate an unengaged user or a failure of the agent to prompt for input.
- High turns: May indicate a complex issue or a struggling customer who isn't getting a timely resolution.
- Tokens per second (TPS): Measures the throughput of the application for this specific call. It measures the average rate at which the virtual agent processes and generates tokens during speech generation.
- Silence detected: A true/false boolean indicating whether the call was completely silent from either the customer or the agent. This helps identify broken audio paths, connection issues, or severe system delays.

This section breaks down the Total average response time into the biggest contributors of latency.
- Network: The round-trip time (RTT) between Twilio and your application via WebSocket.
- STT (Speech-to-text): The time elapsed between the end of a customer's utterance and the point where a full transcription is generated and ready to be processed by your application.
- Application: The time elapsed between sending the last block of speech-to-text (STT) data and receiving the first token back. Because this is measured from Twilio's side, it includes WebSocket network latency.
- TTS (Text-to-speech): The time elapsed between receiving the text response from the application and the start of audio playback to the customer. This includes the round-trip time between Twilio and the TTS provider.
Scope of latency measurements
These metrics measure components from the perspective of Twilio's network. They do not include the last-mile latency between the end user and Twilio's media edge, which might include multiple carrier networks, cellular latency, or internet congestion. Last-mile latency depends on the type of call, quality of the connection, and geographic location of the end user.
Measuring speech-to-text latency
Speech-to-text (STT) latency measurement is calculated from transcription metadata provided by the STT vendor. Its accuracy varies by model and language. English is typically accurate within 100ms, while other languages can vary up to 250ms. The latency measurements described in the documentation are not guarantees of performance.

The Events section provides a visual and chronological log of the specific signals detected during the call. This timeline is a helpful diagnostic tool, allowing developers to reconstruct the exact conversational flow to troubleshoot latency, turn-taking issues, or technical failures.
At the top of the section, a color-coded waveform visualizes the entire call duration:
- Green Waveforms: Represent detected Customer speech.
- Blue Waveforms: Represent Virtual Agent audio playback.
- Playback Control: Use the play button to listen to the call recording. Click anywhere on the waveforms to play from that spot.
The chronological timeline displays the precise order of operations for each turn in the conversation:
- First Token Received: The timestamp when Conversation Relay receives the first token from the WebSocket message. This marks the end of perceived application latency.
- Final Token Received: The point when the full textual response was received from the model.
- Start of Customer Speech: When the system first detected user audio (the beginning of an utterance).
- End of Customer Speech: When the user stopped talking, triggering the prompt to be sent to your application.
- Start of Agent Speech: When the virtual agent began playing audio to the customer.
- End of Agent Speech: When the virtual agent finished its current response.
- Prompt Sent: The exact time the transcription was sent to your WebSocket server to request a response.
- DTMF: Captured when a customer uses their phone's keypad.
- Digit: Logged when the application sends digits to programmatically navigate an IVR.
- Interrupt: Logged when the system detects the customer spoke while the virtual agent was active.
- Preempted: Occurs when a new talk cycle's text tokens arrive and interrupt the current agent speech playback.
- Call Wrap Up: The final event signaling the session has ended and the summary is finalized.
- Play Media: Tracks when the application requests a specific audio file (like an MP3) to be played instead of synthesized speech.
- Language Changed: Identifies the specific language currently in use and flags any shifts detected during the conversation.
- Error Event: Critical notification of technical failures encountered during the session, used to populate the "Virtual Agent Errors" chart.
The properties section displays the specific configuration and TwiML parameters active during this call session, as received by Twilio from your application. These properties map to the TwiML elements and attributes for Conversation Relay.