This page covers custom configurations you can use with Conversational Intelligence. While the Onboarding Guide helps you set up the basic workflow for Conversational Intelligence using Twilio Voice recordings, you can also set up Conversational Intelligence with the following changes:
Conversational Intelligence supports third-party media recordings. If your call recordings aren't stored in Twilio and you want to use them with Conversational Intelligence, the recordings must be publicly accessible for the duration of transcription. You can host the recordings or use on a time-limited pre-signed URL.
For example, to share a recording on an existing AWS S3 bucket, follow this guide. Then add the public recording URL to the media_url
when you create a transcript with the API
If you want to transcribe the audio of a Twilio Video recording, it needs additional processing to create an audio recording that can you can submit for transcription.
To create a dual-channel audio recording first, transcode a separate audio-only composition for each participant in the Video Room.
1curl -X POST "https://video.twilio.com/v1/Compositions" \ --data-urlencode "AudioSources=PAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"2\ --data-urlencode "StatusCallback=https://www.example.com/callbacks"3\ --data-urlencode "Format=mp4"4\ --data-urlencode "RoomSid=RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"5\ -u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN
Next, download the media from these compositions and merge them into a single audio stereo audio.
ffmpeg -i speaker1.mp4 -i speaker2.mp4 -filter_complex "[0:a][1:a]amerge=inputs=2[a]" -map "[a]" -f flac -bits_per_raw_smaple 16 -ar 441000 output.flac
In case the recording duration for each participant is different, you can avoid overlapping audio tracks. Use ffmpeg
to create a single-stereo audio track with delay to cover the difference in track length. For example, if one audio track last 63 seconds and the other 67 seconds, use ffmpeg
to create a stereo file with the first track, with four seconds of delay to match the length of the second track.
ffmpeg -i speaker1.wav -i speaker2.wav -filter_complex "aevalsrc=0:d=${second_to_delay}[s1];[s1][1:a]concat=n=2:v=0:a=1[ac2];[0:a]apad[ac1];[ac1][ac2]amerge=2[a]" -map "[a]" -f flac -bits_per_raw_sample 16 -ar 441000 output.flac
Finally, send a CreateTranscript
request to Conversational Intelligence by providing a publicly accessible URL for this audio file as media_url
in MediaSource
.
By default, Conversational Intelligence assumes Participant
one is on channel one, and Participant
two is on channel two. If your use case doesn't follow this channel mapping, you can provide optional Participant
metadata that maps the participant to the correct audio channel when you create a transcript with the API. You can also use this field to attach other participant metadata to the transcript.
You can provide a CustomerKey
when you create a transcript with the API, which allows you to map a Transcript
to an internal identifier. This can be a unique identifier within your system to track the transcripts. The CustomerKey
is also included as part of the webhook callback when the results for Transcript
and Operators
become available. This is an optional field, and you can't substitute CustomerKey
for Transcript Sid
in the APIs.