SRTP and You: A Deep Dive into Encrypted VoIP Communications

July 27, 2022
Written by
Matt Coser
Twilion
Reviewed by

SRTP and You Hero

RTP, or Real-time Transport Protocol, is used by Twilio (and others) for transmitting audio information for SIP calls. SRTP is Secure RTP, or RTP that has been encrypted. By design, no one can listen to, intercept, or replay the encrypted RTP media except the parties that originally negotiated the SIP session.

In this post, we will discuss:

  • How SRTP Works
  • Why encrypted media is cool
  • Overcoming potential obstacles and overhead
  • How to set up SRTP with Twilio
  • Implementation considerations

How does SRTP work?

If you understand HTTPS, then you will totally get SRTP. If not, let’s start by reviewing the basics.

SRTP employs TLS for encryption, which uses a ‘handshake’ that looks something like this:

TLS handshake ladder diagram
​​

The client and server exchange keys, which are unique to the current session, and use them to encrypt/decrypt the data that is being transferred between them.

SRTP uses Advanced Encryption Standard (AES) as the default cipher, with two primary modes:

  • AES-CTM - VoIP standard
  • AES-f8 - used in 3g data networks

For more details, you can check out Twilio’s glossary entry on TLS. But for now, all you need to know is that audio encrypted with TLS can only be deciphered while the call is in progress, and only by the client and server that negotiated the call to begin with.  

The sound of success

While diagrams are nice, it always helps to listen to call examples to fully understand what is happening on the wire.

This is an outbound call made using Twilio SIP Domains. The sound we hear is the RTP stream of the call.

Fun! We can hear the destination IVR, music on hold, the caller’s voice, jitter artifacts, and everything that makes RTP .

This is a recording of the same exact call flow, but with Secure Media enabled.

Unintelligible static. Of course, it didn’t sound like hot burning garbage while the call was in progress, as both parties were able to decrypt the traffic in real time.

It is only when an attacker goes to replay a captured SRTP stream does it sound like melting ketchup packets.

Behind the MoH

Now that we can hear the power of encryption, we can look deeper at the technical differences between RTP and SRTP.

RTP packet

The RTP spec describes how to packetize digital audio for phone calls. When a SIP session is established, the client and server agree on the RTP packet time, amongst other things.

This ‘ptime’ is the duration in milliseconds that each digital audio packet represents.

In this example of a Twilio SIP Domain call, the ptime value of 20 can be found in the SDP of the 200 OK:

sip packet p time attribute

This means the digital audio data will be encoded, dissected, and transmitted in 20 millisecond chunks.

The timestamps for each RTP packet confirm this ptime.  We can also clearly see Wireshark correctly identifying each packet as RTP in the protocol column.

RTP packet timestamps

Looking closer at a single 20ms unencrypted RTP packet, we can see the header fields and payload in clear text:

Unencrypted RTP payload

The hexdump of the payload of this particular RTP packet is all ff s, which translates to all 0s in binary. In the context of a SIP call, all ffs indicates the 20ms slice of audio was digitally silent.

hexdump of silent audio

Take a look at the waveform for this call. In the green section, audio information is present, and visualized by amplitude changes over time. The red section is flat, indicating the absence of audio information. Nothing to hear here.

waveform comparison

Digital silence is distinct from perceived silence. For example, if I simply stop speaking on a call, audio information will still be present; i.e. background noise from the room, line noise from the hardware, etc. On most SIP devices, the mute function will produce this digital silence.

tl;dr - we know this RTP packet is unencrypted, since we can visually decipher the payload and listen to the audio content of the packet.

SRTP packet

SRTP is represented in Wireshark as UDP, since the packet type identification has been scrambled by the encryption process.

packet type is misinterpreted by wireshark as udp

Since we know what the packet is, we can instruct Wireshark to to decode this UDP as RTP.

decoded RTP packet

The header is in cleartext as expected, but the payload is encrypted.

Encrypted SRTP payload

Most humans can’t even read unencrypted hex dumps. So, without listening, how can we tell if the RTP in this packet is actually encrypted?

This example was generated in a controlled environment, and the same mute function was used as in the above unencrypted RTP example. Based on the timestamp, this packet is known to be from a section of audio where the SIP device was producing digital silence.  However, since the ffs have been encrypted, we can’t visually interpret the payload.

Of course, the more obvious way to tell if SRTP is being used is to check SIP signaling. We can see the crypto attributes being offered for negotiation in the SDP of the SIP INVITE being sent from my device. These are the cipher suites my device supports.

Crypto attributes in the SDP of a SIP INVITE

The 200 OK sent by Twilio contains the cipher suite that was agreed upon, and will be used to encrypt the RTP.

Crypto attribute negotiated in the SIP 200 OK response

RTP header vs RTP payload

Now, SRTP specifically refers to the encryption of the RTP payload only.  The payload is the part of a RTP packet that contains the digital audio information. With SRTP, the header is authenticated, but not actually encrypted, which means sensitive information could still potentially be exposed.

The main components of the RTP header are:

  • Payload Type - The encoding of the RTP packet (G.711 PCMU, opus, etc.)
  • Sequence Number -  integer that increments with each RTP data packet sent.  Can be utilized to detect and smooth jitter and packet loss.
  • Timestamp - begins as a random starting number, and increments based on the quotient of sampling rate over packetization time.
  • Synchronization Source ID (SSRC) - A random number identifying (and masking) the network address of the source of the RTP stream.

Keeping the header unencrypted is critical for proper routing, so SRTP only covers authenticating the header’s association with the payload, which aids in replay protection.

RTP vs SRTP diagram

In most cases the default header information is not considered sensitive, unlike the associated digital audio payload.

That said, Maria Haider, a researcher at KTH Royal Institute of Technology in Stockholm, supposes transmitting unencrypted RTP headers in cloud based SIP environments still poses a significant risk.

“Since the server in the cloud belongs to a third party service provider… the endpoints do not want to risk their personal or corporate information... To achieve that, the double layer of security that is needed cannot be established without any meaningful modification of SRTP.”

When using VBR Codecs, the rate information can be seen unencrypted in the RTP header.  This can pose a risk, as a savvy eavesdropper could deduce the content of speech based on the rate information or speech level.

From RFC 6562: Guidelines for the Use of Variable Bit Rate Audio with Secure RTP:

“In the worst case, using the rate information to recognize a prerecorded message knowing the set of all possible messages would lead to near-perfect accuracy.”

Furthermore, RFC 6464 outlines a header extension which can contain audio level information, regardless of encoding.

“Such an attacker might be able to infer information about the conversation, possibly with phoneme-level resolution. In scenarios where this is a concern, additional mechanisms MUST be used to protect the confidentiality of the header extension.“

Telecom admins are advised to use padding in conjunction with VBR codecs, especially in use cases with structured conversations like with an IVR system.

RFC 6904 describes a future where these headers can be selectively encrypted based on their content or likelihood of containing sensitive info. This is not standard practice yet though.

Why we encrypt

Eavesdropping

The above recordings were obtained by ARP Poisoning myself, and using Wireshark to capture and reconstruct the SIP/RTP packets.

Sniffing a SIP call in a test lab

Capturing and analyzing SIP traffic is an essential troubleshooting skill for any network engineer or VoIP technician.  However, bad actors also leverage packet capture tools in attempts to gain access to your data, or completely disable your infrastructure. Capturing network traffic for nefarious purposes is known as eavesdropping or packet sniffing.

The threat of eavesdropping is ever present, and difficult to avoid entirely, especially on wireless networks. It’s OK if you can’t install a faraday cage at your home or office - using SRTP essentially renders sniffed packets useless, which mitigates the risk of data exposure the threat of eavesdropping poses.

Replay attack

A replay attack occurs when a bad actor replays network packets they have been nefariously captured via eavesdropping.

In this video, David Bombal demonstrates how to specifically intercept and replay RTP packets.

If SRTP were enabled, an attacker could still eavesdrop, but would not be able to conduct a replay attack.

Replay attacks should not necessarily be confined to the context of replaying network packets. Attackers can replay information they capture from eavesdropping. For example, a fraudster may conduct social engineering attacks, and pose as an authorized individual using information gleaned from eavesdropping phone calls.  It is important to consider enabling SRTP, and follow anti-fraud best practices for calls where any remotely sensitive information is discussed.

Payload integrity

As mentioned above, SRTP headers are authenticated, but not encrypted. In other words, while the SRTP header is sent in the clear, the receiver can validate the sender’s headers are actually associated with encrypted payloads they precede.

The sender runs the full SRTP packet contents through a hash function, along with the session key, which produces a digest called the auth tag. The sender then appends the auth tag to the end of the encrypted payload, and sends the fully constructed SRTP packet to the receiver.

The receiver chips off the auth tag sent to them, does the same HMAC-SHA 1 digest generation, and compares the two values. If they match, the plaintext header is associated with the encrypted payload.

However, an attacker can modify SRTP packets without the receiver knowing when the sender uses a weak and/or vulnerable message authentication method.  It is best to start with the SRTP RFC listed defaults, and use HMAC SHA-1, with a session authentication key length of 160 bits, and a resulting authentication tag length of 80 bits.

Needless to say, using a zero-length authentication tag should absolutely be avoided.

SRTP and the PSTN

The Publicly Switched Telephone Network, by nature, does not support SRTP and in some cases is infamously unencrypted.

Bird sitting on a power line

SRTP is specific to SIP communications, which run over the Internet.

So, if your SIP calls hit the Publicly Switched Telephone Network (PSTN), the media will undoubtedly be unencrypted at some point, even with SRTP configured.

The PSTN’s main protocol, SS7, does utilize digital signatures, and the comms between cell phones and towers are (mostly) secure.

Whenever possible, be sure to work with your telecom provider to understand their security policies, their response to eavesdropping threats, and the risk of data exposure over their network. At SIGNAL in 2017, B Byrne, the Head of Product for Authy, discussed how exposed vulnerabilities in the SS7 network fundamentally changed how the telecom industry approaches security.

On the other hand, SRTP will be encrypted as long as the call hops through SIP B2BUAs over the Internet. So a business doing all their calling through SIP infrastructure, even if not physically colocated, can still leverage SRTP for end-to-end encryption.

Too much overhead?

I know exactly what you are thinking. “The quality of a VoIP call is directly related to the conditions present while the call is in progress. Since encryption uses valuable computational resources, voice quality will surely suffer when SRTP is enabled!”

Some users choose to forgo SRTP due to resource constraints or implementation complexities, but this may be misguided.

Researchers at Towson University performed a study on the processing overhead of SRTP in various environments. The results “indicate that SRTP adds negligible overhead to VoIP processing and has no observable effect on VoIP quality.”

Twilio provides a Voice Insights dashboard and REST API so you can monitor voice quality in your voice application, and compare your own metrics when toggling Secure Media.

Voice Insights dashboard

Using SRTP with Twilio

Twilio Programmable Voice and Elastic SIP Trunking both support SRTP. Configuration on our end couldn’t be easier.

Let’s run through how to set it up!

Voice Trace is a feature that captures RTP on a call so Twilio Support can analyze the packet captures for calls with DTMF, Dialogflow, and/or certain audio quality issues. Encrypting the RTP renders these Voice Trace packet captures useless to Twilio Support, since they won’t be able to read them.  Please consider how you will deal with these types of issues before enabling SRTP on your SIP Trunk or Domain. For example, you might create an isolated unencrypted test trunk/domain that mimics your encrypted production config to use for testing/troubleshooting purposes.

Elastic SIP Trunking

https://www.twilio.com/docs/sip-trunking#securetrunking

Log into Console, and click on the Trunk you wish to secure.

Under Features, click the toggle to enable Secure Trunking.

SIP Trunking Secure Media switch

SIP Domains

https://www.twilio.com/docs/voice/api/secure-media

Log into Console, and click the SIP Domain you wish to secure.

Under Secure Media, click the toggle to enable.

SIP Domain Secure Media switch

Don’t forget to Save!

Save button
curl -X POST https://trunking.twilio.com/v1/Trunks/TKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \ 
--data-urlencode "Secure=true" \ 
-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN
curl -X POST https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/SIP/Domains/SDXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.json \ 
--data-urlencode "Secure=true" \ 
-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN

Outbound Calls

On outbound calls, simply append transport=tls to the end of the SIP URI.

<Dial><Sip> TwiML

https://www.twilio.com/docs/voice/twiml/sip#transport

xml
<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Dial>
        <Sip>
            sip:secure@sip.twilio.com;transport=tls
        </Sip>
    </Dial>
</Response>
curl -X POST https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Calls.json \
--data-urlencode "Url=http://demo.twilio.com/docs/voice.xml" \
--data-urlencode "To=+sip:secure@sip.twilio.com;transport=tls" \
--data-urlencode "From=+12345678901" \
-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN

Signaling over TLS is also available, but is typically configured and negotiated separately. At Twilio however, when you enable Secure Media you are also requiring SIP signaling to be sent over TLS. Packet capture files are not available in Twilio Console for encrypted calls like they are for unencrypted SIP calls.

That’s it on the Twilio side! Be sure to configure SRTP on your PBX according to the vendor’s instructions.

Implementation considerations

As seen above, enabling Secure Media on the Twilio side is very straightforward. However, you will need to do some configuration on your end as well.

Each SIP device is like a snowflake, with its own unique characteristics, interface, terminology, config schema, functionality, usability, supportability, limitations, etc. These complexities are compounded by the network conditions and overall environment the device is operating within.

In other words, there is no one singular way to enable SRTP that works for every user on every device.

As a disclaimer, this post is not meant to be a technical manual, so these configuration steps are neither exhaustive nor instructional. That said, a summary of my experience with a few different systems may serve as a curative aid to readers who are struggling with the idiosyncrasies of their own beautiful, unique SIP systems. This is a snapshot in time, and this post will not be maintained or updated.

pjsua

pjsua (pjsip user agent), is a command line SIP softphone that comes bundled with the pjsip install. It is great for testing, automation, embedded devices, and impressing your friends.

pjsua CLI

You won’t find a familiar telephone UI with pjsua. Instead, endpoints register via a config file, outbound calls can be scripted in almost any language, and SIP signaling can be generated via commands. For the most part, it ‘just works’, but to utilize SRTP, the source code must explicitly be built with tls mode enabled.

If OpenSSL libraries are not installed and/or in appropriate working order on the target machine, the build may succeed but SRTP will still not function.

Bria 5

Bria by Counterpath is a leading softphone application. SRTP is only available with the paid subscription models. However, with the proper access, enabling TLS and SRTP is only a few clicks away.

Bria 5 SRTP config menu

SIPp

SIPp is a great tool for SIP testing, but is specifically useful for the telecom tech or network admin in charge of administration of Twilio Elastic SIP Trunking.

SIPp CLI

Call flows and SIP signaling can be configured with XML, which brings a TwiML-esque feel to SIP Trunking call control.

However, the official build of SIPp does not support SRTP.  At all. Luckily, ankitonweb's forked version does.

Poly (previously Polycom)

Most Poly phones have the option to enable SRTP on every call to/from the device, or per Registration.

The most common method is to use the device’s UI. Again, every model is different, but the Polycom forums outline one way.

However, this setting can also be enabled by provisioning a configuration file.

FreePBX

Within FreePBX, SRTP must be enabled in both the General SIP Settings, and within the settings for the Extension(s) you wish to encrypt media on.

FreePBX SRTP config

The exact steps are described on the FreePBX wiki.

Conclusion

SRTP is awesome, yet woefully underutilized. There is almost no reason to NOT use SRTP with your Twilio SIP Domain and/or Elastic SIP Trunking setup. The added security and peace of mind far outweighs any potential overhead. Furthermore, any imperfections in the spec do not render encrypted payloads useless.

Even if you are leaving SIP Land and dancing with the PSTN, SRTP is your best mitigation against bad actors sniffing, modifying, or replaying your sensitive communications.  

Check out our docs, and let us know about your experiences in our Community Forums. We can’t wait to not see what you encrypt!

Matt Coser is a Senior Field Security Engineer at Twilio.  His focus is on telecom security, and empowering Twilio’s customers to build safely. Hit him up on linkedin to connect and discuss more.