An Update on the Feb 26, 2021 Service Disruption

March 03, 2021
Written by
Twilio
Twilion

Service Disruption Update February 26

To our customer community,

At the heart of Twilio's mission is to enable companies to reinvent how they engage with customers. It's the goal that every Twilion works towards, but we did not reach that bar when you experienced a service disruption on February 26, 2021. We understand that this may have disrupted your commerce, prevented your conversation with customers, and delayed launches.

One of our values is to "wear the customer's shoes" and we know we’ve let you down. For everyone that experienced a service disruption, we offer our sincere apologies and our commitment to make our services even more resilient and reliable for you moving forward.

The rest of this document explains what happened in this specific case and what we are doing about it.

What happened

On Friday, February 26, 2021, one of Twilio's internal services suffered a service disruption that impacted a broad set of Twilio products from 5:00am PST to 7:30am PST — a total duration of 2.5 hours, which far exceeds our targets for diagnosis and correction.

During the impact period, customers experienced increased latency, errors from our API, inaccessible web interfaces, and/or undelivered messages. Impacted products included SMS, Flex, Console, and others. Although the service disruption was detected and our on-call engineering team was notified within 1 minute, our Status Page did not update for 25 minutes which led to further customer uncertainty.

Root cause

The root cause of the service disruption was when a critical service that manages feature-enablement for many Twilio products became overloaded. Multiple Twilio products that rely on this feature-enablement service did not handle its failure gracefully and began to fail themselves, manifesting as customer-facing API errors and increased latency.

Resolution

To resolve the immediate issue, we increased server capacity and added additional caching to reduce the load on the service. This took longer than anticipated because our standard procedure for bringing additional capacity online didn’t fully take into account the ongoing load that occurred as other services continuously retried their failed requests.

These changes will remain in place to prevent reoccurrence while we deploy additional and permanent protections and process improvements.

Our path forward 

During our review of this disruption, we identified several improvements that will prevent the recurrence of this specific issue in the future. We will be making the following changes:

  • Reconfiguring the service with more aggressive auto-scaling behavior to better handle traffic spikes.
  • Removing this service from critical paths and making client-side caching the default behavior to prevent service unavailability.
  • Reducing the service’s request timeout and refactor the service’s API to increase scalability.
  • Reconfiguring the service’s failover mechanism to increase resilience in events of failures.
  • Refactoring the server’s approach to caching to decrease workloads.

We are also reviewing our tooling and procedures for communicating with customers during disruptions, including via our status.twilio.com page, to ensure you have accurate and up-to-date information.

Our post-incident review process remains at a relatively early stage. To prevent similar issues with other services, we are taking the following steps across our engineering organization:

  • Conducting an audit of our codebase to identify services with similar risk characteristics and remediate as appropriate.
  • Instituting common architecture best practices for client services to degrade more gracefully.
  • Improving our deployment tooling and on-call runbooks to better manage server fleet capacity across all our services, eliminating manual steps and shortening future time-to-recovery.

Further action items will be identified and shared with you as this process progresses.  

Final note

We take our mission to empower you to engage with your customers very seriously. While the service disruption is obviously regrettable, we are using it as an opportunity to improve our processes, speed up our reaction time, increase our transparency, and live up to your understandably high expectations for customer support and service.

Once again, we’d like to apologize for the inconvenience we caused and thank you for being a valued customer.