Engineering Improvements to Prevent Service Disruptions

April 16, 2021
Written by
Twilio
Twilion

Engineering improvements to prevent service disruptions

Twilio suffered a service disruption on Feb 26, 2021. 

When we fall short, as we did with the service disruption on Feb 26th 2021, it motivates us to learn and to make our services more resilient and reliable. Nothing is more important than regaining your trust. We want the opportunity to show that Twilio can and will be a reliable and consistent partner. 

We also want to revive our commitment to you that when an incident occurs that disrupts your customer communications, we will always tell you about it. “No shenanigans” is our ethos. Striving to act in an honest, direct, and transparent way is a value every Twilion lives by - and in that spirit, we want to share the improvements we’re making for both the short and long term. 

Recap of the Feb 26, 2021 service disruption: 

On Friday, February 26, 2021, one of Twilio's internal services suffered a service disruption that impacted a broad set of Twilio products from 5:00am PST to 7:30am PST. The root cause of the service disruption was the overload on a critical service that manages feature-enablement for many Twilio products.  Although the service disruption was detected and our on-call engineering team was notified within 1 minute, our Status Page did not update for 25 minutes which led to further customer uncertainty. To resolve the immediate issue, we increased server capacity and added additional caching to reduce the load on the service. To read more about the service disruption, head over to this blog

We have identified a total of 37 technical improvements to the Feature Enablement Service which was the cause of the failure event on Feb 26, 2021. Of these 37 improvement opportunities, we have completed 24 to date. Find examples of critical completed and in-flight technical improvements below: 

Completed:

  • Reconfigured our Feature Service with more aggressive auto-scaling behavior to better handle traffic spikes.
  • Removed this service from critical paths and making client-side caching the default behavior to prevent service unavailability. 
  • Reduced the service’s request timeout and refactored the service’s API to increase scalability. 

In-flight with a completion ETA in Q2 2021:

  • Refactor the service’s API contract to increase scalability. 
  • Reconfigure the service’s failover mechanisms to be more resilient in the event of system failure. 
  • Refactor the server’s approach to caching to decrease workloads.

 

Mitigating the risk of similar issues across other services:

To improve our engineering operational processes and prevent further failure recurrences, we’ve identified several holistic changes to be made across our engineering organization. Each of these changes are in flight with a completion ETA in Q2 2021.

  • Complete an audit of our production systems to identify services with similar risk characteristics and remediate as appropriate. 
  • Ensure all client services are configured to degrade more gracefully in the event of downstream failures.
  • Improve our deployment tooling and on-call runbooks to better manage server fleet capacity across all our services, eliminating manual steps and shortening future time-to-recovery. 

Additionally, we are procuring and introducing new standardized tooling, and have defined a new role to coordinate technical efforts during disruptions such as this.

Final note: 

Once again, we’d like to apologize for the inconvenience the disruption caused and thank you for being a valued customer. We will continue to update you if and when we roll out additional improvements across processes and functions.