Starting early this morning, Amazon Web Services experienced several service problems at one of its east coast datacenters. The outage impacted major sites across the Internet. The number of high profile sites affected by the issue shows both the amazing success of cloud services in enabling the current Internet ecosystem, and also the importance of solid distributed architectural design when building cloud services.
Twilio’s APIs and service were not affected by the AWS issues today. As we’ve grown and scaled Twilio on Amazon Web Services, we’ve followed a set of architectural design principles to minimize the impact of occasional, but inevitable issues in underlying infrastructure.
Unit-of-failure is a single host
Where possible, choose services and infrastructure that assume host failures happen. By building simple services composed of a single host, rather then multiple dependent hosts, one can create replicated service instances that can survive host failures.
For example, if we had an application that consisted of business logic component A, B, C each of which had to live on separate host, we could compose service group (A, B, C), (A, B, C)... or, we could create component pools (A, A, ...), (B, B, ...), (C, C, ...). With the composition (A, B, C), a single machine failure would result in the loss of a whole system group. By decomposing resources into independent pools, a single host failure only results in the loss of a single host’s worth of functionality. We’ll cover more on the benefits and drawbacks of this approach in another post.
Short timeouts and quick retries
When failures happen, have software quickly identify those failures and retry requests. By running multiple redundant copies of each service, one can use quick timeouts and retries to route around failed or unreachable services.
- Make a request, if that request returns a transient error or doesn’t return within a short period of time (the meaning of short depends on your application).
- Retry the request to another instance of the service
- Keep retrying within the tolerance of the upstream service.
If you don’t fail fast and retry, distributed systems, especially those that are process or thread-based, can lock up as resources are consumed waiting on slow or dead services.
Idempotent service interfaces
Build services that allow requests to be safely retried. If you aren’t familiar with the concept, go read up on the wonderful world of idempotency.
"In computer science, the term idempotent is used more comprehensively to describe an operation that will produce the same results if executed once or multiple times."
If the API of a dependent service is idempotent, that means it is safe to retry failed requests. (See #2 above) For example, if a service provides the capability to add money to a user’s account, an idempotent interface to that service allows failed request to that service to be safely retried. There’s a lot to this topic, we’ll make a point of covering it in much more detail in the future.
Small stateless services
Separate business logic into small stateless services that can be organized in simple homogeneous pools. Twilio’s infrastructure contains many service pools that implement parts of our voice and SMS APIs. For example, when you make a recording using the <Record> verb in TwiML, the work of post-processing the recording to improve the audio quality and upload it to persistent storage is provided by a pool of recording servers. The pool of stateless recording services allows upstream services to retry failed requests on other instances of the recording service. In addition, the size of the recording server pool can easily be scaled up and down in real-time based on load.
Relax consistency requirements
When strict consistency is not required, create pools of replicated and redundant read data. One of the most important conceptual separations you can do at an application level is to partition the reading and writing of data. For example, if there is a large pool of data that is written infrequently, separate the reads and writes to that data. By making this separation, one can create redundant read copies that independently service requests. For example, by writing to a database master and reading from database slaves, you can scale up the number of read slave to improve availability and performance.
The issues at AWS illustrate the need to carefully think through the design of cloud-hosted applications. We’ve highlighted several well-known best practices of distributed system design that have been helpful to us as we decide what software to build and what external services to integrate into Twilio.
[UPDATE] A central theme of the recent AWS issues has been the Amazon Elastic Block Storage (EBS) service. We use EBS at Twilio but only for non-critical and non-latency sensitive tasks. We’ve been a slow adopter of EBS for core parts of our persistence infrastructure because it doesn’t satisfy the “unit-of-failure is a single host principle.” If EBS were to experience a problem, all dependent service could also experience failures. Instead, we’ve focused on utilizing the ephemeral disks present on each EC2 host for persistence. If an ephemeral disk fails, that failure is scoped to that host. We are planning a follow-on post describing how we doing RAID0 striping across ephemeral disks to improve I/O performance.
This is one of the first posts to our new Twilio engineering blog. Our team is excited to share our experiences in building and scaling Twilio to utilize the capabilities of cloud platforms like Amazon AWS. If you are interested in this topic, there are several great blogs that cover problems building distributed systems.