As more and more of the services we use every day move online and into the cloud, the design and operation of highly-available infrastructure is becoming increasingly critical. However, downtime is common even at the biggest sites on the Internet. Why is high-availability so hard to achieve?
Availability is the amount of time a device is actually operating as the percentage of total time it should be operating. It is the total Uptime divided by the total Uptime + Downtime. "Five nines" or 99.999% is common goal in availability-critical environments. 99.999% of availability translates to 5.26 minutes of downtime per year or 25.9 seconds of downtime per month.
For services that must achieve five or six nines, human intervention in failure detection and mitigation simply isn't practical. Imagine you have a server and a correctly configured monitoring system that can detect any and all service-impacting failures. In addition, all failures correctly trigger a pager service that notifies an oncall team. When a failure is detected and handled by a human process, the oncall team must by able to hop on a computer, access the Internet, determine the failure, and fix the issue -- all within a window of five minutes. This process simply isn't realistic. High-availability infrastructure must therefore be able to automatically recover from failures without human intervention.
It's the database
One of the most difficult to automate infrastructure components is the database. Databases and other stateful components are complex and have traditionally required significant human intervention to configure and fail-over. This makes data stores difficult to maintain and make highly available. This is especially true in cloud computing environments where the traditional approach of running data nodes on expensive special-purpose hardware that automatically handle failover isn't possible. To put it in more dramatic language,
Data persistence is the hardest technical problem most scalable SaaS businesses face.
We recent gave a talk at Web 2.0 Expo in New York that explores the root causes of downtime and the architectural choices we've made at Twilio to handle stateful components to provide for high-availability.
Key points from the talk:
- Data persistence issues and change control mistakes are two key causes of downtime
- Data persistence is hard
- hard to change schema or structure due the need to rewrite data and indices
- hard to recover from failures due to complex split-brain situations and the time it takes to restore or recover state from another node or backup
- hard to scale due to poor I/O performance -- slow I/O bandwidth and latency in the cloud compared to hardware,
- and hard to manage due to the incredible complexity of modern data management systems.
- When building high-availability cloud applications, one should consider
- clearly differentiating stateful and stateless component in your infrastructure
- avoiding the storage of data where possible
- using unstructured storage instead of structured storage where possible
- aggressively archiving or deleting data that doesn't need to be online
- defining clear human-mediated processes for change control
- and pragmatically picking the data storage technology for the problem -- SQL is best for certain tasks whereas a key-value store or cloud storage like S3 is better for others.