Lazarus: Twilio’s Cloud Scale Automated Microservice Remediation System
Is there anyone who likes being on call, waking up in the middle of the night to a buzzing pager, then having to execute a remediation runbook while half-asleep?
Most startups operate like this during the early phases of their company growth. As a company grows, this model of operation is neither sustainable nor an optimal use of engineering resources. To truly scale, you will need to look at automating frequent – and potentially risky – operational tasks. At Twilio, we built Lazarus, a command and control system to respond to events and automate runbooks.
Automating Frequent Remediation Tasks
At Twilio, we run thousands of microservices and over tens of thousands of instances. These microservices are owned and operated by many small teams. We strongly believe this is the best way to design and operate a large scale distributed system. An individual team is closest to the problem space and they can ...