
Twilio engineers are constantly working on improving our core services to meet or exceed 5 nines of availability. A system’s capacity for self-healing when a fault occurs is a key measure of achieving high availability. Recently, Twilio used Chaos Engineering to close the gap and eliminate the need for human intervention for common faults involving our core queueing-and-rate-limiting system, Ratequeue.
Background
Twilio empowers customers to make 100’s of concurrent API requests. However, your messages may be unexpectedly rate-limited, such as carrier regulations (US long codes), contractual limitations (US short codes), or technological (e.g. capacity). As a safeguard, Twilio automatically queues outbound messages for delivery. Each Twilio phone number gets a separate queue and messages are dequeued at rate depending on the contract.
Ratequeue
Ratequeue is an internally-developed distributed queueing system designed for simultaneously rate-limiting dequeue rates for many ephemeral queues. Each sender ID (e.g. phone number, short code, Alphanumeric …