Billing Incident Post-Mortem: Breakdown, Analysis and Root Cause

Twilio experienced an incident with its billing system on July 18, 2013. Although we’ve shared how the incident unfolded, and the impact on our customers, we’d like to detail the root cause, how we fixed it, and what we’re doing to ensure this doesn’t happen in the future.

Incident Summary and Impact

Twilio experienced an incident with its billing system on July 18, 2013. This incident affected 1.4% of Twilio’s customers in up to three ways:

  • If a customer made a credit card payment to Twilio, during the time of the incident, the account balance was not updated to reflect the payment.
  • If the credit card payment was triggered by an auto-recharge, the recharge was attempted multiple times as a result of the account balance not being updated to reflect the payment.
  • Some accounts were suspended as a result of a recharge attempted against a credit card deactivated by the repeated billing.

Additionally, Twilio usage reports were delayed in reflecting the prices of billable items for all customers while the billing system was offline. Voice and SMS message services were not impacted during this incident, however prices were not calculated for calls and message made during the duration of the incident.

Timeline

July 18, 2013

  • 1:35 AM PDT / 8:35 UTC: We experienced a loss of network connectivity between all of our billing redis-slaves and our redis-master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time.
  • 2:39 AM PDT / 9:39 UTC: Services relying on the redis-master began to fail due to the load generated by the slave synchronization.
  • 2:42 AM PDT / 9:42 UTC: Our on-call engineers restarted the redis-master to address the high load.
  • 3:28 AM PDT / 10:28 UTC: Our monitoring systems detected an anomaly in our billing systems, which resulted in erroneous credit card charges and in some cases account suspensions. 1.1% of all Twilio customers were impacted. The on call team immediately began our incident response.
  • 4:10 AM PDT / 11:10 UTC: Our billing system was taken offline to avoid additional credit card charges.
  • 6:24 AM PDT / 13:24 UTC: Service restored to all suspended accounts. The billing system remained offline, while our engineers investigated the root cause.
  • 11:58 AM PDT / 18:58 UTC: Billing system was brought back online.
  • 12:36 PM PDT / 19:36 UTC: Monitoring detected a recurrence of the original billing anomalies, affecting another 0.3% of all Twilio customers. The billing system was taken offline again. Suspended accounts were immediately reactivated.
  • 2:57 PM PDT / 21:57 UTC: We began processing refunds for the erroneous credit card charges. This work continued for the following 24 hours.

July 19, 2013

  • 3:00 PM PDT / 22:00 UTC: We finished processing refunds for all erroneous credit card charges.
  • 3:30 PM PDT / 22:30 UTC: All impacted accounts were given a credit equaling 10% of their last 30 days of Twilio spend.
  • 7:10 PM PDT / July 20 02:30 UTC: The billing system was activated progressively for groups of customer accounts.
  • 8:14 PM PDT / July 20 03:14 UTC: The billing system was activated for all customers.
  • 9:15 PM PDT / July 20 04:15 UTC: We gave the all-clear message, and all systems were restored.

Root Cause

Twilio’s billing system uses an in-memory Redis cluster to store in-flight account balances. This cluster is configured with a single master and multiple slaves distributed across data-centers for resiliency in the event of a host or data-center failure.

At 1:35 AM PDT on July 18, a loss of network connectivity caused all billing redis-slaves to simultaneously disconnect from the master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.

By 2:39 AM PDT the host’s load became so extreme, services relying on redis-master began to fail. At 2:42 AM PDT, our monitoring system alerted our on-call engineering team of a failure in the Redis cluster. Observing extreme load on the host, the redis process on redis-master was misdiagnosed as requiring a restart to recover. This caused redis-master to read an incorrect configuration file, which in turn caused Redis to attempt to recover from a non-existent AOF file, instead of the binary snapshot. As a result of that failed recovery, redis-master dropped all balance data. In addition to forcing recovery from a non-existent AOF, an incorrect configuration also caused redis-master to boot as a slave of itself, putting it in read-only mode and preventing the billing system from updating account balances.

With all account balances at zero and read-only, Twilio usage that resulted in a billing transaction (e.g. 1 cent for a SMS message or a phone call) triggered the billing system to attempt a recharge using the credit card associated with the customer’s account. This only affected accounts with auto-recharge enabled.

Consequently, the billing system charged customer credit cards to increase account balances without being able to update the balances themselves. This root cause produced the billing incident of customer credit cards being charged repeatedly. At 3:28 AM PDT, the billing system monitoring reported the anomalous activity. At 4:10 AM PDT, on-call engineers responding to the incident shut down the billing system to prevent further charges.

The billing system maintains independent double-bookkeeping for all balance data in a separate relational datastore. Following the shutdown of the billing system, this independent record was used to restore the account balances lost in the failed recovery of the redis-master. Once these balances were properly restored, the billing system was turned on at 11:58 AM PDT. Observing further anomalous behavior, the billing system was shut back down and engineering work focused on restoring service and refunding customers accounts.

That work completed the following day and the billing service was restored slowly across the customer base, reaching full restoration at 9:15 AM PDT.

What We’re Doing About It

In the process of resolving the incident, we replaced the original redis cluster that triggered the incident. The incorrect configuration for redis-master was identified and corrected. As a further preventative measure, Redis restarts on redis-master are disabled and future redis-master recoveries will be accomplished by pivoting a slave.
The simultaneous loss of in-flight balance data and the ability to update balances also exposed a critical flaw in our auto-recharge system. It failed dangerously, exposing customer accounts to incorrect charges and suspensions. We are now introducing robust fail-safes, so that if billing balances don’t exist or cannot be written, the system will not suspend accounts or charge credit cards. Finally, we will be updating the billing system to validate against our double-bookkeeping databases in real-time.

All of us here at Twilio apologize for the impact this had on you, your business and your customers. We look forward to these steps to earn back your trust, and all the steps that follow.