Cloud Orchestration, Lazarus and Interning at Twilio

July 11, 2018
Written by

yvTYO9hgKtXVuKlg74fAuugByS68Q7dvho1MNtwKwb6sZWZpolnqj9_iQVNrrOTcg8zX17xmp5E98_i2kTSHGV0cJ26pK8r6fNtRIMzQacD8v7_m7lDsYK4elemXJTo3yIcO4-ZU

There are a few technical interview strategies everyone knows: communicate your thoughts, give yourself time to think, and keep calm. Absent among these, of course, is coming up entirely blank for a question. So when in the interview for an engineering internship my manager asked, “What experience do you have with distributed systems?” I took a deep breath and, cringing a little, replied, “None, yet. But I’m smart and I learn fast, and I’m interested to learn by working in one.” This was true. I’d been curious about the subject for a while, and figuring it wouldn’t hurt to try, had worked up the courage to apply to a role for which I didn’t hit every “ideal qualification” checkbox. (Take that, tech gender gap!) And to my surprise and relief, I moved forward – all the way to San Francisco, to a spot on Twilio’s Cloud Orchestration team.

Twilio’s Lazarus System

The system I’m working with is called Lazarus, and it handles Twilio’s automated host remediation. It’s complex, it’s massive, and it’s my first time venturing into site reliability, so for the next few weeks I’ll be in what I call the Duckling Stage. In much the same way baby ducks imprint and follow around the first creature they see who looks like they’ve got a better idea of how the world works, I’m learning all I can from shadowing my mentor. Luckily for me, it’s been a great experience so far. My mentor Mitch is one of those wonderful teachers who takes the time to get into the nuts and bolts of how a concept fits into the bigger picture, so even though there’s a lot to learn, I’ve been able to make the most of my first small assignments. Every cross-team meeting is an enthusiastic teachable moment, and each time I understand the avalanche of new information a little more clearly. Toward the end of my first week, I even pushed out my first couple of pull requests (which made for a very happy manager and a slightly daunting set of expectations to continue meeting).


Not sure who snapped this photo of me at my first team meeting, but I like it a lot.

 

Since then, my projects have gotten a lot more interesting – and a lot more challenging. Lazarus’s fundamental purpose is to replace failing hosts and scale the number of active nodes as demand rises and falls throughout the day. It works just fine most of the time, but occasionally it breaks in fun and exciting ways – like a few weeks ago, when it scaled up by forty hosts instead of down. Fortunately, having too many hosts won’t cause any immediate disasters, but scaling down at a busy time would be another matter entirely, so we decided it would be prudent to have a full-system kill switch. This way, if Lazarus has a serious breakage, it can be put on pause while the the incident is found and patched.

Lazarus’ Breakage Remediation

The remediation takes place in two phases: the rule processing engine and the workflow engine. When an event comes in, Lazarus must first check a set of rules to verify that the event is eligible for automated remediation. If any of these conditions fail, the incident is immediately escalated to the engineer on call. If the event passes through the rule engine successfully, however, the workflow engine carries out scaling or host replacement operations.

Ideally, checks for kill switch activation would be carried out in the rule processing engine, never proceeding to workflow execution if the system were to be disabled. Every second of an incident is a potential failure for an end customer, so if Lazarus isn’t equipped to handle it, it’s best to escalate to engineers as quickly as possible. Unfortunately, physical constraints make this an impractical solution. Lazarus runs across many AWS regions, only one of which contains the databases with the information concerning the status of the kill switch, as well as the downstream dependencies that actually boot up new hosts. In order to move an event forward from the rule engine to the workflow engine, it would be necessary to send HTTP requests across regions to check the kill switch status and relay that data. In the case that automated remediation is enabled, yet another cross-region trip would be necessary to carry out remediation actions, which would entail high latency and risk of packet loss.

The check against kill switch activation status is instead carried out in the workflow engine immediately before executing the workflow, reducing the number of cross-region requests down to just one. I added a small table to the Lazarus database to log the activation status of host replacement and auto scaling operations. In the service activation check, the appProperties object wraps the results of the query to this table after Lazarus checks the details of the event it’s been asked to handle.

private boolean checkServiceActivation(ActionProcessorContext actionProcessorContext){
    if (!controlRealm){
        String message = “Attempted to execute action outside of control realm.”;
        return notifyKillSwitchFailure(message, actionProcessorContext);	
    }

    Event event = actionProcessorContext.getEvent();
    If (event != null){
        String eventCatefory = event.getEventCategory();
        AppProperties appProperties = propertiesDao.getAppProperties();
        if (appProperties.isWorkflowDisabled(eventCategory)) {
            String messsage = String.format(“%s currently suspended. Escalating the incident.”, eventCategory);
	return notifyKillSwitchFailure(message, actionProcessorContext);
       }
    }
    return true;
}

The event category specifies whether an event requires a host replacement or a scaling action; when the kill switch is activated and these actions are disabled, the only thing Lazarus has power to do is to hand the event over to the engineers on call. A message is sent back to the rule processing engine that the event has been rejected before escalating the event to engineer on call and aborting the remainder of the auto remediation process.

private boolean notifyKillSwitchFailure(String action_suspension_message, ActionProcessorContext actionProcessorContext) {
    updateEventStatus(actionProcessorContext.getParentUUID(), actionProcessorContext.getActionName(), EventStatus.Reason.NO_ACTION, action_suspension_message, actionTriggered: false);
    notifyUser(actionProcessorContext.getInput(), NotificationType.ESCALATION, action_suspension_message);
    notifyFailureToSender(actionProcessorContext, ExecutionStatus.CANCELED, ActionResultConstatns.CANCEL_MESSAGE_KEY, action_suspension_message);
    return false;
}

Needless to say, building a feature with the power to shut off Twilio’s entire remediation system was an intimidating daunting terrifying exciting challenge. I was never the slightest bit nervous about deploying it – or at least, I never showed it.

Just kidding. As with any internship, I’m often tap dancing on the edge of my comfort zone – take intern introductions, for example. Expectation: Shake a couple hands at standup, maybe get introduced to a few people by my mentor. Reality: The entire intern class storms the stage at the company all hands meeting as the CEO runs to high five everyone. Final Countdown blasts in the background. Reviewing and merging code is an adventure too; since my team builds tools for use by our internal developers, we have to treat Twilio’s development environment as production. This sometimes makes for more exciting deployments than I’d care for.

It’s a little uncomfortable, of course, but that’s how growth happens, and because I’ve been included as a full member of the team and encouraged to question and contribute as much as I can, I’m growing quickly. From my perspective, that isn’t a coincidence: when a workplace fosters a sense of belonging and support, its employees are empowered to take the risks necessary to continue developing their skills. I’m looking forward to doing so for the rest of the summer. After all, there’s a lot to learn… who’s got the time to wait till you’re an expert to get started?