“This Is Fine”… Or Not: Get Server Outage Alerts via SMS with Amazon Lambda

March 07, 2017
Written by

giphy

Nick Malcolm used to panic scroll through emails every morning. He’d scroll hoping not to find that email — the alert email telling him about a server that tipped over.

The fact he had to scroll down through a litany of emails to find the server outage alert meant it was already too late to escape scott-free. But, there was still time to minimize damage. Nick sprang into jumped on any server outages that were (luckily) few and far between.

 

As if a black and white infomercial scene came to life, Nick thought “There has to be a better way!” Nick’s “better way” did not include buying an alert service that was too large (and pricey) for his lean team at ThisData. Instead, he built himself an outage alert service using Amazon Lambda and Twilio.

“I was confident I could do it quickly because I’ve used Twilio before and I knew it would be straightforward,” says Nick.
Nick’s tutorial below is straightforward. And better yet, it’s serverless. So you don’t have to worry about your server outage alert system going down because of another party’s server outage. We call that “serverception”.

Read Nick’s blog post originally published on the ThisData blog
 
Getting a phone call in the middle of the night when your servers are on fire is a necessary evil for many developers and network administrators. If your site is being used around the world, then it needs to be available 24/7. I thought it’d be fun to see how easy it’d be to get a simple incident alarm going using Twilio and AWS CloudWatch, SNS, and Lambdas. Hint: it’s very easy. In this post I’ll walk you through how to achieve this yourself. Best of all it’s serverless, so there is nothing to maintain. You don’t have to worry about your incidence response server going down!

Of course, paid incident management tools exist, like PagerDuty, OpsGenie, and VictorOps. Cabot and OpenDuty are open source alternatives you can host yourself. They’ll handle escalating incidents through your team, notification via multiple channels, and more. But that’s no fun!

Before we jump in, here’s what you’ll need:

What you’ll get: A voice call and a text message when your service is down / degraded.

The TL;DR: create an SNS topic, create a Lambda using the gist below which is triggered by that topic, and notify the SNS topic with CloudWatch.

Set Up A New SNS Topic

Simple Notification Service, or SNS, is Amazon’s push messaging service. It is here that we create a “Topic” which describes how to notify us.

  1. Open up the AWS console and head to SNS.
  2. Click “Topics”, then the “Create new Topic” button.
  3. Name it “Incident_Response” and give it a description like “Notifies the CTO via Lambda and text message”.
  4. Click on your newly created Topic.
  5. Click the “Create Subscription” button.
  6. Click the “Protocol” dropdown, and choose “SMS”
  7. Type in your phone number, including area code.

Easy! You have an SNS topic which, when triggered, will send you a text message. The text message will always contain the name of the CloudWatch alarm, which provides some context to tell you what’s on fire.
Fun note: your message might even be powered by Twilio, as AWS use them as one of their delivery partners!

We could leave it at SMS messages, but they aren’t enough to wake me up at night. I need a phone call buzzing.

Create The Voice Call Lamda

Let’s head over to AWS Lambda so we can trigger voice calls.

  1. Open Lambda in the AWS console.
    Click the “Create a Lambda function” button.
    Choose the “Blank template” blueprint.
    Configure an SNS trigger by clicking the grey box outline, and selecting “SNS” from the bottom of the dropdown menu.
    Select your Incident_Response topic.
    Tick the “Enable trigger” checkbox, which will configure all the necessary Lamba permissions and create the “subscription” in your SNS topic.
    Click Next.
    Call your function notifyCTOWithVoiceCall

Time for some code! Copy and paste this into your Lambda:
 

 
You’ll need to update the toNumber variable in the code above, and add three environment variables which contain your Twilio credentials. You can also encrypt these variables using the encryption helpers.

At the top of the page, hit the “Save and Test” button. You should get a voice call which talks to you, then plays a song! If not, take a look at the “Log output” area. It should have completed successfully and show Twilio’s response, or any error messages. If that looks OK, log in to Twilio and check the debugger there.

At this point you have an SNS topic which will send you an SMS message and trigger a Lambda function which calls you. Now it’s time to put it to use!

Configure your CloudWatch alarms

The easiest way I found to do an end-to-end test is to create an alarm you know will fail. Head over to AWS CloudWatch.

  • Click “Alarms” in the sidebar, then the blue “Create Alarm” button.
  • Search for a “CPU” metric, and choose one of your instance’s “CPUUtilization” metrics.
  • Click “Next”.
  • Call it “TestAlarm”
  • Configure a Threshold “whenever CPUUtilization is <= 100 for 1 consecutive period(s)".
  • Click “Create Alarm”

 


 

Since your CPU will be using below 100% (hopefully!) as soon as you click “Create Alarm” you’ll get a phone call and text. The phone call will wake you up, and the text message will contain a little bit of context before you get to your emails. Too easy!

Congratulations! You’ve successfully set up a simple incident response tool which is effectively free, and you don’t have any extra servers to maintain. Now you can create new alarms, or update existing ones, for those critical times when you need to be woken up.

Where To From Here

To take this further, an easy win should be getting more context from the CloudWatch into the voice call. It’d involve looking up how CloudWatch passes those attributes into the Lambda, and then using that in our dynamically generated TwiML.

You could look at storing “on call” information in a DynamoDB, and get Lambda to look up who it should call based on the day. You could also use the Twilio API to make sure the call is answered / acknowledged, and escalate to another developer when the first line of defense doesn’t respond.

You could create an API Gateway endpoint which other services (like external uptime monitoring) can POST to and trigger your Lambda.

Lot’s of room for improvement, but I hope this has given you a taste of how easy it is to use Lambdas. Let me know how you get on!