It's a rare website or company today that doesn't have to make HTTP requests to another party. Phone apps need to communicate with a server. Servers have to communicate with 3rd party tools like Facebook, Twitter, Sendgrid, (Twilio!) and others. Most companies that implement a service oriented architecture use HTTP to communicate between servers in the cluster.
In happy times, your HTTP requests return 200, they do so relatively quickly, and your health dashboards are green. But who are we kidding - this is the Internet, where things fail all the time. In our ongoing quest for five nines of reliability, let's learn about some ways HTTP requests can fail and how you can write a production-ready, robust HTTP client to work around these.
Sometimes your server will be unable to connect to a remote machine. This happens for a number of reasons - the server may be down and unresponsive, a DNS lookup might fail, or you may fall victim to the popular and undiagnose-able "network blip". In all of these cases, code that looks like this:
response = request('GET', 'exampleurl.com')
Will not give you a response object in the way that you expect, because the HTTP request never completes! Instead, your library will raise a
ConnectionError, or the
response will be some kind of error object instead of an HTTP response.
There are two actions you should take to handle this error. First, you need to set a timeout on the connection request. The only thing worse than a failed connection is a failed connection that hangs for 30 seconds before returning an error. In general, a connect() attempt will either succeed very quickly or never succeed. At Twilio, we usually allow 3100 milliseconds for a connect timeout - 3 seconds for the default TCP retransmission window, plus a small amount of time for the request to succeed. Most of our servers are in the same Amazon region, so 100 milliseconds is more than plenty - if you are making a request across an ocean you may want to bump this up a little.
Second, requests that fail with a connection error are always safe to retry, because they imply the request never made it to the server. Consider wrapping your HTTP request in a try/catch block that detects connection errors and retries them several times, potentially with a sleep() call beforehand.
Ok, so you've established a connection to the server and sent over your HTTP request. Sometimes the server is naughty and does not send back a valid HTTP response. The most common of these errors is a closed connection - it may manifest itself in error messages as "the server unexpectedly closed the connection" or as "EOFError" or "Bad status line". In general, these errors should be treated as 500-level errors. Be careful though, because most clients lump closed connection errors together with connection errors by default. Unlike connection errors, these imply the request made it to the server, and may not be safe to retry. You should be careful to catch this error and treat it like a 500-level error.
The most common cause of a closed connection is a problem in your app's timeout chain; be sure to check what happens if your app decides to sleep forever. PHP in particular sometimes ignores
max_execution_time so you may want to set this limit in nginx or Apache.
3rd party server taking too long to respond
A common scenario: users sign up for your application and you make an API call to Sendgrid or Mailgun to send them an email that says, "Thanks for signing up! Here are all the cool things you can do" etc. And your signup code may look like this:
@app.route('/signup', methods=['POST']) def signup(): # This makes a HTTP request to a third party email provider send_welcome_email(request.form['email_address']) return "Thanks for signing up!"
Now imagine Sendgrid is having a bad day and taking 25 seconds to respond to every request. Your application thread will wait patiently for Sendgrid to return a response, and all of your users will see a beach ball in their browser for 25 seconds while trying to sign up for your fancy service. I guarantee you that this is not good for your conversion rate.
One solution here is to offload tasks and process them asychronously where possible, but that's outside the scope of this post. The other solution is to assign a timeout to the external request. If the 3rd party request takes too long, bail on it and execute some fallback logic. Every external request in your system should have a timeout value attached to it.
Note: this timeout is different than the connect timeout - it only kicks in once you've connected to the server and are waiting for a response. Your HTTP client may refer to this as the "read" or the "request" timeout. If your client has just a single value for "timeout" it applies the same timeout to the connect and the read. This is bad!! If your client does this, urge the maintainer to offer separate timeouts, write code to split the timeout values, or write your own, better client.
HTTP level errors
Finally we get to the most common type of error - a non-200 level HTTP response. This generally means you tried to do something and the server barfed, or you sent a bad request to the server. The most important question to ask now is whether the request is safe to retry. I must introduce the concept of idempotence to answer this.
A request is idempotent if the system ends in the same state whether you make the request one time or several times. A request is not idempotent if making the request multiple times means that the system can change each time. Idempotence is valuable because it means requests are always safe to retry. Consider the following examples of idempotent actions:
Hanging up a call. If the hangup request fails, just try it again.
Downloading your email. No matter how many times you click "Download", it will always just show the latest email that you have.
Washing dirty dishes. They will be in a "clean" state whether you wash them one time or ten separate times.
And some examples of things that are not idempotent:
Sending a text message. Typing a message and clicking "Send" multiple times means the message will be sent multiple times.
Clicking "Purchase" in most online shopping carts. If you've ever seen the frantic "Don't click the purchase button twice!" message in an online shopping cart, you have seen an example of a non-idempotent transaction.
Great, but what does this have to do with my HTTP Client?
Glad you asked! Idempotence is really nice for production systems because it means that when an idempotent request fails, you can just try it again. That's so important I will repeat it: when an idempotent request fails, you can just try it again, and not worry about charging someone's card twice, or sending the same text message multiple times, or similar. Due to ACID guarantees, this idempotency constraint is best enforced at the database layer.
Some HTTP methods are designed to be idempotent. In particular, in a system that follows the HTTP specification, a request using the GET, PUT, DELETE, HEAD, or OPTIONS method implies that the request is safe to retry. POST requests are sometimes safe to retry, but usually not.
So when you make a request and it fails (with closed connection, or a read timeout error, or a 5xx server error), you should retry it if the request is idempotent. If requests to other methods like POST fail, it may not be safe to retry.
Some servers limit the number of concurrent requests you can make to their API in a period of time, so sometimes you will make a request and the server will tell you to "enhance your calm." The HTTP status code 429 Too Many Requests has also been allocated for this use case. In general if you receive this status code, the request is also safe to retry, no matter which HTTP method you used, because the server hasn't done any processing.
Testing your client
I've been working on a small library that simulates various types of network failures. By connecting to different HTTP ports on the server, you can simulate different types of failures and ensure your client handles them properly. Check it out here.
HTTP requests can fail in several different ways - the server can be unreachable, the server can misbehave, the server can be slow, or it can fail to process your request. The "out of the box" behavior in most HTTP library implementations will not be suitable for a production-ready client - like a database server, you need to tune your HTTP client for every request that you make.
We're also trying to hire smart people who want to learn how to write fault-tolerant apps. Apply via our jobs page, or contact me directly if you have questions about the role.