When you build a product for technical users, documentation is extremely important. If people can't figure out how to use your product you won't have any users no matter how amazing your product actually is.
Poor documentation will increase the number of confused
developers and subsequent support requests that you receive. Just
like quick-and-dirty coding decisions can lead to technical
debt, poor documentation can lead to operational debt.
The repercussions of sloppy documentation are twofold: some fraction of
developers who would otherwise have used your product won't use it, and those
that do will require more attention from your support team.
Users are busy
Imagine you need to complete a math assignment using LaTeX and
someone suggests you read you read TeX: The
Program. Though it is a great book,
your goal is to produce a PDF with your homework properly formatted, not
to read a novel. The vast majority of people who visit your documentation
are there not to learn more about your fascinating product. They are
there to find out how to get something done, like write a 2-Factor auth
library, or figure out how to receive calls
from a fake girlfriend.
You can't expect that users will actually read any of your
documentation. Many users
will try to copy and paste the exact contents of a code box into a text
file or command prompt and expect it to run, without reading any of the
surrounding comments. You cannot assume your users have any context
besides what you put in a code sample and maybe one or two sentences above
or below. Let's walk through an example, using an old snippet from the
twilio-python API library.
# How to make a callclient=TwilioRestClient()calls=client.calls.list(statsus=Call.IN_PROGRESS)call=client.calls.create(to="9991231234",from_="9991231234",url="http://foo.com/call.xml")
This will cause a number of problems for the first time Twilio developer.
Their script will choke unless they've remembered to add a from twilio.rest
import TwilioRestClient line at the top of the file.
Initializing the TwilioRestClient with no parameters means that the
TwilioRestClient constructor will look in the user's environment for variables
named TWILIO_ACCOUNT_SID and TWILIO_AUTH_TOKEN. The documentation for these
parameters is in a different location. Better to be explicit:
The argument to the list method, statsus is spelled incorrectly, so the
parameter will not be passed correctly.
Unless the user knows about TwiML and
changes the url parameter, the first thing the user will hear when the call
connects is "An application error has occurred." There's no context in that
error to learn about TwiML or understand the error. Instead of linking to a url
that will 404, use something that actually successfully demonstrates your
product in action:
Our example is now more robust and has much needed context:
fromtwilio.restimportTwilioRestClient# How to make a callACCOUNT_SID="ACXXXXXXXXXXX"AUTH_TOKEN="dddddddddddd"client=TwilioRestClient(ACCOUNT_SID,AUTH_TOKEN)calls=client.calls.list(status=Call.IN_PROGRESS)call=client.calls.create(to="9991231234",from_="9991231234",url="http://demo.twilio.com/welcome/")
Better error messages
When things are broken, it's important to tell people what exactly went wrong
and how to fix it. Some things you can't control very well, such as a user
running pip install twilio and getting a -bash: pip: command not found
error.
That message gives no hint at how to fix the problem, and you can't change
that. But you should anticipate, and provide explicit messages for, errors
where you can control the output shown to the user. Continuing with the example
from above, here's what the error message used to look like:
>>> from twilio.rest import TwilioRestClient
>>> client= TwilioRestClient()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.6/site-packages/twilio/rest/__init__.py", line 110, in __init__
twilio.TwilioException:
Twilio could not find your account credentials.
That doesn't explain where you should actually put your credentials. A better
message:
>>> from twilio.rest import TwilioRestClient
>>> client= TwilioRestClient()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.6/site-packages/twilio/rest/__init__.py", line 110, in __init__
twilio.TwilioException:
Twilio could not find your account credentials. Pass them into the
TwilioRestClient constructor like this:
client= TwilioRestClient(account='ACCOUNTS_SID', token='AUTH_TOKEN')
Or add your credentials to your shell environment. On OSX or Linux, add
the following to your .bashrc file
TWILIO_ACCOUNT_SID=AC3813535560204085626521
TWILIO_AUTH_TOKEN=2flnf5tdp7so0lmfdu3d7wod
Replace the values for the Account SID and auth token with the values
from your Twilio Account at https://www.twilio.com/user/account.
Your error messages should always explain how to solve the problem raised by
the error. Otherwise, they'll lead to support tickets (an increased expense),
and/or frustration (lost revenue).
Users are coming from Google
Over 50% of the visitors to Twilio's documentation come straight from Google,
and your documentation probably has a similarly high number. This has several
interesting implications.
First, you need to think about SEO for everything that you write. Specifically,
you should make sure that:
documentation pages have exactly one <h1> tag.
the <h1> tag fully describes the content of the page.
the page's <meta description> tag describes the content of the page.
you are linking to other pages of your documentation often, using relevant
keywords. Instead of linking like this: "Click here to read our SMS sending
documentation", link on the keyword, like this: "For more information, read our
documentation on sending SMS messages."
all anchors have title attributes (extra text when you hover over the link,
like this: <a href="http://www.example.com" title="Hello!">). This is good both for SEO
and for accessibility/usability.
all images have alt text (needed for accessibility, and so Google will know
what's on the page).
Furthermore, users are not "discovering" your documentation by going to the
homepage and navigating through the tree. They are landing on the specific
page that they want to get to, via Google. So you can't expect that users will
see anything outside of the page they land on; each page needs to be a self
contained unit of awesomeness.
If you don't think about SEO for your content, you may be getting outranked for
key terms by content farms.
You are busy
Let's face it, you've got a million things on your plate, and if you leave the
documentation until the end, it won't be as awesome as it should be. That's why
you should write your documentation first, before you start coding. This means
you're doing it while you're still really excited about the project. It also
forces you to think through some of the decisions you will make, before you
start writing code for things that you're not going to implement.
Bottom Line: Writing effective documention requires you to understand
who your user is, how they behave and how they are finding answers to their
questions. You also need to dedicate time in your product development cycle to
writing documentation good enough so that users can figure out your product.
As more and more of the services we use every day move online and into the cloud, the design and operation of highly-available infrastructure is becoming increasingly critical. However, downtime is common even at the biggest sites on the Internet. Why is high-availability so hard to achieve?
Five 9's
Availability is the amount of time a device is actually operating as the percentage of total time it should be operating. It is the total Uptime divided by the total Uptime + Downtime. "Five nines" or 99.999% is common goal in availability-critical environments. 99.999% of availability translates to 5.26 minutes of downtime per year or 25.9 seconds of downtime per month.
For services that must achieve five or six nines, human intervention in failure detection and mitigation simply isn't practical. Imagine you have a server and a correctly configured monitoring system that can detect any and all service-impacting failures. In addition, all failures correctly trigger a pager service that notifies an oncall team. When a failure is detected and handled by a human process, the oncall team must by able to hop on a computer, access the Internet, determine the failure, and fix the issue -- all within a window of five minutes. This process simply isn't realistic. High-availability infrastructure must therefore be able to automatically recover from failures without human intervention.
It's the database
One of the most difficult to automate infrastructure components is the database. Databases and other stateful components are complex and have traditionally required significant human intervention to configure and fail-over. This makes data stores difficult to maintain and make highly available. This is especially true in cloud computing environments where the traditional approach of running data nodes on expensive special-purpose hardware that automatically handle failover isn't possible. To put it in more dramatic language,
Data persistence is the hardest technical problem most scalable SaaS businesses face.
We recent gave a talk at Web 2.0 Expo in New York that explores the root causes of downtime and the architectural choices we've made at Twilio to handle stateful components to provide for high-availability.
Key points from the talk:
Data persistence issues and change control mistakes are two key causes of downtime
Data persistence is hard,
hard to change schema or structure due the need to rewrite data and indices,
hard to recover from failures due to complex split-brain situations and the time it takes to restore or recover state from another node or backup,
and hard to manage due to the incredible complexity of modern data management systems.
When building high-availability cloud applications, one should consider
clearly differentiating stateful and stateless component in your infrastructure,
avoiding the storage of data where possible
using unstructured storage instead of structured storage where possible,
aggressively archiving or deleting data that doesn't need to be online,
defining clear human-mediated processes for change control,
and pragmatically picking the data storage technology for the problem -- SQL is best for certain tasks whereas a key-value store or cloud storage like S3 is better for others.
Computers don’t care about API design. Convenient serialization formats, sane parameter names, and a RESTful interface mean nothing to a robot. These aspects of an API are not meant for a machine, they’re meant for you – the developer. APIs should be for human beings first and computers second.
Write, rewrite, ask, re-rewrite
Don’t settle on a design until as many eyes and minds have seen it as is reasonably possible. If you were given the task “Design a way for customers to search for phone numbers which they can purchase,” what would you end up with? There’s no single design that satisfies this requirement and, as a result, you can build any number of reasonable APIs.
For this reason, the AvailablePhoneNumbers resource was challenging to design. A major rule we follow in API design is make the common case easy. When searching US numbers most users end up wanting one of two things: A number from a particular area code, or a number containing some string of digits. These are the first two filters we document and the ones we expect most people to care about.
Why even include an area code parameter? You can accomplish the same with the ‘Contains’ parameter and wildcards , i.e., +1555*******. The first iteration of the design only included a Contains parameter. It was simple and powerful. It was also frustrating when all you needed was an area code lookup as you had to remember the exact filter syntax. If you have to think this much to complete a dead simple task, then we’ve failed as API designers.
Simplify the surface area of your API
Let’s take a look at the IncomingPhoneNumbers resource. A POST to this resource will let you buy a number, and a GET to this same resource will let you see the numbers you’ve bought. When designing this functionality, the first instinct of many developers would be to create a separate resource for purchasing phone numbers. POST /ProvisionPhoneNumber to buy, and then GET /IncomingPhoneNumbers to return a list of numbers you’ve bought. However, we now need an entirely new resource to provision phone numbers that supports only one method. When you take this design and multiply it by the amount of functionality you wish to provide, you end up with an explosion of complexity.
GET /Calls
POST /PlaceCall
GET /SMS/Messages
POST /SendSMS
GET /IncomingPhoneNumbers
POST /ProvisionPhoneNumber
By choosing a consistent convention we’re able to simplify the surface area of the API. When we add things like Subaccounts, it’s a natural extension of the API. POST /Accounts to create a subaccount, GET /Accounts to see your accounts. In addition, every unique parameter name adds to the weight of your API. Be consistent and try to reduce the number of resources a client of your API has to learn. Put yourself in the shoes of your users.
Easy to add, Hard to take away
There is a fine line to walk when adding new features. If you have a particularly contentious feature design, don’t be afraid to leave it out. After you launch with it in your public API, it is extremely difficult to remove features without a) bumping the API version or b) breaking customer’s applications. This means that every addition should be reviewed with the knowledge that removing or changing the API later could be more difficult for customers than not having it in the first place. At Twilio we’ll often add extra parameters, resources, or properties as we see customers adopting and using an API instead of attempting predict all usage patterns from the beginning.
Even after all of that, mistakes will happen.
Don't be afraid to make backwards incompatible changes
In the long run, it’s better for your user base and it’s better for you. Sometimes your assumptions are wrong, sometimes a parameter name isn’t the clearest, sometimes your response format is less than ideal.
However, once you’ve released a new API version, it’s very important to maintain access to the old API version for some period of time. It’s very difficult to get customers who have built against a certain API version to upgrade. You can provide incentives: new features, better performance, etc. Even then, many customers who’ve already built a working integration are reticent to incur the cost of upgrading.
One way to reduce the cost is to make version changes as transparent as possible. By writing smart client libraries, you can abstract away certain aspects of API versioning. For example, a client library could turn requests like this:
Now, let’s say you add support for a JSON representation in addition to XML. You can beg all of your customers to switch, but almost no one will if they’ve implemented the first case. In the second case, everyone can get the upgrade for free. In fact, if your client library was smart enough to send the Accept: application/json; text/xml; header The server could decide at request time which representation was most efficient for this particular request and dynamically send the better Content-Type. The user of the library wouldn’t ever need to know which response type he’s receiving and he wouldn’t ever need to upgrade his library to handle one or the other.
Human-centric API Design
API design is as much an art as a science. There isn’t a single ‘right way’ of doing it. Different APIs feel and behave in different ways. We’ve grown the Twilio API from only a couple of features to over a dozen, and we continue to add new features and functionality all the time. At the end of the day, we build our APIs to serve you: the human being who needs to get stuff done.
On Sept 22nd, 2011 at the first Twilio <Conference> we had the opportunity to share details on how the engineering team has scaled up the Twilio infrastructure and organization over the past three years. Embedded below are the slides from the talk.
Over the past few years the company has grown from 3 to more than 70. We’ve expanded from phone service to long code SMS, short code SMS, browser VoIP, and mobile VoIP. We’ve extended the Twilio cloud beyond the 5-verbs to conferencing, transcription, presence, and app billing with Twilio Connect. We’ve scaled traffic by more 100x over the past year, and expanded our server infrastructure from a few servers to 100′s running in the cloud.
The presentation also touches on core technologies used by the team, which include PHP, Python, Twisted/gevent, Java, Asterisk/FreeSwitch/JSR289, MySQL, and Redis.
Beyond technology, the presentation also covers the core Twilio engineering cultural values we’ve emphasized to help foster sustainable engineering culture, process and technology:
Simplicity - continuously iterate toward simpler designs for internal and external processes, infrastructure and APIs
Automation - focus on building tools to augment human processes not necessarily replace them
Shipping - build systems and processes that let you ship high quality products more rapidly
Empiricism - enable data-based decision making by aggressively measuring infrastructure and business metrics
Humbleness - make post-mortems part of your team process and constantly re-evaluate how you and your team and seek ways to continuously improve
Hope you enjoy the talk. We’ll be sharing more presentations on Twilio infrastructure and technology over the next few weeks.
This week we thought we’d share some background on localtunnel, a project I wrote outside of Twilio to help deal with the challenges of developing against webhook-based APIs (such as Twilio’s) when coding behind a NAT.
These days it’s fairly common to run a local environment for web development. Whether you’re running Apache, Mongrel, or the App Engine SDK, we’re all starting to see the benefits of having a production-like environment right there on your laptop so you can iteratively code and debug your app without deploying live, or even needing the Internet.
The Problem
With the growing popularity of HTTP callbacks, or webhooks, there are cases where you can really only debug your app while live and on the Internet. Webhooks aside, there are other cases where you might need to make local web servers public, such as testing or public demos. Demos are a surprisingly common case, especially for multi-user systems (“Man, I wish I could have you join this chat room app I’m working on, but it’s only running on my laptop”).
The Solution?
To some, the solution is obvious: SSH tunneling! That is, use a magical set of options with SSH on a hosted box to set up a tunnel from that machine to your local machine. When people connect to a port on your public machine, it gets forwarded to a local port on your machine, looking as if that port was on a public IP.
The idea is great, but it’s a hassle to set up. It’s not just a large, unwieldy command you have to do every time, but you have to make sure sshd is set up properly in order to make a public tunnel on the remote machine. Otherwise you need to set up two tunnels, one from your machine to a private port on the remote machine, and then another on the remote machine from a public port to the private port (that forwards to your machine).
An Easier Solution
I think it’s too much of a hassle to consider SSH tunneling as “a quick and easy option.” Especially if you’re trying to help somebody else set up a tunnel. This is when I started to envision a simple command and service to make this dead simple. Here is the quick and easy way that localtunnel provides:
$ localtunnel 8080
And you’re done! With localtunnel, it’s so simple to set this up, it’s almost fun to do. What’s more is that the publicly accessible URL has a nice hostname and uses port 80, no matter what port it’s on locally. And it tells you what this URL is when you start localtunnel:
$ localtunnel 8080
Port 8080 is now accessible from http://8bv2.localtunnel.com ...
This URL can now be shared with others, used for webhook callbacks, etc. We assume the tunnel will be fairly short-lived, although we don’t actively close long-lived tunnels. The tunnel will generally be available until you stop the process.
Here’s another example of using localtunnel to make the built-in Python HTTP server public for quickly sharing files over the web:
$ python -m SimpleHTTPServer 8000 &
$ localtunnel 8000
Port 8000 is now accessible from http://hy51.localtunnel.com ...
Now you’ll get a directory listing of files in the current directory at that URL. Share the URL and now people can access those files until you close the tunnel.
How It Works
To be clear, this is still at its core SSH tunneling, but wrapped up in a nice package involving a simple client and server. But let’s see what all is happening.
The localtunnel command is written in Ruby and uses an SSH library to open the actual tunnel, but first it hits a tunnel registration API. The API is on the same server that you tunnel through, provided by the server component.
The server component provides two services: a reverse proxy to the forwarded port, and the tunnel registration API. You can see the registration API for yourself, just browse to http://open.localtunnel.com. This simple API allocates an unused port to tunnel on, and gives the localtunnel client the information it needs to set up an SSH tunnel for you.
Of course, there’s also authentication. As a free and public service, I don’t want to just give everybody SSH access to this machine (as it may seem). Instead, since we’re wrapping SSH, we use public keys and per key options to lock down access while still allowing normal SSH access for anything else on the box.
The server runs as a user with no shell. It only has a home directory with an authorized_keys file. The first time you use localtunnel, you have to use the -k option specifying a public key to pass along when we register a tunnel. We verify it’s a valid key and then add it to the authorized_keys file with a bunch of options preventing pretty much any use of SSH other than tunneling to a private port on the server. After that, assuming SSH can find your private key, it just works.
We only allow private port tunneling because we don’t want arbitrary public port forwarding from the server. Public access to this private port on the server comes through our reverse proxy listening on port 80 for any *.localtunnel.com requests. We keep the mapping of randomly generated hostnames for each active port in memory and use that to determine the backend for our reverse proxy. This backend, however, is actually your local server thanks to the SSH tunnel.
It’s important to consider that right now there are no real privacy guarantees and currently no HTTPS support. This is on the roadmap, but for now this means you may not want to share highly sensitive data over localtunnel.
That’s pretty much it. You can explore further by looking at the code on GitHub. The server component is written in Twisted Python and is just over 100 lines of code.
What Now?
You can start using it immediately if you have Ruby and Rubygems installed. You just need to run:
$ gem install localtunnel
Although it currently depends on Ruby, one contributor is working on a Java client. Speaking of contributors, there are a few other things on the roadmap if anybody wants to help:
CNAME support for long-lived or "reserved" tunnels
HTTPS support for fully secure tunnels
Automatic key generation, eliminating the need to initially specify a key
Otherwise, enjoy the free service and feel free to stand up your own instance!
Starting early this morning, Amazon Web Services experienced several service problems at one of its east coast datacenters. The outage impacted major sites across the Internet. The number of high profile sites affected by the issue shows both the amazing success of cloud services in enabling the current Internet ecosystem, and also the importance of solid distributed architectural design when building cloud services.
Twilio’s APIs and service were not affected by the AWS issues today. As we’ve grown and scaled Twilio on Amazon Web Services, we’ve followed a set of architectural design principles to minimize the impact of occasional, but inevitable issues in underlying infrastructure.
Unit-of-failure is a single host
Where possible, choose services and infrastructure that assume host failures happen. By building simple services composed of a single host, rather then multiple dependent hosts, one can create replicated service instances that can survive host failures.
For example, if we had an application that consisted of business logic component A, B, C each of which had to live on separate host, we could compose service group (A, B, C), (A, B, C)… or, we could create component pools (A, A, …), (B, B, …), (C, C, …). With the composition (A, B, C), a single machine failure would result in the loss of a whole system group. By decomposing resources into independent pools, a single host failure only results in the loss of a single host’s worth of functionality. We’ll cover more on the benefits and drawbacks of this approach in another post.
Short timeouts and quick retries
When failures happen, have software quickly identify those failures and retry requests. By running multiple redundant copies of each service, one can use quick timeouts and retries to route around failed or unreachable services.
Make a request, if that request returns a transient error or doesn’t return within a short period of time (the meaning of short depends on your application).
Retry the request to another instance of the service
Keep retrying within the tolerance of the upstream service.
If you don’t fail fast and retry, distributed systems, especially those that are process or thread-based, can lock up as resources are consumed waiting on slow or dead services.
Idempotent service interfaces
Build services that allow requests to be safely retried. If you aren’t familiar with the concept, go read up on the wonderful world of idempotency.
"In computer science, the term idempotent is used more comprehensively to describe an operation that will produce the same results if executed once or multiple times."
If the API of a dependent service is idempotent, that means it is safe to retry failed requests. (See #2 above) For example, if a service provides the capability to add money to a user’s account, an idempotent interface to that service allows failed request to that service to be safely retried. There’s a lot to this topic, we’ll make a point of covering it in much more detail in the future.
Small stateless services
Separate business logic into small stateless services that can be organized in simple homogeneous pools. Twilio’s infrastructure contains many service pools that implement parts of our voice and SMS APIs. For example, when you make a recording using the <Record> verb in TwiML, the work of post-processing the recording to improve the audio quality and upload it to persistent storage is provided by a pool of recording servers. The pool of stateless recording services allows upstream services to retry failed requests on other instances of the recording service. In addition, the size of the recording server pool can easily be scaled up and down in real-time based on load.
Relax consistency requirements
When strict consistency is not required, create pools of replicated and redundant read data. One of the most important conceptual separations you can do at an application level is to partition the reading and writing of data. For example, if there is a large pool of data that is written infrequently, separate the reads and writes to that data. By making this separation, one can create redundant read copies that independently service requests. For example, by writing to a database master and reading from database slaves, you can scale up the number of read slave to improve availability and performance.
The issues at AWS illustrate the need to carefully think through the design of cloud-hosted applications. We’ve highlighted several well-known best practices of distributed system design that have been helpful to us as we decide what software to build and what external services to integrate into Twilio.
[UPDATE] A central theme of the recent AWS issues has been the Amazon Elastic Block Storage (EBS) service. We use EBS at Twilio but only for non-critical and non-latency sensitive tasks. We’ve been a slow adopter of EBS for core parts of our persistence infrastructure because it doesn’t satisfy the “unit-of-failure is a single host principle.” If EBS were to experience a problem, all dependent service could also experience failures. Instead, we’ve focused on utilizing the ephemeral disks present on each EC2 host for persistence. If an ephemeral disk fails, that failure is scoped to that host. We are planning a follow-on post describing how we doing RAID0 striping across ephemeral disks to improve I/O performance.
This is one of the first posts to our new Twilio engineering blog. Our team is excited to share our experiences in building and scaling Twilio to utilize the capabilities of cloud platforms like Amazon AWS. If you are interested in this topic, there are several great blogs that cover problems building distributed systems.
Ahoy hoy and welcome to the newly minted Twilio engineering blog! We the Twilio engineering team will be sharing some of the unique challenges we face bridging the 100-year-old world of realtime telecom with the world of HTTP and the web. Using cloud infrastructure to implement a communications platform has required us to build a highly automated, self-healing distributed platform that can be deployed across thousands of servers. We are extremely excited to share details on building this infrastructure and how we are developing an engineering organization and processes for telecom-grade scale and available. A few of the topics we are looking forward to covering include:
No-downtime philosophy
Deployments that can’t lose or fail a single request
Asynchronous programming at scale
High availability in the cloud
Building scalable engineering process
Automated deployment of 1000′s of servers
Designing great APIs
Software development lifecycle for the cloud
DevOps and software-based automation
Continuous integration and deployment for APIs
Twilio is an engineering-driven company. Our founders are engineers, our customers are engineers, and our company, well yeah. We’ll be scheduling a "live chat" with the members of the Twilio engineering team over the coming weeks. Stay on the line.