Adventures in Unicode SMS

NOV 08

I remember one of my first weeks as an Engineer at Twilio, I tried to send Unicode chess pieces to my phone. I was disappointed to see the characters not come through to the handset. Perhaps just as bad, when sending chess pieces to Twilio from my phone, they were sent to my TwiML endpoint incorrectly encoded.

One of my first major projects on the Messaging team at Twilio was to bring International SMS out of beta. A requirement for sending messages internationally is to support the characters of the languages of those countries. Learning about character encoding and SMS network protocols proved to be an extremely educational and interesting task.

SMS Character Encoding Background

Historically, our SMS API claimed support for 160 ASCII Characters. This turned out to be oversimplified in two ways: the size limitation is actually more complicated, and ASCII is not the encoding of interest. What can actually fit into a single SMS (at least for GSM networks) is limited to 140 bytes.

The 160 maximum actually comes from the fact that you can encode 160 7-bit characters into 140 bytes. But even if all your characters are in ASCII, you're not guaranteed to fit in 160 characters. This is because the character encoding used in SMS is not ASCII, but GSM 03.38. Consequently,

  • Certain characters in GSM 03.38 require an escape character. This means they take 2 characters (14 bits) to encode. These characters include: |, ^, {, }, , [, ~, ] and \.

    In short, an SMS of all escape characters can only have 80 characters, instead of 160. Perhaps more devastating is that a message that was truncated at exactly 160 won't fit into a single SMS if it has a single one of these characters.

  • Not every ASCII character is encodable in GSM. For example, we're missing the ` and tab.

Other Character Encoding Options

SMS Gateway protocols tend to support more character encodings than just GSM 03.38. But, the bad news is that very few of these encodings actually matter. It depends on the carrier network of the handset, but most phones (especially internationally) will receive messages in either GSM or UCS-2 character encoding. This means that even though, for example, SMPP 3.4 supports character encodings like ISO 8859-1, ISO 8859-8, ISO 8859-5, IA5, and a mysterious unspecified Pictogram encoding, the message is probably going to get converted into GSM 03.38 anyway, so any characters in these encodings that are missing in GSM will be lost in the conversion.

Thus, for any message that can not be encoded in GSM, the most reliable option is to encode it using UCS-2. UCS-2 is a now defunct character encoding that has since been replaced by UTF-16. There are two main differences between these encodings:

  • UCS-2 doesn't use a Byte Order Mark so it's always big endian.
  • UCS-2 does not have explicit support for "surrogate pairs", meaning it's limited to always being 2 bytes per character, whereas UTF-16 is 2 or 4 bytes per character. This means you cannot encode characters outside of the Basic Multilingual Plane (non-BMP characters).

These differences turn out not to matter in practice, because due to the lack of support for the UCS-2 encoding, in modern programming languages smartphones tend to just decode UCS-2 messages as UTF-16 Big Endian. This is good news, because it means in practice we can send non-BMP characters, such as Emoji characters, over SMS, despite the fact that the spec doesn't strictly allow it.

The drawback of UTF-16 messages is that since each character is 2 or 4 bytes, we can have a maximum of 70 fit into a 140 byte SMS message. (This is even true if only one character isn't in GSM, since the encoding applies to every character in the message). Note that if we're only sending Emoji or other non-BMP characters, then the limitation is actually 35. This is ignoring combining characters (e.g. diacritic marks), which could bring the count down further, depending on how you count the length of your text.

Fixing the API

Once we knew what was possible, it was clear our API had some limitations, so it was time to start correcting these issues. Most of the bugs fell into just a few categories. To avoid our mistakes, the following concepts are important to keep in kind:

  • Make sure your language or framework is interpreting percent encoded HTTP parameters as UTF-8 and not Latin-1.
  • Make sure your database connection is set to UTF-8.
  • If you're on MySQL 5.5 or better, make sure your UTF-8 columns are of type utf8mb4 and not utf8. The latter does not have support for non-BMP characters as it has a maximum of 3 bytes per character.

    If you're on MySQL 5.1 and you can't upgrade, you're going to need to find your own solution for non-BMP characters. Some options include: using a BLOB type and handling encoding/decoding yourself or escaping non-BMP characters (whatever scheme you come up with must be sure to produce valid UTF-8).

  • In general, do not confuse "strings" and "bytes." If you have a collection of bytes, you must know the encoding in order to understand it as a string of text.

Unicode SMS: Not just for International Characters!

With an SMS API with robust character encoding support we now had the opportunity to fully realize the vision of a chess game played entirely over SMS! Using unicode characters we can model the pieces on a chess board and build Chess over SMS (ChesSMS) game. Moves are made using the Long Algebraic Notation (e.g. e2e4) and the opponent is currently limited to a chess engine.

Encoding a Chessboard

To send a chessboard to a phone, there's basically two challenges: getting the pieces to line up and getting the whole board to fit into one SMS message.

Display Problems

There's a few tricky things about displaying chessboards using only text. When I was searching for Unicode codepoints that would make good empty black and white squares, I was hoping to find characters as close to the width of the chess piece characters as possible. In general, phones do not use fixed-width fonts. Worse, since phones use different fonts, it may prove difficult or impossible to find a black and white square character that will make the board align nicely across all handsets. In fact, in many fonts, the chess pieces themselves don't even have the same width!

Of course, all this assumes the phone's font even has the characters for the chess pieces. If it doesn't, you'll likely see a substitute character like a Unicode block (▯). Luckily, iPhone, Android, and Windows Phone all use fonts with chess pieces. For other phones, we'll have to fall back to alphabetical characters.

After some trial and error, I found Unicode shaded block characters and fixed-width whitespace characters that worked pretty well on my iPhone's Helevetica Neue font. Since we can't tell what kind of phone we're sending the SMS to, it's up to the end user to change their "settings" for which characters it uses to render to board. But in general, due to all these problems, rendering a board is very much "best effort."

Here's how it renders on iOS:

Since I was using whitespace characters for the white squares, I ran into a problem that if the top left square was empty (the top left happens to be a white square, when you're playing as white) the first character in the message was whitespace. It turned out that our API actually strips leading whitespace, and so it was stripping my square! Perhaps this is the most literal "corner case" in the history of software. I changed the API to only strip the "standard" types of whitespace, under the assumption that anyone who is using special whitespace characters is probably doing so on purpose.

Fitting a Board

If you recall from the character encoding background section, we're allowed 140 bytes in an SMS message. Luckily, the chess pieces are in the Basic Multilingual Plane (BMP), so each of them takes only 2 bytes. Thus, in each message we can send 70 characters.

A chessboard is 8 by 8, so 64 of our characters are used by the board itself. We need to split the ranks so we need 7 newlines to do that (one after each rank except the last).

64 + 7 = 71. We need one too many characters, so close! I happen to have footage of the exact moment Grand Master Garry Kasparov read the SMS specifications:

The solution for fitting the board is to omit the final whitespace on the ranks which have an empty white space on the end. This works until each of those 4 ranks have a piece sitting on them. For those positions, I haven't come up with a workaround, and the board will just be sent in two messages.

Modeling a Chess Service in Erlang/OTP

For implementing ChesSMS, I chose Erlang/OTP. I knew that the requirements of ChesSMS would expose me to several features of Erlang that I hadn't gotten to play with yet, and I knew that for a service as mission critical as Chess over SMS, no other system could give me the reliability and scalability that I needed. Well, actually despite not having great reasons to choose Erlang, it did end up being an excellent and fun choice, as I will explain without getting too detailed on the architecture.

Communicating With the Chess Engines

Developing a chess engine was outside of the scope of this project. Luckily, there's dozens of great open source chess engines, and two main protocols for communicating between engines and interfaces. There is the XBoard protocol, which gradually evolved with the XBoard and Winboard user interfaces, and there's UCI (Universal Chess Interface) which was designed from scratch by engine authors. UCI is a much simpler interface, but it puts more of the complexity on the UI side. I chose to implement UCI because of its stateless design and simpler command list.

The chess engine and UI communicate over stdin/stdout, which is a great match for Erlang's Port feature. A Port is an executable program that your program forks and can send and receive messages to as if it was another Erlang actor. This made it extremely simple to command chess engines from Erlang. Since I want to give the engine a few seconds to decide on a move, I spawn 8 engines (by default, it's of course configurable) and add them to a common resource pool, which manages my pool of engines. I'm currently using the Stockfish chess engine, but in theory, any UCI engine should plug in and work.

Representing a Game

Implementing the chessboard state and updating the board on moves was a surprisingly good fit for Erlang. The pattern matching of Erlang in particular made parts of the code very readable that would have otherwise been nasty.

special_kind({_, pawn}, _From, To, undefined, EnPassantSquare)
        when To == EnPassantSquare -> enpassant;
special_kind({_, pawn}, _, _, Promoted, _)
        when Promoted =/= undefined -> promotion;
special_kind({white, king}, 5, 7, undefined, _) -> castle;
special_kind({white, king}, 5, 3, undefined, _) -> castle;
special_kind({black, king}, 61, 63, undefined, _) -> castle;
special_kind({black, king}, 61, 59, undefined, _) -> castle;
special_kind(_, _, _, undefined, _) -> normal.

The code above defines a function called special_kind which determines if a move is a special move. It takes a tuple of {Color, PieceType}, the From square, To square, Promotion square, and enpassant square and returns either castle, normal, enpassant, or promotion.

Each chess game is running as an Erlang "process" (not an operating system process) inside of the Erlang VM. I use the mnesia database to store a mapping of Players to Pids (process IDs). When the web interface (implemented in the Cowboy HTTP framework) receives an SMS command from Twilio, it looks up which game process the From phone number is currently involved in. Then, it sends the move as a message to the process (since we're in Erlang, the process could be running locally or on another Erlang node without changes to the code). If the move is valid, the process makes the changes to its local state, asks the chess engine pool for a response move, serializes the chessboard, and sends it back to the actor that returns TwiML.

Final Thoughts

ChesSMS was a nice work-related side project in terms of learning Erlang/OTP, dog-fooding our API, and being a neat demo of what you can do with Unicode support besides sending international characters. I think Erlang is a great choice for building SMS applications on Twilio.

You can play ChesSMS by texting "PLAY" to +1 (415) 494-8454.

Or, you can fork the source on Github.

Posted by Chad Selph on November 08, 2012