Saturday, September 15, 2007

When Unicode just works itself out

I love college; there are so many smart people here. One smart person I met named Paul designed his own CMS (it started as a forum) using PHP and MySQL called Really Cool Stuff while in high school. It has tons of features, looks pretty well-designed, and he has real, non-programmer users. (He just upgraded and wiped the site, so it's a little empty right now.) I'm jealous.

I've heard many bad things about PHP's Unicode handling, so I asked Paul what he thought about it. He told me that he didn't write any special code for it. In PHP, the strings are composed of normal 8-bit characters. In MySQL, an 8-bit string is also used. Convinced I had found a bug, I opened up a new forum post and entered in a bunch of special characters, convinced I would see a mess of misrendered partial code points from the lack of explicit handling.

Disappointingly, the program worked properly. All of the characters displayed as they were entered.

What happened?

It was too simple for me to realize first: it's UTF-8 all the way through.

UTF-8 is the encoding which is overwhelmingly used for transmission over the internet, due it its backwards compatibility with ASCII and efficiency in representing ASCII. Like nearly all Unicode encodings, UTF-8 is represented as a string of bytes; each character represented as 1, 2, 3 or 4 bytes. In this case, the only thing the website needed to do was to get the string from input, store it in the database, and then serve it back out on a different page. PHP generally works with 8-bit strings, but here, it doesn't really matter. Modern web browsers treat everything as UTF-8 by default. Within PHP and MySQL, it's all just bytes that have no inherent meaning. It really doesn't matter whether the bytes are characters or UTF-8 octets, because the web browser controls the presentation.

So these two basic operations, storage and retrieval, happen to work just fine here. The input is UTF-8, and the output is interpreted as UTF-8, so everybody's happy. You could call this the Unicode-bind approach. This is basically all Paul's program does with strings: no fancy manipulation, just input and output. But there's a third operation going on here, implicit in lookup. The titles, usernames, etc. are similarly stored in 8-bit strings, and looking them up in the database involves, ultimately, testing two strings to see if they are equal.

Wait a minute!

This should raise a red flag here. A simple byte-equality comparison is being used when the strings may be in different, but canonical-equivalent, forms! (I discussed normalization and equivalence in detail earlier. Two strings are considered canonical equivalent if they are the same after being put in NFD (or NFC).) That is, there might be strings which are in different in what code points are used, but represent the same glyphs. A simple example of this is the difference between a lower-case a with an acute accent (one character), and a lower case a followed by a combining acute accent (two characters). Both of these represent the same piece of text, so they should be treated the same. The Unicode standard suggests that they be treated this way.

However, in the real world, this is rarely an issue. Probably it will never come up among Paul's users. From browser input, text is usually in a form that's mostly kinda like Normalization Form C (NFC). In any case, it's nearly always consistent among the people who use non-ASCII characters, so it will rarely create a problem.

It's disappointing, but many popular programs and libraries, including many Unicode fonts, don't treat canonical-equivalent strings the same. The Unicode standard only says that no process shall presume that another process treats any two canonical-equivalent strings differently, and suggests that it'd be nice if canonical-equivalent strings were always treated as the same. The practical manifestation of this is that, in many Unicode fonts, a with an acute and a with a separate combining acute are displayed differently. In the case of Paul's program, a user with an accent mark in his name will probably only be able to log in if he enters his user name in the right form. The username might show up on his screen as the same, but if the user uses, say, combining accent marks instead of precomposed characters, the database might not find him. But, again, this will rarely occur.

So, what about other operations?

If you go beyond the simple operations of storage, lookup, and equality comparison, you'll run into problems in internationalization. Concatenation should work OK too. According to Paul, these four operations are all the website needs to do with strings, so everything should work out OK. But there are four classes of string operations that can cause problems in the know-nothing Unicode approach described above. These may become important as Paul continues to develop his CMS.

  1. Dealing with individual characters. There are two layers of problems here. First, the string is in UTF-8, so it takes a bit of work to extract characters. (I previously wrote about the format of UTF-8, halfway down that page.) Accessing a character in the normal byte offset way just won't work. The default substring and length functions also don't work, and you can't loop through indexes of a string the way you may be accustomed to. Second, in many situations, it's more appropriate to think of graphemes than characters. (Case study with another Carleton programmer.) Individual processing by characters may not always work.

    An immediate consequence of the lack of individual character processing is that regular expressions do not work. Additionally, when using Unicode, regular expressions must not use old patterns like [a-zA-Z] to match alphabetic characters, as this fails when more diverse texts are considered. More subtly, many regular expressions implicitly depend on assumptions like 'words are separated by blanks or punctuation', which are inappropriate in general.

  2. Alphabetical order. I still have yet to write about this in detail, but alphabetical order is much more complicated in than you may think. Simple programs which are only used with ASCII often get away by sorting in code point order, after converting the string to a consistent capitalization. But in general, this doesn't work. In this scheme, a with an accent mark will sort after z. Due to the complicated tie-breaking rules governing accent locale, marks, ligatures, contractions, capitalization and punctuation in proper sorting, the comparison operators included in most programming languages by default (usually code point order) will not be appropriate for international applications. For example, in American English, an a-e ligature sorts like an a followed by an e, and 'café' sorts in between 'cafe' and 'cafes'. The Unicode Collation Algorithm (UCA) sorts this all out.

  3. Word breaking and search. When considering all of Unicode, it is inappropriate to assume that words are separated by blanks. For example, in Japanese, adjacent Kanji should be treated as separate words. This has implications when searching by whole word match. For insensitive search, it often makes sense to use part of the collation key, which I'll describe later.

  4. Uppercase, lowercase, and titlecase. With a totally naive approach, capitalizing a UTF-8 string the way you capitalize an ASCII string, you would only see the letters in the ranges A-Z and a-z transformed properly, with other characters completely ignored. (This is half lucky accident, half by design of UTF-8.) Apparently, PHP has some support for UTF-8 in some of its string functions, including case transformations. But by default, only the ASCII transformation is done. (It's possible to do everything except for Turkish, Azeri and Lithuanian case operations using the same function, but PHP requires an explicit change of locale.)


(Of course, if you do something like write a rich AJAXy text editor, more problems will also show up.) But if you don't use any of these aspects of text manipulation, your program should be fine for the most part, without explicit Unicode support.

This isn't to say that it's OK to ignore Unicode, though. Most complex programs will soon run into one of these issues, but I still think this is an interesting case study. I feel like I've said this a million times, but the best way to avoid running into any of these issues is to have them solved in the standard library correctly. This way, all string manipulations are Unicode-aware, and more complicated things like breaking properties can be easy to access on all systems.


Note about databases: Most modern SQL databases support some Unicode in some way or another, by which I mean they can store strings in particular Unicode encodings. MySQL 4.1 looks like it has reasonably good Unicode support, including several encodings (with some non-Unicode ones, too), and collation based on the UCA, with support for tailoring the order to specific locales. But unfortunately, it looks like only characters in the Basic Multilingual Plane (BMP) are supported. Note that when storing UTF-8 code in a database Unicode-blind, more characters might be taken up in the database than there actually are in the string when displayed, because some characters take multiple octets.

No comments: