Tuesday, September 4, 2007

Strings are now complicated

It's too late to turn back now. Now that the Unicode floodgates have been opened, and we're trying to support every script by default, string processing has gotten a bit more complicated.

In my pre-frosh orientation trip, I met another kid who enjoys programming. He told me about a little PHP script he wrote called Synesthesia. From his description, I imagine it goes something like this. First, generate a CSS file with classes named c0 through c255, each with a random text color. Then, take an input text file, and generate an HTML document from it. Put each character in a span specifying the class as "c" plus the ISO Latin 1 (8859-1) number associated with that character after the character has been put in lower case. The lower case is done so that "t" and "T" have the same color. The effect of this program is really cool, assigning each character a random color, making interesting designs and patterns from a plain text.

Now, how can we make this work with every language? It gets just a little bit more difficult. First of all, before doing anything, the string should be normalized. The Unicode standard explicitly states that any Unicode-conformant process must treat all canonical-equivalent strings equally. The easiest way to do this is to get the string in a consistent form, by normalizing it in some way.

When coloring a letter, we should be thinking about graphemes, not code points. Code points are the 21-bit basic units of Unicode text, whereas a grapheme is a unit of display, for example, an e with an accent mark over it, or a Hangul syllable. So we want to color units of display, not, say, just an accent mark. Iteration through graphemes is a feature of all sufficiently useful Unicode libraries. We can put the span around the whole grapheme, not just one code point.

When all of Unicode is considered, conversion to lower case isn't enough to cancel out all of the unimportant differences between similar graphemes. A better way to do it is to take the initial primary collation weight (of all non-ignorable characters) and use that as the unique identity for coloration. Just one little problem: there are, potentially, 65536 primary collation weights. That's a pretty big CSS file, so, where possible, only the used weights should be generated as CSS classes for text colors.

If you don't follow this (which is likely, as I haven't discussed collation in detail yet, and Unicode is confusing), I don't blame you. Unicode-aware text processing is complicated. While the Unicode consortium has made sure to allow less intelligent programs to be called conformant, they still might not be as good or readily internationalizable (is that a word?) as they could be.

Of course, it's ridiculous to expect programmers to be aware of all of this. That's why it's extremely important to have a good, easy-to-use standard library for Unicode, which abstracts as many things as possible away from the programmer. The optimal case, which I'm working towards, is to have nearly everything be Unicode-aware by default.


Anonymous said...


You are supposed to be off for a month!!!!!!


Bruce Rennie
(God's Own Country Downunder)

Slava Pestov said...

Bruce: he's even been popping onto IRC!