Tuesday, June 26, 2007

The incomplete, would-be Unicode implementer's guide

If you're crazy, you might be think, sometime in your life, "Hey, I need Unicode for something. How about I write it myself?" That's what I thought when I was almost finished with an XML parser and realized it would not be conformant unless it allowed UTF-8 input. Bad idea. You should know that nearly everyone (except for Microsoft, Squeak, and Larceny Scheme) uses a library called International Components for Unicode (ICU). ICU, a project started by IBM, stores strings in UTF-16 in memory, supports about a billion encodings for I/O, and implements all of the Unicode algorithms, as far as I can tell. ICU code is written in C and C++ very easy to read and under a liberal BSD-style license.

But Factor's just too special for ICU. We like our strings to be manipulable as random access arrays. This lets them conform to the sequence protocol. Most programming languages using ICU (Python, Java, etc) allow this, but it is erroneous. When you treat a UTF-16 string as an array, code points above 16 bits will inevitably be treated wrong. As the Java documentation for String.charAt() states, "If the char value specified by the index is a surrogate, the surrogate value is returned." Programmers should not treat strings as arrays. Instead, they should iterate through them using special functions which are aware of high-bit characters. In my opinion, this breaks abstraction. Programmers shouldn't have to care whether they're using Unicode or not; that's the Unicode library's concern. In practice, many programmers will only support the only the Basic Multilingual Plane (the first approximately 2^16 characters) properly, as the other planes come up far less frequently and rarely cause bugs.

But we're doing Unicode right. All strings which can't be contained in ASCII will be stored in UTF-32 in memory, in Normalization Form D. This is the approach Squeak takes, and it allows correct random access in strings. Anyway, your Unicode implementation doesn't have to take this path, but mine does. As ICU supports only UTF-16 for internal representation, I had no choice but to write my own implementation. I don't need to do any display, but there are a few things that I do need: UTF-8/16 I/O, normalization, case conversion, grapheme traversal, collation, name to character conversion, character classes, and, maybe in the future, BIDI. I couldn't find any introductory articles about this stuff, so hopefully this will serve as one.

The first thing you have to be willing to do, when implementing these Unicode algorithms, is to be willing to read the standard. For XML, you can get a good deal done without reading the standard, going on your understanding of XML. And the XML standard is about the most boring thing I've ever read, filled with legalistic pronouncements of what an XML parser SHOULD and MUST do, with helpful EBNF line noise. Compared to the XML standard, the Unicode standard is Shakespeare. It has lots of informative introductions, explanations and algorithm descriptions. The most important parts of it, for us implementors, are chapters 2, 4 and 5 and Unicode Standard Annexes #15 and #29, and Unicode Technical Standard #10. There are many, many more documents, and it can be hard to sort through them all to find which ones are relevant. But those are some good starting points.

In order from easy to hard, here are the main Unicode things that I needed/need to implement.

To start out: UTF-8, UTF-16 and UTF-32 are the main ways that Unicode is stored in files and memory and transmitted between computers. There are other techniques, but very few people use them. They are called "encodings" because they encode code points (characters) in ones and zeroes, each with different bit packing mechanisms. UTF-8 stores code points in one of the following ways, where xxxxx... are the binary digits of the code point:

110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

As you can see, this is a variable width encoding. Depending on how large the code point is, it takes a different number of bytes. This is nice for European languages, but the format is somewhat complicated to manipulate directly and inefficient for east Asian languages. In UTF-16, the format is one of the following:

xxxxxxxx xxxxxxxx
110110yy yyyyyyyy 110111yy yyyyyyyy

where yyyy... is xxxx-2^16. The y value always fits in 20 bits, because Unicode code points go from 1 to 17*2^16. UTF-32 is much simpler, with only one format:

00000000 000xxxxx xxxxxxxx xxxxxxxx

But it's not really used for anything except in-memory storage, and even there it's relatively rare. But it is what I'm using in Factor.

Chapter 2 of the Unicode Standard goes into more detail about what's acceptable and correct here, but this is basically all you need to know. If you haven't done anything with bitshifts and masks before (like me), it can be a bit scary, but it's not too bad.

To do anything beyond plain old encoding and decoding UTF-8/16, you need to look at some of the Unicode resource files. The most important of these files is called UnicodeData.txt, whose format is described here. This huge text file (the Unicode folks refer to it as database) contains a line of information about every single Unicode code point. It is extremely simple to parse (in pure ASCII, values separated by semicolons), but processing is another matter. The first field of every line is the code point in hex, the second is the name, the third is the category, and so on. The ICU folks apparently compiled the database into a few huge header files, but my strategy was to parse them at compiletime, building a number of tables from the included data.

Using this data table, the simplest thing to make is a conversion from Unicode name to character. This allows something like CHAR: black-four-pointed-star to be equivalent to HEX: 2726 or "\{limbu letter ra}bc" to be equivalent to "\u1916bc". This is makes it easier to understand certain string or character literals which would otherwise be unreadable numbers. The way to do this is just to look at the UnicodeData.txt file, pick out the first two fields, parse the first field as a hex number, and construct an associative array from the second field (the name) to the first field (the code point). Then, code points can be looked up quickly given their names. For my library, I applied a little transformation to the names (putting them in lower case and replacing spaces with dashes) to give a more Factor-ish feel. More commonplace is to simply replace the spaces with underscores. This table takes up a good deal of memory, but it will not be included in most programs because it's only needed at parsetime.

Just slightly higher on the complexity scale is character classes. I discussed the high-level interface to these in a previous post but I did not not discuss how I stored them. There are only 29 character classes (or 30, if you count Cn, the non-class), so they can be stored in 5 bits. However, it's a lot easier and faster to store them in a byte array, or 8 bits, and not too space-inefficent. A byte array long enough to hold the first three planes (the BMP, Supplementary Ideographic Plane (SIP) and Supplementary Multilingual Plane (SMP), the only ones defined yet) would take about 200kb. On a modern computer, this is hardly significant, and it should be possible to compress it further for more limited settings.

Grapheme traversal is just a little harder, but still well into the easy category. You might be confused by the word "grapheme." (Is that even a word? I'm not sure.) By "grapheme" I mean "a character plus all combining [accent] marks following it." It is easy to confuse characters with graphemes, and the Unicode standard has a lot of clarification and explanation if my definition didn't make enough sense. But sometimes we have to divide text up into these graphemes. As a simple, useless example, reversing a string should reverse the graphemes, not the characters. If you reverse the characters individually, the combing marks will go to the wrong character. For something more useful, imagine moving the cursor around a text input box. The cursor should move by graphemes, not characters. So, to implement this, we need to know about something called the combing class of characters. The combining class is in the fourth field of UnicodeData.txt. If that is non-zero, then the character is a combining mark. To iterate through glyphs, just keep moving through the string until you get to a combining class of zero--the next glyph. To make things easier, I wrote a word to convert a string into an array of strings, each of which is a glyph. This takes more memory, but is easier to process. Grapheme traversal, as well as word separation and other similar things, is discussed in UAX #29 (which I haven't read yet).

[Update: Well, this is what I thought when I wrote it. But, when I went back and actually read UAX 29, I realized that it is actually a bit more involved. So I blogged about it later in much more depth.]

With case conversion, it starts to get a little messy. Upper, lower and title case of all applicable Unicode code points is listed in the UnicodeData.txt file in fields 14-16. But in Unicode, upper case is not as simple as

: >upper ( str -- upper )
[ ch>upper ] map ;

as the Factor strings library currently defines. Some characters, when converted to another case, become two characters. For example, the German es-zed (ß) becomes SS when capitalized. There are actually several cases like this. This means that any ch>upper function must be at least partly wrong, unless the character set is severely limited. That's not all: there are also context-depending casing modifications for Greek and Lithuanian, as well as locale-dependent modifications for Turkish and Azeri. These special cases are listed in another data file, SpecialCasing.txtThis file has a similarly simple format, which is documented in the same place UnicodeData.txt is. Case conversion is specified in the Unicode Standard, chapter 5. A confession: though it shouldn't be hard, I still haven't gotten around to properly implementing case conversion for Lithuanian, Turkish or Azeri. But there have been no complaints so far!

Hopefully, in a follow up, I'll describe where it gets harder: normalization, collation, and BIDI. But I've written enough for today.

Update: I described normalization forms here.


Michael said...

Great post (along with your other one about Unicode)! I never thought this topic could be so interesting, but you've inspired me to go out and start reading all the Unicode docs I can get my hands on, along with your code(though I'm very new to Factor).

I've found one minor bug that you may not be aware of as you haven't read UAX #29 yet. It seems that there should never be a grapheme cluster break between a CR (U+000D) and LF (U+000A).

I hope you've found this helpful, and keep up the good work (I can't wait to read your thoughts on collation, etc.). Perhaps after I get more experience under my belt I'll be able to contribute to your implementation.

Daniel Ehrenberg said...

Thanks for commenting and pointing that out; I didn't realize I made such a careless oversight. I'll change that soon.

If you feel like contributing, it'd be great if you read up on BIDI, since I haven't started looking at that in detail yet, and it seems mostly orthogonal to what I've done so far. But in general, the code I've written so far isn't high enough quality to collaborate with someone on it...