Saturday, February 10, 2007

Doing Unicode right, part 1

When I found out that Factor didn't really support Unicode, I decided to implement it myself. One goal in this is to do Unicode right, where other programming languages had terrible bugs, exposing programmers to surrogate pairs, casing complications and other quirks of Unicode. My other goal is to spend absolutely no money in doing it. Unicode is a free standard, so it should not cost anything to get the necessary information about it. I started to read Unicode Explained every time I went to Barnes and Noble, so I could figure out how to accomplish it. This shouldn't be too hard, should it? It's only a character encoding...

It wasn't hard to get a finite state machine working to decode UTF-8 and UTF-16, and it wasn't hard to write an encoder either, using make (though I may refactor that out). But then, I decided to do case conversion, opening a can of worms. After a bit of research (it really doesn't say anything on the internet on this outside of the standard itself) I found out that unicode.org hosts a web-accessable text database of Unicode characters in a file called UnicodeData.txt, including their upper and lower forms. Since I was doing Unicode right and everything, I decided to use Unicode 5.0, not 2.0 like most things do. And to make transition easy, I got all data directly from these files without any code generation (as I did for libs/xml/char-classes.factor) and with minimal hard coding.

For most characters, this maps cleanly 1:1, but for just a few (maybe 100) there are multi-letter translations for upper or lower case. These are defined in a separate file called SpecialCasing.txt. For example, in German, "Fuß" >upper ==> "FUSS". Though this is a little difficult, it ultimately worked out fine. This is about as far as I've gotten. For a small subset of these (around 15) there are conditional mappings: a word ending in a sigma, when made into lower case, must end in a final sigma; in Lithuanian, Azeri and Turkish, there are oddities with dots on I (and sometimes J). Some of the mappings are duplicated in the Turkish/Azeri section and the main UnicodeData.txt and I have no idea why. Additionally, the file format in SpecialCasing.txt changed in 5.0 from 4.0, but I can't figure out what the new semantics are. The strangest thing is, actually, the fact that the Lithuanian, Turkish and Azeri mappings are locale-dependent, that is, the language of the text must be known. But the whole point of Unicode to allow multilingual texts. How is this compatable?

I would look at the Unicode standard itself to resolve these issues (there's not exactly a big Unicode developer community I can tap into), but the most recent version available online is 4.0, and the format of SpecialCasing.txt is incompatible. The website lists the table of contents and says that links to PDFs will be made as soon as the standard is put up on the website. But it's been two months since the book was released, and theses are computer people: they should be capable of getting it on the website. Buying the book (or having someone else buy it for me) is out of the question; that would not be in keeping with my second goal.

To see my progress so far, look in a factor distribution in libs/unicode. In the next update, I'll discuss Unicode equivalence forms and why they make no sense at all.

No comments: