Sunday, May 25, 2008

Unicode collation works now

This morning I got Unicode collation to pass all of the 130,000+ unit tests. It was a bit more difficult than I imagined, and it's still far from complete for a number of reasons. The whole thing is less than 200 lines (the core algorithm in about 100) in the unicode.collation vocab in the working version of Factor in git. Here's what I figured out:
  • Weird format of UnicodeData.txt It's not documented anywhere that I can find, but the UnicodeData.txt resource file has a weird range format for specifying certain properties, including character classes, which are used in collation. It looks just like two regular lines, but they have weird names for the characters that apparently need to be parsed. For example, lines that look like
    100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;
    10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;
    mean that all of the characters in the range U+100000 to U+10FFFF have the category Co, the combining class 0, etc.
  • My normalization bugs Working on this uncovered a bunch of bugs in older code, including that my conjoining Jamo behavior inserted nonexistent terminator characters.
  • Triple contractions The UCA specifies that collation graphemes should be found by checking if an adjacent character or non-blocked combining character has a contraction with a previous character. But this incremental approach doesn't work with four of the contractions listed in the DUCET which consist of three, not two, elements, without having the first two forming a contraction. So a simple identity contraction for the first two of each of those must be added.
  • Combining character contractions Apparently, two combining marks can form a contraction. A straight reading of the UCA wouldn't predict this, but not all of the UCA tests pass unless you check for non-adjacent combining marks being in a contraction together, without a noncombining mark to start it off.

And here's what I still have to do:
  • Korean stuff Because of some disagreement with the ISO people, the DUCET doesn't actually decide the best way to sort Korean. Instead, they provide three methods, both of which require modifying the table. I don't really understand the issue right now.
  • Tailoring for locales Actually, heh, the default algorithm is inaccurate for any specific locale you might be in. And, for human interfaces, it's actually pretty important that the sort order corresponds to expectations. So, if you want to do sorting that's correct, you have to modify the data. Unfortunately, the standard doesn't go into the specific algorithms for tailoring the table, though there is data available through the Common Locale Data Repository (CLDR).
  • Efficiency My implementation is both time and space inefficient, because I paid absolutely no attention to those, because solving the basic problem is hard enough (for me). Collation keys should be made shorter, and they should be made in fewer passes over the string.

Update: Here's an overview of the words that are defined in the vocabulary. It's about the minimum that any UCA implementation should have, in my opinion:
  • sort-strings This word takes a sequence of strings and sorts them according to the UCA, using code point order as a tie-breaker.
  • collation-key This takes a string and gives a representation of the collation key, which can be compared with <=>
  • string<=> This word takes two strings and compares them using the UCA with the DUCET, using code point order as a tie-breaker.
  • primary= This checks whether the first level of collation is identical. This is the least specific kind of equality test. In Latin script, it can be understood as ignoring case, punctuation and accent marks.
  • secondary= This checks whether the first two levels of collation are equal. For Latin script, this means accent marks are significant again.
  • tertiary= Along the same lines as secondary=, but case is significant.
  • quaternary= This is a little less typical (and definitely non-essential, unlike the other things), and it's my own nomenclature, but it makes punctuation significant again, while still leaving out things like null bytes and Hebrew vowel marks, which mean absolutely nothing in collation.


Doug Felt said...

I just ran across this by accident, and I was interested that you're working on implementing Unicode support. Have you looked at the ICU project? It implements the Unicode algorithms in C and uses the most current Unicode data. But it might be too large for what you want to do.

Daniel Ehrenberg said...


Yes, I'm aware of ICU. The reason I started reimplementing everything in ICU is to use a fixed-width encoding internally rather than UTF-16, as this makes the user's view of strings simpler. Also, this is partly an experiment to see if all of this can be done cleanly (ICU is not very clean or readable code, in my opinion).