Thursday, December 13, 2007

Alphanum sort and Unicode

So, I'll just be the Unicode troll here and say, all of your solutions to the problem of sorting alphanumeric file names are insufficient. To recap, if you haven't been following Reddit: it's annoying how, when looking at a directory, the file named z10.txt is sorted between z1.txt and z2.txt, and we programmers should come up with a solution. The gist of the solution is that you split up the number into a list of alphabetical and numeric parts, parse the numbers, and then sort it all.

What's wrong with that?

Every one of these implementations, including a slightly internationalized version (just for some Scandanavian locale, apparently), does not produce a sort the way humans expect it. Remember, in ASCII, capital Z comes before lower-case a. Jeff Attwood hinted at internationalization problems, and that's just part of the problem. We should use the Unicode Collation Algorithm (which I previously blogged about). The UCA provides a locale-dependent, mostly linguistically accurate collation for most locales (basically all that don't use Han ideographs).

The basic idea that we need here is that the UCA is based on a number of levels of comparison. This is useful not only in "international" locales, but also in United States English. First, we compare two strings with most features removed, then we add back in accent marks, and then capitalization, and then punctuation/spacing. So, even if we want "B" to come before "b", we can have "bench" come before "Bundle"--before comparing for capitalization, we compare for the letters themselves, and "bench" obviously precedes "bundle". For a better description of the UCA, see my past blog post.

So, the most basic way to put the UCA together with the Alphanum algorithm is to just use the same algorithm, except with the UCA to compare the strings rather than an ASCII comparison like most other implementations have done. But this would completely destroy the UCA's subtle level effect: even if "A" is sorted before "a", we would want "a1" to come before "A2". We also want "A1" to come before "a2", and we also don't want to ignore case completely ignore case; it's still significant.

The UCA solves this problem when we're working with "a1" and "A2", but not when we're comparing "a2" and "a10". For this, we need a slightly more complicated solution based on something like the Alphanum algorithm. But breaking the string up by itself won't allow for the layering kind of behavior that the UCA depends on.

What's a solution for this?

One way to fix this is to break the string up into its segments, right-pad the alphabetical strings and left-pad the numbers. We can just break things up into segments of consecutive digits and non-digits. The length that we'll pad to for the nth subsequence is the maximum of the lengths of the nth subsequences for all of the strings. For numbers, we can left-pad with plain old 0, but for strings, we want to
right-pad with something that's lower than any other character. This sounds hard to do, but by tailoring the DUCET (data table for the UCA), we can just define a character in a private use space as a new, special padding character. This padding character will be completely non-ignorable, fixed weight and have a lower primary, secondary and tertiary weight than any other character by definition.

OK, that sounds really complicated. Let's step back a minute and see how this padding could work out to provide functionality equivalent to the existing Alphanum algorithm. For this, we'll assume access to an ASCII-order sorting algorithm, and just figure out how to pad the string in the right way to come up with a suitable sort key. Instead of some new character, we can just use a null byte. So, if we have the strings "apple1b" and "the2345thing1", the fully padded strings should look like "apple0001b\0\0\0\00" and "the\0\02345thing1", where '\0' is the null byte.

I can make a simple Factor implementation by stealing Doug's cut-all code from his description of a solution to this problem.

: pad-seq ( seq quot -- newseq )
>r dup [ length ] map supremum r>
curry map ; inline

: pad-quote ( seq -- newseq )
[ "" pad-right ] pad-seq ;

: pad-number ( str -- newstr )
[ CHAR: 0 pad-left ] pad-seq ;

: pad-string ( str -- newstr )
[ 0 pad-right ] pad-seq ;

: pieces ( strings -- pieces )
[ [ digit? ] [ ] cut-all ] map pad-quote flip ;

: pad ( pieces ? -- ? newpieces )
[ pad-string f ] [ pad-number t ] if swap ;

: pad-strings ( strings -- newstrings )
pieces t swap [ swap pad ] map nip flip [ concat ] map ;

: alphanum-sort ( strings -- sorted )
dup pad-strings [ 2array ] 2map sort-values keys ;

This, by far, isn't the most concise implementation, but the advantage is that it can easily be adapted for a better collation algorithm.

But where's the real Unicode implementation of this?

I'm not sure where to implement this in terms of tailoring and hooking it up with the UCA. I'd either have to (a) go into C++ and do this with ICU (b) write my own Unicode library which has tailoring, and do it there. I'm too scared to do the first, and the second isn't done yet. I've looked at Java's tailoring, but I don't see how I could define a character that's lower than all other characters. I feel a little bit lame/trollish putting this blog post out there without a real implementation of the solution, but I hope I was able to expand on people's knowledge of the concerns of internationalization (yes, I'll write that word all out; I don't need to write i18n) in text-based algorithms.

1 comment:

Manfred said...

Most of the times the Unicode standard provides a reasonable general solution like the Collation Algorithm, I think those algorithms should be implemented in the standard library. It gives programmers a relatively usable quick solution.

Anything beyond that is so dependent on the requirements of the application that there is not much you can predict in my opinion. It all depends way to much on what you're sorting, in what language, in what specific dialect of the language and even on the wishes of the current user.

So basically: educate and enable.