Friday, May 23, 2008

Little things I've been working on

I've been working on a few different things that, individually, are too insignificant for a blog post, so I'll put them together.

One is, I expanded my previous survey of algorithms for XML diffing, and the result is here [PDF].

I've been working on a few libraries in Factor. One is extra/delegate, which now interacts properly with reloading. For example, if you define a protocol, then say that a class consults something for that protocol, and then redefine the protocol to include more generic words, the consulting class will be automatically updated. Unfortunately, this doubled the size of the code, give or take. Slava changed duplex-streams to use extra/delegate, and the code is much simpler now, as it previously amounted to manual delegation. I got rid of mimic because it's unsafe and violates encapsulation in unexpected ways.

Another little thing I made was extra/lcs, a library for calculating Levenshtein distance between two strings, the longest common subsequence of two strings, and an edit script between two strings. Because the LCS problem and Levenshtein distance are duals, I was able to share most of the code between them. I used the simple quadratic time and space algorithm that Wikipedia describes rather than the better O(nd) algorithm [PDF] commonly called the GNU diff algorithm. I'll upgrade it to this once I understand the algorithm. This replaces extra/levenshtein. I expected it to be significantly faster, because the old code used dynamically scoped variables and this uses statically scoped locals, but the speed improvement turned out to be just around 4% in small informal benchmarks on short strings.

Now, I'm working on the Unicode Collation Algorithm. The basics were simple, but I'm still unsure how to recognize collation graphemes efficiently in general. Either way, I discovered a bug in normalization: my insertion sort, used for canonical ordering of combining marks, wasn't a stable sort as required for normalization. It was actually an anti-stable sort: it reversed subsequences which were of the same sort key. That was really stupid of me. I'm going to work on incorporating existing test suites for things like this. For the test suite for collation, all but 8000 of 130,000 tests pass, making it far from ready.

3 comments:

Slava Pestov said...

Dan, the link to the PDF survey is broken.

Barry Kelly said...

Worth knowing: Windows' Unicode collation support isn't even always self-consistent, in that a < b, b < c, and c < a for some values of a, b and c. Random data => bad for testing Unicode sorts.

Manfred said...

I've used:
http://www.unicode.org/Public/5.1.0/ucd/NormalizationTest.txt
and
http://www.unicode.org/Public/5.1.0/ucd/NormalizationCorrections.txt

To test my normalization implementation. I'm pretty sure that when those tests succeed your implementation is up to par. For grapheme cluster support I defined tests similar to the normalization tests. For added confidence you can always test against another implementation.