Wednesday, October 31, 2007

Post removed

The post that used to be here was wrong and pointless. I've removed it, but if you're really curious, it's just commented-out as HTML, so you could still read it.

Saturday, October 27, 2007

FactorCon MN 2007

You may have already seen this on Slava's blog, but for those of you who didn't: on November 18-21, we're having a Factor-related get-together in Minneapolis, Minnesota. All levels of experience/use of Factor are encouraged to attend. Confirmed guests include Slava Pestov, Doug Coleman and me. If you think you can come, please contact me (or someone else who's already going). I haven't been able to do much Factor hacking lately, due to school constraints, so this'll be a great opportunity for me, personally. For all of you: if you don't know Factor well, this would be a great time to learn more; if you do know Factor well, this would be a good chance to meet your fellow developers. I hope to see you there!

Friday, October 26, 2007

Factor's object system

In my FAQ, I wrote that Factor was object-oriented. That might have confused some of you: how is Factor object-oriented when code isn't organized in terms of classes like in Java? When there is no system for message passing*? When tuples can't inherit from each other?


Originally, Factor didn't have an object system. It was said that it's not necessary; we can get by with just words, conditionals and lists. But as the library grew more complicated, it became clear that we needed a cleaner method of dispatch. This took the form of generic words, or words that can be overloaded by the class of one of their arguments. Initially, we only had predicate classes, and generic words were in fact no more than syntactic sugar for cond. But we also needed a good way to define data types, so that they wouldn't be confused with each other. Later, we realized the need for a system of 'inheritance' from tuples to tuples, flexible constructors, extensible unions and other things. Piece by piece, features were added to form a emerging coherent whole which could be used to organize some Factor code.

It's important to note that most words are not methods and not all data is stored in tuples. Groups of Factor code are put in vocabularies, not in classes. It's common to write a lot of code without using any object-oriented features (aside from generic words defined in libraries). Nevertheless, object orientation, when used appropriately, makes code much better organized.

The class hierarchy

In Factor, everything is an object. Well, of course everything is an object; everything is something! But I mean everything is an instance of the class object. Under that class, there are a number of built-in classes, representing basic things like tuple, vector, array, fixnum, float, bignum, ratio, quotation, curry, etc. You can test if an object is in a particular class with words of the scheme object?. To test if an object is a tuple, use the word tuple?.

On top of these builtin classes, there are a number of abstractions. For example, the class number is the union of all numeric classes. Under this is a numerical tower of subclasses, similar to Scheme. An example of a couple unions of this kind:

UNION: integer fixnum bignum ;
UNION: rational ratio integer;

Factor also supports extensible unions, called mixins. Whereas a union is closed, later code is allowed to add things to a mixin. The above code can be translated into mixin syntax as:

MIXIN: rational
MIXIN: integer
INSTANCE: rational ratio
INSTANCE: rational integer
INSTANCE: integer fixnum
INSTANCE: integer bignum

This is useful for things like sequences and assocs, which form mixin classes. Unions are inappropriate here, as new sequences and assocs can be made in the library.

Going in the other direction, classes can be narrowed down using predicate classes. A predicate class is a subclass of some other class consisting of instances which satisfy a conditional. For example, the natural numbers (defined as integers 0 or greater) could be defined as

PREDICATE: integer natural 0 >= ;

Generic words

So what use are all these classes? We can use predicates on them to test membership, but that's not all that useful. The centerpiece of the object system is generic words, or words which dispatch on the class of their arguments. Typically, only single dispatch is needed, so multiple dispatch is not in the core**. A generic word which dispatches on the top of the stack is defined with the parsing word GENERIC:. Methods, or implementations of the generic word, are written with the parsing word M:.Take the following useless example:

GENERIC: process ( thing -- processed )
M: string process
"foo" append ;
M: integer process
5 + ;

Here, there is a generic word process with two methods: one on strings (which appends "foo") and one on integers (which adds 5). This is equivalent to:

: process ( thing -- processed )
{ [ dup string? ] [ "foo" append ] }
{ [ dup integer? ] [ 5 + ] }
} cond ;

But the first version of process has the advantage that it is extensible later, in other files. Additionally, generic words are sensitive to the class hierarchy and will test things in an appropriate order, and dispatch is more optimized than with a simple cond expression.

There are other ways to dispatch besides GENERIC:. Other method combinations include MATH:, for words which dispatch on builtin math types for two arguments, HOOK:, to dispatch on the class of the value of a variable, and GENERIC#, which allows dispatch on items further down on the stack. You can also make your own method combinations.

Tuples, delegation and constructors

It's possible to construct your own data types using just arrays and predicate classes, but that can get awkward and ambiguous. To make new data structures, Factor offers a tuples, basically another name for structs or records. Here is an example of a tuple class called foo, with slots bar and baz, together with a constructor called <foo>:

TUPLE: foo bar baz ;
C: <foo> foo

The first line creates the tuple class, and the second line makes the constructor. (Slava described the details and rationale for the constructor system.) It is simple to use this. To make a new foo tuple, just get the values of bar and baz on the stack and call <foo. With that object, the bar slot can be read with the word foo-bar the baz slot can be read with the word foo-baz. To write values to those slots, use the words set-foo-bar and set-foo-baz. And, of course, to test if an object is a foo tuple, use the word foo?. All in all, TUPLE: is a wildly unhygenic macro, and we like it that way!

But wouldn't it be useful if we could let tuples inherit from each other? We've already seen that Factor allows for general subtypes and supertypes, but this does not extend to tuples. The way that tuples can be based off each other is through delegation: one tuple delegates to another. Each tuple has a slot, called the delegate (accessed with delegate and set-delegate). When a generic word dispatches on a tuple, but it doesn't have a method implementation for that tuple, it looks in the delegate slot for what to do next. If there is a delegate, then the generic word is again invoked on that delegate, with the hope that generic word itself has an implementation for that class. If not, it's on to the next delegate until there is no delegate anymore--the delegate is f.***

Instead of inheriting from a class, tuples delegate to an instance. That is, if I have my tuple TUPLE: foo bar baz and I want it to inherit from TUPLE: bing quux ;, all I have to do is make an instance of foo and an instance of bing, and set the delegate of the foo to the bing. Slot accessors like bing-quux and set-bing-quux are generic words implemented, by default, only on that one tuple class that they were made for. So, if you call bing-quux on a tuple of class foo that delegates to a bing, the foo instance will fail to respond and the generic word will be called again on the delegate, bing. This delegate will know what to do, and it'll return its quux value.

There's a subtle problem here with delegates. Look at the following code:

TUPLE: foo bar baz ;
TUPLE: bing quux ;
M: bing who how ;
M: bing how drop "bing" print ;
M: foo how drop "foo" print ;

You might expect that when who is called on a foo instance, "foo" is printed, but in fact, "bing" is printed. This is called the slicing problem: the object is "sliced" when using delegates, so only part of the object is represented in further method calls. On one hand, the behavior of M: bing who is more predictable, but on the other hand, information is lost. It's not a horrible thing, just something to watch out for.

Use cases

As I said before, a lot of Factor code has no need for object orientation. Going by my own experience, some complicated libraries, such as Unicode and inverse, don't yet use the object system at all. The XML-RPC vocabulary relies heavily on generic words to perform dispatch in turning Factor data into XML-RPC format, though no new classes are defined. The XML library, however, uses tuples to represent XML at various stages of processing, and dispatches on the class of these tuples to provide complex functionality. Several tuples are defined in the XML library to represent exceptions, though any object can be thrown as an exception.

In general, the object system should be used when it's useful and not used when it's not useful. That seems simple enough, but it is a radical departure from the philosophy of other languages where all procedures are methods. It's good to use whichever strategy you think will produce the most maintainable, clear code.

The future

For a while, I hesitated to write about the object system because it changes so frequently. Tuple constructors, for example, changed very recently to eliminate the default constructor and replace it with a more flexible system. Mixins are also a recent addition, yet they significantly change the way default methods are implemented. Soon, Slava plans to replace delegation with tuple inheritance.

Since the Factor object system is constructed completely in Factor, rather than being built in to the virtual machine, any changes to it can be done within Factor itself. This is a huge advantage. One change that I proposed could easily be written in Factor and run without any special care. Fundamental changes to the object system can even be interactively tested and debugged!

* Eduardo Cavazos has written a message passing object system in Factor, and he uses it for some of his code. Factor allows other object systems to be created, but with his system, not everything is an 'object' that can receive messages. Still, it's an interesting (and surprisingly terse) example of the flexibility of Factor, and of useful syntax extension. It is not widely used, however.

** I implemented double dispatch as a library called visitor (since it's implemented with the visitor pattern, yielding a simple lexicographic method order). Maybe in the future we'll have double dispatch in the Factor core, to better implement words like like, which end up approximating double dispatch using conditionals. Though it makes sense in theory, I can't imagine any use for triple dispatch in Factor.

*** As you can see, if you make an infinite (that is, recursive) delegate chain, method resolution may not halt. But why would you ever want to do that? Anyway, since we have predicate classes, it's already not guaranteed to halt!

Blogging is much more fun than writing essays about the history of South Africa.

Wednesday, October 17, 2007

Tuesday, October 16, 2007

Unicode implementer's guide part 5: Collation

What could be simpler than alphabetical order1? It's a simple lexicographic comparison of two strings, comparing the first character to see which is greater, then the second as a tie breaker, then the third, and so on. Unicode makes a lot of things harder, but this shouldn't be any different. Unicode doesn't prohibit this kind of alphabetical order; you can have a 'Unicode-conformant' process which uses this order.

Great idea, but...

This kind of order, which we could call code point order, is useless. Well, it's not completely useless; it's fine if you need an arbitrary, consistent order. But for humans, it's useless. It puts Z before a. In Normalization Form D (which I'll assume strings are in for the rest of the article), it puts the "äa" after "az". It puts "aa" after "a-z". It puts "æ" after "af". All of these are backwards for users in the American English locale and most other locales.

It's not just that the letters are in the wrong order. It also won't be sufficient to put all characters in a consistent case, as the R6RS Unicode module authors seem to think, as this will still give unnatural behavior for punctuation, accent marks, ligatures, and other things. Removing all of these features won't do it either; they can't be completely ignored, becase "foot ball" should come consistently after "football", and "Apple" should come consistently either before or after "apple", depending on the needs of the user.

It's crucial that things are sorted correctly; if they are in a different order than the user expects, then a user, looking at a long alphabetized list, may conclude that an item on the list doesn't exist.

A better way for people

To get things right, the way humans expect them, you need another few levels of tiebreakers. When comparing two strings to see which one comes first, there's a number of steps: First, compare two strings with all of the features stripped. If they're equal, compare the accent marks of the strings, as a tie breaker. If those are equal again, compare the capitalization as a second tie breaker. And if those are still equal, compare the punctuation.

I left out one thing--ligatures like æ. This can be expanded to something that looks like "ae" in the first stage, but looks different in the accent mark tiebreaker. We'll call them expansions. In some languages, like Danish, there are also contractions like "aa" with their own special alphabetization; these contract into a single value for the initial sort, if the Danish locale is used. In Thai and Lao scripts, there is an additional problem of certain vowels and consonants reversing order for collation purposes. This can be achieved with a combination of the expansion and contraction patterns. There's also a special problem for Korean collation2.

UTS #10

Believe it or not, I didn't come up with this all by myself. It's part of the Unicode Standard, in Unicode Techinical Standard #10: Collation. UTS #10 isn't something that's required of Unicode implementations, but it is specifically standardized and codified. It does basically what I described in the section above, with a slight change for efficiency. Because the folks at the Unicode Consortium did all this work, we don't have to! They've even given us a convenient table of data that makes it easier to implement the above tiebreakers.

UTS #10 suggests a simple way for implementing these tiebreakers that works for efficiently for sorting large lists of strings: calculate a collation key, a bit array, for each string, and compare these collation keys with a simple bit comparison. This collation key can be calculated in linear time and space with respect to the string, so a human-appropriate comparison is asymptotically no worse than a naive one and suffers only linear overhead.

Collation keys and the DUCET

But how do we make the collation key? Well, that table I mentioned earlier is called DUCET, the Default Unicode Collation Element Table. It describes a neutral locale which is linguistically inaccurate everyone. For most3 characters in NFD, there is a three-tuple4 of the primary, secondary, and tertiary collation weights. In assembling the collation key from a string, go through the string, first appending the primary weights of each character5, then a separator of 0, then all of the secondary weights, followed by another separator, and finally the tertiary weights. If one of these weights for a particular character is 0, leave it out. This is all followed by a representation of the punctuation6.

A simple binary comparison of the collation keys ends up being equivalent to what I described before. How? When the primary weights are appended, they form something equivalent to the characters all put in a consistent case, with accent marks and punctuation removed. In a binary comparison of collation keys, these are compared first. They are followed by a separator of 0 so that, if the strings' primary weights are equivalent up to the end of the shorter one, the longer string will be sorted second. This is because 0 is less than any collation weight. The secondary collation weight represents the accent marks, and the tertiary collation weight represents the capitalization. (See note 5 about punctuation.)

Implementation and tailoring

I wish I could give hints on implementation, but, umm... I haven't written a working one yet. It's quite an involved algorithm, at least compared to things I've done before. Luckily, the UTS gives lots of hits for how to do it. One thing that makes implementation easier is that the collation weights of many characters, mostly Han ideographs, can be derived algorithmically through a formula specified in UTS #10. In fact, this is somewhat unfortunate, as it is a symptom of the fact that the ordering of Han ideographs requires a more involved process, using some sort of dictionary lookup for pronunciation or radical index, depending on the context.

But implementation comes with an additional challenge: tailoring to specific locales. The DUCET isn't linguistically accurate for any locale, and must always be modified somewhat. Different contexts such as language, nation, context and organizational preference can demand slightly different collation orders that users depend on. Modifying the collation table to suit this is known as tailoring. I'm not yet sure what algorithm should be used to reshuffle the collation weights in the right way. But there is a ton of locale data in the Common Locale Data Repository (CLDR), a project run by the Unicode Consortium. This includes information about collation.


Collation isn't easy, but it's necessary to get right. UTS #10 lays out an algorithm for many aspects of collation, but still leaves some things out: there's no specified process for ignoring words like 'the' in the beginning of a book title, or padding numbers with zeros so that 2 comes before 10. Nevertheless, with this algorithm and appropriate locale data, it's possible to give most users a linguistically accurate alphabetical order which makes sense to humans.

1 Actually, we should probably call it collation, since alphabetical order implies we're working with an alphabet. With Unicode, some scripts don't use an alphabet.

2 There are a number of different ways to sort Korean Hangul syllables appropriately. The precomposed Hangul syllables are already in the correct collation order, but a problem arises in the representation of old Hangul texts, where multiple initial, medial or final jamo can be used. (In modern Korean, only one is used per syllable.) There is more than one way to sort this. Because of an apparent bureaucratic argument between the Unicode Consortium and the ISO (who publishes the parallel standard ISO 10646), none of these are encoded in the DUCET. The one that sounds the simplest to implement is to decompose all syllables and place a special terminator weight after the end of each Hangul syllable. More details are in UTS #10.

3 I say 'most' because Han characters and Hangul syllables are left out, as their weights can be derived algorithmically. Maybe I shouldn't say 'most', as most characters are in that group, though.

4 Actually, it's a four-tuple, but the last entry isn't very useful for anything. I think it's the same as the shifted strategy for variable weighting, described in note 5.

5 Actually, to make expansions and contractions work, you shouldn't be iterating over characters; you should iterate over collation elements. For example, in Slovak, 'ch' forms a collation element with a distinct single primary weight.

6 There are many different ways to represent the punctuation, but that is beyond the scope of this article. Generally, we don't refer to it as 'punctuation' but rather 'variable-weighted'. The most useful variant (they all have applications, however) is called 'shifted', adding an algorithmically-derived fourth level of weight to compare variable-weighted characters.