Monday, July 30, 2007

How to make ad-hoc polymorphism less ad hoc in Factor

Note: Originally, I used the words type, class and instance where I now use class, group and implementation, respectively, but I changed this because it was too confusing. This proposal may sound sketchy, which is because I haven't implemented it yet. But I'll have a working draft in a few days.

A few days ago, Slava told me that he had plans to radically change the Factor object system. At first, this alarmed me. I'm afraid we might lose users if we keep changing the language so much, with no end in sight. It's amazing how quickly Factor code undergoes bitrot. One of the changes he suggested included getting rid of class names in slot accessors, which I think is both useless and somewhat obfuscating. But after some thought I realized there was a potential for great changes here.

A proposal: what if Factor had type classes, like Haskell? No, it's impossible to port classes directly from Haskell in their current form. But type classes form a good model to build a dynamically typed object system around. For the rest of this blog posting, I'll use the word "group" to refer to my extended notion of a type class. I call it this because it refers to a group of generic words, as well as a group of classes that implement them. I'll continue referring to Factor classes as classes. This may confuse Haskellers, but it makes more sense, since it fits in with current Factor concepts. A good introduction to type classes is Philip Wadler's original paper, "How to make ad-hoc polymorphism less ad hoc." (Note that I'm not doing any of the dictionary translation part, really just the first part. Also, this does not include multiparameter type classes, but may be extended to that by Factor 2.0 to support homogeneous sequences when Factor is optionally statically typed. Really, the link to Haskell is more metaphorical than anything else. You can also think of groups as abstract classes.)

The sequence protocol, one of the first big uses of Factor's object system, is a good example of what's wrong with what we have right now. Here's some code from core/sequences/sequences.factor:

GENERIC: length ( seq -- n ) flushable
GENERIC: set-length ( n seq -- )
GENERIC: nth ( n seq -- elt ) flushable
GENERIC: set-nth ( elt n seq -- )
GENERIC: new ( len seq -- newseq ) flushable
GENERIC: new-resizable ( len seq -- newseq ) flushable
GENERIC: like ( seq prototype -- newseq ) flushable

GENERIC: nth-unsafe ( n seq -- elt ) flushable
GENERIC: set-nth-unsafe ( elt n seq -- )

Those are just a bunch of generic words, all detached from one another. The only thing connecting them is documentation, and the understanding that all sequences should implement all of them. (Or at least enough of them so that it can function; some of these methods have implementations on object and some don't.) Some methods have a convenient property that they can defer missing methods to the delegate of a tuple, but this only works if no default implementation on object exists.

The first thing to do is to provide a structure for these generic words. In documentation, they form a protocol, but I'll use them to form a group, called sequence. (Note that this code won't work, if you try to run it. It's just an idea. Hopefully I'll have an implementation in a month or so, but I have a lot to do right now.)

GROUP: sequence
GENERIC: length ( seq -- n ) flushable
GENERIC: nth ( n seq -- elt ) flushable
GENERIC: nth-unsafe ( n seq -- elt ) flushable
GENERIC: new ( len seq -- newseq ) flushable
GENERIC: new-resizable ( len seq -- newseq ) flushable
GENERIC: like ( seq prototype -- newseq ) flushable

GROUP: mutable FROM: sequence
GENERIC: set-nth ( elt n seq -- )
GENERIC: set-nth-unsafe ( elt n seq -- )

GROUP: resizable FROM: mutable
GENERIC: set-length ( n seq -- )
GENERIC: underlying ( seq -- underlying ) ! Note: defined in core/growable
GENERIC: set-underlying ( underlying seq -- ) ! ditto

Immediately, you can see that I've added some structure to this. The generic words are each inside a GROUP: declaration. Some of the group declarations have FROM: following the group name, meaning that every implementation of that group must also be an instance of the group following the colon. (Multiple classes could be placed in brackets.) I've separated this out into three type classes, because that is the actual structure of the sequence protocol: all sequences are indexable, and some of those are mutable, and some of those are growable.

In each of the above examples, GENERIC: is used, but it's possible to use other method combinations like GENERIC#, MATH: and HOOK:. This proposal doesn't imply any significant change in dispatch mechanisms, except for the removal of delegates as they are now.

Each group forms a class which is the (initially empty) union of the types implementing the group. At first, this sounds useless, but it allows two things we haven't had in Factor that Haskellers take for granted: the ability to do proper default method implementations and create type class instances of forms like instance Foo a => Bar a where ... (this isn't Haskell98, but it's still useful). Here's how some default methods look:

M: sequence nth bounds-check nth-unsafe ;
M: mutable set-nth bounds-check set-nth-unsafe ;
M: resizable set-nth ensure set-nth-unsafe ;

Currently, each of these definitions is repeated multiple times throughout the Factor code base. Definition on object, the current mechanism for default methods, is avoided to make room for delegation. Other default methods, which are currently implemented on object, could be

M: sequence new-resizable drop ;
M: sequence new drop f ;
M: sequence like drop ;

It is good that these are implemented on sequence rather than object because that is a more precise way of indicating what the code is actually doing. I always found it confusing that sequence methods would be implemented on object when they would clearly fail on many types of objects.

So, we have this group. How do we make an implementation of it, or a class which belongs to it? If we just started implementing methods without any declaration, it'd be hard to decide whether a type belongs to a particular class. So, here's an example of how the syntax could look, showing the instance of sequences on integers. Note that integers belong only to the class sequence, not mutable or resizable.

JOIN: sequence integer
M: integer length ;
M: integer nth-unsafe drop ;

The JOIN: declaration ensures that integers will be included in the sequence group. (They're joining the group, hence the name.) We may or may not want an error to be triggered if someone attempts to implement the sequence methods without that declaration. I'm not sure. It might also be an error-triggering condition if an JOIN: declaration exists in a file without a complete implementation of the group's methods.

Slava suggested that we may want a behavior, or set of implementations of methods, without adding any slots or new generic words. That could be done easily with this system:

GROUP: foo FROM: sequence ;
M: foo nth-unsafe ... ;

Now, here's an innovation where we go beyond Haskell, sorta. It could, potentially, be included back into Haskell; there's nothing stopping them, but it might not be as useful there. (I should note that this so far doesn't incorporate nearly all of the functionality of Haskell type classes, but that's OK.) I'm talking about a generalization a Haskell extension called generalized deriving for newtype. Basically, if you make a newtype in Haskell, you can use the deriving construct to allow a few classes to implement all methods by looking up the object stored "in" the object of the newtype. (In reality, there's nothing in it; the difference is only at the type level.) That wasn't a very good description, so let me explain how it works in Factor.

My first real programming project was writing an XML parser. Recently, I found a way to make some operations on tag objects much simpler: a tag can act like its name, its children, and its attributes all at the same time! That probably sounds confusing, but it is actually quite simple. The main way names are used is querying the name-url, name-tag and name-space, the three tuple fields. The way a tag's children are used is as a sequence. And attributes are used as assocs. So really, the concept we want to express here is that there are three different delegations, each for a different type class. (Let's imagine here that the tuple type TUPLE: name url space tag ; defines a type class name-slots containing the generic words name-url, name-tag and name-space. I'm also not sure if this is a good idea.) Then we can specify these delegations using:

TUPLE: tag name attrs children ;
DELEGATE: name-slots tag tag-name
DELEGATE: growable-seq tag tag-children
DELEGATE: assoc tag tag-attrs

Currently, my code for this is a mess of delegation using the delegate slot (which is set awkwardly in a manual constructor definition to the name) and this:

M: tag at* tag-attrs at* ;
M: tag set-at tag-attrs set-at ;
M: tag new-assoc tag-attrs new-assoc ;
M: tag assoc-find >r tag-attrs r> assoc-find ;
M: tag delete-at tag-attrs delete-at ;
M: tag clear-assoc tag-attrs clear-assoc ;
M: tag assoc-size tag-attrs assoc-size ;
M: tag assoc-like tag-attrs assoc-like ;

M: tag nth tag-children nth ;
M: tag nth-unsafe tag-children nth-unsafe ;
M: tag set-nth tag-children set-nth ;
M: tag set-nth-unsafe tag-children set-nth-unsafe ;
M: tag length tag-children length ;

which is just plain unnecessary duplication. The simpler preceding code would basically generate the code underneath. Assuming tuple slot accessors are kept the way they are, this would allow a tuple to "inherit" from more than one tuple type. So all classes have the capability to inherit from one or more "abstract classes" (groups) and "concrete classes" (normal classes). For those of you worried about multiple inheritance conflicts, let's take a simple approach and just say the last DELEGATE: declaration takes precedence. There's no reason to have complicated graph algorithms here. If your code depends on something more complicated than that, then there's probably something wrong with it. In general, the classes you implement or delegate to shouldn't overlap too much.

If you've noticed, I have a notion of a class which is a bit broader than the Haskell concept. Now let me make it even broader: every generic word forms a group (and therefore also a class, since every group forms a class which is the union of the types that implement it). This way, every generic word is in some sort of type class. So, by itself, the code GENERIC: foo would be equivalent to

GROUP: foo

In most Factor code, it wouldn't make sense to put most generic words into classes. They're not part of a protocol, and it's just annoying. When a type implements a method, it is automatically made an instance of the class corresponding to that method. Here's a nice useless example of that:

GENERIC: nothing? ( object -- ? )
M: f nothing? drop t ;
M: sequence nothing? empty? ;
M: zero? nothing zero? ;

The final method works on all objects that have an implementation for zero?. Again, I'm not sure whether this is a good idea or not.

So, there's my long-winded proposal for something like type classes in Factor. The terminology is probably confusing, so feel free to ask for clarification (or correct my errors!). Ed Cavazos made another interesting proposal for a radically changed object system in Factor, which he has already implemented and made use of in extra/mortar. I could implement this system in Factor, and may in the future, but haven't yet. Comments?

Update: I made some changes that Slava suggested. I should've noted better that ad-hoc polymorphism, with or without this change, will still be a lot more ad hoc than in Haskell, if I didn't make that clear.

Update 2: I have a working implementation of this at the wee-url pastebin. There are only two problems: it's slow and it'll hang if you give it a cyclic group structure. (Sorry, that statement sounded too abstract algebra-like but it actually has nothing to do with that, AFAIK.) Oh, and a couple more problems: I can't include this in my repository, because loading it affects the whole system, which is bad. And also, it might be a bad idea to make every generic word and set of tuple slots a class, because that'd add 1199 classes to the core as of Factor .89, most of which will never be used. In .89, there were 296 classes in the core. Another problem is that I haven't fully tested the newly modified class< needed because of the new class structure. I'm not sure if it'll work properly, since this includes multiple inheritance. (But then again, so do overlapping unions, which, in my tests, silently and arbitrarily chose one when there was any ambiguity about which method to run.) When writing this, I realized that it's actually really easy to infer by context when a group is joining a class, so an explicit JOIN: declaration is only needed in certain edge cases, or when a group does not have any directly associated methods.

Update 3: Actually, Slava implemented basically the same thing at the same time as I wrote it, but I didn't have internet at the time, so I didn't know it. Huh.


Slava Pestov said...

Dan, in Factor, a 'type' is an internal representation of an object. There is only a fixed number of them and corresponding to each type is a built-in class. Classes can be defined by the user. So the two concepts are not interchangeable.

Ed's object system can be found in extra/mortar in darcs.

Daniel Ehrenberg said...

Oh yeah, I forgot. Well, that meaning of 'type' is really low-level and the programmer doesn't have to worry about it. We could call it "primitive type" or something else instead, if this sort of terminology is adopted. Thanks for the link.