Saturday, April 7, 2007

Assoc protocol

Still in my break from Unicode (I'll get back to it eventually...) I'm working on something really cool right now: a generic protocol for associative arrays! [For non-Factorers, a protocol is our word for an informal collection of generic words that, together, constitute something that is useful for a class to implement.]

I hear what you're saying. "That's not cool; that sounds really boring and technical. Also, why do you need a generic protocol in the first place." To your first objection: remember when you first learned how to implement map on linked lists? Wasn't that so simple, yet useful and interesting? But then you realized that thousands of other beginning programmers wrote the exact same snippet of Scheme, and yours was pretty much useless. Well, I'm writing all of those things again, except in a way that no one else (to my knowledge) has done before. More on that later.

The reason it's useful is a bit more obvious. Imagine you have a piece of code which uses a hashtable as an associative array. Then, after writing 1000 lines of code, you realize (oops) that, fairly often, you have to iterate over your data in alphabetical order. Also, the data is all in strings, and there are many strings starting with the same word. A hashtable works, but it's definitely not the best option. So trie to the rescue! But now you have to go back and change all of your code from using the hash library to the trie library. It shouldn't be so much work to do that. With a generic protocol, all you have to change is the code initializing the trie/hashtable to use the appropriate type. As another example, imagine you're implementing a database. I have no idea how databases are implemented, but let's imagine you use an AVL tree if there are 1024 or fewer entries, and if there are more, you use a B-tree to represent the data. This way, the switch can be made internally without other functions having to care. A third use, probably the most immediate benefit, is giving the Factor core a really solid set of words for manipulating alists without polluting the namespace, complete with all the derived operations you could want.

Well, those may be a little contrived, but back to the fun part, doing map all over again! Well, here's the current incarnation of the protocol that all assocs (our abbreviation for associative array) use:

GENERIC: assoc* ( key assoc -- value/f ? )
GENERIC: set-assoc ( value key assoc -- )
GENERIC: key? ( key assoc -- ? )
GENERIC: new-assoc ( size exemplar -- newassoc )
G: assoc-find ( assoc quot -- key value ? )
1 standard-combination ; inline ! quot: key value -- ?
GENERIC: remove-assoc ( key assoc -- )
GENERIC: clear-assoc ( assoc -- )
GENERIC: assoc-size ( assoc -- n )
! Additionally, clone must be implemented properly

Most of those have a clear analogue in the current hashtable implementation. key? is hash-member? and H{ } new-assoc is <hashtable>. The only really different operation is assoc-find, which at first looks like a rather curious choice for the protocol.

Probably the most confusing part of assoc-find is that it doesn't use the standard GENERIC: declaration, instead using G: assoc-find 1 standard-combination ;. The purpose of this is to make it dispatch off the second item on the stack instead of the first one. G: some-generic-word 0 standard-combination ; is equivalent to the normal GENERIC: some-generic-word.

The idea of assoc-find is to be analogous to find. It traverses through the assoc in order (if an order exists), returning the first key/value pair that the given predicate succeeds on. It also returns a boolean of whether or not anything was found, which is necessary in the unlikely event was a key of f and a value of f. This probably seems strange because there currently is no hash-find in the Factor library, and nothing has needed it. But on top of assoc-find, any other useful combinator can easily and efficiently be built.

Think about it this way: what combinator could be used that would give you the following features:

  • The ability to execute a quotation on each key/value pair of an assoc (needed for assoc-each)

  • The ability to test, in a short-circuited way, if any or all quotations succeed on a particular predicate (needed for assoc-all?)

  • The ability to return the key/value pair associated with the success of a predicate (well, this is the exact function of assoc-find in the first place, but it's needed for rassoc)


To all these problems, assoc-find is the most natural solution. The strategy for implementing assoc-each? Just call the quotation given, and then return f to assoc-find so it keeps going. At the end, drop all three outputs. Here's the implementation:

: assoc-with 2swap [ >r -rot r> call ] 2keep ; inline

: assoc-find-with ( obj assoc quot -- key value ? )
swap [ assoc-with rot ] assoc-find
>r >r 2nip r> r> ; inline

: assoc-each ( assoc quot -- )
swap [ rot call f ] assoc-find-with 3drop ;

(Yes, I did steal assoc-with from the existing hash-with code verbatim.)

Inspired by the sequences protocol that already exists in Factor, in this style, I can implement all other operations just once and have them work on all other types of assocs. I'm unaware of any similarly designed system. In Java, there's a Map interface, but since interfaces and abstract classes are distinct in Java, and Map is not an abstract class, there are no shared method bodies between Map implementations. In Ruby and Python, you can overload the [] or __getitem__ methods (and many other languages have similar operator overloadings) but there's no organized protocol or shared code around this. Common Lisp uses different functions for different data structures. The two I'm still wondering about are Smalltalk and Haskell, because an assoc superclass or typeclass would make a huge amount of sense in those languages especially, though the Haskell one would be limited to some degree. If you know anything about these or other things, I'd be really interested to hear your comments. Even if another language as a similar assoc protocol, I'd be surprised if they made the same weird choice to use assoc-find as their fundamental traversal operation.

Another thing I'm wondering about is the possibility of "virtual assocs," analogous to the virtual sequences already present in Factor. The basic idea is, to have a sequence, all you really need are nth and length operations, maybe also a set-nth. This allows for cool things like, say, a reversed virtual sequence, which merely modifies indexing so that you are looking from the end instead of the beginning. There is no need to actually copy the underlying sequence. So why not do the same for assocs? It'd definitely be possible, but I don't know if it would ever be useful. I can't really think of a use, though it'd always be easy to implement one once someone thinks of one.

Anyway, this isn't much of a new idea, just a recombination of a bunch of old ones. It's really a great example of the benefits of generic words. It's interesting to note that this would be just as easy to implement in a class-based system as it would be in a typeclass system, though for initial rough development, it's always easier to use generic words or typeclasses which can be defined after the fact.

Update: It turns out that, in Haskell, a library called Edison implements the same sort of thing that I've done with type classes. It's made by (who else?) Chris Okasaki, author of Purely Functional Data Structures. Unfortunately, I can't fully understand either that book or Edison (at least from the associated documentation). I'm sure the library makes mine look extremely crude by comparison, but I think it's a good thing to have your library be relatively simple and comprehensible to someone who doesn't care about a typeclass hierarchy, just about a couple functions. At the same time, I see how the type classes make sense to allow for fine-grained functionality differences.

No comments: