Sunday, January 13, 2008

until: The Ultimate Enumerator

Recently, via programming.reddit, I found this paper by Oleg Kiselyov about the best way to build a generic collection protocol to go through a sequence. We want to do this so that operations on collections can be implemented once generically to work on any kind of collection.

Oleg's idea

There are two main ways to go about this: one is a cursor, which is called an iterator in Java and C#: basically, you have an object which represents a particular point in a collection, and there are operations to get the current value and the next cursor position, if one exists. The alternative is to have a higher order function called an enumerator which represents, basically, a left fold. Oleg's conclusion is that the left fold operator is the best way to go about things, and he proceeds to demonstrate how this works out in languages with and without call/cc. He shows how cursors and enumerators are isomorphic, defining each in terms of the other.

Oleg's function is something like this: you have a coll-fold-left function which takes a collection, a function, and an unbounded number of seeds. This will return those seeds after this left fold processing is done. The function takes an element and the values of the seed inputs and returns an indicator with new values for the seeds. The indicator determines whether to keep going or to stop now and return the current values from coll-fold-left.

An enumeration function like this, which is fully general, is much better than a cursor, because cursors depend on mutable state and they are much more difficult to implement, say, when traversing a tree with no links to parent nodes. Also, enumerators handle termination of the collection much better than iterators, where it must either be manually checked whether the collection has terminated, or an exception thrown when reading past the end. There are also efficiency problems, and some interesting research into doing compiler optimizations like deforestation based on enumerators.

Concatenativity and accumulators

I'm going to talk about how this can all work in Factor. Factor does have call/cc, so we can use the simpler iterator which depends on call/cc for full control. (I won't go into how to stop and start an iteration in the middle in this article; Oleg's example can be translated directly into Factor.) But we also have something else to take advantage of, which neither Scheme nor Haskell have at their disposal: a stack.

Remember, Factor is a concatenative language, and that means that whenever we make a function call, it operates on a huge built-in accumulator. You could think of the stack as one accumulator ("all words are functions from stack to stack") or as an unbounded number of accumulators (which makes more sense thinking about things from a dataflow perspective, that the stack is used to glue together the inputs and outputs of various words).

Here's an example. You know the regular old left fold? As in, taking a list like [1, 2, 3, 4], an initial value (called the identity) 0, and a function like (+), and transforming this into ((((0+1)2)+3)+4)? Well, in Factor we call that reduce, and here's how it's defined:

: reduce ( sequence identity quotation -- value )
! quotation: accumulator element -- new-value
swapd each ; inline

swapd is a stack shuffler with the effect ( x y z -- y x z ) and each is a word which takes a sequence and a quotation, and calls the quotation on each element of the sequence in order. each normally takes a quotation with stack effect ( element -- ).

But how does this work? each, like nearly every combinator, is defined to expose its quotation to values further down on the stack, so we can mess around with that stuff all we want as long as the result is balanced. Here, the quotation that we give the whole thing is messing with the top thing on the stack, right under the element, each time it's called. So, by the end, the top of the stack has our answer. Keep in mind that all of this is happening completely without mutable state.


So, that's the great simplification that you get from concatenative languages: there is no need to maintain the accumulators. Here's all we need to define the perfect enumeration protocol: a word which I call until ( collection quotation -- ). It takes a quotation and a collection from the stack and leaves nothing. The quotation is of the effect ( element -- ? ). If the quotation returns true, the iteration through the list stops there.

until is significantly easier to define than coll-fold-left because we don't have to worry about the accumulators. Here's a simple definition with linked lists, if f is used as the empty list:

TUPLE: cons car cdr ;

: until ( list quot -- )
over [ 2drop ]
[ [ >r cons-car r> call ] 2keep >r cons-cdr r> until ] if ; inline

Of course, we'd probably want to define it as a generic word, if we're doing this in Factor and will use it for more than one thing.

(until is kinda like the generic iteration protocol I've defined for assocs, using assoc-find, except until should be easier to implement.)

Derived operations

Using until as a base, we can easily define other basic operations like each, find and assoc-each as they currently exist in Factor.

: each ( seq quot -- )
[ f ] compose until ; inline

: find ( seq quot -- i elt ) ! This is hideous, and should be factored
>r >r -1 f r> r> [ ! i f elt quot
2swap >r >r keep r> 1+ r> drop ! ? elt i
rot [ rot and ] keep ! i f/elt ?
] curry until dup [ nip f swap ] unless ; inline

: assoc-each ( assoc quot -- )
! This assumes that assoc implement until, giving 2arrays for each element
[ first2 ] swap compose each ; inline


You might notice that find will involve a little bit of redundant arithmetic on integer-indexed sequences. I'm not sure how to avoid this without complicating until and leaking implementation details. Also, you might notice that there's no obvious way to implement find* (without doing a bunch of redundant iteration), which takes an index to start as an argument. This is a result of the fact that find* can only really be implemented efficiently on collections with random access indexing. There's also no find-last.

until isn't everything. As I pointed out just now, some operations, like find, may have to do redundant arithmetic when operating on certain kinds of collections to get an index, when that index must already be calculated for the traversal itself. Another limitation is that there's no obvious way to implement map. (I would still have to suggest going with Factor's current scheme of the words new, new-resizable and like. But this scheme doesn't work with, say, linked lists.) This could be implemented in place of the assoc iteration protocol, but that might result in some unnecessary allocation of 2arrays when iterating through things like hashtables.

Nevertheless, something like until could definitely help us make our sequence protocol more generic, allowing for (lazy) lists, assocs, and what we currently call sequences to be used all through the same protocol.

If you don't understand the title allusion, you should read some of the original Scheme papers.

1 comment:

Slava Pestov said...

Dan, you could significantly simplify 'find' if you got rid of the code that maintains the index. Indices make no sense with abstract iterators anyway.