Monday, July 16, 2007

Eww, that's the dirtiest macro I've ever seen!

I've shown a few parsing words on this blog that create words that aren't mentioned directly in the text. An example of this is TUPLE:, which makes a class with a number of slots, a constructor, a predicate, accessors and setters. It's very useful, even if not all words it creates are given literally in the code. Now, what would you say if I told you I have an unhygenic macro that creates O(n^m) words? (I'll tell you what m and n are later.) In reality, it makes more than n^m words. With 7 as m and n, it creates 1419760 words, and with 8 as both of them, Factor crashes because it runs out of memory after expanding the heap a number of times.

You're probably thinking, "Why would I ever want such a stupid unhygenic macro? And what does it do? Your introduction was so boring and pointless, Dan." Well, I'll tell you: it's a new stack shuffling syntax. dmpk2k on #concatenative asked me about alternative shuffle word syntax. For example, why not make dup into x-xx, swap xy-yx, etc? It'd be much easier to remember, and there would be no need for "stack idioms" like swap rot for xyz-zyx. Even complicated stack shuffling would be simple.

Believe it or not, it's not all to hard to do in Factor. It'd be prohibitively difficult to extend the parser to make this work for all possible shuffles, so I took a halfway approach: provide a way to define as many as you want. The syntax I came up with was
SHUFFLE: abcd 6

which defines shuffle words for up to 4 inputs labeled a, b, c and d (in order), and up to 6 outputs, defining a paltry 6686 shuffle operators, everything from abcd- to ab-baabaa. (If you want, you can alias them, as in : tuckd xyz-yxyz ;, but there's generally little point.) Some of the shufflers do the same thing as others, such as ab-a and a-, but this constitutes a very small fraction of the total. Going with the variables at the beginning of the article, n is the length of the alphabet (here, abcd) and m is the maximum output length (here, 6). The total number of transformations is higher than 4^6 (4096) because that count only includes the longest transforms, from four elements on the stack to six. However, it is still an accurate asymptotic approximation.

So, there are two difficult programming problems here, in addition to some easy ones. The hard ones are (a) generating all possible stack transformations and (b) turning a stack transformation into a quotation.

Generating all stack transformations

(a) may look simple, but it's a bit more complex than it seems. It is easy to get the possible inputs: it's just somewhere between the first element of the alphabet and the entire alphabet. But the outputs can be any number of the inputs, in any order, duplicated or not.

This can be reduced to a simpler problem. Let the length of the alphabet be A, and the length of the output be O. Given an alphabet length and array length, generate an array of arrays consisting of all the possible combinations of numbers 0 to A-1. The length of the array generated will be O^A because there are O spaces, each of which can be filled with A different items. So if we can find a bijection between the numbers from 0 to (O^A)-1 (let's call one of these numbers N) and the arrays we need, the problem is easy.

As it happens, the structure of this problem is remarkably similar to converting numbers to different bases. We can treat our number N as a base A number. Then we create an array of length O consisting of the digits of the number. Each digit should be between 0 and A-1. If we convert each number N into this base, we'll get every possible combination for the output array. The code for this is below, where translate converts a number into an array with a given length and alphabet size, and (combinations) creates all possible combinations for the given length and alphabet size:

: translate ( alphabet out-len n -- seq )
-rot [ drop /mod ] map-with nip ;

: (combinations) ( alphabet out-len -- seq[seq] )
2dup ^ [ translate ] map-with2 ;

To make an array of all the possible stack shuffles with input length from 1 to max-in and output length from 1 to max-out, we need a couple more words. To represent a stack shuffle, I used a 2array whose first element is the number of inputs and whose second element is an array of the output indices.

: combinations ( n max-out -- seq[seq] )
1+ [ (combinations) ] map-with concat ;

: make-shuffles ( max-out max-in -- shuffles )
[ 1+ dup rot combinations [ 2array ] map-with ] map-with concat ;

(Going counter to the Forth/Factor mantra, I actually wrote this code in a top-down fashion. First, I wrote make-shuffles, and then realized I'd need combinations, and then realized I need (combinations) and translate. This is generally not a good tactic, but it works for small problems that take a lot of reasoning, like this.)

Generating quotations
It surprised me, but (b) turned out to be the easy one. At first, to transform the stack, I stored the top few elements in an array, then accessed the elements used for output and dropped the array. But that isn't optimal, because an array is unnecessarily allocated. It's slightly more efficient if we just pass things around on the stack.

To make this easier, I used Chris Double's extra/shuffle library. It defines words like [npick] and [ndrop] which generate quotations for arbitrary depth pick and drop. Using this, I can take the strategy of reaching down and pulling up the last element of the output array, then dipping beneath it. After iterating backwards through the entire output array, I drop the inputs and come back up. That wasn't too good an explanation, so just have a look at the code:

: shuffle>quot ( shuffle -- quot )
first2 2dup [ - ] map-with
reverse [ [npick] % \ >r , ] each
swap [ndrop] % length [ \ r> , ] times
] [ ] make ;

As an example, the code for ab-ba is over >r dup >r 2drop r> r>.

Wrapping it all up

To make this work, I also need a few trivial operators. There are words for turning a stack shuffle datum into a string:
: shuffle>string ( names shuffle -- string )
[ [ swap nth ] map-with ] map-with
first2 "-" swap 3append >string ;

For attaching a stack effect to a shuffle word:

: put-effect ( word -- )
dup word-name "-" split1
[ >array [ 1string ] map ] 2apply
<effect> "declared-effect" set-word-prop ;

And for defining a set of shuffle operators (in-shuffle and out-shuffle make sure that the operators get defined in their own minivocabulary, so they're not exported by default. This follows the example of <PRIVATE and PRIVATE>):

: in-shuffle ( -- ) in get ".shuffle" append set-in ;
: out-shuffle ( -- ) in get ".shuffle" ?tail drop set-in ;

: define-shuffles ( names max-out -- )
in-shuffle over length make-shuffles [
[ shuffle>string create-in ] keep
shuffle>quot dupd define-compound put-effect
] each-with out-shuffle ;

And finally, the dirty macro itself, the greatest sin to hygiene Factor has ever seen:

scan scan string>number define-shuffles ; parsing

Enjoy! I'm not sure whether it would be ethical for me to release this evil into the Factor distribution, so I haven't pushed it on to my repository.


Anonymous said...

it reminds constraints programming that can give you all the options for a given range. I always thought it can be a very useful abstraction.
but you have generated everything yourself.

interesting work, but a thought occurs - as factor gets refactored all the time, there are many new words that build on top of other words.
how would a new programmer know what to choose and use? is documentation a good enough solution for that?

Daniel Ehrenberg said...

Sorry for the delayed response. I never really thought about constraint-based programming, though that probably would provide a more elegant solution than the algorithmic one I made. I should read up more on that; I have no idea how a constraint-based solution would be implemented (in the backend). It would be interesting to see something like that in Factor, or just in general.

It's true that, right now, Factor goes through a major change every few months that makes a lot of code incompatible with the current version. Recent examples include the module system change, the parser rewrite and the assoc protocol. It's true that these changes sometimes make it hard for programmers to keep up. But generally, Factor is growing more and more stable. Once version 1.0 comes out, stability and backwards compatibility will be a goal. But until then, Factor might have major changes if they are deemed beneficial. Factor is still a young language, and this isn't a problem specific to it. But I see how it's a problem.