Tuesday, January 16, 2007

Factor's make

One of the more unique features of Factor is a word called make. It's used for generating sequences easily where it wouldn't work to use a literal, a sequence initializer, or something that operates on an existing sequence. In its simplest form, make can be used like

[ 1 , 2 , 3 , ] { } make

which is equivalent to { 1 2 3 }. The word , appends the top of the stack to an invisible array, which is outputted at the end. Of course, , can be used in a loop, and you can even call another word to add elements to the loop. For example,

: add-all ( seq -- ) [ , ] each ;
[ 3 add-all 4 add-all ] { } make

which generates { 0 1 2 0 1 2 3 }. [Note: this takes advantage of a curious property of Factor, that numbers themselves are sequences representing the range 0..n-1] add-all turns out to be a pretty useful word, so it's actually included in the Factor standard library under the name % (with a more efficient implementation). So the previous code could be rewritten [ 3 % 4 % ] { } make. These names may seem obscure at first, but you'll get used to them quickly.

The importance of this is that, with make, you can do many complex things in constructing sequences that would be much more annoying when explicitly passed around. An example of this is take-until, defined with the imperative parsing code in previous entries and make as:

: take-until ( quot -- string )
#! Take the substring of a string starting at spot
#! from code until the quotation given is true and
#! advance spot to after the substring.
[ [
dup slip swap dup [ get-char , ] unless
] skip-until ] "" make nip ;

Here's another example: say you want to write a function that takes a number and outputs the string "The answer to [whatever the number is] squared is [the number squared].". Factor doesn't require complicated string interpolation or inefficient multiple concatenations as other languages would; instead, it uses make. The word # converts a number to a string and then appends it to the current product of make. The code is below:

: square-description ( num -- )
"The answer to " %
dup #
" squared is " %
sq #
"." %
] "" make ;

When used in this way, "" make turns out rather like the C function sprintf. In other cases, like most uses of { } make, it ends up acting as a substitute for Lisp's ` (quasiquote).

make isn't any sort of magical builtin; it's actually implemented in very simple Factor code. It is in the vocabulary namespaces and defined as such:

: make ( quot exemplar -- seq )
>r [
V{ } clone building set
building get
] with-scope
r> like ; inline

In English, all that does is make a new empty vector, place it in the building variable (which is dynamically scoped), call the given quotation, retrieve the variable, and convert the contents to the given exemplar. It operates inside a new scope so any outside definition of building is unchanged, and make's value for it is invisible after the word is run.

Obviously, something's missing: how do elements get on this new sequence? The answer is that , puts them on. Its implementation is even simpler than that of make:

: , ( elt -- ) building get push ;

and that is basically all there is to make

Update: That's not all, folks! There's also a way to push a whole sequence, all at once, onto the building. How? Using a word cryptically called %:

: % ( seq -- ) building get push-all ;

The push-all word is an in-place append.

Tuesday, January 9, 2007

Imperative parsing with abstractions

Here's somewhat of a confession: after writing a non-trivial library centered around parsing, I still know nothing about parsing strings. My attempts to wrap my head around what exactly an LALR(1) parser is and any little basic aspect of the theory of parsing have been in vain. I still struggle with using simple parser combinators (like the ones doublec has created).

So for my XML library, I created my own system of parsing which may or may not be original or useful. All of the below examples cah be run, in Factor, by requiring the module libs/xml and USEing state-parser. The source code can be found in the Factor distribution in libs/xml/state-parser.factor.

This way of parsing evolved very slowly. At first, it was just something like this:

SYMBOL: code
SYMBOL: spot
: get-char ( -- char )
spot get code get nth ;
: next ( -- )
spot inc ;

I used variables instead of the stack because a lot of the code, which processed this, didn't care about the code or spot. Really, only those two words care. This is the basis for a bunch of abstraction that I can build on it later, for example the combinator skip-until, the basic form of which is below:

: skip-until ( quot -- ) ! quot: ( -- ? )
[ call ] keep swap [ drop ] [
next skip-until
] if ; inline

This calls next until the quotation returns something other than f. It turned out to be really useful in parsing, allowing further devices to be built on top of it easily, like take-until, which does the same thing but records the characters skipped. From this, something like take-char is simple to write. This word skips the parser until the specified character is reached, and returns the string passed:

: take-char ( char -- string )
[ dup get-char = ] take-until nip ;

After building abstractions like this, I was able to build a parser for XML relatively painlessly. The parser itself is in libs/xml/tokenize.factor, and it does no string operations directly at all. I think it is a type of recursive decent parser, but I'm not sure.

Because of this abstraction, I was able to change the inner workings of next and get-char twice, the first time to add information about the line and column of any errors encountered, and the second time to operate on streams instead of strings. This didn't require any changes at all to tokenize.factor. I'm planning on changing it again, but still, only a few of these basic, underlying words will be affected, and higher-level code will not be changed.

I've heard people say that it's tough to write a parser by hand. This was probably simpler than it could have been because of the relative simplicity of XML's grammar. This isn't really a Factor-specific solution, just something in general: with significant abstraction, and the ability to write higher-order functions, I don't think anything is too dificult to do "by hand" in an adequate programming language.

PS. To those of you who commented on this blog and I never responded: Sorry; I didn't really expect anyone to read this blog so I never checked it for comments. To Sam, about the combinators: Your implementation might work, but it won't compile in Factor's current implementation. In designing the XML parser and associated utilities, speed was a concern, and that means letting as much as possible compile. This particular thing doesn't work because you are manipulating quotations at runtime. In general, the word curry is good for that, and should probably be used in this case.

To Ray Cromwell, I know Google didn't really drop its API, but it's not issuing any new keys for the SOAP API, so as far as I'm concerned, that's gone. As far as the new AJAX API, I was under the impression that that can only be accessed client-side, so for all applications that want to do something like remove the ads or analyze rankings, it doesn't work.

Tuesday, January 2, 2007

XML combinators

I really need to work on making my writing make sense, because I think my last post failed at that. The XML library that I've written has a lot of combinators defined for processing, and in this blog post, I'm going to try to describe them as well as I can.

The basic idea behind the XML combinators is to let XML be processed the same way Lispers (and Factorers) have always processed lists and other sequences: with map each find subset and now,inject. These operations are slightly more complicated when used with a tree structure like XML, rather than a flat sequence as the operations were originally designed for. Fortunately, Factor's generic word system makes this relatively easy.

There's one consideration necessary in designing these that I wasn't happy with: since I had to iterate over all nodes, not just the leaf nodes. In all combinators, the result can be affected by which order is used. I made a relatively arbitrary decision and said that first the parent nodes were processed and then the leaves.

A note before the code of xml-each, the simplest of the combinators: this uses Factor's generic words, which dispatch only on the top of the stack. xml-each must dispatch on the class of the XML tag, not the quotation. However, the Factor convention for combinators is that the quotation is on the top of the stack, because this looks better syntactically. So I need to do a swap before entering the generic word. That said, here's the combinator:

GENERIC: (xml-each) ( quot tag -- ) inline
M: tag (xml-each)
[ swap call ] 2keep
tag-children [ (xml-each) ] each-with ;
M: object (xml-each)
swap call ;
M: xml-doc (xml-each)
delegate (xml-each) ;
: xml-each ( tag quot -- ) ! quot: tag --
swap (xml-each) ; inline

The rest of the combinators are implemented in much the same way, with only trivial changes to make different ones. Unlike some combinators (see my previous post) this is very simple to implement, not in small part because I have existing combinators to help. (Unfortunately, it doesn't seem to be possible to abstract a general "lift" combinator (analogous to liftM in Haskell for monads) to turn each into xml-each and map into xml-map, despite the similarities between them).

A really simple usage of xml-each is [ write-item ] xml-each, which prints out each element of the given XML tag, document, or other type of element.

As a more in-depth example, xml-find (which works just like find, but without returning an index) is used to implement the equivalent of .getElementByID(), called get-id:

: get-id ( tag id -- elem )
swap [
dup tag? [
"id" prop-name-tag
[ string? ] subset concat
over =
] [ drop f ] if
] xml-find nip ;

(Note: one weakness of the XML utilities that currently exist is that it's annoying to deal with XML tag attributes, though this should be fixed when I do more real work using them)

Monday, January 1, 2007

Designing combinators in Factor

Factor's a great programming lanugage, but one of the more annoying points of Factor is that it can be difficult to write complicated combinators, the Factor/Joy term for higher order functions. Here's one that I wrote recently. See if you can tell what it does:

: map-replace ( seq pred map -- )
pick dup length [ ! ( seq pred map elt i )
>r swap >r rot >r ! ( pred elt | seq map i )
[ swap call ] 2keep rot ! ( pred elt ? | seq map i )
r> r> swap >r rot >r ! ( elt map | pred seq i )
[ call ] keep ! ( new-elt map | pred seq i )
r> r> r> rot >r rot >r ! ( new-elt seq i | map pred )
swap [ set-nth ] keep ! ( seq | map pred )
r> r> swap ! ( seq pred map )
] [
drop r> swap r> r> drop ! ( seq map pred )
] if
] 2each 3drop ; inline

I showed this code to Slava Pestov, the creator of Factor, and he didn't understand it. This is ridiculously illegible, and the funny part is, it only has five words that actually *do* anything, and the rest just move stuff around the stack. I put these words in bold and italics.

This code looks complicated but it actually does a pretty simple thing. Given a mutable sequence, this code takes two quotations, a predicate and a map. On each element, the predicate is applied to the element. If the result is not f, the map quotation is run with the element, and that value is put in the place where the element used to be. An example of this is

{ 1 2 3 } dup [ 2 = ] [ 3 + ] map-replace .

which outputs { 1 5 3 }. I wrote this up as part of some code to change XML code in place, as a supplement to xml-map, which makes a new tree entirely.

To simplify this, I wrote a new combinator which takes only one quotation. This quotation is run on the element. If the result is f, nothing happens to the element. But if it is something else, the item is set to the resulting value. This can be written as:

: map-replace2 ( seq quot -- )
over dup length [ ! ( seq quot elt i )
>r rot >r swap [ call ] keep ! ( result quot | seq i )
swap [ ! ( quot result | seq i )
r> r> swap [ set-nth ] keep ! ( quot seq )
] [ ! ( quot | seq i )
r> r> drop ! ( quot seq )
] if* swap
] 2each 2drop ; inline

In the end, it turned out all of this was almost completely pointless. Apparently, the Factor standard library already has a word called inject, which is like map, except it works in place. This is basically what I wanted all along, I was just making the mistake of premature optimization, shying away from setting each element individually because I thought it'd be too costly. inject is defined as

: inject ( seq quot -- )
over length [
[ -rot change-nth ] 3keep
] repeat 2drop ; inline

Using inject, Slava implemented my first map-replace as

: (map-replace) ( pred quot elt -- newelt )
[ -rot slip ] keep rot [ swap call ] [ nip ] if ;

: map-replace ( seq pred quot -- )
[ pick pick >r >r (map-replace) r> r> rot ] inject
2drop ;

but at this point, it looks like like inject will be enough by itself. Now, going back to the original purpose, it is really simple to lift this to operate on XML trees rather than just sequences. xml-inject now joins the existing combinators xml-each xml-map and xml-find, all of which were very simple to implement given the existing parallel combinators on sequences:

GENERIC: (xml-inject) ( quot tag -- ) inline
M: tag (xml-inject)
tag-children [
swap [ call ] keep
swap [ (xml-inject) ] keep
] inject ;
M: object (xml-inject) 2drop ;
M: xml-doc (xml-inject) delegate (xml-inject) ;
: xml-inject ( tag quot -- ) ! quot: tag -- tag
swap (xml-inject) ; inline

Update: Note that in the current version of factor, inject is called change-each. Also, when writing new combinators now, it's very often useful to include a new word called curry, which I described here.