Tuesday, January 9, 2007

Imperative parsing with abstractions

Here's somewhat of a confession: after writing a non-trivial library centered around parsing, I still know nothing about parsing strings. My attempts to wrap my head around what exactly an LALR(1) parser is and any little basic aspect of the theory of parsing have been in vain. I still struggle with using simple parser combinators (like the ones doublec has created).

So for my XML library, I created my own system of parsing which may or may not be original or useful. All of the below examples cah be run, in Factor, by requiring the module libs/xml and USEing state-parser. The source code can be found in the Factor distribution in libs/xml/state-parser.factor.

This way of parsing evolved very slowly. At first, it was just something like this:

SYMBOL: code
SYMBOL: spot
: get-char ( -- char )
spot get code get nth ;
: next ( -- )
spot inc ;

I used variables instead of the stack because a lot of the code, which processed this, didn't care about the code or spot. Really, only those two words care. This is the basis for a bunch of abstraction that I can build on it later, for example the combinator skip-until, the basic form of which is below:

: skip-until ( quot -- ) ! quot: ( -- ? )
[ call ] keep swap [ drop ] [
next skip-until
] if ; inline

This calls next until the quotation returns something other than f. It turned out to be really useful in parsing, allowing further devices to be built on top of it easily, like take-until, which does the same thing but records the characters skipped. From this, something like take-char is simple to write. This word skips the parser until the specified character is reached, and returns the string passed:

: take-char ( char -- string )
[ dup get-char = ] take-until nip ;

After building abstractions like this, I was able to build a parser for XML relatively painlessly. The parser itself is in libs/xml/tokenize.factor, and it does no string operations directly at all. I think it is a type of recursive decent parser, but I'm not sure.

Because of this abstraction, I was able to change the inner workings of next and get-char twice, the first time to add information about the line and column of any errors encountered, and the second time to operate on streams instead of strings. This didn't require any changes at all to tokenize.factor. I'm planning on changing it again, but still, only a few of these basic, underlying words will be affected, and higher-level code will not be changed.

I've heard people say that it's tough to write a parser by hand. This was probably simpler than it could have been because of the relative simplicity of XML's grammar. This isn't really a Factor-specific solution, just something in general: with significant abstraction, and the ability to write higher-order functions, I don't think anything is too dificult to do "by hand" in an adequate programming language.

PS. To those of you who commented on this blog and I never responded: Sorry; I didn't really expect anyone to read this blog so I never checked it for comments. To Sam, about the combinators: Your implementation might work, but it won't compile in Factor's current implementation. In designing the XML parser and associated utilities, speed was a concern, and that means letting as much as possible compile. This particular thing doesn't work because you are manipulating quotations at runtime. In general, the word curry is good for that, and should probably be used in this case.

To Ray Cromwell, I know Google didn't really drop its API, but it's not issuing any new keys for the SOAP API, so as far as I'm concerned, that's gone. As far as the new AJAX API, I was under the impression that that can only be accessed client-side, so for all applications that want to do something like remove the ads or analyze rankings, it doesn't work.

No comments: