So for my XML library, I created my own system of parsing which may or may not be original or useful. All of the below examples cah be run, in Factor, by requiring the module libs/xml and USEing state-parser. The source code can be found in the Factor distribution in
This way of parsing evolved very slowly. At first, it was just something like this:
: get-char ( -- char )
spot get code get nth ;
: next ( -- )
spot inc ;
I used variables instead of the stack because a lot of the code, which processed this, didn't care about the code or spot. Really, only those two words care. This is the basis for a bunch of abstraction that I can build on it later, for example the combinator
skip-until, the basic form of which is below:
: skip-until ( quot -- ) ! quot: ( -- ? )
[ call ] keep swap [ drop ] [
] if ; inline
nextuntil the quotation returns something other than f. It turned out to be really useful in parsing, allowing further devices to be built on top of it easily, like
take-until, which does the same thing but records the characters skipped. From this, something like
take-charis simple to write. This word skips the parser until the specified character is reached, and returns the string passed:
: take-char ( char -- string )
[ dup get-char = ] take-until nip ;
After building abstractions like this, I was able to build a parser for XML relatively painlessly. The parser itself is in
libs/xml/tokenize.factor, and it does no string operations directly at all. I think it is a type of recursive decent parser, but I'm not sure.
Because of this abstraction, I was able to change the inner workings of
get-chartwice, the first time to add information about the line and column of any errors encountered, and the second time to operate on streams instead of strings. This didn't require any changes at all to
tokenize.factor. I'm planning on changing it again, but still, only a few of these basic, underlying words will be affected, and higher-level code will not be changed.
I've heard people say that it's tough to write a parser by hand. This was probably simpler than it could have been because of the relative simplicity of XML's grammar. This isn't really a Factor-specific solution, just something in general: with significant abstraction, and the ability to write higher-order functions, I don't think anything is too dificult to do "by hand" in an adequate programming language.
PS. To those of you who commented on this blog and I never responded: Sorry; I didn't really expect anyone to read this blog so I never checked it for comments. To Sam, about the combinators: Your implementation might work, but it won't compile in Factor's current implementation. In designing the XML parser and associated utilities, speed was a concern, and that means letting as much as possible compile. This particular thing doesn't work because you are manipulating quotations at runtime. In general, the word
curryis good for that, and should probably be used in this case.
To Ray Cromwell, I know Google didn't really drop its API, but it's not issuing any new keys for the SOAP API, so as far as I'm concerned, that's gone. As far as the new AJAX API, I was under the impression that that can only be accessed client-side, so for all applications that want to do something like remove the ads or analyze rankings, it doesn't work.