Showing posts with label xml. Show all posts
Showing posts with label xml. Show all posts

Thursday, February 5, 2009

XML pattern matching

I've implemented syntax for pattern matching on interpolated XML literals. This is a Scala-inspired feature which may or may not be useful, but is definitely cool-looking. Here's a sample of code:
: dispatch ( xml -- string )
{
{ [ [XML <a><-></a> XML] ] [ "a" prepend ] }
{ [ [XML <b><-></b> XML] ] [ "b" prepend ] }
{ [ [XML <b val='yes'/> XML] ] [ "yes" ] }
{ [ [XML <b val=<->/> XML] ] [ "no" prepend ] }
} switch ;

And here's some examples of what it does:
[XML <a>pple</a> XML] dispatch
=> "apple"
[XML <b>anana</b> XML] dispatch
=> "banana"
[XML <b val="yes"/> XML] dispatch
=> "yes"
[XML <b val="where"/> XML] dispatch
=> "nowhere"

The pattern matching here is based on my inverse library. Hopefully you get the high-level idea of how XML pattern matching works. A caveat is that namespaces are ignored, as in Scala's XML pattern matching, and I haven't thought of a good way to incorporate namespaces.

Jeff Attwood recently wrote a blog post about XML literal syntax*, and how it's a big help. In part, I agree with his statement that it lets you write cleaner code to have XML literals mixed in with everything. That's why I implemented them in Factor.

But I think it's insane to have them built in the way that Scala and, apparently, VB.NET have. Just think if the designers of Java had this idea 15 years ago. They would have made SGML literal syntax rather than XML, and there would be no way to remove this for backwards compatibility concerns. Every implementation of Java would have to contain a full-featured SGML parser. Generating HTML would be easier and allow for very pretty code, though.

XML won't be so prominent forever. It'll always be there, of course, just like SGML is still there in non-XML HTML and DocBook files which refuse to die. But I wouldn't be surprised if, in 20 years, what we want is a literal syntax for something other than XML. We'll still be using a lot of the same programming languages as were invented today, though.

Factor's XML literal syntax and XML pattern matching are just libraries, and I wouldn't want it any other way. Factor's parser needed no adjustments at all to allow for XML literals. If an XML literal is written when the vocabulary xml.literals isn't in scope, it's a syntax error. Sure, you need delimiters [XML XML] around the XML literals, but this is a small price to pay.

In 20 years, if XML isn't useful any longer, then Factor programmers will be just as well off as if they are. If there's a new data format, it'll just be a new library to download, and you'll have literal syntax for that data format. If you were using VB.NET, though, you'd have to upgrade to the most recent version of the language, with a parser that's been vastly complicated.


*He actually talked about HTML literal syntax, but that's such a ridiculous idea that I know he just mistyped. There's no way to tell where an HTML fragment ends, for one, and the HTML DTD would have to be part of the core grammar of the language. You would need delimiters around where the HTML starts and ends, and none of his code fragments have those delimiters. The X key is right next to the letters H and T, so it's an honest typo. His pseudocode sample just before the VB.NET fragment must have also been a simple mistake, as it would seem to imply that either XML literals get printed immediately or that there is some implicit way of collecting them.

Sunday, January 25, 2009

Factor supports XML literal syntax

Factor can now express XML as literals in code. There's a new library, xml.interpolate, which lets you create an XML chunk or document using interpolating either by locals or using a fry-like syntax. Here's a taste, from the syndication vocab, to create an Atom file:
: feed>xml ( feed -- xml )
[ title>> ]
[ url>> present ]
[ entries>> [ entry>xml ] map ] tri
<XML
<feed xmlns="http://www.w3.org/2005/Atom">
<title><-></title>
<link href=<-> />
<->
</feed>
XML> ;
This could also be written with locals:
:: feed>xml ( feed -- xml )
feed title>> :> title
feed url>> present :> url
feed entries>> [ entry>xml ] map :> entries
<XML
<feed xmlns="http://www.w3.org/2005/Atom">
<title><-title-></title>
<link href=<-url-> />
<-entries->
</feed>
XML> ;
Here's an example with more complicated logic:
"one two three" " " split
[ [XML <item><-></item> XML] ] map
<XML <doc><-></doc> XML>
whose prettyprinted output (using xml.writer:pprint-xml) is
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<item>
one
</item>
<item>
two
</item>
<item>
three
</item>
</doc>
The word <XML starts a literal XML document, and [XML starts a literal XML chunk. (A document has a prolog and exactly one tag, whereas a chunk can be any kind of snippet, as long as tags are balanced.) The syntax for splicing things in is using a tag like <-foo->. This syntax is a strict superset of XML, as a tag name in XML 1.0 is not allowed to start with -.

It took me just an evening to hack this up. It's less than 15 lines of code modifying the XML parser, and all the interpolation is less than 75 lines of code. Best of all, unlike Scala's XML literals, this doesn't affect the core of the language or make anything else more complicated. It's just nicely tucked away in its own little vocabulary. I plan on replacing usages of html.components with this, once I make some tweaks.

Thursday, January 15, 2009

XML encoding auto-detection

The Factor XML parser now auto-detects the encodings of XML documents. This is implemented for all of the encodings that are implemented in Factor. To see how it's implemented, look at the XML standard, because it explains it much better than my blog post, which was below.

I was mystified myself when I first read that XML documents can specify what encoding they are, in the document itself. The encoding, if it's not UTF-8, must specified in the prolog like this:
<?xml version="1.0" encoding="ISO-8859-1"?>

The idea of the algorithm is simple. It just goes by cases. First, check for a byte order mark (BOM), which would indicate UTF-16 and a particular endianness, or UTF-8. If there's no BOM, then the first character must be < or whitespace. If it's <, we can differentiate between UTF-16BE, UTF-16LE (without BOMs) and an 8-bit encoding. If it's one of the first two, we can tell by the fact that there's a null byte before or after the <. If it's an 8-bit encoding, we can be sure that there won't be any non-ASCII in the prolog, so just read the prolog as if it's UTF-8, and if an encoding is declared, use that.

To implement it, I just read byte by byte and have a case statement for each level. After two just octets, it's possible to differentiate between UTF-8, UTF-16 (with a BOM, for both endiannesses), UTF-16BE and UTF-16LE. A similar process could also identify UTF-32 and friends after 4 octets. In my implementation, I had to do a little bit of hacking inside the XML code itself to get this integrated properly. All together, it's about 40 or 50 lines of code. It's available now in the Factor git repository.

[Update: Thanks for pointing out my error, Subbu Allamaraju. Fixed a typo, see comments.]

Thursday, January 8, 2009

I'm back

Hi again! I've been gone for the past few months because in the fall I was away in Budapest studying math, and it was really hard. Now, I've decided to take three months off from school to work on Factor full-time. My plan is first work on finishing up Unicode and XML, and then try to improve Factor's garbage collector.

For Unicode, over the past couple days, I fixed bugs in normalization and grapheme breaking, and I implemented word breaking. This was all made a lot easier by the test suites that Unicode includes, and these are now used as unit tests. I also wrote docs for everything. I still need to optimize everything and clean up the code, though. There are a lot more algorithms that I could implement, and plan on implementing eventually, but I'm just going to do this for now.

For XML, I plan on hooking up the XML conformance test suite and fixing any conformance issues that come up, for starters. I've been terribly informal about testing in the past, and I'll try to change this. Instead of optimizing the current XML code directly, I plan on replacing it with a generated lexer and parser: I'll try to make a system like lex/yacc in Factor. Previously, Chris Double has made some pretty cool stuff for parsing with parser combinators and parsing expression grammars, but these both have efficiency problems. I know that a traditional system like lex/yacc can solve the problem of parsing XML, and while it's not as flexible as PEGs, it still might be possible to add high-level OMeta-like features. Parsing is something that I don't know a lot about, so I'll be doing some reading for this.

For garbage collection, I want to actually implement something after studying it for so long! I plan on first making Factor's GC generational mark-sweep, and investigating changes to the generational policy to reduce thrashing. Then, there are two things that I want to do: make it incremental, and make some kind of compaction. Maybe this will take the form of mark-copy, or maybe it'll just be an incremental mark-sweep collector with occasional stop-the-world compaction (which could be disabled for certain applications). Either way, expect the footprint of Factor programs in the future to be reduced by almost half, and hopefully GC pauses will be shorter most of the time.

Wednesday, February 6, 2008

XML and its alternatives

I started writing Factor's XML parser thinking that its purpose was to interface with legacy protocols which made the mistake of choosing XML, and that the people at the W3C were a bunch of idiots for pushing such a bad, unoriginal format on innocent programmers who would do better without it. At this point, though, I think it might not actually be that bad. Let's look at the alternatives for representing human-readable structured information for standardized protocols.

Flat text files

In the recent past, many protocols and file formats were written with a flat text file, binary or human-readable, each requiring an individually specialized parser. Many are still written this way. Does it make any sense to impose a tree structure on something as simple as blog syndication, documents or remote procedure calls? Or was it a wrong turn to put all of that in the same verbose, complicated syntax?

I think it was a good idea to specify these formats in terms of a common tree-based human-readable format. Maybe for some low-level network protocols, a flat text or binary file makes sense, but many other things work out well using a tree structure. For example, the Atom syndication format is a way to store things like blog feeds in XML. The structure is pretty simple: there's a bunch of metadata about the blog, and then there are a bunch of nodes corresponding to items, with roughly the same fields as the feed itself has. (I'm oversimplifying, here.) Atom uses a tree structure to store this, and the tree is in XML syntax. A tree structure makes sense, because there are a couple different sub-levels: there's the level of items, and then underneath that, the level of data about each item. These can be cleanly separated in a tree model.

Using a pre-existing XML parser, Atom is fairly easy to parse and generate. I wrote a simple library for Atom parsing and generation here in not much code.

An additional benefit of a tree structure in a standard syntax is that standard tools can be used on it. On the most basic level, you can use a parsing library. But is this really necessary if the format is simple enough anyway? When there is a large amount of information, parsing inevitably becomes harder, and a consistent encoding of hierarchical structure makes this easier.

A new alternative to Atom is Zed Shaw's XSFF, where information is in a simple in a flat-file format. (Update: Zed says this should be taken as a joke.) Originally, this only had basic information about the blog overall, and the URL of each post in chronological order. But when things were extended to show the article contents, Zed's solution was to have the flat file link to a special source format he uses to generate HTML. He didn't provide anything to get the date things are posted, which, in his case, can be deduced from the URL.

I don't mean to criticize Zed, but this new format will actually be more difficult for programmers to process than regular Atom feeds, if they want to have a "river of news" for aggregator format. A ZSFF aggregator (as Planet Factor is an Atom aggregator) would have to parse the URLs to figure out the dates to get a correct ordering and follow the URLs with a .page extension to get content. For those pages, they also must be parsed to get the title and content in HTML form. Is it easier to write an ZSFF generator? Yes, but it's much harder to read, and that must be taken into consideration just as much.

S-expressions

Many smart Lispers have complained about XML. They claim s-expressions (sexprs), the basis for Lisp syntax, are better for most human-readable data serialization purposes with a bunch of good reasons. Some of these are,
  • Sexprs are simpler to parse—just use read.
  • They're easier to process, since they're just nested lists.
  • Sexprs encode real data types in a direct way, not just strings but also integers and floats.
  • XML is unnecessarily complicated, including things like the distinction between attributes and children and unnecessary redundancy in closing tags.
  • Sexprs came first and they do everything XML does, so why should we use XML?

Explaining XML's utility is different than to Lispers' criticisms, largely because at least half of the criticisms are correct: XML is completely isomorphic to sexprs but with a more complicated and verbose syntax. Still, it has a few advantages besides its legacy status. XML is very well-specified and is guaranteed not to differ between conforming implementations. The general idea of s-expressions is fairly universal, but it differs between implementations what characters can be included in a symbol, what characters and escapes can be used in strings, the precision of floats and the identity of symbols. Within a well-specified Lisp (I'm using Lisp in the general sense here), there's not much ambiguity, but in communicating between computers programmed with different parser implementations, XML is more robust. XML supports Unicode very consistently, with explicit mention of it in the stanard. Sexprs can't be depended on for this.

This might not be helpful in general, but one really great thing about XML is that you can embed other XML documents inside of it very cleanly. I never thought I'd use XML for anything but others' protocols until it turned out to be the right tool for a quick hackish job: a simple resource file to generate the Factor FAQ. Previously, I maintained the FAQ in HTML directly, but it became tedious to update the table of contents and the start values for ordered lists. So in a couple hours I came up with and implemented a simple XML schema to represent the questions and answers consisting of XHTML fragments. I could parse the HTML I was already using for the FAQ and convert it to this XML format, and convert it back.

Could I have used s-expressions, or an equivalent using the Factor reader? Sure, but it would have taken more effort to construct the HTML, and I'm lazy. One reason is really superficial: I would have had to backslash string literals. Another reason is that I had an efficient XML parser lying around. But one thing which came in handy which I didn't expect was that because of XML's redundancy, the parser caught my mistakes in missing closing tags and pointed them out at the location that they occurred. Anyway, these are all small things, but they added up to XML being an easier-to-hack-up solution than s-expressions or ad-hoc parsing.

(JSON: I have nothing to say about JSON, but I don't think anyone's ever tried to use it for standardized protocols, just for AJAX. And I don't really know much about it. But I feel like I should mention it because it's out there, and it's definitely a reasonable option because it's strongly standardized. The same goes for YAML, basically. It's important to note, though, that JSON and YAML are in a way more complicated (though arguably more useful) because they include explicit data types.)

Conclusion

While XML isn't perfect, it is the right tool for some jobs, even some things it isn't currently used for. Of course, there are definitely cases where flat files or s-expressions are more appropriate; it would be stupid to reflexively use XML for everything where you want a human-readable data format. But format standards, while annoying, are great when someone else takes the time to implement them for you. This way, you don't have to worry about things like character encodings or parsing more complicated grammars as the format grows. The biggest benefits of XML are the tree structure and its standardization.

Update: For a more complete look at the alternatives to XML, check out this page which William Tanksley pointed out (unfortunately only on the Internet Archive).

Tuesday, January 22, 2008

Matching, diffing and merging XML

Update: A newer, more complete version is here.

I've said bad things about my job working on Carleton College's website, but fundamentally it's a really sound work environment we have. Just before winter break, one of the full-time employees came to me and asked if I could make a diff between two XHTML documents for use in Carleton's CMS, Reason. This would be useful for (a) comparing versions of a document in the CMS (b) merging documents, in case two people edit the same document at the same time, so as to avoid locks and the need for manual merges. They came to me because I told them I'd written an XML parser.

I may know about XML, but I know (or rather knew) nothing about the algorithms required for such a diff and merge. I looked on Wikipedia, but there was no article about this kind of stuff. So for the past three weeks, I've been paid to read academic papers, mostly from Citeseer, my favorite website, about matching, diffing, and merging trees. Here's some of what I've found.

A naive algorithm

Here's something that won't work: a line-by-line diff and merge, along the lines of the Unix utilities diff and patch. You can split things up so that each tag is on a different line, as many have suggested, but it still won't work. Everyone who I've mentioned this problem to, including the boss who gave the assignment, thought of doing something like this, but it's unworkable.

Most obviously, this can easily break the tree structure. Here's one example of a failing three-way merge. (By a three-way merge, I mean a merge of two documents where you have the original document which the other two are derived from. diff3 does a line-by-line three-way merge, which I'm using for this example.)

<p>
This is
some
text
</p>

<p>
<b>
This is
some
</b>
text
</p>

<p>
This is
<i>
some
text
</i>
</p>

<p>
<b>
This is
<i>
some
</b>
text
</i>
</p>
OriginalPart boldedPart italicizedLine-by-line merge

This XML is not well-formed! This is unacceptable. I'm not saying diff3 is a bad tool, it's just not what we need.

Another problem is that a standard Unix diff only takes two things into account: insertions and deletions. In document diffing and merging, it'd be helpful if we supported another operation: moves. Say we take two paragraphs and swap them. In a model with only deletions and insertions, the paragraph would have to be deleted from one place and inserted in another. This can lead to sub-optimal diffs as presented to the user. It can also lead to bad merges: say in one branch, a paragraph was edited, and in another branch, that paragraph was moved to a different location. An optimal merge would put the edited paragraph in the new location, which requires tracking these moves directly.

Background to a better solution

XML can be viewed as ordered trees, where each node has an unbounded number of children and each internal node has a label. (In some situations, it acts more like an unordered tree, but in XHTML it's definitely ordered. Also, finding the optimal matching for unordered trees is known to be NP-hard, so we don't want to go there.) So we can solve these problems of diffing and merging XML by solving a more general problem on ordered trees. There are some specific aspects of XML which deserve mention (attributes, which are guaranteed to have unique names within a node; IDs, which are guaranteed to be unique within a document), but these are minor aspects which we can ignore for most of the time.

To avoid confusing, I'll define some terms I've been using or will soon start using. When I talk about "diffing", what I mean, formally, is generating an "edit script", or list of changes between two documents that can be used to get the modified document from the original. Sometimes, these edit scripts are invertible, but not always. When I talk about a "merge", I mean a way to reconcile the changes between documents to incorporate both of these changes. A merge can be an operation on edit scripts or it can be done directly on a tree matching. A "matching" is a set of correspondences between nodes in different trees; it is the basis for doing either a diff or a merge, and it's difficult to do efficiently.

(It might help to have a basic understanding of Levenshtein distance and the implied string diff algorithm. The longest common subsequence problem will also crop up here from time to time.)

The Zhang-Shasha algorithm and extensions

The Zhang-Shasha algorithm is the basic starting point when thinking about tree matching, diffing and merging. Except it isn't that basic. Dennis Shasha and Kaizhong Zhang created an algorithm to solve the approximate tree matching problem, which they described in the book Pattern Matching Algorithms. Their chapter on trees is available from Citeseer. Here's the basic idea: we want to see how similar two trees are, by a weighted edit distance metric. The edit script has three operations, similar to Levenshtein distance: add a node (optionally including a contiguous subsequence of the parent node), delete a node (putting children in the parent node), and relabel a node.

With this, they were able to come up with an algorithm of complexity (basically) O((n log n)^2), where n is the number of nodes in the tree. So this can get you a matching and an edit script, or diff between two trees. This isn't great, but it's much better than previous algorithms. The two also worked on the more difficult problem of supporting "Variable-length don't-cares", or the equivalent of * in Google/Unix file globbing, in tree matching, but we don't care about that here.

But this doesn't describe all of the changes that might take place in a tree structure that we might want to record. For example, in Zhang-Shasha proper, moving a node from one place to another is recorded as deleting the node from one place and inserting it into another place. Another issue is that inserting or deleting a subtree is recorded as inserting the node, then inserting each of its children, or deleting the leaf nodes recursively up until you delete their parent. This all leads to counterintuitive diffs, as far as human readability goes, as well as inflated edit distances.

So David Barnard, Gwen Clarke and Nicholas Duncan got together to create a modified algorithm that accommodated this modified definition of edit distance in this paper. It adds three additional operations: insertTree, deleteTree, and swap. Unfortunately, this doesn't account for copying nodes, or for moves that aren't within the same parent.

Some tree matching heuristics

So, it's not very good that the Zhang-Shasha algorithm is quadratic in most cases. In fact, in many cases, it's unacceptable. For example, in my case, where I might sometimes have to compare XHTML documents which are very long, it's unacceptable. But there are some algorithms which run in a time which is dominated by the number of nodes multiplied by the edit distance, or O(ne).

One algorithm called FastMatch, which has insert leaf, delete leaf, update and general move operations is presented by a bunch of people from Stanford in this paper. They work on getting an edit script and matching at the same time, but the algorithm starts by matching as much as possible, top-down, before proceeding to calculate the differences. This yields a complexity of O(ne+e^2). A related algorithm, described in Chapter 7 of Tancred Lindholm's master's thesis [PDF] incorporates tree insertion and deletion operations for a complexity of O(ne log n).

It's important to note that both of these will be O(n^2) in the complete worst case. A different XML matching algorithm was described by Grégory Cobéna in his master's thesis (and also in this paper, which is basically the relevant segment). Cobéna calls his algorithm BULD, which stands for bottom-up lazy-down. The key to the algorithm is that, for each node, there is a hash value and a weight, both calculated bottom-up. Exact equivalence between nodes can be approximated by equal hash values, and you search for the equal hash values of nodes that have been inserted on a maxheap by weight. In Cobéna's full thesis [PDF], he goes into more depth about his invertible edit script format. This algorithm doesn't necessarily generate the optimal diff, but in experiments it generates a very good one, and with a worst-case time of O(n log n).

Three-document merge

Creating an edit script is all well and good, but it's only half of the problem: the merge. Remember that a three-document merge is one where we have the original document and two modified versions, and we want to create a fourth version with both modifications together. Here was my idea: create an edit script for both modified versions with respect to the original, then do one followed by another, with repeated modifications done only once. We know there's a conflict if the order matters, in terms of which comes first in applying to the original document.

But this will come up with more conflicts than actually exist. For example, say some node A has four children, B C D and E. In one change, we insert a new node X after B as a child of A, and in another change, we insert a node Y after D as a child of A. So a sensible merge would have A's new children be B X C D Y E, in that order. But with the model described above, there would be an edit conflict!

One solution to this is the more general strategy of operational transformation. The basic idea for this technique as applied here is that, if we insert Y after inserting X, we have to add 1 to the index that Y is being inserted. If, on the other hand, Y is inserted first, we don't have to add one to the index that X is inserted on. This technique leads to fewer conflicting merges, or in OT lingo, it converges in more cases. There are a few formal properties of an operational transformation that have only recently been proven correct in the best-known algorithms. Pascal Molli used operational transformation, together with Cobéna's diff algorithm and format, in his So6 synchronization framework, which he described in a deceptively short paper [PDF].

Tancred Lindholm went a different route altogether in creating a three-way merge, throwing out the edit script and basing it on a tree matching. His algorithm is described in a paper that I don't understand well enough to faithfully summarize here.

Conclusion

At this point, you might be thinking, "I read all that and he didn't even tell me how the algorithms work?!" I would tell you all about that, but there are two problems. First, that'd take thousands more words, and this post already exceeds 2000 words. This is just an introduction to the problem and the literature. But more importantly, I don't fully understand these algorithms. They're very complicated to follow and even harder to implement.

So, overall, it looks like this is a much harder problem than I initially thought. Each time I read a paper on this, I think, "Oh, so this is what I needed all along. And I wasted so much time looking at the other papers before it!" But there is always another more complicated layer. Before I get to each new layer, I think, "No one's ever done this before!" and then I suddenly find a paper about it, published increasingly recently.

The problem I'm looking at right now is the last step, the interface. In particular, I'm not sure how to explain a diff to non-technical users and ask questions about how to resolve a merge conflict. Though I haven't found information about these things yet, I'm sure I will in the near future.

(There's also the problem of doing a three-way merge on the text mentioned earlier, which should come out bolding one part and italicizing another. But this problem, in the general case, is significantly harder to solve because it requires knowledge of XHTML. Imagine how the desired behavior would differ if one of the tags were an anchor. It also makes the matching task more complex as text nodes can be expected to be matched when they are all together in one version, and split up in subnodes in another version. I'm not sure if anyone has researched this yet. Another difficult problem is to merge edits within a text node, which can be done with a standard LCS procedure and operational transformation but requires a fuzzy matching of text nodes.)

Maybe, continuing in the trend of increasing dates, the answer to these questions hasn't come yet. In that case, I might have to solve this problem myself (and publish the results) so that later people don't have to solve it again. I've already started meeting with a Carleton computer science professor to talk about this problem in general.

Academic computer science is amazing.

Update: I've been getting lots of comments suggesting certain tools for XML diffing or merging. I should have made it more clear that this is a solved problem, or rather, that all current XML diff/merge tools do roughly the same thing. What isn't a solved problem is a schema-sensitive diff for XHTML which doesn't treat text nodes as atomic, and no comment given so far has anything like that. Most of the research going on in XML diffing is about improving heuristics to give faster algorithms, but my main concern is diff quality: a diff over XHTML that's so good even an untrained person could understand it.

If you're trying to solve the first problem of a general XML diff or merge, you shouldn't be using a closed-source implementation, because these don't generally tell you what algorithm they're using. Three good open-source implementations are XyDiff (and its Java clone jXyDiff) implementing Cobena's algorithm in C++, Logilab's xmldiff implementing the FastMatch algorithm in Python, and the 3DM project implementing Lindholm's algorithms. Unfortunately, none of these seems to have been updated since 2006.

Originally, my description of edit distances in the Zhang-Shasha algorithm was incorrect (I described Selkow's operations), but it's now fixed.

Wednesday, October 31, 2007

Post removed

The post that used to be here was wrong and pointless. I've removed it, but if you're really curious, it's just commented-out as HTML, so you could still read it.

Tuesday, June 12, 2007

No more high school ever!

Yesterday didn't feel like much of an eventful day until the end. As I was taking the bus home, I realized there was no more high school, ever, for the rest of my life. It's amazing. I just have one final (Calc III) and then I graduate. Now, when you deride my opinions based on my age, you can't say "He's just a little high schooler." You have to say "He's just a little college student." The downside is, if you ever felt like talking about how amazing it is that "a high schooler can do that," (if I ever did anything amazing, which I didn't) it just doesn't work the same with a college student, because college students are generally recognized as full citizens who are capable of doing cool things.

Anyway, I've been inactive for the past week or so because I've had final projects to do. But here's the latest stuff I've been coming up with. Remember all that qualified naming code I posted here before? Well, I have a third version, which is probably much better. It's based on Slava's original idea for qualified naming. Instead of having a word like math or a prefix like << to mark what vocab something is in, what if you prepend it onto the word name itself? That is, what if you used math:+ instead of math: + or << math + >> The advantage of that, besides saving one or four characters, is that it is integrated into the language better. For example, you can do \ math:+ and all will work properly; the word + from the vocab math will be pushed on the stack. The other solutions don't do this right. So, here's the surprisingly short code needed to achieve this. Note that this code only works with the latest CVS because it uses the new module system, which treats vocabs differently.

USING: kernel sequences assocs parser vocabs namespaces ;

: define-qualified ( vocab-name -- )
[ CHAR: : add ] keep vocab vocab-words
[ >r append r> ] assoc-map-with use get push ;


: QUALIFIED:
scan define-qualified ; parsing


So the things I'm working on now are (a) Unicode, which should be ready for .90, or if not, .91, and (b) improvements on libs/xml. You probably thought libs/xml was done after so much work. Well, Matthew Willis is writing a Jabber client in Factor. To do that, he has to parse XML without exhausting the stream. Now, because the XML library is stream-based, it can handle streams just fine. The only problem is that, with the DOM-style view (which is reached using the words read-xml or xml-chunk, among others), the stream is exhausted. So, in their current state, it is unusable for this purpose. These structures are strict, not lazy (a path I may explore in the future, along the lines of HaXML, which I still have to read about).

The solution is to extend the pull-xml view. When I implemented it, I never thought it would be that useful, but now it turns out to be very useful for these incremental situations. To extend the DOM view to pull-xml, all I have to do is write a word which will pull the next DOM-view element--a tag, string, or other similar thing. Except, if the next element is a closing tag, I have to return that. The implementation of this is almost done, and when it is, Yuuki should be able to make the Jabber client relatively easily, without having to reinvent the wheel of the tree structure that XML follows.

Tuesday, January 9, 2007

Imperative parsing with abstractions

Here's somewhat of a confession: after writing a non-trivial library centered around parsing, I still know nothing about parsing strings. My attempts to wrap my head around what exactly an LALR(1) parser is and any little basic aspect of the theory of parsing have been in vain. I still struggle with using simple parser combinators (like the ones doublec has created).

So for my XML library, I created my own system of parsing which may or may not be original or useful. All of the below examples cah be run, in Factor, by requiring the module libs/xml and USEing state-parser. The source code can be found in the Factor distribution in libs/xml/state-parser.factor.

This way of parsing evolved very slowly. At first, it was just something like this:

SYMBOL: code
SYMBOL: spot
: get-char ( -- char )
spot get code get nth ;
: next ( -- )
spot inc ;

I used variables instead of the stack because a lot of the code, which processed this, didn't care about the code or spot. Really, only those two words care. This is the basis for a bunch of abstraction that I can build on it later, for example the combinator skip-until, the basic form of which is below:

: skip-until ( quot -- ) ! quot: ( -- ? )
[ call ] keep swap [ drop ] [
next skip-until
] if ; inline

This calls next until the quotation returns something other than f. It turned out to be really useful in parsing, allowing further devices to be built on top of it easily, like take-until, which does the same thing but records the characters skipped. From this, something like take-char is simple to write. This word skips the parser until the specified character is reached, and returns the string passed:

: take-char ( char -- string )
[ dup get-char = ] take-until nip ;


After building abstractions like this, I was able to build a parser for XML relatively painlessly. The parser itself is in libs/xml/tokenize.factor, and it does no string operations directly at all. I think it is a type of recursive decent parser, but I'm not sure.

Because of this abstraction, I was able to change the inner workings of next and get-char twice, the first time to add information about the line and column of any errors encountered, and the second time to operate on streams instead of strings. This didn't require any changes at all to tokenize.factor. I'm planning on changing it again, but still, only a few of these basic, underlying words will be affected, and higher-level code will not be changed.

I've heard people say that it's tough to write a parser by hand. This was probably simpler than it could have been because of the relative simplicity of XML's grammar. This isn't really a Factor-specific solution, just something in general: with significant abstraction, and the ability to write higher-order functions, I don't think anything is too dificult to do "by hand" in an adequate programming language.



------
PS. To those of you who commented on this blog and I never responded: Sorry; I didn't really expect anyone to read this blog so I never checked it for comments. To Sam, about the combinators: Your implementation might work, but it won't compile in Factor's current implementation. In designing the XML parser and associated utilities, speed was a concern, and that means letting as much as possible compile. This particular thing doesn't work because you are manipulating quotations at runtime. In general, the word curry is good for that, and should probably be used in this case.

To Ray Cromwell, I know Google didn't really drop its API, but it's not issuing any new keys for the SOAP API, so as far as I'm concerned, that's gone. As far as the new AJAX API, I was under the impression that that can only be accessed client-side, so for all applications that want to do something like remove the ads or analyze rankings, it doesn't work.

Tuesday, January 2, 2007

XML combinators

I really need to work on making my writing make sense, because I think my last post failed at that. The XML library that I've written has a lot of combinators defined for processing, and in this blog post, I'm going to try to describe them as well as I can.

The basic idea behind the XML combinators is to let XML be processed the same way Lispers (and Factorers) have always processed lists and other sequences: with map each find subset and now,inject. These operations are slightly more complicated when used with a tree structure like XML, rather than a flat sequence as the operations were originally designed for. Fortunately, Factor's generic word system makes this relatively easy.

There's one consideration necessary in designing these that I wasn't happy with: since I had to iterate over all nodes, not just the leaf nodes. In all combinators, the result can be affected by which order is used. I made a relatively arbitrary decision and said that first the parent nodes were processed and then the leaves.

A note before the code of xml-each, the simplest of the combinators: this uses Factor's generic words, which dispatch only on the top of the stack. xml-each must dispatch on the class of the XML tag, not the quotation. However, the Factor convention for combinators is that the quotation is on the top of the stack, because this looks better syntactically. So I need to do a swap before entering the generic word. That said, here's the combinator:

GENERIC: (xml-each) ( quot tag -- ) inline
M: tag (xml-each)
[ swap call ] 2keep
tag-children [ (xml-each) ] each-with ;
M: object (xml-each)
swap call ;
M: xml-doc (xml-each)
delegate (xml-each) ;
: xml-each ( tag quot -- ) ! quot: tag --
swap (xml-each) ; inline

The rest of the combinators are implemented in much the same way, with only trivial changes to make different ones. Unlike some combinators (see my previous post) this is very simple to implement, not in small part because I have existing combinators to help. (Unfortunately, it doesn't seem to be possible to abstract a general "lift" combinator (analogous to liftM in Haskell for monads) to turn each into xml-each and map into xml-map, despite the similarities between them).

A really simple usage of xml-each is [ write-item ] xml-each, which prints out each element of the given XML tag, document, or other type of element.

As a more in-depth example, xml-find (which works just like find, but without returning an index) is used to implement the equivalent of .getElementByID(), called get-id:

: get-id ( tag id -- elem )
swap [
dup tag? [
"id" prop-name-tag
[ string? ] subset concat
over =
] [ drop f ] if
] xml-find nip ;

(Note: one weakness of the XML utilities that currently exist is that it's annoying to deal with XML tag attributes, though this should be fixed when I do more real work using them)