Tuesday, January 22, 2008

Matching, diffing and merging XML

Update: A newer, more complete version is here.

I've said bad things about my job working on Carleton College's website, but fundamentally it's a really sound work environment we have. Just before winter break, one of the full-time employees came to me and asked if I could make a diff between two XHTML documents for use in Carleton's CMS, Reason. This would be useful for (a) comparing versions of a document in the CMS (b) merging documents, in case two people edit the same document at the same time, so as to avoid locks and the need for manual merges. They came to me because I told them I'd written an XML parser.

I may know about XML, but I know (or rather knew) nothing about the algorithms required for such a diff and merge. I looked on Wikipedia, but there was no article about this kind of stuff. So for the past three weeks, I've been paid to read academic papers, mostly from Citeseer, my favorite website, about matching, diffing, and merging trees. Here's some of what I've found.

A naive algorithm

Here's something that won't work: a line-by-line diff and merge, along the lines of the Unix utilities diff and patch. You can split things up so that each tag is on a different line, as many have suggested, but it still won't work. Everyone who I've mentioned this problem to, including the boss who gave the assignment, thought of doing something like this, but it's unworkable.

Most obviously, this can easily break the tree structure. Here's one example of a failing three-way merge. (By a three-way merge, I mean a merge of two documents where you have the original document which the other two are derived from. diff3 does a line-by-line three-way merge, which I'm using for this example.)

<p>
This is
some
text
</p>

<p>
<b>
This is
some
</b>
text
</p>

<p>
This is
<i>
some
text
</i>
</p>

<p>
<b>
This is
<i>
some
</b>
text
</i>
</p>
OriginalPart boldedPart italicizedLine-by-line merge

This XML is not well-formed! This is unacceptable. I'm not saying diff3 is a bad tool, it's just not what we need.

Another problem is that a standard Unix diff only takes two things into account: insertions and deletions. In document diffing and merging, it'd be helpful if we supported another operation: moves. Say we take two paragraphs and swap them. In a model with only deletions and insertions, the paragraph would have to be deleted from one place and inserted in another. This can lead to sub-optimal diffs as presented to the user. It can also lead to bad merges: say in one branch, a paragraph was edited, and in another branch, that paragraph was moved to a different location. An optimal merge would put the edited paragraph in the new location, which requires tracking these moves directly.

Background to a better solution

XML can be viewed as ordered trees, where each node has an unbounded number of children and each internal node has a label. (In some situations, it acts more like an unordered tree, but in XHTML it's definitely ordered. Also, finding the optimal matching for unordered trees is known to be NP-hard, so we don't want to go there.) So we can solve these problems of diffing and merging XML by solving a more general problem on ordered trees. There are some specific aspects of XML which deserve mention (attributes, which are guaranteed to have unique names within a node; IDs, which are guaranteed to be unique within a document), but these are minor aspects which we can ignore for most of the time.

To avoid confusing, I'll define some terms I've been using or will soon start using. When I talk about "diffing", what I mean, formally, is generating an "edit script", or list of changes between two documents that can be used to get the modified document from the original. Sometimes, these edit scripts are invertible, but not always. When I talk about a "merge", I mean a way to reconcile the changes between documents to incorporate both of these changes. A merge can be an operation on edit scripts or it can be done directly on a tree matching. A "matching" is a set of correspondences between nodes in different trees; it is the basis for doing either a diff or a merge, and it's difficult to do efficiently.

(It might help to have a basic understanding of Levenshtein distance and the implied string diff algorithm. The longest common subsequence problem will also crop up here from time to time.)

The Zhang-Shasha algorithm and extensions

The Zhang-Shasha algorithm is the basic starting point when thinking about tree matching, diffing and merging. Except it isn't that basic. Dennis Shasha and Kaizhong Zhang created an algorithm to solve the approximate tree matching problem, which they described in the book Pattern Matching Algorithms. Their chapter on trees is available from Citeseer. Here's the basic idea: we want to see how similar two trees are, by a weighted edit distance metric. The edit script has three operations, similar to Levenshtein distance: add a node (optionally including a contiguous subsequence of the parent node), delete a node (putting children in the parent node), and relabel a node.

With this, they were able to come up with an algorithm of complexity (basically) O((n log n)^2), where n is the number of nodes in the tree. So this can get you a matching and an edit script, or diff between two trees. This isn't great, but it's much better than previous algorithms. The two also worked on the more difficult problem of supporting "Variable-length don't-cares", or the equivalent of * in Google/Unix file globbing, in tree matching, but we don't care about that here.

But this doesn't describe all of the changes that might take place in a tree structure that we might want to record. For example, in Zhang-Shasha proper, moving a node from one place to another is recorded as deleting the node from one place and inserting it into another place. Another issue is that inserting or deleting a subtree is recorded as inserting the node, then inserting each of its children, or deleting the leaf nodes recursively up until you delete their parent. This all leads to counterintuitive diffs, as far as human readability goes, as well as inflated edit distances.

So David Barnard, Gwen Clarke and Nicholas Duncan got together to create a modified algorithm that accommodated this modified definition of edit distance in this paper. It adds three additional operations: insertTree, deleteTree, and swap. Unfortunately, this doesn't account for copying nodes, or for moves that aren't within the same parent.

Some tree matching heuristics

So, it's not very good that the Zhang-Shasha algorithm is quadratic in most cases. In fact, in many cases, it's unacceptable. For example, in my case, where I might sometimes have to compare XHTML documents which are very long, it's unacceptable. But there are some algorithms which run in a time which is dominated by the number of nodes multiplied by the edit distance, or O(ne).

One algorithm called FastMatch, which has insert leaf, delete leaf, update and general move operations is presented by a bunch of people from Stanford in this paper. They work on getting an edit script and matching at the same time, but the algorithm starts by matching as much as possible, top-down, before proceeding to calculate the differences. This yields a complexity of O(ne+e^2). A related algorithm, described in Chapter 7 of Tancred Lindholm's master's thesis [PDF] incorporates tree insertion and deletion operations for a complexity of O(ne log n).

It's important to note that both of these will be O(n^2) in the complete worst case. A different XML matching algorithm was described by Grégory Cobéna in his master's thesis (and also in this paper, which is basically the relevant segment). Cobéna calls his algorithm BULD, which stands for bottom-up lazy-down. The key to the algorithm is that, for each node, there is a hash value and a weight, both calculated bottom-up. Exact equivalence between nodes can be approximated by equal hash values, and you search for the equal hash values of nodes that have been inserted on a maxheap by weight. In Cobéna's full thesis [PDF], he goes into more depth about his invertible edit script format. This algorithm doesn't necessarily generate the optimal diff, but in experiments it generates a very good one, and with a worst-case time of O(n log n).

Three-document merge

Creating an edit script is all well and good, but it's only half of the problem: the merge. Remember that a three-document merge is one where we have the original document and two modified versions, and we want to create a fourth version with both modifications together. Here was my idea: create an edit script for both modified versions with respect to the original, then do one followed by another, with repeated modifications done only once. We know there's a conflict if the order matters, in terms of which comes first in applying to the original document.

But this will come up with more conflicts than actually exist. For example, say some node A has four children, B C D and E. In one change, we insert a new node X after B as a child of A, and in another change, we insert a node Y after D as a child of A. So a sensible merge would have A's new children be B X C D Y E, in that order. But with the model described above, there would be an edit conflict!

One solution to this is the more general strategy of operational transformation. The basic idea for this technique as applied here is that, if we insert Y after inserting X, we have to add 1 to the index that Y is being inserted. If, on the other hand, Y is inserted first, we don't have to add one to the index that X is inserted on. This technique leads to fewer conflicting merges, or in OT lingo, it converges in more cases. There are a few formal properties of an operational transformation that have only recently been proven correct in the best-known algorithms. Pascal Molli used operational transformation, together with Cobéna's diff algorithm and format, in his So6 synchronization framework, which he described in a deceptively short paper [PDF].

Tancred Lindholm went a different route altogether in creating a three-way merge, throwing out the edit script and basing it on a tree matching. His algorithm is described in a paper that I don't understand well enough to faithfully summarize here.

Conclusion

At this point, you might be thinking, "I read all that and he didn't even tell me how the algorithms work?!" I would tell you all about that, but there are two problems. First, that'd take thousands more words, and this post already exceeds 2000 words. This is just an introduction to the problem and the literature. But more importantly, I don't fully understand these algorithms. They're very complicated to follow and even harder to implement.

So, overall, it looks like this is a much harder problem than I initially thought. Each time I read a paper on this, I think, "Oh, so this is what I needed all along. And I wasted so much time looking at the other papers before it!" But there is always another more complicated layer. Before I get to each new layer, I think, "No one's ever done this before!" and then I suddenly find a paper about it, published increasingly recently.

The problem I'm looking at right now is the last step, the interface. In particular, I'm not sure how to explain a diff to non-technical users and ask questions about how to resolve a merge conflict. Though I haven't found information about these things yet, I'm sure I will in the near future.

(There's also the problem of doing a three-way merge on the text mentioned earlier, which should come out bolding one part and italicizing another. But this problem, in the general case, is significantly harder to solve because it requires knowledge of XHTML. Imagine how the desired behavior would differ if one of the tags were an anchor. It also makes the matching task more complex as text nodes can be expected to be matched when they are all together in one version, and split up in subnodes in another version. I'm not sure if anyone has researched this yet. Another difficult problem is to merge edits within a text node, which can be done with a standard LCS procedure and operational transformation but requires a fuzzy matching of text nodes.)

Maybe, continuing in the trend of increasing dates, the answer to these questions hasn't come yet. In that case, I might have to solve this problem myself (and publish the results) so that later people don't have to solve it again. I've already started meeting with a Carleton computer science professor to talk about this problem in general.

Academic computer science is amazing.

Update: I've been getting lots of comments suggesting certain tools for XML diffing or merging. I should have made it more clear that this is a solved problem, or rather, that all current XML diff/merge tools do roughly the same thing. What isn't a solved problem is a schema-sensitive diff for XHTML which doesn't treat text nodes as atomic, and no comment given so far has anything like that. Most of the research going on in XML diffing is about improving heuristics to give faster algorithms, but my main concern is diff quality: a diff over XHTML that's so good even an untrained person could understand it.

If you're trying to solve the first problem of a general XML diff or merge, you shouldn't be using a closed-source implementation, because these don't generally tell you what algorithm they're using. Three good open-source implementations are XyDiff (and its Java clone jXyDiff) implementing Cobena's algorithm in C++, Logilab's xmldiff implementing the FastMatch algorithm in Python, and the 3DM project implementing Lindholm's algorithms. Unfortunately, none of these seems to have been updated since 2006.

Originally, my description of edit distances in the Zhang-Shasha algorithm was incorrect (I described Selkow's operations), but it's now fixed.

27 comments:

jan said...

You may be interested in the work by Benjamin C. Pierce [1] and his coworkers.

Specifically the Harmony framework [2] seems relevant.

[1] http://www.cis.upenn.edu/~bcpierce/

[2] http://www.seas.upenn.edu/~harmony/

Daniel Ehrenberg said...

Thanks, Jan. I saw the Harmony framework before, and I should have mentioned it here, but it looks to me like it's unsuited for this task, merging XHTML fragments. It looks like it's more for synchronization and conversion between different hierarchical (and usually non-cyclic) file formats. The idea of reversible lenses is really interesting, related to (but much better developed and more useful than) my separate idea of inverting concatenative quotations. I hope that someday my inverse library can do most of the things Harmony can.

stesch said...

There's a HTML diff on http://www.aaronsw.com/2002/diff (with source).

A Lisp version can be found here: http://www.cliki.net/CL-HTML-DIFF

Daniel Ehrenberg said...

Stesch,
Thanks for the link, but unless I misunderstand the code, that looks like a regular old line-by-line diff which is vaguely aware of HTML syntax, to the degree my naive algorithm at the start of the article is. It looks like it only processes insertions and deletions. Maybe this works for HTML, where you typically have a very loose parser, and in a situation where there are no moves and you don't need to record updates or changes to attributes. But for my purposes, personally, it wouldn't work very well.

Baczek said...

Python has a diff lib and it can diff any objects you tell it to. Worth a look.

michaelw said...

diff-sexp diffs two s-expressions, via a very naive Levenshtein-inspired algorithm that I cooked up, probably in less time that it would have taken me to do a literature review on the topic.

Amazingly enough, recently somebody told me that he used it for HTML diffs, and it did its job well enough, despite its less-than-ideal worst case-complexity and other short-comings. :)

NorthStar said...

Microsoft has solved this problem long time ago. One of their .NET Utilities is called XmlDiff which can show the differences between 2 xml files, construct a diff file so when you merge xml file 1 and diff file, you get the original xml file 2.

Jakub Narebski said...

NorthStar, simple Google search for xmldiff returns many, many results; programs in Python, in Perl, in Java,... and there is Microsoft one. Have you used it in practice?

Jakub Narebski said...

Daniel Ehrenberg, I think you can either try to convert diff3 / 3-way merge algorithm from line based to tree (and line) based instead of relying on using diffs to ancestor in some order, or try to use Darcs' theory of patches to make a merge if you have patches.

Shameless plugin: Git, distributed version control system, allows custon diffs and file level merges, so I guess it would be right tool to use to use as version control for XML documents.

Daniel Ehrenberg said...

I'm happy to be redditted and have so many thoughtful comments on this post! I'll respond to them all now.

Baczek, I looked at difflib (stesch also pointed me to it, indirectly) but this only handles flat sequences, and it not aware of the tree structure of XML or anything else, as far as I can tell. As such, it is unsuitable for a three-way XML merge, or even accurate diffing as presented to the user.

Michaelw, congratulations, you've reinvented Selkow's tree diffing algorithm! (You can read about this in either of the Zhang-Shasha-related papers.) This is reasonably efficient on your problem domain, operating in something like quadratic time. (The FastMatch algorithm, I think, would work in a fairly straightforward way on s-exprs, especially since you only have node labels.) Unfortunately, you've encountered the same limitation he did, namely that you can only account for atoms in this way. Also, to construct a three-way merge out of this, you'd need something like operational transformation to make the changes converge a reasonable amount of the time (though with Lisp code this isn't as necessary, as you might want to highlight conflicts that OT would resolve).

NorthStar, I can't figure out what algorithm Microsoft's XmlDiff uses, but I'd be surprised if it wasn't one described in the above article. The diff it generates looks significantly more difficult to merge than other XML diff formats, since it would require a tree matching itself. There are other pieces of free software that generate more sensible diff formats, support merging and don't require .NET. I'd feel a little bit funny using XmlDiff not knowing exactly what algorithm it uses. I would be using one of the existing pieces of software, but they don't solve the problem outlined in the end of the post.

Jakub Narebski, I do plan on using a three-way merge in the style of diff3 as you suggest. I'm not sure if a version control system like git or darcs would be applicable in this situation, though, as these things will be going on in the context of a larger, already existing CMS.

Jakub Narebski said...

Daniel, I was mentioning Darcs and Git SCMs in two different aspects.

Darcs uses its theory of patches to make a merge using diffs (or rather series of diffs) or other operations, and not, as far as I know, 3-way merge. The approach used by Darcs might be used with XML-aware diff you have chosen, without need for XML equivalent of diff3 / merge utility. But this might affect performance.

On the other hand Git, thanks to being snapshot based, and thanks to ability of using custom diff and file-level merge drivers, can be used as SCM for XML documents, or XML derivs like OpenDocument (ODF) files.

In short: Darcs as an inspiration, for merge algorithm, Git to use as SCMs.

Daniel Ehrenberg said...

Jarub, Darcs doesn't seem to use a very unusual diff/merge algorithm on individual files, and by the documentation I can find, it looks like a plain old port of diff and diff3 to Haskell.

There are two packages, LibreSource and Harmony/Unison, which use the same algorithm for merging XML as merging file system changes. Libresource uses operational transformation, with Cobena's hashing XyDiff algorithm, and Unison uses Harmony's tree merging algorithm (which unfortunately doesn't scale well to ordered trees, as it creates unnecessary conflicts which OT can resolve).

yason said...

I don't know about merging but I once had to diff XML trees and wrote a Python program to convert XML files into leaf-node-per-line based encoding, which I then diffed with the standard Unix diff utilities.

It was years ago but it went along the lines:

html/head/title|This is a cdata node
html/body/h1[class="heading"]|Title
html/body/img[src="foobar"][border="1"]

with special notation for other esoteric SGML/XML constructs (comment declarations, white-space preserved text, dtd subsets, etc.)

I also wrote a decoder feature that regenerated the XML (sub)trees by parsing the above notation and serializing the data structure back into plain old XML. I used that to produce human-readable XML output for presenting the diffs.

Jakub Narebski said...

Daniel, what I thought would help is that Darcs on the conceptual level deals with patches (diffs), and the fact that they are ordinary textual diffs is just details. In Darcs revisions are specified as a sequence of patches (changes), and merge between two branches in Darcs is replacing parallel seqies of patches by sequential one, using operations on patches like reverting a patch, and commuting independent patches. There was even idea (I don't know if implemented) about other operations than simple patches, e.g. variable rename operation (refactoring).

See Wikibook Understanding Darcs, chapter(s) "Patch theory and conflictors".

But after reading about it a bit I don't think that writing conflictors and mergers for XML diffs would be much easier than writing XML-aware 3-way merge.

Daniel Ehrenberg said...

Jakub, darcs' patch theory looks very closely related, though not identical to, operational transformation, so I'll look into it, though I suspect one won't be more useful than the other. On the subject of inverting a diff, a chapter of Cobena's thesis describes his invertible, normalized XyDiff format with many interesting mathematical properties.

Yason, that looks really interesting. Do you have the source code for that program?

Jakub Narebski said...

By the way, Git git-blame tool, which is used to annotate file with line-wise file history, can track code (contents) movement and code copying not only between files but also within a file. During the discussion on git mailing list which led to finally implementing this feature, Junio C Hamano (if I remember correctly) proposed [internal] mdiff / cdiff diff extensions which shows in some format contents movement. This was primarly meant to help understand contents movement detection in git-blame. You can try to search git mailing list archives for this discussion; this might, or might be not helpful in better implementation of code movement operator in XML-aware diff.

If I understand correctly one of solutions is to use directory tree merger for XML structure tree. Git tree merge uses (can use) file rename detection, and I think it can also detect code movement like splitting a file (or at least can be tweaked to do that). So if the documents have large enough contents (text inside tags) to formatting ratio (number of tags) it might be useful to try to adapt this to generate XML aware diffs, and XML aware 3-way merge (based on 3-way tree merge in Git).

Finally, if you ever want to use this research for XML-aware version control of XML doucments, I'd like to mention that Git uses recursive merge strategy to deal with some situations, like criss-cross merge, where ordinary 3-way merge with some common ancestor might be not enough. I think that recursive merge strategy can work with XML-aware 3-way file-level diff; other algorighms described at revctrl.org as far as I can see are much line-change based.

Anonymous said...

Altova's diffdog tool works well for 2 way diffs

Anonymous said...

How can an article this long on XML Matching/Diffing not mention canonical form?

http://www.w3.org/TR/xml-c14n

yason said...

Daniel, I can try to look up the source code somewhere in my archives although I don't know if I have it.

However, I remember that the program was really trivial -- I wouldn't be surprised to rewrite it from scratch in an hour (supporting plain&simple XML without any special cases that you won't see today anyway).

Basically it was just a simple SAX parser sink that kept track of the current path in the XML document and printed lines such as I posted as an example in my first comment.

Diffing two versions of the same file with diff -uBb gave rather understandable results. I could also do simple queries into the XML using plain grep and mangle the files using standard Unix tools! For example, "All html elements containing code elements containing 'foo'" would be something like egrep '/html.+/code.+foo' xmlfile.txt.

Guy Van den Broeck said...

Thanks for the interesting review of Daisy Diff.

Your analysis on diffing algorithms is very good. I had some of the same insights when researching for Daisy Diff. If you think you're able to improve on the current heuristic then feel free to dive into the code. If you have any specific questions (sorry for the Dutch:-) then I'd be very glad to answer them.

Quick link:
http://code.google.com/p/daisydiff/

Thorsten said...

Hi Daniel,

could you check the link to your PDF? It seems to be broken.

Thanks a lot,
Thorsten

Daniel Ehrenberg said...

Thorsten,

Thanks for pointing that out. It's fixed now.

Markus Schlegel said...

The link to the pdf seems to be broken again

Daniel Ehrenberg said...

Markus,

Thanks. I fixed it again.

Alex said...

XML diff / merge is somehow use-case specific and therefore the available tools are more or less suitable for your XML based use-case.

two commercial opportunities

- https://community.emc.com/docs/DOC-5042

- http://www.deltaxml.com/dxml/index.html

ericw said...

Thanks for the article, it was a helpful overview for me of current XML diff/merge research.

I like your example where two users italicize and bold overlapping sections of "This is some text", and the desired merge result involves splitting the < i > tag.
In other words, a good merge algorithm needs to be flexible enough to take a rule about the problem domain which says, "the < i > tag can be split, so < i> some text < / i > is equivalent to < i >some< / i >< i >text< / i >"

Another recent algorithm which looks interesting is jndiff. I'm still reading about it, but it is designed around detecting a "natural" diff on XML documents representing literary content.

Here's an article: http://www.dis.uniroma1.it/~midlab/articoli/ICEIS09.pdf
LGPL source code is on sourceforge: http://sourceforge.net/projects/jndiff/

Alex said...

I'm not sure whether it speaks to your use case, or even whether you still care, but I've gone down a similar path of meta-research recently, and found An optimal decomposition algorithm for tree edit distance by Demaine, Mozes, Rossman and Weimann (PDF).