I've said bad things about my job working on Carleton College's website, but fundamentally it's a really sound work environment we have. Just before winter break, one of the full-time employees came to me and asked if I could make a diff between two XHTML documents for use in Carleton's CMS, Reason. This would be useful for (a) comparing versions of a document in the CMS (b) merging documents, in case two people edit the same document at the same time, so as to avoid locks and the need for manual merges. They came to me because I told them I'd written an XML parser.
I may know about XML, but I know (or rather knew) nothing about the algorithms required for such a diff and merge. I looked on Wikipedia, but there was no article about this kind of stuff. So for the past three weeks, I've been paid to read academic papers, mostly from Citeseer, my favorite website, about matching, diffing, and merging trees. Here's some of what I've found.
A naive algorithm
Here's something that won't work: a line-by-line diff and merge, along the lines of the Unix utilities
patch. You can split things up so that each tag is on a different line, as many have suggested, but it still won't work. Everyone who I've mentioned this problem to, including the boss who gave the assignment, thought of doing something like this, but it's unworkable.
Most obviously, this can easily break the tree structure. Here's one example of a failing three-way merge. (By a three-way merge, I mean a merge of two documents where you have the original document which the other two are derived from.
diff3 does a line-by-line three-way merge, which I'm using for this example.)
This XML is not well-formed! This is unacceptable. I'm not saying
diff3 is a bad tool, it's just not what we need.
Another problem is that a standard Unix diff only takes two things into account: insertions and deletions. In document diffing and merging, it'd be helpful if we supported another operation: moves. Say we take two paragraphs and swap them. In a model with only deletions and insertions, the paragraph would have to be deleted from one place and inserted in another. This can lead to sub-optimal diffs as presented to the user. It can also lead to bad merges: say in one branch, a paragraph was edited, and in another branch, that paragraph was moved to a different location. An optimal merge would put the edited paragraph in the new location, which requires tracking these moves directly.
Background to a better solution
XML can be viewed as ordered trees, where each node has an unbounded number of children and each internal node has a label. (In some situations, it acts more like an unordered tree, but in XHTML it's definitely ordered. Also, finding the optimal matching for unordered trees is known to be NP-hard, so we don't want to go there.) So we can solve these problems of diffing and merging XML by solving a more general problem on ordered trees. There are some specific aspects of XML which deserve mention (attributes, which are guaranteed to have unique names within a node; IDs, which are guaranteed to be unique within a document), but these are minor aspects which we can ignore for most of the time.
To avoid confusing, I'll define some terms I've been using or will soon start using. When I talk about "diffing", what I mean, formally, is generating an "edit script", or list of changes between two documents that can be used to get the modified document from the original. Sometimes, these edit scripts are invertible, but not always. When I talk about a "merge", I mean a way to reconcile the changes between documents to incorporate both of these changes. A merge can be an operation on edit scripts or it can be done directly on a tree matching. A "matching" is a set of correspondences between nodes in different trees; it is the basis for doing either a diff or a merge, and it's difficult to do efficiently.
(It might help to have a basic understanding of Levenshtein distance and the implied string diff algorithm. The longest common subsequence problem will also crop up here from time to time.)
The Zhang-Shasha algorithm and extensions
The Zhang-Shasha algorithm is the basic starting point when thinking about tree matching, diffing and merging. Except it isn't that basic. Dennis Shasha and Kaizhong Zhang created an algorithm to solve the approximate tree matching problem, which they described in the book Pattern Matching Algorithms. Their chapter on trees is available from Citeseer. Here's the basic idea: we want to see how similar two trees are, by a weighted edit distance metric. The edit script has three operations, similar to Levenshtein distance: add a node (optionally including a contiguous subsequence of the parent node), delete a node (putting children in the parent node), and relabel a node.
With this, they were able to come up with an algorithm of complexity (basically) O((n log n)^2), where n is the number of nodes in the tree. So this can get you a matching and an edit script, or diff between two trees. This isn't great, but it's much better than previous algorithms. The two also worked on the more difficult problem of supporting "Variable-length don't-cares", or the equivalent of * in Google/Unix file globbing, in tree matching, but we don't care about that here.
But this doesn't describe all of the changes that might take place in a tree structure that we might want to record. For example, in Zhang-Shasha proper, moving a node from one place to another is recorded as deleting the node from one place and inserting it into another place. Another issue is that inserting or deleting a subtree is recorded as inserting the node, then inserting each of its children, or deleting the leaf nodes recursively up until you delete their parent. This all leads to counterintuitive diffs, as far as human readability goes, as well as inflated edit distances.
So David Barnard, Gwen Clarke and Nicholas Duncan got together to create a modified algorithm that accommodated this modified definition of edit distance in this paper. It adds three additional operations: insertTree, deleteTree, and swap. Unfortunately, this doesn't account for copying nodes, or for moves that aren't within the same parent.
Some tree matching heuristics
So, it's not very good that the Zhang-Shasha algorithm is quadratic in most cases. In fact, in many cases, it's unacceptable. For example, in my case, where I might sometimes have to compare XHTML documents which are very long, it's unacceptable. But there are some algorithms which run in a time which is dominated by the number of nodes multiplied by the edit distance, or O(ne).
One algorithm called FastMatch, which has insert leaf, delete leaf, update and general move operations is presented by a bunch of people from Stanford in this paper. They work on getting an edit script and matching at the same time, but the algorithm starts by matching as much as possible, top-down, before proceeding to calculate the differences. This yields a complexity of O(ne+e^2). A related algorithm, described in Chapter 7 of Tancred Lindholm's master's thesis [PDF] incorporates tree insertion and deletion operations for a complexity of O(ne log n).
It's important to note that both of these will be O(n^2) in the complete worst case. A different XML matching algorithm was described by Grégory Cobéna in his master's thesis (and also in this paper, which is basically the relevant segment). Cobéna calls his algorithm BULD, which stands for bottom-up lazy-down. The key to the algorithm is that, for each node, there is a hash value and a weight, both calculated bottom-up. Exact equivalence between nodes can be approximated by equal hash values, and you search for the equal hash values of nodes that have been inserted on a maxheap by weight. In Cobéna's full thesis [PDF], he goes into more depth about his invertible edit script format. This algorithm doesn't necessarily generate the optimal diff, but in experiments it generates a very good one, and with a worst-case time of O(n log n).
Creating an edit script is all well and good, but it's only half of the problem: the merge. Remember that a three-document merge is one where we have the original document and two modified versions, and we want to create a fourth version with both modifications together. Here was my idea: create an edit script for both modified versions with respect to the original, then do one followed by another, with repeated modifications done only once. We know there's a conflict if the order matters, in terms of which comes first in applying to the original document.
But this will come up with more conflicts than actually exist. For example, say some node A has four children, B C D and E. In one change, we insert a new node X after B as a child of A, and in another change, we insert a node Y after D as a child of A. So a sensible merge would have A's new children be B X C D Y E, in that order. But with the model described above, there would be an edit conflict!
One solution to this is the more general strategy of operational transformation. The basic idea for this technique as applied here is that, if we insert Y after inserting X, we have to add 1 to the index that Y is being inserted. If, on the other hand, Y is inserted first, we don't have to add one to the index that X is inserted on. This technique leads to fewer conflicting merges, or in OT lingo, it converges in more cases. There are a few formal properties of an operational transformation that have only recently been proven correct in the best-known algorithms. Pascal Molli used operational transformation, together with Cobéna's diff algorithm and format, in his So6 synchronization framework, which he described in a deceptively short paper [PDF].
Tancred Lindholm went a different route altogether in creating a three-way merge, throwing out the edit script and basing it on a tree matching. His algorithm is described in a paper that I don't understand well enough to faithfully summarize here.
At this point, you might be thinking, "I read all that and he didn't even tell me how the algorithms work?!" I would tell you all about that, but there are two problems. First, that'd take thousands more words, and this post already exceeds 2000 words. This is just an introduction to the problem and the literature. But more importantly, I don't fully understand these algorithms. They're very complicated to follow and even harder to implement.
So, overall, it looks like this is a much harder problem than I initially thought. Each time I read a paper on this, I think, "Oh, so this is what I needed all along. And I wasted so much time looking at the other papers before it!" But there is always another more complicated layer. Before I get to each new layer, I think, "No one's ever done this before!" and then I suddenly find a paper about it, published increasingly recently.
The problem I'm looking at right now is the last step, the interface. In particular, I'm not sure how to explain a diff to non-technical users and ask questions about how to resolve a merge conflict. Though I haven't found information about these things yet, I'm sure I will in the near future.
(There's also the problem of doing a three-way merge on the text mentioned earlier, which should come out bolding one part and italicizing another. But this problem, in the general case, is significantly harder to solve because it requires knowledge of XHTML. Imagine how the desired behavior would differ if one of the tags were an anchor. It also makes the matching task more complex as text nodes can be expected to be matched when they are all together in one version, and split up in subnodes in another version. I'm not sure if anyone has researched this yet. Another difficult problem is to merge edits within a text node, which can be done with a standard LCS procedure and operational transformation but requires a fuzzy matching of text nodes.)
Maybe, continuing in the trend of increasing dates, the answer to these questions hasn't come yet. In that case, I might have to solve this problem myself (and publish the results) so that later people don't have to solve it again. I've already started meeting with a Carleton computer science professor to talk about this problem in general.
Academic computer science is amazing.
Update: I've been getting lots of comments suggesting certain tools for XML diffing or merging. I should have made it more clear that this is a solved problem, or rather, that all current XML diff/merge tools do roughly the same thing. What isn't a solved problem is a schema-sensitive diff for XHTML which doesn't treat text nodes as atomic, and no comment given so far has anything like that. Most of the research going on in XML diffing is about improving heuristics to give faster algorithms, but my main concern is diff quality: a diff over XHTML that's so good even an untrained person could understand it.
If you're trying to solve the first problem of a general XML diff or merge, you shouldn't be using a closed-source implementation, because these don't generally tell you what algorithm they're using. Three good open-source implementations are XyDiff (and its Java clone jXyDiff) implementing Cobena's algorithm in C++, Logilab's xmldiff implementing the FastMatch algorithm in Python, and the 3DM project implementing Lindholm's algorithms. Unfortunately, none of these seems to have been updated since 2006.
Originally, my description of edit distances in the Zhang-Shasha algorithm was incorrect (I described Selkow's operations), but it's now fixed.