Comments on Useless Factor: Matching, diffing and merging XML

I'm not sure whether it speaks to your use cas...

2011-05-04T17:40:34.140-07:00

I'm not sure whether it speaks to your use case, or even whether you still care, but I've gone down a similar path of meta-research recently, and found An optimal decomposition algorithm for tree edit distance by Demaine, Mozes, Rossman and Weimann (PDF).

Thanks for the article, it was a helpful overview ...

2010-03-10T15:01:35.470-08:00

Thanks for the article, it was a helpful overview for me of current XML diff/merge research.

I like your example where two users italicize and bold overlapping sections of "This is some text", and the desired merge result involves splitting the tag.
In other words, a good merge algorithm needs to be flexible enough to take a rule about the problem domain which says, "the tag can be split, so some text is equivalent to sometext"

Another recent algorithm which looks interesting is jndiff. I'm still reading about it, but it is designed around detecting a "natural" diff on XML documents representing literary content.

Here's an article: http://www.dis.uniroma1.it/~midlab/articoli/ICEIS09.pdf
LGPL source code is on sourceforge: http://sourceforge.net/projects/jndiff/

XML diff / merge is somehow use-case specific and ...

2009-12-28T10:59:10.091-08:00

XML diff / merge is somehow use-case specific and therefore the available tools are more or less suitable for your XML based use-case.

two commercial opportunities

- https://community.emc.com/docs/DOC-5042

- http://www.deltaxml.com/dxml/index.html

Markus, Thanks. I fixed it again.

2009-04-21T07:28:00.000-07:00

Markus,

Thanks. I fixed it again.

The link to the pdf seems to be broken again

2009-04-21T06:51:00.000-07:00

The link to the pdf seems to be broken again

Thorsten,Thanks for pointing that out. It's fixed ...

2008-07-26T12:26:00.000-07:00

Thorsten,

Thanks for pointing that out. It's fixed now.

Hi Daniel,could you check the link to your PDF? It...

2008-07-26T01:05:00.000-07:00

Hi Daniel,

could you check the link to your PDF? It seems to be broken.

Thanks a lot,
Thorsten

Thanks for the interesting review of Daisy Diff.Yo...

2008-05-28T14:52:00.000-07:00

Thanks for the interesting review of Daisy Diff.

Your analysis on diffing algorithms is very good. I had some of the same insights when researching for Daisy Diff. If you think you're able to improve on the current heuristic then feel free to dive into the code. If you have any specific questions (sorry for the Dutch:-) then I'd be very glad to answer them.

Quick link:
http://code.google.com/p/daisydiff/

Daniel, I can try to look up the source code somew...

2008-02-01T15:55:00.000-08:00

Daniel, I can try to look up the source code somewhere in my archives although I don't know if I have it.

However, I remember that the program was really trivial -- I wouldn't be surprised to rewrite it from scratch in an hour (supporting plain&simple XML without any special cases that you won't see today anyway).

Basically it was just a simple SAX parser sink that kept track of the current path in the XML document and printed lines such as I posted as an example in my first comment.

Diffing two versions of the same file with diff -uBb gave rather understandable results. I could also do simple queries into the XML using plain grep and mangle the files using standard Unix tools! For example, "All html elements containing code elements containing 'foo'" would be something like egrep '/html.+/code.+foo' xmlfile.txt.

How can an article this long on XML Matching/Diffi...

2008-01-30T13:04:00.000-08:00

How can an article this long on XML Matching/Diffing not mention canonical form?

http://www.w3.org/TR/xml-c14n

Altova's diffdog tool works well for 2 way diffs

2008-01-30T05:45:00.000-08:00

Altova's diffdog tool works well for 2 way diffs

By the way, Git git-blame tool, which is used to a...

2008-01-28T04:55:00.000-08:00

By the way, Git git-blame tool, which is used to annotate file with line-wise file history, can track code (contents) movement and code copying not only between files but also within a file. During the discussion on git mailing list which led to finally implementing this feature, Junio C Hamano (if I remember correctly) proposed [internal] mdiff / cdiff diff extensions which shows in some format contents movement. This was primarly meant to help understand contents movement detection in git-blame. You can try to search git mailing list archives for this discussion; this might, or might be not helpful in better implementation of code movement operator in XML-aware diff.

If I understand correctly one of solutions is to use directory tree merger for XML structure tree. Git tree merge uses (can use) file rename detection, and I think it can also detect code movement like splitting a file (or at least can be tweaked to do that). So if the documents have large enough contents (text inside tags) to formatting ratio (number of tags) it might be useful to try to adapt this to generate XML aware diffs, and XML aware 3-way merge (based on 3-way tree merge in Git).

Finally, if you ever want to use this research for XML-aware version control of XML doucments, I'd like to mention that Git uses recursive merge strategy to deal with some situations, like criss-cross merge, where ordinary 3-way merge with some common ancestor might be not enough. I think that recursive merge strategy can work with XML-aware 3-way file-level diff; other algorighms described at revctrl.org as far as I can see are much line-change based.

Jakub, darcs' patch theory looks very closely rela...

2008-01-27T14:33:00.000-08:00

Jakub, darcs' patch theory looks very closely related, though not identical to, operational transformation, so I'll look into it, though I suspect one won't be more useful than the other. On the subject of inverting a diff, a chapter of Cobena's thesis describes his invertible, normalized XyDiff format with many interesting mathematical properties.

Yason, that looks really interesting. Do you have the source code for that program?

Daniel, what I thought would help is that Darcs on...

2008-01-27T14:27:00.000-08:00

Daniel, what I thought would help is that Darcs on the conceptual level deals with patches (diffs), and the fact that they are ordinary textual diffs is just details. In Darcs revisions are specified as a sequence of patches (changes), and merge between two branches in Darcs is replacing parallel seqies of patches by sequential one, using operations on patches like reverting a patch, and commuting independent patches. There was even idea (I don't know if implemented) about other operations than simple patches, e.g. variable rename operation (refactoring).

See Wikibook Understanding Darcs, chapter(s) "Patch theory and conflictors".

But after reading about it a bit I don't think that writing conflictors and mergers for XML diffs would be much easier than writing XML-aware 3-way merge.

I don't know about merging but I once had to diff ...

2008-01-27T13:59:00.000-08:00

I don't know about merging but I once had to diff XML trees and wrote a Python program to convert XML files into leaf-node-per-line based encoding, which I then diffed with the standard Unix diff utilities.

It was years ago but it went along the lines:

html/head/title|This is a cdata node
html/body/h1[class="heading"]|Title
html/body/img[src="foobar"][border="1"]

with special notation for other esoteric SGML/XML constructs (comment declarations, white-space preserved text, dtd subsets, etc.)

I also wrote a decoder feature that regenerated the XML (sub)trees by parsing the above notation and serializing the data structure back into plain old XML. I used that to produce human-readable XML output for presenting the diffs.

Jarub, Darcs doesn't seem to use a very unusual di...

2008-01-27T13:33:00.000-08:00

Jarub, Darcs doesn't seem to use a very unusual diff/merge algorithm on individual files, and by the documentation I can find, it looks like a plain old port of diff and diff3 to Haskell.

There are two packages, LibreSource and Harmony/Unison, which use the same algorithm for merging XML as merging file system changes. Libresource uses operational transformation, with Cobena's hashing XyDiff algorithm, and Unison uses Harmony's tree merging algorithm (which unfortunately doesn't scale well to ordered trees, as it creates unnecessary conflicts which OT can resolve).

Daniel, I was mentioning Darcs and Git SCMs in two...

2008-01-27T12:28:00.000-08:00

Daniel, I was mentioning Darcs and Git SCMs in two different aspects.

Darcs uses its theory of patches to make a merge using diffs (or rather series of diffs) or other operations, and not, as far as I know, 3-way merge. The approach used by Darcs might be used with XML-aware diff you have chosen, without need for XML equivalent of diff3 / merge utility. But this might affect performance.

On the other hand Git, thanks to being snapshot based, and thanks to ability of using custom diff and file-level merge drivers, can be used as SCM for XML documents, or XML derivs like OpenDocument (ODF) files.

In short: Darcs as an inspiration, for merge algorithm, Git to use as SCMs.

I'm happy to be redditted and have so many thought...

2008-01-27T11:35:00.000-08:00

I'm happy to be redditted and have so many thoughtful comments on this post! I'll respond to them all now.

Baczek, I looked at difflib (stesch also pointed me to it, indirectly) but this only handles flat sequences, and it not aware of the tree structure of XML or anything else, as far as I can tell. As such, it is unsuitable for a three-way XML merge, or even accurate diffing as presented to the user.

Michaelw, congratulations, you've reinvented Selkow's tree diffing algorithm! (You can read about this in either of the Zhang-Shasha-related papers.) This is reasonably efficient on your problem domain, operating in something like quadratic time. (The FastMatch algorithm, I think, would work in a fairly straightforward way on s-exprs, especially since you only have node labels.) Unfortunately, you've encountered the same limitation he did, namely that you can only account for atoms in this way. Also, to construct a three-way merge out of this, you'd need something like operational transformation to make the changes converge a reasonable amount of the time (though with Lisp code this isn't as necessary, as you might want to highlight conflicts that OT would resolve).

NorthStar, I can't figure out what algorithm Microsoft's XmlDiff uses, but I'd be surprised if it wasn't one described in the above article. The diff it generates looks significantly more difficult to merge than other XML diff formats, since it would require a tree matching itself. There are other pieces of free software that generate more sensible diff formats, support merging and don't require .NET. I'd feel a little bit funny using XmlDiff not knowing exactly what algorithm it uses. I would be using one of the existing pieces of software, but they don't solve the problem outlined in the end of the post.

Jakub Narebski, I do plan on using a three-way merge in the style of diff3 as you suggest. I'm not sure if a version control system like git or darcs would be applicable in this situation, though, as these things will be going on in the context of a larger, already existing CMS.

Daniel Ehrenberg, I think you can either try to co...

2008-01-27T11:05:00.000-08:00

Daniel Ehrenberg, I think you can either try to convert diff3 / 3-way merge algorithm from line based to tree (and line) based instead of relying on using diffs to ancestor in some order, or try to use Darcs' theory of patches to make a merge if you have patches.

Shameless plugin: Git, distributed version control system, allows custon diffs and file level merges, so I guess it would be right tool to use to use as version control for XML documents.

NorthStar, simple Google search for xmldiff return...

2008-01-27T10:55:00.000-08:00

NorthStar, simple Google search for xmldiff returns many, many results; programs in Python, in Perl, in Java,... and there is Microsoft one. Have you used it in practice?

Microsoft has solved this problem long time ago. O...

2008-01-27T09:38:00.000-08:00

Microsoft has solved this problem long time ago. One of their .NET Utilities is called XmlDiff which can show the differences between 2 xml files, construct a diff file so when you merge xml file 1 and diff file, you get the original xml file 2.

diff-sexp diffs two s-expressions, via a very naiv...

2008-01-27T08:54:00.000-08:00

diff-sexp diffs two s-expressions, via a very naive Levenshtein-inspired algorithm that I cooked up, probably in less time that it would have taken me to do a literature review on the topic.

Amazingly enough, recently somebody told me that he used it for HTML diffs, and it did its job well enough, despite its less-than-ideal worst case-complexity and other short-comings. :)

Python has a diff lib and it can diff any objects ...

2008-01-27T05:46:00.000-08:00

Python has a diff lib and it can diff any objects you tell it to. Worth a look.

Stesch,Thanks for the link, but unless I misunders...

2008-01-23T13:05:00.000-08:00

Stesch,
Thanks for the link, but unless I misunderstand the code, that looks like a regular old line-by-line diff which is vaguely aware of HTML syntax, to the degree my naive algorithm at the start of the article is. It looks like it only processes insertions and deletions. Maybe this works for HTML, where you typically have a very loose parser, and in a situation where there are no moves and you don't need to record updates or changes to attributes. But for my purposes, personally, it wouldn't work very well.

There's a HTML diff on http://www.aaronsw.com/2002...

2008-01-22T23:33:00.000-08:00

There's a HTML diff on http://www.aaronsw.com/2002/diff (with source).

A Lisp version can be found here: http://www.cliki.net/CL-HTML-DIFF