Wednesday, February 6, 2008

XML and its alternatives

I started writing Factor's XML parser thinking that its purpose was to interface with legacy protocols which made the mistake of choosing XML, and that the people at the W3C were a bunch of idiots for pushing such a bad, unoriginal format on innocent programmers who would do better without it. At this point, though, I think it might not actually be that bad. Let's look at the alternatives for representing human-readable structured information for standardized protocols.

Flat text files

In the recent past, many protocols and file formats were written with a flat text file, binary or human-readable, each requiring an individually specialized parser. Many are still written this way. Does it make any sense to impose a tree structure on something as simple as blog syndication, documents or remote procedure calls? Or was it a wrong turn to put all of that in the same verbose, complicated syntax?

I think it was a good idea to specify these formats in terms of a common tree-based human-readable format. Maybe for some low-level network protocols, a flat text or binary file makes sense, but many other things work out well using a tree structure. For example, the Atom syndication format is a way to store things like blog feeds in XML. The structure is pretty simple: there's a bunch of metadata about the blog, and then there are a bunch of nodes corresponding to items, with roughly the same fields as the feed itself has. (I'm oversimplifying, here.) Atom uses a tree structure to store this, and the tree is in XML syntax. A tree structure makes sense, because there are a couple different sub-levels: there's the level of items, and then underneath that, the level of data about each item. These can be cleanly separated in a tree model.

Using a pre-existing XML parser, Atom is fairly easy to parse and generate. I wrote a simple library for Atom parsing and generation here in not much code.

An additional benefit of a tree structure in a standard syntax is that standard tools can be used on it. On the most basic level, you can use a parsing library. But is this really necessary if the format is simple enough anyway? When there is a large amount of information, parsing inevitably becomes harder, and a consistent encoding of hierarchical structure makes this easier.

A new alternative to Atom is Zed Shaw's XSFF, where information is in a simple in a flat-file format. (Update: Zed says this should be taken as a joke.) Originally, this only had basic information about the blog overall, and the URL of each post in chronological order. But when things were extended to show the article contents, Zed's solution was to have the flat file link to a special source format he uses to generate HTML. He didn't provide anything to get the date things are posted, which, in his case, can be deduced from the URL.

I don't mean to criticize Zed, but this new format will actually be more difficult for programmers to process than regular Atom feeds, if they want to have a "river of news" for aggregator format. A ZSFF aggregator (as Planet Factor is an Atom aggregator) would have to parse the URLs to figure out the dates to get a correct ordering and follow the URLs with a .page extension to get content. For those pages, they also must be parsed to get the title and content in HTML form. Is it easier to write an ZSFF generator? Yes, but it's much harder to read, and that must be taken into consideration just as much.

S-expressions

Many smart Lispers have complained about XML. They claim s-expressions (sexprs), the basis for Lisp syntax, are better for most human-readable data serialization purposes with a bunch of good reasons. Some of these are,
  • Sexprs are simpler to parse—just use read.
  • They're easier to process, since they're just nested lists.
  • Sexprs encode real data types in a direct way, not just strings but also integers and floats.
  • XML is unnecessarily complicated, including things like the distinction between attributes and children and unnecessary redundancy in closing tags.
  • Sexprs came first and they do everything XML does, so why should we use XML?

Explaining XML's utility is different than to Lispers' criticisms, largely because at least half of the criticisms are correct: XML is completely isomorphic to sexprs but with a more complicated and verbose syntax. Still, it has a few advantages besides its legacy status. XML is very well-specified and is guaranteed not to differ between conforming implementations. The general idea of s-expressions is fairly universal, but it differs between implementations what characters can be included in a symbol, what characters and escapes can be used in strings, the precision of floats and the identity of symbols. Within a well-specified Lisp (I'm using Lisp in the general sense here), there's not much ambiguity, but in communicating between computers programmed with different parser implementations, XML is more robust. XML supports Unicode very consistently, with explicit mention of it in the stanard. Sexprs can't be depended on for this.

This might not be helpful in general, but one really great thing about XML is that you can embed other XML documents inside of it very cleanly. I never thought I'd use XML for anything but others' protocols until it turned out to be the right tool for a quick hackish job: a simple resource file to generate the Factor FAQ. Previously, I maintained the FAQ in HTML directly, but it became tedious to update the table of contents and the start values for ordered lists. So in a couple hours I came up with and implemented a simple XML schema to represent the questions and answers consisting of XHTML fragments. I could parse the HTML I was already using for the FAQ and convert it to this XML format, and convert it back.

Could I have used s-expressions, or an equivalent using the Factor reader? Sure, but it would have taken more effort to construct the HTML, and I'm lazy. One reason is really superficial: I would have had to backslash string literals. Another reason is that I had an efficient XML parser lying around. But one thing which came in handy which I didn't expect was that because of XML's redundancy, the parser caught my mistakes in missing closing tags and pointed them out at the location that they occurred. Anyway, these are all small things, but they added up to XML being an easier-to-hack-up solution than s-expressions or ad-hoc parsing.

(JSON: I have nothing to say about JSON, but I don't think anyone's ever tried to use it for standardized protocols, just for AJAX. And I don't really know much about it. But I feel like I should mention it because it's out there, and it's definitely a reasonable option because it's strongly standardized. The same goes for YAML, basically. It's important to note, though, that JSON and YAML are in a way more complicated (though arguably more useful) because they include explicit data types.)

Conclusion

While XML isn't perfect, it is the right tool for some jobs, even some things it isn't currently used for. Of course, there are definitely cases where flat files or s-expressions are more appropriate; it would be stupid to reflexively use XML for everything where you want a human-readable data format. But format standards, while annoying, are great when someone else takes the time to implement them for you. This way, you don't have to worry about things like character encodings or parsing more complicated grammars as the format grows. The biggest benefits of XML are the tree structure and its standardization.

Update: For a more complete look at the alternatives to XML, check out this page which William Tanksley pointed out (unfortunately only on the Internet Archive).

12 comments:

Zed said...

Hey Dan, ZSFF is hardly an XML alternative, more a joke. But, the river of news thing is pretty easy to implement. See, when the feed is updated, you show the links you haven't seen yet. Done! Disk is cheap anyway.

Now, if you want to see my *real* XML alternative, check out Stackish (http://www.zedshaw.com/essays/stackish_xml_alternative.html)
which is used quite easily in Utu: http://savingtheinternetwithhate.com/design.html

Stackish is basically s-expressions inverted for stack ordering and then given the addition of a BLOB type for storing binary strings of arbitrary length. With it you can support any data structure you want, even ones that XML and s-expressions can't do easily.

Bernd Haug said...

A you sure that XML should be considered a human-readable format?

I see it more as a format that's entirely for consumption by computers that has text as its primary representation. That's nice because it's easier to debug, but as XML it's not a "human readable" format in any usual sense.

I see it much more like HTTP on the protocol side; that it's textual is great as a debugging and testing tool, but it's not made for direct human consumption in production, ever.

Daniel Ehrenberg said...

Zed, oops, sorry if I misrepresented ZSFF. But for the river of news, you sometimes might want more than just links. You might want the content on the same page. Look at Planet Factor (planet.factorcode.org). How would you do that with ZSFF? Sometimes more metadata than a simple URL is useful, and that's the basic rationale for Atom.

Stackish looks really interesting, but at an extremely brief glance, some things appear to be unspecified, like character encoding. This would be a good thing to explicitly add in in preparation for when there's more than implementation. (You could just say, "everything's in UTF-8, except for binary blobs which are binary").

Bernd, you're right, XML shouldn't really be called, as a blanket statement, human-readable, since untrained humans can't read it. But unlike a binary format, trained humans can and do write it directly, and that's what I meant to indicate. Things differ by XML schema, of course; people aren't expected to read or write SOAP directly, but XHTML is designed to be written by humans directly. The ability to do this was an explicit design goal of XML, IIRC.

Mathieu said...

Sure XML is human-readable, open the file in Firefox or IE and you can read it very well, and see the structure.

Smelly said...

Also consider YAML

http://www.yaml.org

Anonymous said...

My biggest problem really is the human-readable part of XML.
I think it was a huge mistake to market XML as human readable. Yes, XML in essence, if you look at one, two, few tags, is quite readable.
But you always have to make the end quote, and this is NOT good for storing data. I used to maintain a huge XML file for my video files, then i switched to Yaml and saved about 40% of the size.
Another problem with XML is that, it tends to really become needlessly complex and complicated. I really dont want to worry that much about storing sub-tags onto sub-tags tagged with attributes ...

I have maintained large XML files but I came to the point that XML is way too ugly.

YAML is not the be-all-end-all solution but as far as I am concerned, it beats XML downright, gives me full data structs for use in ruby quickly and transparently, is much smaller than XML.
I use YAML as backend for Unix files in /etc too, the ruby files generate the configuration stored in the yaml file.
So these days I mostly use YAML (and ruby) to generate XML datasets requires.
This way I can use a format I like, and export the ugly XML crap if need be.

By the way, what I wrote about XML is partially valid about HTML too. I dont write html anymore, I write in a pseudo DSL that comes close to html but has no need for '<' '>' and also allows for a much more flexible solution on web-related issues.

Bernd Haug said...

Daniel,

In that sense of course I agree. I think there's a lot of wriggle room there because XML is just a meta-format. It's basically like discussing whether Perl "is readable" or not.

Mathieu,

No offense, but in that sense Word .doc is very human-readable as it can be read with special software that makes it palatable, more or less.

smelly,

YAML is yummy, but as an interchange format, most people seem to just use YAML parsers for reading the JSON subset of the syntax anyway. YMMV, hopefully.

zoobert said...

If there was only XML, I could live with it even if I still have a problem to understand the necessity of the attributes: as you said, the tree structure node-children is sufficient.

No my real problem comes with the XML Stack because when you adopt XML nowadays, you need to understand xsd,
xslt, xquery, relax-ng, .... etc.
This is incredibly complex and transform your human readable files in non-human, non-machine understandable files.
It also destroy the idea of being interoperable between platforms, or even between xml parsers.

Now, I don't even start to talk about the ugly object model to xml transformation such as in used in the XML ISO metadata standard (ISO 19139). In few words, XML is not meant to represent object models.

We should never forget about the first axiom in computer science: Let's keep it simple.

wtanksley said...

A lot of people (naturally) link to pault's "XML Alternatives" page, which he took down a couple of years ago (ouch). Check it out at the Internet Archive: http://web.archive.org/web/20060325012720/www.pault.com/xmlalternatives.html.

I'm going to read up on all of them later today. I do recommend UBF -- it seems much more appropriate for interactive protocols than XML is (have you SEEN how the Jabber XML protocol abuses XML???).

...anyhow.

wtanksley said...

I completely forgot to mention ASN.1; no discussion of alternatives to XML should be considered complete without mentioning that.

Anonymous said...

VTD-XML may also be interesting to you...
http://vtd-xml.sf.net

Anonymous said...

RFC 2822 makes the best structured data format.

Here's what you get:

1. Metadata header tags, standard
2. extensible mechanism for optional headers, domain-specific headers
3. delinated data body
4. super ease of parsing.

By far, XML is a giant leap backward in computing.

All of the other "fixes" and "workarounds", while better than XML, fall far short of RFC 2822, e.g., YAML, JSON.