Flat text files
In the recent past, many protocols and file formats were written with a flat text file, binary or human-readable, each requiring an individually specialized parser. Many are still written this way. Does it make any sense to impose a tree structure on something as simple as blog syndication, documents or remote procedure calls? Or was it a wrong turn to put all of that in the same verbose, complicated syntax?
I think it was a good idea to specify these formats in terms of a common tree-based human-readable format. Maybe for some low-level network protocols, a flat text or binary file makes sense, but many other things work out well using a tree structure. For example, the Atom syndication format is a way to store things like blog feeds in XML. The structure is pretty simple: there's a bunch of metadata about the blog, and then there are a bunch of nodes corresponding to items, with roughly the same fields as the feed itself has. (I'm oversimplifying, here.) Atom uses a tree structure to store this, and the tree is in XML syntax. A tree structure makes sense, because there are a couple different sub-levels: there's the level of items, and then underneath that, the level of data about each item. These can be cleanly separated in a tree model.
Using a pre-existing XML parser, Atom is fairly easy to parse and generate. I wrote a simple library for Atom parsing and generation here in not much code.
An additional benefit of a tree structure in a standard syntax is that standard tools can be used on it. On the most basic level, you can use a parsing library. But is this really necessary if the format is simple enough anyway? When there is a large amount of information, parsing inevitably becomes harder, and a consistent encoding of hierarchical structure makes this easier.
A new alternative to Atom is Zed Shaw's XSFF, where information is in a simple in a flat-file format. (Update: Zed says this should be taken as a joke.) Originally, this only had basic information about the blog overall, and the URL of each post in chronological order. But when things were extended to show the article contents, Zed's solution was to have the flat file link to a special source format he uses to generate HTML. He didn't provide anything to get the date things are posted, which, in his case, can be deduced from the URL.
I don't mean to criticize Zed, but this new format will actually be more difficult for programmers to process than regular Atom feeds, if they want to have a "river of news" for aggregator format. A ZSFF aggregator (as Planet Factor is an Atom aggregator) would have to parse the URLs to figure out the dates to get a correct ordering and follow the URLs with a .page extension to get content. For those pages, they also must be parsed to get the title and content in HTML form. Is it easier to write an ZSFF generator? Yes, but it's much harder to read, and that must be taken into consideration just as much.
Many smart Lispers have complained about XML. They claim s-expressions (sexprs), the basis for Lisp syntax, are better for most human-readable data serialization purposes with a bunch of good reasons. Some of these are,
- Sexprs are simpler to parse—just use
- They're easier to process, since they're just nested lists.
- Sexprs encode real data types in a direct way, not just strings but also integers and floats.
- XML is unnecessarily complicated, including things like the distinction between attributes and children and unnecessary redundancy in closing tags.
- Sexprs came first and they do everything XML does, so why should we use XML?
Explaining XML's utility is different than to Lispers' criticisms, largely because at least half of the criticisms are correct: XML is completely isomorphic to sexprs but with a more complicated and verbose syntax. Still, it has a few advantages besides its legacy status. XML is very well-specified and is guaranteed not to differ between conforming implementations. The general idea of s-expressions is fairly universal, but it differs between implementations what characters can be included in a symbol, what characters and escapes can be used in strings, the precision of floats and the identity of symbols. Within a well-specified Lisp (I'm using Lisp in the general sense here), there's not much ambiguity, but in communicating between computers programmed with different parser implementations, XML is more robust. XML supports Unicode very consistently, with explicit mention of it in the stanard. Sexprs can't be depended on for this.
This might not be helpful in general, but one really great thing about XML is that you can embed other XML documents inside of it very cleanly. I never thought I'd use XML for anything but others' protocols until it turned out to be the right tool for a quick hackish job: a simple resource file to generate the Factor FAQ. Previously, I maintained the FAQ in HTML directly, but it became tedious to update the table of contents and the start values for ordered lists. So in a couple hours I came up with and implemented a simple XML schema to represent the questions and answers consisting of XHTML fragments. I could parse the HTML I was already using for the FAQ and convert it to this XML format, and convert it back.
Could I have used s-expressions, or an equivalent using the Factor reader? Sure, but it would have taken more effort to construct the HTML, and I'm lazy. One reason is really superficial: I would have had to backslash string literals. Another reason is that I had an efficient XML parser lying around. But one thing which came in handy which I didn't expect was that because of XML's redundancy, the parser caught my mistakes in missing closing tags and pointed them out at the location that they occurred. Anyway, these are all small things, but they added up to XML being an easier-to-hack-up solution than s-expressions or ad-hoc parsing.
(JSON: I have nothing to say about JSON, but I don't think anyone's ever tried to use it for standardized protocols, just for AJAX. And I don't really know much about it. But I feel like I should mention it because it's out there, and it's definitely a reasonable option because it's strongly standardized. The same goes for YAML, basically. It's important to note, though, that JSON and YAML are in a way more complicated (though arguably more useful) because they include explicit data types.)
While XML isn't perfect, it is the right tool for some jobs, even some things it isn't currently used for. Of course, there are definitely cases where flat files or s-expressions are more appropriate; it would be stupid to reflexively use XML for everything where you want a human-readable data format. But format standards, while annoying, are great when someone else takes the time to implement them for you. This way, you don't have to worry about things like character encodings or parsing more complicated grammars as the format grows. The biggest benefits of XML are the tree structure and its standardization.
Update: For a more complete look at the alternatives to XML, check out this page which William Tanksley pointed out (unfortunately only on the Internet Archive).