Tuesday, February 26, 2008

A quick intro to garbage collection

I've been reading about garbage collection algorithms to potentially create a better one for Factor, so I thought I'd share the basics of what I've learned so far. Garbage collection, by the way, is the automatic reclamation of unreferenced heap-allocated memory. This means you can malloc without worrying about free, since the runtime system does it all for you.

Modern programming languages like Lisp, Java, Python, Prolog, Haskell and Factor all have some form of garbage collection built-in, whereas C, C++ and Forth require explicit freeing of unused memory.* Because all heap-allocated data is handled by the garbage collector (and stack allocation is rarely, if ever, used), garbage collector performance has an impact on all programs written in a language with GC.

For this reason, a lot of effort has been put into maximizing throughput (the amount of time spent not doing garbage collection), and more recently, minimizing the length of individual pauses, possibly at the expense of some throughput. Other good things a garbage collector can do is improve locality (for both cache and virtual memory paging purposes), reduce memory fragmentation, and provide for fast allocation. When everything is considered, a well-constructed garbage collector can actually make a program have greater throughput than manual memory management. When there's efficient allocation, then programmers can be free to use higher-level techniques. First, let's look at the simplest garbage collection algorithms.

Reference counting—the naive algorithm

Here's a pretty easy way to automatically collect unused memory. Put a counter on each object, signifying how many outside references there are to this object. When it gets to zero, you know there are no references to it, so it can be deleted. When another object starts referencing that object, it needs to increment the reference count by one. When it stops referencing it, the referrer needs to decrement the reference count.

This adds just a little overhead: at each read of a pointer, you may have to increment the reference count. But this cost is distributed throughout the program, and it's not that great; no noticeable pause is caused. Another piece of overhead is that each reference-counted object has to have a counter which is as big as a a pointer, since potentially, everything in memory points to a particular object. In practice, it doesn't need to be made this big for most applications, but in the case of overflow, you need some other mechanism to clean up the mess.

There's a bigger problem, though. Reference counting fails to collect cyclic data structures. Say I have a doubly linked list, or a tree with parent pointers. Then simple reference counting will fail to delete it when there are no outside references. Here's why: say we have three objects, A, B and C. A points to B. B points to C. C points to B. There are no other pointers to B and C. We'll assume A is a local variable, so its reference count is 1. B's reference count is two, since A and C point to it. C's reference count is 1 since A points to it. When A goes out of scope and stops pointing to B, then B's reference count will be decremented. So B and C will have a count of 1 and they won't be deleted!

There are a couple different ways to deal with this. One is to detect cycles some time. This isn't so good because cycles might be kinda long, and you don't want to go chasing pointers around unless you know where you're going. There's no good way to do this. Another way is to use a hybrid technique, where cycles are collected with something which doesn't have this weirdness. The easiest way to do this is with partial mark-sweep collection.

Ensuring soundness with mark-sweep

Another way to go about this is to just allocate memory as normal, without worrying about freeing it, until we run out of memory. At that time, you can scan the heap and free everything that's not being referenced.

Say that each object has a mark bit, determining whether it's being referenced. At the beginning of the heap scan, every mark bit is set to zero. Then, we do a depth-first search of the heap, starting at things that we know have references to then, and at each item we visit, we set the mark bit to 1. After the marking process is done, we can scan through the heap again and add everything that's not referenced to the free list.

One problem with this is that the stack space needed for the depth-first search is, potentially, as big as all of the memory. Even if you don't use the host programming language's stack and instead use a manually-maintained stack, things might not always work out. So the Deutsch-Schorr-Waite algorithm uses the space inside the pointers themselves to store what to do next. It uses pointer reversal during the depth-first search: each object has a flag bit to determine whether it's reversed or not, and when it's visited, it is changed to a pointer to its referent after reading its value and deciding to recurse on it. Pointer reversal is somewhat expensive, but it comes with a constant space guarantee, which is valuable.

The free list: not as efficient as it sounds

Any good C programmer will tell you, don't allocate memory in an inner loop. Why is this? It's because malloc'ing memory isn't a simple process, and in the worst case can become as slow, asymptotically, as traversing the entire heap. Here's how it works: when the program starts, there's this big free area called the heap where memory is available. There's something called a free list, a circular singly linked list holding all of the free blocks of memory. Initially, it contains just this one big block. When you allocate memory, the free list is searched to find a block that's big enough to hold what you want. Then the free list is altered to say that that memory is taken. This operation preserves the length of the free list.

Freeing memory is where the trouble starts. When you free memory, the old location is added to the free list. If memory is allocated in the same order that it's freed, this is no problem: adjacent free blocks of memory in the free list are joined, so there will only ever be two things in the free list. But this generally doesn't happen. Generally, as the program runs, things get freed in different orders, so the free list grows longer and longer, with pieces in it getting smaller and less usable. When the free list grows, it takes longer to allocate more memory. This is called heap fragmentation, and it's one reason why memory allocation is avoided in C.

There are ways around it. The simplest way to allocate memory is to look for the first place in the free list that the requested block size will fit, but you can also look for the best fit. Another strategy is to partition the heap into areas for different sizes of allocated memory. But, fundamentally, heap fragmentation in general is unavoidable because the allocator isn't free to move things around. Once a pointer is allocated, it's supposed to stay in the same place for ever. This limitation is maintained by refcounting and mark-sweep collectors. More advanced compacting collectors, like copying collectors and mark-compact collectors, don't have this limitation, making allocation cheaper.

Copying collection with Cheney's algorithm

Instead of dealing with a free list, we could just increment a pointer for allocation, and when memory fills up, move all the data that's referenced someplace else. This is the basic idea behind copying collection. We have two spaces, Fromspace and Tospace. All of the memory is allocated in Fromspace by incrementing a pointer, since everything in it above that pointer is free. When it fills up, copy each thing in Fromspace to Tospace. We can preserve the structure of the pointers by making sure that if something gets moved to Tospace, then its old value in Fromspace is set to a special value that indicates where it's already been copied to. When we're all done, swap the titles of Fromspace and Tospace.

This has the same basic problem as mark-sweep collection: we can't just do a depth-first search of the heap. The easiest way to solve this is by instead using breadth-first search. No, we don't want an external queue to store this stuff in; we can use Tospace itself.

Here's how it works with Cheney's algorithm. First, copy the stuff we know is referenced into Tospace. (These are called the roots, by the way.) Maintain one pointer, initially at the beginning of the heap, called "scan", and another, initially after the copied roots, called "free". "Scan" represents where we're looking for new things to copy, and "free" represents unalllocated memory in Tospace. Look at the thing at "scan", and copy it to right after "free". Then increment "scan", and increment "free" by the size of what you just copied. Continue this until scan and free are equal, that is, until nothing more is referenced that hasn't already been copied. To me, this is a very elegant instance of breadth-first search, where a queue emerges naturally between the "scan" and "free" pointers.

Sliding mark-compact using less space

Copying is pretty good because everything is compacted, and allocation is as simple as incrementing a pointer. But there's a cost: it takes twice as much space as the earlier algorithms because half of the space just isn't used at any given time. Another technique is to use something like mark-sweep (with the same marking phase), but instead of putting things on a free list, to move memory back towards the beginning. This is called mark-(sweep-)compact collection.

One goal in compaction is to preserve locality. The breadth-first search of copying collection doesn't do this well for most applications, but mark-compact collection can. The key to modern techniques for this, called "table-based", is to store information about free space that's been discovered in tables which are located in the free space itself. Once this information has been compiled, new locations for each piece of data can be calculated, pointers updated, and finally objects moved. The details are a little obscure, though.

Further ideas

This is only the beginning. There's a big problem with the last two methods discussed, that they can cause long pauses since they require a scan of the whole heap or all of the data that's in use. To do this, there are a number of possible improvements. The most basic one is generational garbage collection, and there are also incremental and concurrent methods. These get a lot more complicated, though, and I can't discuss them in this blog post. But before it's possible to understand those, you need the basics.



* Caveat: Linear Lisp (a Lisp dialect designed by Henry Baker) and JHC (a Haskell implementation) don't use garbage collection, instead using only stack allocation (in Linear Lisp) or region inference (in JHC). Mercury, a Prolog derivative, also uses region inference. But this is for another blog post which I'm not qualified to write. Also, you can define a garbage collector for C, C++ or Forth, but it has to treat each word with the precaution that it might be a pointer, even if it isn't.

If any of this was hard to understand (especially for lack of diagrams), I strongly recommend reading Garbage Collection: Algorithms for Automatic Dynamic Memory Management by Richard Jones and Rafael Lins.

Update: Minor changes in wording in response to comments.

Sunday, February 17, 2008

Designing an API for encoded streams

When I started looking at Unicode to design a good library for Factor, I wanted to make an API such that the programmer never needed to think about Unicode at all. I now see that that's impossible, for a number of reasons. One thing that the programmer needs to explicitly think about is the encoding of files. For this reason, I'm in the middle of changing lots of words which deal with streams to take an extra mandatory parameter specifying the encoding. The encodings supported so far are binary, ascii, latin1, utf8, utf16 and some more. In the library, we'll eventually put Shift JIS, more 8-bit encodings like MacRoman, Windows 1252 and the other ISO-8859s, UTF-32, etc. Internally, all strings are already in Unicode; this is only for external communication.

Mandatory?!

Some people objected to this. Why should there be a new mandatory parameter when everything worked already? This makes code longer, rather than shorter! The rationale is that this expands the functionality of streams. With the old functionality, everything is treated as if it's encoded in Latin 1. But in 99% of cases, this is just wrong. When a text file isn't in plain old ASCII, it's almost always in UTF-8 (though occasionally it's in UTF-16 or Shift JIS). Things are rarely in an 8-bit encoding because of its ambiguity; 8-bit non-ASCII encodings can safely be labeled "legacy" except on specialized low-resource applications. Right now things work like Latin 1 is the encoding for all streams, but if we want to do much actual text processing, things will come out wrong.

Even if UTF-8 is used most of the time, could we use a heuristic to determine what encoding things are in? If we know the file is in either ASCII, UTF-8, UTF-16 or UTF-32, it's not too hard to come up with some kind of heuristic that works in almost all cases. But once things get generalized to Shift JIS and 8-bit encodings, it's basically impossible to determine generally how things are encoded. And it's completely impossible if there are binary streams, or for output streams all together.

So let's make UTF-8 the default encoding. Any stream which doesn't want to use UTF-8 should have its instantiation followed by some set-encoding word. But what about binary streams? These aren't uncommon and are needed for things like audio, video, compressed data and Microsoft Word documents. If UTF-8 is the default encoding, it'd be easy to open a file for reading or writing, forgetting that it's in UTF-8, and then writing stuff to it as if it's a binary stream. But if we make the encoding a mandatory explicit parameter, then nobody will forget: if you want to open a stream reading it as UTF-8, you can do utf8 <file-reader>, and if you want to open it as binary, you can do binary <file-reader>. Writing utf8 or binary isn't just boilerplate: it actually indicates some information about how things should work. And for those situations where you want some other encoding, that can be specified just as easily.

Now, do we really want to prefix each stream constructor with an encoding, or can it be determined, explicitly, in the context somehow? There are two ways to scope this, lexical and dynamic, and they both fail. With dynamic scoping, composability is broken: if one piece of code makes some assumption about the encoding—say, that the encoding is UTF-8, which could be the default global encoding—but then the caller sets it to something else. So the encoding must be set lexically. But when I looked at actual code samples, I saw it'd be more trouble than it's worth to have a lexically scoped encoding: nearly all words which open streams which need an encoding only need one or two. You're at most writing the same encoding twice, but it ends up being fewer words than a whole scope declaration (which needs, at a minimum, brackets, the encoding name and something declaring that this is a scope for encoding purposes). What about vocab-level scoping? It could work, but it'd have to be overridden in too many cases to be useful, since it's not unrealistic to have a vocab which uses UTF-8 for half of its streams and binary for the other half.

One other thing that's useful and not particularly common in these sorts of libraries is the fact that the encoding can be changed after the stream is initialized. This is useful for things like XML, where the encoding can be declared in the prolog, and certain network protocols like HTTP and SMTP which allow the encoding to be specified in the header, so the encoding needs to change on the fly. I can only assume that previous implementations of this took everything as binary and used string processing routines to get things in and out of the right encoding.

You might think of this as a standard library cruely forcing everyone to specify every little detail, but I think of it a little differently: the file I/O API encourages programmers to think about the encodings of their files. We could go the other way, still, and use UTF-8 as the default, but it'd create some strange and unreadable bugs. Any default is bad. All other stream APIs I've looked at make this optional, but no matter which way you go this makes misleading assumptions for programmers.

Specifics

<file-reader>, <file-writer>, <file-appender>, <client> and <server> will now take an extra argument of an encoding descriptor, making them have the stack effect ( path/addrspec encoding -- stream ). file-contents and file-lines also take an encoding from the top of the stack. process-stream's encodings are in the descriptor, as a possible value for stdin, stdout or stderr, indicating that those values will be sent to/from Factor as a stream of the given encoding. If you're dealing with files, the process you call should handle all encoding issues. Some streams, like HTML streams and pane streams, don't need changes, since their encoding is unambiguous. You also don't need to specify the encodings of file and process names, since those are OS-specific and handled by the Factor library.

In addition to <string-reader>s and <string-writer>s that already exist and remain unchanged (they don't need an encoding since everything that goes on there is in Factor's internal Unicode encoding), there are now also <byte-reader>s and <byte-writer>s which do have an encoding as a parameter. Byte readers and writers work on an underlying byte vector, and provide the same encodable interface that files do, because an array of bytes, unlike a string, can take multiple interpretations as to the code points it contains.

I renamed with-file-out to with-file-writer, with-file-in to with-file-reader, string-in to with-string-reader and string-out to with-string-writer for consistency. Additionally, there are now also words with-byte-reader and with-byte-writer. Since byte and file readers and writers need an encoding, in these combinators I've put the encoding before the quotation. It could be the other way around, and really this was an arbitrary choice. Conceptually, you can think of it like the file name or byte array and the encoding form a sort of unit, so they're consistently adjacent in the words which use them.

I've made all the updates to everyone's software in my local branch, so you don't have to worry about implementing these changes. You might want to go back and look at your code to make sure the encoding I chose was sane. 90% of the time it's binary or UTF-8, occasionally ASCII. It's usually clear-cut. Also, I never had to make more than 3 or 4 updates in a single file.

It'd be nice if things were simpler, and nobody had to consider encodings at all except for Unicode library writers. Theoretically, this could be solved by a standard way to denote, inside the file, what encoding the rest of the file is in. But if we did that, then multiple competing encoding encodings might emerge, and we'd have to explicitly choose among them! It'd be even better if the filesystem had metadata on this, but it doesn't. Maybe, on the Factor end, there's a place for having an abstraction over the locations of resources grouped with a description of their type (either encoding or filetype). But either way, encodings just aren't simple enough to allow programmers not to think about them.

Update: Added more info about specifics. It's been taking me a little longer than I initially thought to get this whole thing working with Factor, so this stuff still isn't in the main branch, though you can see the progress in the unicode branch of my repository. Bootstrapping will take a little work, though. The changes have been integrated into Factor! Thanks, Slava, for making it all work.

Monday, February 11, 2008

Entscheidungsproblem gelöst! New answer and FAQ

Alan Turing wasn't quite right: anything that can be executed in the physical universe runs in constant time, and it's simple to design a mechanism which tests if an algorithm will halt or not which runs in constant time (assuming it takes less than half of the universe to run).

Q: How can you make such a bold, stupid claim?
A: Because Turing assumed an infinite space to solve problems in. In the real world, there is a finite number of bits we can work with. Let's call this n. If a problem can be solved in the physical universe, it must be solvable using n or fewer bits of memory. There are 2n possible states the bits can be in, so after 2n memory writes, either an answer must be produced or the bits are in a state that they've been in exactly, an infinite loop. Since n is a constant, 2n is a constant, so we know everything runs in worst-case O(1) time. For a more practical simulation, take n to be the number of bits in addressable memory, including virtual memory.

Q: How can you check if an algorithm will halt or not in constant time?
A: Divide the available memory in two. Half will be used for solving the algorithm, and half will be a counter. Increment the counter on each memory write the algorithm memory space. If this counter overflows before the algorithm halts, we know there is an infinite loop. (Note: This can only solve things that take a maximum of half of the available memory, which is a smaller class of programs.)

Remember the big thing that Turing proved: we can't have a program which tests if other programs halt, because if we did have that program, and ran it on a program which halted if it didn't halt and didn't halt if it halted (by the first program's test), then the halting test program couldn't halt or there would be a contradiction. This implicitly rests on the idea that program memory is unbounded. To run this program (the second one) using the halting test method described above would require an unbounded amount of memory because it would result in an unbounded recursion. We can just say that the program crashes, since it runs out of memory. In this world, not everything that's computable can be computed, but everything halts (or repeats a previous memory state exactly, which can be caught, as explained above).

Q: What do you mean by unbounded? And halt?
A: By "unbounded" I mean countable (meaning isomorphic as a set to the naturals). Basically, there is no limit, but no element itself is infinite. By "halt" I mean that the algorithm will stop processing and come to an answer. In this theoretical world of algorithms, note that there's no addtional external input once the algorithm begins.

Q: But the system you describe isn't Turing-complete!
A: Yes. The finite universe isn't strictly Turing-complete, either, and I think this is interesting to note. For example, there's no way we could model another universe larger than our own, down to quantum states, from this universe.

Q: So my computer science professor was lying to me all along! Computational complexity is useless. Why didn't they tell you this?
A: Not quite. Since this constant is so large (even when we shrink it down to 2the size of the available memory), we can actually get a tighter upper bound than O(1) if we use, say, O(m) for the same algorithm. This is why you sometimes have to be careful what your constants are in analysis; sometimes something which looks asymptotically faster is actually slower in not just small but all cases. Anyway, your professor probably didn't tell you this because it's both obvious and vacuous.

Q: So, if you were just tricking me, does this idea mean anything?
A: Actually, it comes up fairly often. Say you're doing arithmetic on integers. Each operation takes constant time, right? Well, it does if you're using machine integers. That's because machine integers are of a certain fixed maximum size. The reason we can say it takes constant time is the same reason that the universe can calculate everything in constant time! In fact, even if we use bignums, if we say "this is only going on in numbers less that can be represented in 256 bits," it still makes some sense to say that things go on in constant time. It's only when things are unbounded that it makes total sense. Sometimes you have to look at very large data points to find that, say, nlog4 3 grows faster than n log2 n. If the size of the input is bounded at 10 billion, there's no clear winner.

Think about sorting algorithms. If we have k possible items in an array of length n, we can sort the array in O(n) time and O(k) space using counting sort. Say we're sorting an array of machine integers, and we have no idea what the distribution is. Great! Just make an array of integers which has one counter for each machine integer. Now iterate through the list and increment the array at the integer's index when you encounter that integer. What's the problem? Oh yeah, we just used up more than all of the addressable memory. So just because we can, theoretically, construct something that will solve our problem in linear (or, with the original example) constant time doesn't mean it'll work.

Update: Added another question. Note to readers: The headline and first paragraph aren't to be taken seriously or directly! This is a joke to demonstrate a point, not a claim of mathematical discovery, and the original solution to the halting problem is an eternal mathematical truth.

Saturday, February 9, 2008

Factor[ial] 102

Previously, I wrote an introduction to Factor though Factorial called Factor[ial] 101. If you saw that, you probably thought that's all you'd see of that totally un-compelling example. Well, you're wrong! We can actually implement factorial in a more simple way and more efficiently with large integers.

Let's look at the solution from last time:

: factorial ( n -- n! )
1 [ 1+ * ] reduce ;

When you saw this, you might have been thinking, "Why not just get a list of the numbers from 1 to n and do their product? Why bother with 1+? In Haskell, it's just product [1..n]." We can actually use this strategy in Factor using the math.ranges library, which has code a word called [1,b] which creates a virtual sequence containing the integers 1 through its argument. We can also use the word product in math.vectors to get the product of a sequence. So the new factorial is just

: factorial ( n -- n! )
[1,b] product ;

Efficiency and bignums

I previously talked about how to make some simple mathematical functions work well with bignums. The example I used there was a string>number conversion procedure, but it applies equally to getting the product of a list. In short: when multiplying two bignums of size (in bits) n and m, we know there's a lower bound of Ω(nm), since a bignum of nm bits must be constructed. So if we go about finding the product of a sequence by starting with 1, then multiplying that by the first element, then the second, and so on for the entire sequence, then this will take Ω(n2) time where n is the size of the resulting product!

We can do better, and a better strategy is to use binary recursion similar to mergesort: split the sequence in half, find the products of both halves, and multiply them. Then the easy lower bound is lower: Ω(n log n). (Note: the upper bounds for these algorithms is something like the lower bound times log n, with a reasonably efficient multiplication algorithm.)

So here's a better implementation of product, using this strategy:

: halves ( seq -- beginning end )
dup length 2 /i cut-slice ;

: product ( seq -- num )
dup length {
{ 0 [ drop 1 ] }
{ 1 [ first ] }
[ drop halves [ product ] bi@ * ]
} case ;

Abstraction and combinators

If we want to implement a word to sum an array, we'll be repeating a lot of code. So let's abstract the basic idea of this kind of recursion into a combinator, or higher order function, that we can supply the starting value and combining function to. With this, we should be able to write

: product ( seq -- num ) 1 [ * ] split-reduce ;
: sum ( seq -- num ) 0 [ + ] split-reduce ;

where split-reduce is this new combinator. Now, it's no harder to write code using the binary-recursive strategy than the original naive strategy, if split-reduce is somewhere in the library. Here's how you can implement it:

: split-reduce ( seq start quot -- value )
pick length {
{ 0 [ drop nip ] }
{ 1 [ 2drop first ] }
[ drop [ halves ] 2dip [ [ split-reduce ] 2curry bi@ ] keep call ]
} case ; inline

This looks a little messy, as combinators sometimes get. Let's see how it looks using local variables: (I'm using the locals vocab, which allows the syntax :: word-name ( input variables -- output ) code... ; for lexical scoping. I usually don't use this very often, but in implementing combinators it can make things much cleaner.)

:: split-reduce ( seq start quot -- seq' )
seq empty? [ start ] [
seq singleton? [ seq first ]
[ seq halves [ start quot split-reduce ] bi@ quot call ] if
] if ; inline

Which one of these you prefer is purely a matter of taste.

A tangent to mergesort

What if we wanted to use split-reduce to implement mergesort? It might look like you can do this:

: mergesort ( seq -- sorted )
{ } [ merge ] split-reduce ;

However, there's a problem here: in the base case, if we have { 1 }, it'll be changed into 1. But we need the base case to output sequences! (Ignore the fact that 1 is a sequence; it's of the wrong type.) So the cleanest way to do this is to make a new word, split-process, which does the same thing as split-reduce but takes a new parameter specifying what to do in the base case. With this we're able to do

: split-reduce ( seq start quot -- value )
[ first ] split-process ; inline

To implement this, we just need to modify split-reduce, factoring out the base case code:

:: split-process ( seq start quot base-quot -- seq' )
seq empty? [ start ] [
seq singleton? [ seq base-quot call ] [
seq halves
[ start quot base-quot split-process ] bi@
quot call
] if
] if ; inline

Now mergesort can be implemented as

: mergesort ( seq -- sorted )
{ } [ merge ] [ ] split-process ;

for some suitable implementation of merge.

To binrec and beyond

What if we took this even further: why restrict this to binary recursion on sequences? We can do binary recursion on everything that needs binary recursion! So let's make a combinator out of this, calling it binrec. binrec takes four—four!—quotations. The first one specifies the termination (base case) condition. The second specifies what to do in the base case. The third specifies how to split up the data in the inductive case, and the fourth specifies how to put the two pieces back together after the recursion takes place. Here's how we can implement binrec for a totally general binary recursion combinator:

:: binrec ( data test end split rejoin -- value )
data test call [ data end call ] [
data split call
[ test end split rejoin binrec ] bi@
rejoin call
] if ; inline

In the abstract, this isn't too bad. But how can you read code that uses binrec? You have to remember four quotations, their intended stack effects and their role in calculating this. For me, this is too difficult to do in most cases.

Look at how we can define split-process in terms of binrec:

:: split-process ( seq start rejoin-quot end-quot -- value )
[ dup singleton? swap empty? or ]
[ dup singleton? [ end-quot call ] [ drop start ] if ]
[ halves ] rejoin-quot binrec ; inline

This isn't actually easier than defining split-process directly, and you can argue that it's worse than the original version. Still, it provides an interesting way to avoid explicit recursion.

Pulling it all together

Complicated combinators like binrec can be useful, sometimes, as long as you don't use them directly. One of the great things about Factor is that it's so easy to specialize these things. So why not? Almost every case that you're using binrec follows a particular pattern.

We can tell everyone more loudly about split-reduce, which is much easier to use, and have binrec be hidden in the library for advanced users who want to implement their own similar combinators without repeating the code that's already written in binrec. It's not that recursion is difficult to do, its just that there's no reason to write this code more than once.

So that's how you implement factorial in Factor. Except once all this is in the library, all you have to worry about is [1,b] product.

(BTW If you actually want to use factorial for something practical, where it'll be called multiple times, a memoizing table-based approach might be faster. Or maybe Stirling's approximation is appropriate, depending on the use case. Or maybe one of these algorithms. But that's a topic for another blog post!)

Update: I fixed some typos, found by Slava. Also, Slava added split-reduce into the core as binary-reduce and implemented sum and product with it.

Update 2: Updated the code samples for the current version of Factor, as of late March 2009.

Wednesday, February 6, 2008

XML and its alternatives

I started writing Factor's XML parser thinking that its purpose was to interface with legacy protocols which made the mistake of choosing XML, and that the people at the W3C were a bunch of idiots for pushing such a bad, unoriginal format on innocent programmers who would do better without it. At this point, though, I think it might not actually be that bad. Let's look at the alternatives for representing human-readable structured information for standardized protocols.

Flat text files

In the recent past, many protocols and file formats were written with a flat text file, binary or human-readable, each requiring an individually specialized parser. Many are still written this way. Does it make any sense to impose a tree structure on something as simple as blog syndication, documents or remote procedure calls? Or was it a wrong turn to put all of that in the same verbose, complicated syntax?

I think it was a good idea to specify these formats in terms of a common tree-based human-readable format. Maybe for some low-level network protocols, a flat text or binary file makes sense, but many other things work out well using a tree structure. For example, the Atom syndication format is a way to store things like blog feeds in XML. The structure is pretty simple: there's a bunch of metadata about the blog, and then there are a bunch of nodes corresponding to items, with roughly the same fields as the feed itself has. (I'm oversimplifying, here.) Atom uses a tree structure to store this, and the tree is in XML syntax. A tree structure makes sense, because there are a couple different sub-levels: there's the level of items, and then underneath that, the level of data about each item. These can be cleanly separated in a tree model.

Using a pre-existing XML parser, Atom is fairly easy to parse and generate. I wrote a simple library for Atom parsing and generation here in not much code.

An additional benefit of a tree structure in a standard syntax is that standard tools can be used on it. On the most basic level, you can use a parsing library. But is this really necessary if the format is simple enough anyway? When there is a large amount of information, parsing inevitably becomes harder, and a consistent encoding of hierarchical structure makes this easier.

A new alternative to Atom is Zed Shaw's XSFF, where information is in a simple in a flat-file format. (Update: Zed says this should be taken as a joke.) Originally, this only had basic information about the blog overall, and the URL of each post in chronological order. But when things were extended to show the article contents, Zed's solution was to have the flat file link to a special source format he uses to generate HTML. He didn't provide anything to get the date things are posted, which, in his case, can be deduced from the URL.

I don't mean to criticize Zed, but this new format will actually be more difficult for programmers to process than regular Atom feeds, if they want to have a "river of news" for aggregator format. A ZSFF aggregator (as Planet Factor is an Atom aggregator) would have to parse the URLs to figure out the dates to get a correct ordering and follow the URLs with a .page extension to get content. For those pages, they also must be parsed to get the title and content in HTML form. Is it easier to write an ZSFF generator? Yes, but it's much harder to read, and that must be taken into consideration just as much.

S-expressions

Many smart Lispers have complained about XML. They claim s-expressions (sexprs), the basis for Lisp syntax, are better for most human-readable data serialization purposes with a bunch of good reasons. Some of these are,
  • Sexprs are simpler to parse—just use read.
  • They're easier to process, since they're just nested lists.
  • Sexprs encode real data types in a direct way, not just strings but also integers and floats.
  • XML is unnecessarily complicated, including things like the distinction between attributes and children and unnecessary redundancy in closing tags.
  • Sexprs came first and they do everything XML does, so why should we use XML?

Explaining XML's utility is different than to Lispers' criticisms, largely because at least half of the criticisms are correct: XML is completely isomorphic to sexprs but with a more complicated and verbose syntax. Still, it has a few advantages besides its legacy status. XML is very well-specified and is guaranteed not to differ between conforming implementations. The general idea of s-expressions is fairly universal, but it differs between implementations what characters can be included in a symbol, what characters and escapes can be used in strings, the precision of floats and the identity of symbols. Within a well-specified Lisp (I'm using Lisp in the general sense here), there's not much ambiguity, but in communicating between computers programmed with different parser implementations, XML is more robust. XML supports Unicode very consistently, with explicit mention of it in the stanard. Sexprs can't be depended on for this.

This might not be helpful in general, but one really great thing about XML is that you can embed other XML documents inside of it very cleanly. I never thought I'd use XML for anything but others' protocols until it turned out to be the right tool for a quick hackish job: a simple resource file to generate the Factor FAQ. Previously, I maintained the FAQ in HTML directly, but it became tedious to update the table of contents and the start values for ordered lists. So in a couple hours I came up with and implemented a simple XML schema to represent the questions and answers consisting of XHTML fragments. I could parse the HTML I was already using for the FAQ and convert it to this XML format, and convert it back.

Could I have used s-expressions, or an equivalent using the Factor reader? Sure, but it would have taken more effort to construct the HTML, and I'm lazy. One reason is really superficial: I would have had to backslash string literals. Another reason is that I had an efficient XML parser lying around. But one thing which came in handy which I didn't expect was that because of XML's redundancy, the parser caught my mistakes in missing closing tags and pointed them out at the location that they occurred. Anyway, these are all small things, but they added up to XML being an easier-to-hack-up solution than s-expressions or ad-hoc parsing.

(JSON: I have nothing to say about JSON, but I don't think anyone's ever tried to use it for standardized protocols, just for AJAX. And I don't really know much about it. But I feel like I should mention it because it's out there, and it's definitely a reasonable option because it's strongly standardized. The same goes for YAML, basically. It's important to note, though, that JSON and YAML are in a way more complicated (though arguably more useful) because they include explicit data types.)

Conclusion

While XML isn't perfect, it is the right tool for some jobs, even some things it isn't currently used for. Of course, there are definitely cases where flat files or s-expressions are more appropriate; it would be stupid to reflexively use XML for everything where you want a human-readable data format. But format standards, while annoying, are great when someone else takes the time to implement them for you. This way, you don't have to worry about things like character encodings or parsing more complicated grammars as the format grows. The biggest benefits of XML are the tree structure and its standardization.

Update: For a more complete look at the alternatives to XML, check out this page which William Tanksley pointed out (unfortunately only on the Internet Archive).