I've been reading about garbage collection algorithms to potentially create a better one for Factor, so I thought I'd share the basics of what I've learned so far. Garbage collection, by the way, is the automatic reclamation of unreferenced heap-allocated memory. This means you can malloc without worrying about free, since the runtime system does it all for you.
Modern programming languages like Lisp, Java, Python, Prolog, Haskell and Factor all have some form of garbage collection built-in, whereas C, C++ and Forth require explicit freeing of unused memory.* Because all heap-allocated data is handled by the garbage collector (and stack allocation is rarely, if ever, used), garbage collector performance has an impact on all programs written in a language with GC.
For this reason, a lot of effort has been put into maximizing throughput (the amount of time spent not doing garbage collection), and more recently, minimizing the length of individual pauses, possibly at the expense of some throughput. Other good things a garbage collector can do is improve locality (for both cache and virtual memory paging purposes), reduce memory fragmentation, and provide for fast allocation. When everything is considered, a well-constructed garbage collector can actually make a program have greater throughput than manual memory management. When there's efficient allocation, then programmers can be free to use higher-level techniques. First, let's look at the simplest garbage collection algorithms.
Reference counting—the naive algorithm
Here's a pretty easy way to automatically collect unused memory. Put a counter on each object, signifying how many outside references there are to this object. When it gets to zero, you know there are no references to it, so it can be deleted. When another object starts referencing that object, it needs to increment the reference count by one. When it stops referencing it, the referrer needs to decrement the reference count.
This adds just a little overhead: at each read of a pointer, you may have to increment the reference count. But this cost is distributed throughout the program, and it's not that great; no noticeable pause is caused. Another piece of overhead is that each reference-counted object has to have a counter which is as big as a a pointer, since potentially, everything in memory points to a particular object. In practice, it doesn't need to be made this big for most applications, but in the case of overflow, you need some other mechanism to clean up the mess.
There's a bigger problem, though. Reference counting fails to collect cyclic data structures. Say I have a doubly linked list, or a tree with parent pointers. Then simple reference counting will fail to delete it when there are no outside references. Here's why: say we have three objects, A, B and C. A points to B. B points to C. C points to B. There are no other pointers to B and C. We'll assume A is a local variable, so its reference count is 1. B's reference count is two, since A and C point to it. C's reference count is 1 since A points to it. When A goes out of scope and stops pointing to B, then B's reference count will be decremented. So B and C will have a count of 1 and they won't be deleted!
There are a couple different ways to deal with this. One is to detect cycles some time. This isn't so good because cycles might be kinda long, and you don't want to go chasing pointers around unless you know where you're going. There's no good way to do this. Another way is to use a hybrid technique, where cycles are collected with something which doesn't have this weirdness. The easiest way to do this is with partial mark-sweep collection.
Ensuring soundness with mark-sweep
Another way to go about this is to just allocate memory as normal, without worrying about freeing it, until we run out of memory. At that time, you can scan the heap and free everything that's not being referenced.
Say that each object has a mark bit, determining whether it's being referenced. At the beginning of the heap scan, every mark bit is set to zero. Then, we do a depth-first search of the heap, starting at things that we know have references to then, and at each item we visit, we set the mark bit to 1. After the marking process is done, we can scan through the heap again and add everything that's not referenced to the free list.
One problem with this is that the stack space needed for the depth-first search is, potentially, as big as all of the memory. Even if you don't use the host programming language's stack and instead use a manually-maintained stack, things might not always work out. So the Deutsch-Schorr-Waite algorithm uses the space inside the pointers themselves to store what to do next. It uses pointer reversal during the depth-first search: each object has a flag bit to determine whether it's reversed or not, and when it's visited, it is changed to a pointer to its referent after reading its value and deciding to recurse on it. Pointer reversal is somewhat expensive, but it comes with a constant space guarantee, which is valuable.
The free list: not as efficient as it sounds
Any good C programmer will tell you, don't allocate memory in an inner loop. Why is this? It's because malloc'ing memory isn't a simple process, and in the worst case can become as slow, asymptotically, as traversing the entire heap. Here's how it works: when the program starts, there's this big free area called the heap where memory is available. There's something called a free list, a circular singly linked list holding all of the free blocks of memory. Initially, it contains just this one big block. When you allocate memory, the free list is searched to find a block that's big enough to hold what you want. Then the free list is altered to say that that memory is taken. This operation preserves the length of the free list.
Freeing memory is where the trouble starts. When you free memory, the old location is added to the free list. If memory is allocated in the same order that it's freed, this is no problem: adjacent free blocks of memory in the free list are joined, so there will only ever be two things in the free list. But this generally doesn't happen. Generally, as the program runs, things get freed in different orders, so the free list grows longer and longer, with pieces in it getting smaller and less usable. When the free list grows, it takes longer to allocate more memory. This is called heap fragmentation, and it's one reason why memory allocation is avoided in C.
There are ways around it. The simplest way to allocate memory is to look for the first place in the free list that the requested block size will fit, but you can also look for the best fit. Another strategy is to partition the heap into areas for different sizes of allocated memory. But, fundamentally, heap fragmentation in general is unavoidable because the allocator isn't free to move things around. Once a pointer is allocated, it's supposed to stay in the same place for ever. This limitation is maintained by refcounting and mark-sweep collectors. More advanced compacting collectors, like copying collectors and mark-compact collectors, don't have this limitation, making allocation cheaper.
Copying collection with Cheney's algorithm
Instead of dealing with a free list, we could just increment a pointer for allocation, and when memory fills up, move all the data that's referenced someplace else. This is the basic idea behind copying collection. We have two spaces, Fromspace and Tospace. All of the memory is allocated in Fromspace by incrementing a pointer, since everything in it above that pointer is free. When it fills up, copy each thing in Fromspace to Tospace. We can preserve the structure of the pointers by making sure that if something gets moved to Tospace, then its old value in Fromspace is set to a special value that indicates where it's already been copied to. When we're all done, swap the titles of Fromspace and Tospace.
This has the same basic problem as mark-sweep collection: we can't just do a depth-first search of the heap. The easiest way to solve this is by instead using breadth-first search. No, we don't want an external queue to store this stuff in; we can use Tospace itself.
Here's how it works with Cheney's algorithm. First, copy the stuff we know is referenced into Tospace. (These are called the roots, by the way.) Maintain one pointer, initially at the beginning of the heap, called "scan", and another, initially after the copied roots, called "free". "Scan" represents where we're looking for new things to copy, and "free" represents unalllocated memory in Tospace. Look at the thing at "scan", and copy it to right after "free". Then increment "scan", and increment "free" by the size of what you just copied. Continue this until scan and free are equal, that is, until nothing more is referenced that hasn't already been copied. To me, this is a very elegant instance of breadth-first search, where a queue emerges naturally between the "scan" and "free" pointers.
Sliding mark-compact using less space
Copying is pretty good because everything is compacted, and allocation is as simple as incrementing a pointer. But there's a cost: it takes twice as much space as the earlier algorithms because half of the space just isn't used at any given time. Another technique is to use something like mark-sweep (with the same marking phase), but instead of putting things on a free list, to move memory back towards the beginning. This is called mark-(sweep-)compact collection.
One goal in compaction is to preserve locality. The breadth-first search of copying collection doesn't do this well for most applications, but mark-compact collection can. The key to modern techniques for this, called "table-based", is to store information about free space that's been discovered in tables which are located in the free space itself. Once this information has been compiled, new locations for each piece of data can be calculated, pointers updated, and finally objects moved. The details are a little obscure, though.
This is only the beginning. There's a big problem with the last two methods discussed, that they can cause long pauses since they require a scan of the whole heap or all of the data that's in use. To do this, there are a number of possible improvements. The most basic one is generational garbage collection, and there are also incremental and concurrent methods. These get a lot more complicated, though, and I can't discuss them in this blog post. But before it's possible to understand those, you need the basics.
* Caveat: Linear Lisp (a Lisp dialect designed by Henry Baker) and JHC (a Haskell implementation) don't use garbage collection, instead using only stack allocation (in Linear Lisp) or region inference (in JHC). Mercury, a Prolog derivative, also uses region inference. But this is for another blog post which I'm not qualified to write. Also, you can define a garbage collector for C, C++ or Forth, but it has to treat each word with the precaution that it might be a pointer, even if it isn't.
If any of this was hard to understand (especially for lack of diagrams), I strongly recommend reading Garbage Collection: Algorithms for Automatic Dynamic Memory Management by Richard Jones and Rafael Lins.
Update: Minor changes in wording in response to comments.