The train algorithm
The idea of the train algorithm is to break the heap (or, really, just the oldest generation) into small chunks that can be collected individually. These chunks need to contain a list, or remembered set, of things from the outside that point into it. (Card marking doesn't work so well in the presence of many separate chunks.) Then, crucially, cyclic data structures need to be put in the same chunk, or at least the same group of chunks which get collected together. You can think of the chunks as cars and the groups of chunks as trains.
It's a bit difficult to get this whole thing sound, though. The precise strategy is described really well in this term Würthinger. That PDF has tons of great diagrams. Java used to use the train algorithm optionally, but it was deprecated because the train algorithm has high overhead in terms of throughput and it can take several GC cycles to delete a cyclic data structure: as many as there are elements in the cycle.
Another thing we can try is making mark-sweep incremental. Mark-sweep isn't that good for general collection, since it can cause memory fragmentation and make allocation slow. However, for just the oldest generation (aka tenured space) in a generational system, it's not completely horrible. Since the oldest generation can be pretty large, compaction takes a long time, since everything has to be copied. (This is true whether you're using mark-compact or copying collection.)
So, can we base something off mark-sweep that eliminates long pauses? Well, going by the heading, I guess we could try to make it incremental. There are two pieces to this: incremental mark and incremental sweep. Actually, instead of incremental sweep, we can do either lazy sweep (sweep as much as we need whenever there's an allocation) or concurrent sweep (sweep in a concurrent thread, and have allocating threads block on allocation until there's enough space swept).
The marking phase is more difficult because of a consistency problem. Imagine this situation with objects A, B and C. A and C are pointed to by something that we know we're keeping. When marking starts, A points to B, and B and C points to null. The marker visits C and marks it as already visited. Then, before the marker visits A or B, C is changed to point to B and A is changed to point to null. Then, the marker visits A, and A gets marked. But then B is never marked, and it is lost!
The easiest solution is to trap all pointer writes to watch out for cases like this, making sure that B gets marked when A is changed. This is called a snapshot-at-the-beginning write barrier. But this makes it so that, if A and C both point to null, B still won't be collected until the next time around. That phenomenon is called floating garbage, and more subtle strategies remedy it. Most of these incremental algorithms can be parallelized with a little bit of work.
Aside from the book I recommended before, a good resource on incremental techniques is this ACM survey on garbage collection. [Update: There's also this great blog post which I forgot to mention before. It has lots of pretty diagrams.]
The people at Sun have come up with a new garbage collection strategy for Java called garbage-first garbage collection (G1GC). The idea is somewhat similar to the train algorithm: the heap is split up into small chunks which can be collected separately, maintaining a remembered set of inward references. But the G1GC uses all kinds of crazy heuristics to figure out what chunks are most likely to have a small remembered set. This works so well that this "youngness heuristic" can completely replace the generational mechanism. The whole thing is led by user-specified parameters about the maximum allowable pause time and throughput goals.
There's a paper describing G1GC [Update: link no longer requires ACM access], but I can't really understand it. A more intelligible source is FAQ #4 of the most recent blog post on Jon Masamitsu's blog. (Jon works in Java's GC group.)
Reference counting with concurrency and cycles
In a nondeterministically multithreaded environment, reference counting has problems. Increment and decrement operations need to be atomic, or else there will be consistency issues. For example, if two concurrent threads try to increment the reference count of a single variable at the same time, and it works out that they both read and then both write, then the reference count will only increase by one. This might mean that the memory is freed while there are still references to it! In the same way, a decrement could be missed.
A bad solution is to put a lock on each reference count. This is bad because it's slow: every time there's a new reference, you need to not only increment the refcount but also acquire and then free a lock. Another solution is to gave a worker thread which handles all increments and decrements; all other threads send messages to it.
To handle cycles, you could use a hybrid approach, to use mark-sweep when memory runs out in order to collect cycles. But there are other approaches. In an explicit refcounting system (where increments and decrements are manual), the user could be expected to insert a "weak reference", one which doesn't increase the refcount, whenever completing a cycle. Another way is to perform a small local marking trace when refcounts are decremented but not set to zero, to make sure there isn't an unreferenced cycle. That's described in this recent paper, which also handles concurrency. Here's an optimization on that with a proof of correctness.
Hard real-time GC
So far, I've been talking about minimizing pauses in a vague, general sense. We just want them to be a fraction of how long it takes to do a full tracing collection. But this isn't enough for some applications. Say you're making a video game where a 50ms GC pause (as the best incremental mark-sweep collectors benchmark at, I've heard) means a skipped frame or two. That can be noticeable, and it presents a real disadvantage compared to explicit allocation. Even refcounting doesn't always give really short pauses, since it causes memory fragmentation (making allocation take longer) and deallocation is not scheduled incrementally. That is, if you have a long linked list with just one reference to the head, and that reference ends, then the entire linked list gets deallocated in a chain, with no incremental scheduling.
What this situation needs is a garbage collector follow hard real-time guarantees. One way that this guarantee could be phrased is that pauses are at most 1ms, and that at least 7 out of 10 milliseconds are spent running the main program. This guarantee will be met even if the main program is acting "adversarially", bringing out the worst-case behavior in the collector. It's possible to specify a requirement like this that's unachievable for a particular application, but this requirement works for most things. Different applications can specify different requirements based on, say, their frame rate and how long it takes to render a frame. For applications like this, absolute throughput (which is basically maximized by a well-tuned generational collector) can be sacrificed in favor of better scheduling.
This sounds like an impossible dream, but it's actually been implemented in the Metronome system, implemented for Jikes by IBM. Metronome has been written about in the ACM Queue and there's also a paper which is harder to understand but explains more. The goal of the Metronome project is to allow high-level languages to be used for real-time applications on uniprocessor machines. While Java isn't what I'd choose, the GC seems to be the biggest barrier, and it's great that this research is being done.
The idea is to have an incremental mark-sweep collector (not generational) which segregates the heap into chunks (just for allocation purposes) of roughly the same size data. This minimizes fragmentation. However, fragmentation can still occur, and when a heap segment is too fragmented, it is incrementally copied and compacted to a different piece of memory. Large objects are split up into chunks called arraylets. By all of these techniques, garbage collection can be broken up into small tasks, and an innovative scheduler makes it satisfy the hard real-time guarantees.
Because the collector isn't generational, and because of the overhead of the scheduler and the floating garbage that's left by the incremental collector, this is far from optimal for applications that don't really need things to be so predictably well-spaced. But maybe, if there were more knobs on this algorithm (eg, the scheduler can be turned off, and more generations can be added), this could be a general-purpose GC that's really useful.
GC and language features
In the most basic sense, a garbage collection system consists of one exposed function,
allocate, which takes a number of bytes and allocates a region of memory that's that big. But there are some other things that can be useful. For example, for tracing collectors, a
collect-garbagefunction can be used to do a major collection when the program knows it's idle.
Another useful feature is finalizers. For most things, it's sufficient to just deallocate memory in when it's collected. But think about files. You should always explicitly close a file when you're done with it, but if the programmer makes an error, the file should still be closed once it is unreachable. With a reference counting or mark-sweep collector, this is relatively easy: just have a generic function
finalizethat gets called on everything that's collected. With copying collection, the collector maintains a list of objects that have finalizers, and on each copying cycle, this list is traversed and it is checked whether objects have forwarding pointers in fromspace. If an object with a finalizer doesn't have a forwarding pointer, it has been deleted and the finalizer should be invoked. This avoids a full traversal of the heap.
Actually invoking so simple, because now the object might contain some dead pointers. With a reference counting collector, if you're not collecting a cycle, you can call the finalizers in top-down order (also called topological order), and then the dead pointer issue doesn't exist. But this breaks down in the presence of cycles, and is difficult to calculate with a tracing collector. An easier-to-implement strategy is to call the finalizers in arbitrary order, but call them all before garbage is actually collected. Alternatively, everything the finalizer references can be considered a root. But in this situation, programmers have to be very careful not to retain the objects forever.
This summer, I hope to join this research community in a small way by participating in a Harvey Mudd REU (summer undergraduate research project) in garbage collection. In previous summers, an idea of blobs was developed, a generalization of a concept called ephemerons to make weak hashtables without memory leaks. They wrote and published a paper about it. They also researched and wrote about garbage collection techniques for certain hard-to-collect persistent data structures.
Another project of the leader of this group is called a simplifier, which is something that gets invoked in a copying collector to simplify a datastructure when it gets copied. This is a technique that is used in an ad-hoc way in graph-reduction runtimes in purely functional languages: you don't want to copy the whole graph if there's an easy way to simplify it without allocating any new nodes. It should be really fun to research these more advanced techniques for making garbage collection more correct and efficient.
Where to now for more research?
Academia has been working on this since the '60s. But recently, big companies like Sun, IBM and Microsoft have been doing more here in order to augment their Java and .NET platforms. Some academic resources to look at to learn more about GC are at UT Austin's website (especially bigsurv.ps). There are conferences which discuss memory management, like the International Symposium on Memory Management and SPACE.
When implementing a garbage collector for research purposes, you probably don't want to build a whole programming language runtime yourself. Jikes RVM provides an open-source research virtual machine that you can easily plug in different garbage collectors into. Jikes RVM's MMTk (Memory Manager Toolkit) makes this possible. There are visualization tools, and the heap can be split up into different segments which use different collectors.
These advanced garbage collection algorithms haven't been implemented many times and aren't well-understood by many people. There also hasn't been much work in formally analyzing and comparing these algorithms. This is partly because they're hard to analyze; constant factors have a gigantic effect. Someone came up with a unified theory of garbage collection, though, which analyzes all garbage collection strategies as some combination between marking and reference counting, which can be seen as duals. Just like with differential equations, there's no general solution which meets all of our requirements (maximizing throughput and locality, minimizing and organizing pause times, making allocation cheap) at once, though our understanding is always improving.
* You may be wondering why I keep saying that these posts are short when they're something like four pages long printed out. Mainly, it's because the things that I'm reading are much longer. It'd be kinda hard for me to describe anything meaningful in 500 words or less.