Saturday, May 3, 2008

A couple GC algorithms in more detail

In previous posts on garbage collection, I've given a pretty cursory overview as to how things actually work. In this post, I hope to give a somewhat more specific explanation of two incremental (and potentially concurrent or parallel, but we'll ignore that for now) GC algorithms: Yuasa's snapshot-at-the-beginning incremental mark-sweep collector, and the MC2 algorithm. Yuasa's collector is very widely used, for example in Java 5 when an incremental collector is requested. MC2 is a more recent algorithm designed to reduce the fragmentation that mark-sweep creates, and appears to get great performance, though it isn't used much yet. In their practical implementation, both collectors are generational.

Yuasa's mark-sweep collector

The idea is pretty simple: take a mark-sweep collector and split up the work, doing a little bit on each allocation. When the heap occupancy passes a certain threshold, say 80%, switch into "mark phase", and on each allocation, mark the right amount of the heap so that everything's marked by the time the heap is full. (You can ensure this by making the amount of marking proportional to the amount of memory allocated.) Then, switch into sweep phase, and on each allocation sweep the heap by a certain amount. If a big object is allocated, sweeping continues until there's enough room. Once sweeping is done, the collector returns to a neutral state and allocation takes place without any special collection actions until the free space dips below the threshold.

Making this work

This is a neat little way to specify a GC algorithm. The implementor has three knobs at their disposal: the threshold to begin collection, the speed of marking, and the speed of sweeping. But there's a problem: the algorithm, as I described it, doesn't work. See, the graph of interconnections in the heap may change during the course of marking, and that's a problem. As I described in a previous post, if a pointer gets moved to another location, it might evade marking and get swept, causing memory corruption.

In a snapshot-at-the-beginning incremental marking GC, the technique to save this is to trap all pointer writes and execute a little bit of code: if the collector is in the marking phase, and if the old pointer value isn't marked, it needs to get marked and get pushed on the marking stack so that its children get marked. (The marking stack is the explicit stack used for depth-first traversal of the heap, to mark everything it reaches.) This piece of code is called the write barrier, and it goes on in addition to the generational write barrier, if one is necessary.


One more thing: objects are allocated as marked, if an object is allocated during a GC cycle. This means that they can't be collected until the next time around. Unfortunately, this means that any generational GC will be ineffective while marking is going on: everything is effectively allocated in the oldest generation. Nevertheless, generations still provide a significant performance advantage, since most time is spent in the neural non-GC state.

This is called snapshot-at-the-beginning not because an actual snapshot is made, but because everything is saved that had something referring to it at the beginning of the marking phase. (Everything that gets a reference to it during the cycle is also saved.) Of all incremental mark-sweep GC algorithms, a snapshot-at-the-beginning collector is the most conservative, causing the most floating garbage to lie around and wait, uncollected, until the next cycle. Other algorithms have techniques to avoid this, but it often comes at other costs.


Unfortunately, no matter what strategy is used to minimize fragmentation, there is a program which will cause bad fragmentation of the heap, making it less usable and allocation more expensive. For this reason, a compaction strategy is helpful, and the MC2 algorithm (Memory-Constrained Compaction), created by Narendran Sachindran, provides one within an incremental and generational system. The details are somewhat complicated, and in this blog post I'll offer a simplified view. You can also look at the full paper.


The idea is based on the Mark-Copy (MC) algorithm. The heap is divided up into a number of equally sized windows, say 40. One of these is the nursery, and the others act as tenured space. (I don't know why, but the papers about this seem to use a two-generation rather than three-generation model. I think it could easily be updated to use three generations, but I'll stick with this for now.) Each window has a logical number, with the nursery having the highest number.

Nursery collections go on as I've described in a previous post. A tenured space collection is triggered when there is only one (non-nursery) window left free. At this point, the heap is fully marked. During marking, remembered sets of pointers into each window are made. In turn, each window is copied (using Cheney's copying collector) to the open space that exists, starting in the one free window. The remembered sets can be used to update pointers that go to things that were moved. If the lowest number window is copied first, the remembered sets only need to contain pointers from higher windows to lower windows.

New modifications

MC2 adds a few things to this, to make the algorithm incremental and give low upper bounds on space overhead. The first change is that incremental marking is done. This is similar to the incremental snapshot-at-the-beginning marker described above, though the creators of MC2 opted for a version called incremental update, which is less conservative and more complicated but equally sound. The next change is in the copying technique. If a window is determined to have high occupancy (like more than 95%), it is left as it is without copying. Otherwise, windows are collected into groups whose remaining data can fit into one window. Those groups are incrementally copied into a new window.

Other changes make sure that the space overhead is bounded. The size of remembered sets is limited by switching to a card marking system in the event of an overflow. Objects with many references to them are put in semi-permanent storage in the lowest possible window number, minimizing the size of remembered set that they need.

In a benchmark included in the MC2 paper, it is demonstrated that MC2 has the same or slightly better performance compared to non-incremental generational mark-sweep or generational mark-compact, the alternatives for the domain of memory-constrained systems. Pauses more than 30ms are rare, and performance appears to be consistent over a wide range of Java programs.


dmpk2k said...

> I don't know why, but the papers about this seem to use a two-generation rather than three-generation model.

Generations don't come free. For the same amount of memory requested from the OS you need to run minor collections more often (since the nursery space(s) need to be smaller).

Also, objects that survive the first two generations will have been copied more often. So the question is if the second copy is worth it, particularly when you consider that it requires more frequent minor collections.

Daniel Ehrenberg said...


You're right. I don't have empirical evidence one way or the other, which is very unfortunate. But the number of generations is orthogonal from the choice of nursery size. I've just heard, from several places, that three generations happens to turn out well most of the time, and that's not a very strong argument.

dmpk2k said...

It'd be great if it was orthogonal, but unfortunately I doubt it is. Let's say you have a live set (x bytes) and memory provided from the OS (y bytes).

If you manage to completely avoid fragmentation and overhead, you're left with y-x bytes to use for your younger generations.

So now the question is how you'll divide up those bytes.

I used to think three generation was ideal, but then I noticed that none of the benchmarks were using three, just two. Obviously this also varies with the behaviour of the workload.