In my last post about garbage collection, I mentioned that that there are more complicated and efficient algorithms to do GC better. A few people asked me for a followup describing those, so here's my best shot. For inspiration, let's look at an excellent runtime system for a horrible programming language.
Java's garbage collection
Sun's HotSpot JVM gives us a good example of how to go about garbage collection. It may not be the best programming language, but for more than 10 years they've been working on a good implementation of a garbage collector. What they have now is pretty good, and programs can expect 95-99% throughput with few noticeable pauses. Everything can be tweaked by advanced users through command-line options.
Originally, the JVM used a very inefficient, naive GC algorithm. (I think it was mark-sweep, but I can't say for sure because the docs are fairly sketchy.) By version 1.2, Java switched to a generational scheme where copying (originally mark-compact) is used for the younger generations and mark-compact is used for the oldest generation. In version 1.3, there was an optional implementation of the train algorithm for applications which needed to avoid a pause, but this has since been discontinued. In version 1.4, an incremental mark-sweep algorithm was introduced for programs with harder real-time constraints, along with "heap ergonomics"--smart runtime heap and generation resizing. In Java 1.5, the main maximum-throughput generational collector was upgraded to use multiple threads in minor GC collections. Java 7 will have a new collector, invented at Sun, called Garbage First, which can be seen as an extremely complicated generalization of generational GC which I don't completely understand. A listing of the collectors in Java 6, along with information about G1GC, is here
Java uses some interesting concepts in its garbage collector, and in this post I hope to describe enough to understand the ideas behind these things, though implementation details will vary wildly. I haven't described the incremental algorithms here (train, G1, concurrent mark-sweep), but hope to do so in a later post.
Increasing throughput with generational collection
We want to make our garbage collector such that programmers don't really have to worry about memory allocation costs. In copying and mark-compact collection systems, the cost of allocation is just incrementing a pointer, so this is great. Or is it? In truth, the cost is higher: each time we allocate memory, we move closer to having to do a full garbage collection, which results in either a traversal of the heap or copying of all the data on the heap. This isn't so good if we have to do it a lot, so memory allocation isn't so cheap.
With a quick realization, we can make this more efficient: most objects die young. This is called the weak generational hypothesis. (There's a stronger variant of this, but it's not really true, so we'll ignore it.) Anyway, once we have this realization, we can use generational garbage collection, or garbage collection with objects segregated by age.
Here's the idea, in somewhat better-thought-out form: let's take the heap and split it into three generations. The first generation is called the nursery, the second is called aging space and the last tenured space. Aging space is bisected into a "fromspace" and a "tospace" in the way that copying garbage collection works. When objects get allocated, they are first put in the nursery. When the nursery fills up, there is a minor garbage collection, where everything in the nursery which is referenced from the roots is copied to the next generation up, the place in aging space where allocation happens. When aging space fills up, we copy to the other aging space, back and forth in the style that copying GC normally works. A minor GC needs to happen whenever this takes place. When aging space is full (or almost full) after one of these back-and-forth aging space GC cycles, an intermediate collection takes place, where things in aging space are moved to tenured space, and things in the nursery are moved to aging space. Tenured space can be managed by a mark-compact algorithm, or potentially other things instead. When all of tenured space (and everything below it) collects garbage, it's called a major garbage collection.
There are variations to this, but that's the basic idea. Except what I wrote above is unsound, potentially: in a minor or intermediate GC, it's insufficient to only consider the roots; we also have to treat pointers from older generations to younger generations as roots. The idea is that we can figure out where these roots are without traversing the entire heap.
There are two ways to do this, using a remembered set or a card marking system. Both of these require a write barrier, or a small piece of code that's executed when writing a pointer. With a remembered set, there's a list of memory addresses stored which contain old-to-young pointers. The write barrier for this strategy must check each write operation to see if a pointer is being stored, and if so, if it's an old to young pointer. A more efficient strategy is card marking: divide the heap for older generations into cards, say, 128 bytes each, and when an object is modified, unconditionally set this bit on. Then, when a garbage collection occurs, old-to-young pointers only need to be scanned among cards which are marked. If no such pointer is found, the card can be unmarked.
A few more spaces to put stuff
What I've described so far interacts terribly with concurrency. This is because there has to be a global lock on allocation, which can form a bottleneck if there are many threads and allocation is relatively frequent. To fix this, just make a separate nursery for each thread. This is called a thread-local allocation buffer. The only thing you have to watch out for is that, whenever there's a GC going on from one TLAB, other allocation must stop or there is a risk that more than one TLAB could try to move to aging space at a time. Also, if something's referenced from more than one thread, it has to be moved outside of a TLAB (or else the write barrier has to be accommodated to deal with inter-thread pointers).
It's fairly inefficient to copy large objects, so a collection strategy more like mark-sweep might be more appropriate. To have this, we can make a large object area where the bodies of large objects are put immediately when they are allocated. Though it's managed by a free list, fragmentation won't be so severe (though it still happens) because of the size of the objects involved. Headers for these objects aren't in the LOA itself but rather passed around the heap like regular objects, promoted through various age groups in the normal way.
Alternatively, large objects could just be immediately placed in the oldest generation. This strategy works best when the oldest generation isn't compacting.
How big should things be?
Counterintuitively, it can actually speed things up for the nursery and aging space to be relatively small. It's good for the nursery to be small because, if it's small enough, it can fit entirely in the CPU's L2 cache for faster access. Also, since most of it is expected to be garbage (since most objects die young), there's no benefit to having it be large. It's good for the aging space to be relatively small, too, since if it's too big, then the same long-lived objects will bounce back and forth inside of it, needlessly copied and increasing pause times. It'd be better if these objects were promoted to tenured space earlier. Also, aging space represents objects allocated at around the same time, so if they're moved to tenured space at the same time, they'll be close together in memory, improving locality.
When the tenured space fills up and stays filled up (or almost filled up) after a major garbage collection, the heap needs more space. You could increase the size of all of the generations, but because of the reasons in the previous paragraph, it's usually best just to increase the size of the tenured space, and maybe the large object area if you're using one.
So, say the heap size required was very high, but now not as much is needed. How can we go back and make the heap smaller? It wouldn't be good to immediately free part of the heap when it's not being used, because it might be needed soon again. A good strategy is to set a soft goal for throughput: shrink the heap as long as the total throughput, or percentage of time spent not collecting garbage, is below a certain bound. This will avoid a cycle of growing and shrinking the heap while allowing the heap to shrink as long as heap shrinking doesn't occupy too much time.
Why this is still insufficient
Is this the end of what we need to know about garbage collection? When I started writing this post, I thought it would be. Generational collection, along with TLABs and a LOA, can significantly improve the performance of garbage collection, but there's still the risk of long, noticeable pauses caused by a full traversal of the heap.
So I still need to do a bit more research before being able to implement a perfect garbage collection algorithm for Factor which can avoid noticeable pauses. Maybe G1 is the answer here (but it's really complicated), or maybe something simpler like the train algorithm or concurrent mark-sweep (but these both have significant performance disadvantages). Until then, heap ergonomics, mark-compact or mark-sweep in the oldest generation and maybe a large object area would be good places to start in improving Factor's memory management system.
Right now, Factor's system consists of a three-generational copying collector which grows all generations by a particular factor when the heap is exhausted. The heap never shrinks. There is a separate section of the heap for compiled code, which uses a separate mark-sweep system, which can potentially make the system run out of memory when it doesn't have to. I hope this can be fixed by Factor 1.0, but if not, the GC will definitely be redone in Factor 2.0 when multiple system threads are supported.