Friday, March 7, 2008

A little more about garbage collection

In my last post about garbage collection, I mentioned that that there are more complicated and efficient algorithms to do GC better. A few people asked me for a followup describing those, so here's my best shot. For inspiration, let's look at an excellent runtime system for a horrible programming language.

Java's garbage collection

Sun's HotSpot JVM gives us a good example of how to go about garbage collection. It may not be the best programming language, but for more than 10 years they've been working on a good implementation of a garbage collector. What they have now is pretty good, and programs can expect 95-99% throughput with few noticeable pauses. Everything can be tweaked by advanced users through command-line options.

Originally, the JVM used a very inefficient, naive GC algorithm. (I think it was mark-sweep, but I can't say for sure because the docs are fairly sketchy.) By version 1.2, Java switched to a generational scheme where copying (originally mark-compact) is used for the younger generations and mark-compact is used for the oldest generation. In version 1.3, there was an optional implementation of the train algorithm for applications which needed to avoid a pause, but this has since been discontinued. In version 1.4, an incremental mark-sweep algorithm was introduced for programs with harder real-time constraints, along with "heap ergonomics"--smart runtime heap and generation resizing. In Java 1.5, the main maximum-throughput generational collector was upgraded to use multiple threads in minor GC collections. Java 7 will have a new collector, invented at Sun, called Garbage First, which can be seen as an extremely complicated generalization of generational GC which I don't completely understand. A listing of the collectors in Java 6, along with information about G1GC, is here

Java uses some interesting concepts in its garbage collector, and in this post I hope to describe enough to understand the ideas behind these things, though implementation details will vary wildly. I haven't described the incremental algorithms here (train, G1, concurrent mark-sweep), but hope to do so in a later post.

Increasing throughput with generational collection

We want to make our garbage collector such that programmers don't really have to worry about memory allocation costs. In copying and mark-compact collection systems, the cost of allocation is just incrementing a pointer, so this is great. Or is it? In truth, the cost is higher: each time we allocate memory, we move closer to having to do a full garbage collection, which results in either a traversal of the heap or copying of all the data on the heap. This isn't so good if we have to do it a lot, so memory allocation isn't so cheap.

With a quick realization, we can make this more efficient: most objects die young. This is called the weak generational hypothesis. (There's a stronger variant of this, but it's not really true, so we'll ignore it.) Anyway, once we have this realization, we can use generational garbage collection, or garbage collection with objects segregated by age.

Here's the idea, in somewhat better-thought-out form: let's take the heap and split it into three generations. The first generation is called the nursery, the second is called aging space and the last tenured space. Aging space is bisected into a "fromspace" and a "tospace" in the way that copying garbage collection works. When objects get allocated, they are first put in the nursery. When the nursery fills up, there is a minor garbage collection, where everything in the nursery which is referenced from the roots is copied to the next generation up, the place in aging space where allocation happens. When aging space fills up, we copy to the other aging space, back and forth in the style that copying GC normally works. A minor GC needs to happen whenever this takes place. When aging space is full (or almost full) after one of these back-and-forth aging space GC cycles, an intermediate collection takes place, where things in aging space are moved to tenured space, and things in the nursery are moved to aging space. Tenured space can be managed by a mark-compact algorithm, or potentially other things instead. When all of tenured space (and everything below it) collects garbage, it's called a major garbage collection.

There are variations to this, but that's the basic idea. Except what I wrote above is unsound, potentially: in a minor or intermediate GC, it's insufficient to only consider the roots; we also have to treat pointers from older generations to younger generations as roots. The idea is that we can figure out where these roots are without traversing the entire heap.

There are two ways to do this, using a remembered set or a card marking system. Both of these require a write barrier, or a small piece of code that's executed when writing a pointer. With a remembered set, there's a list of memory addresses stored which contain old-to-young pointers. The write barrier for this strategy must check each write operation to see if a pointer is being stored, and if so, if it's an old to young pointer. A more efficient strategy is card marking: divide the heap for older generations into cards, say, 128 bytes each, and when an object is modified, unconditionally set this bit on. Then, when a garbage collection occurs, old-to-young pointers only need to be scanned among cards which are marked. If no such pointer is found, the card can be unmarked.

A few more spaces to put stuff

What I've described so far interacts terribly with concurrency. This is because there has to be a global lock on allocation, which can form a bottleneck if there are many threads and allocation is relatively frequent. To fix this, just make a separate nursery for each thread. This is called a thread-local allocation buffer. The only thing you have to watch out for is that, whenever there's a GC going on from one TLAB, other allocation must stop or there is a risk that more than one TLAB could try to move to aging space at a time. Also, if something's referenced from more than one thread, it has to be moved outside of a TLAB (or else the write barrier has to be accommodated to deal with inter-thread pointers).

It's fairly inefficient to copy large objects, so a collection strategy more like mark-sweep might be more appropriate. To have this, we can make a large object area where the bodies of large objects are put immediately when they are allocated. Though it's managed by a free list, fragmentation won't be so severe (though it still happens) because of the size of the objects involved. Headers for these objects aren't in the LOA itself but rather passed around the heap like regular objects, promoted through various age groups in the normal way.

Alternatively, large objects could just be immediately placed in the oldest generation. This strategy works best when the oldest generation isn't compacting.

How big should things be?

Counterintuitively, it can actually speed things up for the nursery and aging space to be relatively small. It's good for the nursery to be small because, if it's small enough, it can fit entirely in the CPU's L2 cache for faster access. Also, since most of it is expected to be garbage (since most objects die young), there's no benefit to having it be large. It's good for the aging space to be relatively small, too, since if it's too big, then the same long-lived objects will bounce back and forth inside of it, needlessly copied and increasing pause times. It'd be better if these objects were promoted to tenured space earlier. Also, aging space represents objects allocated at around the same time, so if they're moved to tenured space at the same time, they'll be close together in memory, improving locality.

When the tenured space fills up and stays filled up (or almost filled up) after a major garbage collection, the heap needs more space. You could increase the size of all of the generations, but because of the reasons in the previous paragraph, it's usually best just to increase the size of the tenured space, and maybe the large object area if you're using one.

So, say the heap size required was very high, but now not as much is needed. How can we go back and make the heap smaller? It wouldn't be good to immediately free part of the heap when it's not being used, because it might be needed soon again. A good strategy is to set a soft goal for throughput: shrink the heap as long as the total throughput, or percentage of time spent not collecting garbage, is below a certain bound. This will avoid a cycle of growing and shrinking the heap while allowing the heap to shrink as long as heap shrinking doesn't occupy too much time.

Why this is still insufficient

Is this the end of what we need to know about garbage collection? When I started writing this post, I thought it would be. Generational collection, along with TLABs and a LOA, can significantly improve the performance of garbage collection, but there's still the risk of long, noticeable pauses caused by a full traversal of the heap.

So I still need to do a bit more research before being able to implement a perfect garbage collection algorithm for Factor which can avoid noticeable pauses. Maybe G1 is the answer here (but it's really complicated), or maybe something simpler like the train algorithm or concurrent mark-sweep (but these both have significant performance disadvantages). Until then, heap ergonomics, mark-compact or mark-sweep in the oldest generation and maybe a large object area would be good places to start in improving Factor's memory management system.

Right now, Factor's system consists of a three-generational copying collector which grows all generations by a particular factor when the heap is exhausted. The heap never shrinks. There is a separate section of the heap for compiled code, which uses a separate mark-sweep system, which can potentially make the system run out of memory when it doesn't have to. I hope this can be fixed by Factor 1.0, but if not, the GC will definitely be redone in Factor 2.0 when multiple system threads are supported.

13 comments:

wtanksley said...

Did you see the Unified View of Garbage Collection paper? Interesting. Also cites a lot of work on many different types of GC systems, which would seem useful to your work here.

-Wm

Unknown said...

wtanksley,

Yeah, I did. Haven't read it all, though. At this point I'm still having trouble understanding those more advanced techniques.

Slava Pestov said...

Dan, Factor's GC will definitely see some changes before 1.0. I plan on using mark/sweep for the oldest generation and adding heap shrinking. As for a large object area, I'm not convinced its really necessary; allocating large objects directly in tenured space seems to have the same effect. Incremental collection will come later, although I'm not sure which algorithm I will use.

Anonymous said...

An idea I heard somewhere or other: do a fork and run the mark and sweep in the child and communicate back the data to be freed. Parent doesn't need to stop and anything free in the child is free in the parent. Don't know how it would work on an OS that doesn't have a unix style fork though.

Alan

Unknown said...

Alan,

Sure, you can do that, but things get more complicated. To do concurrent marking (which is what you're talking about), you have to make sure that this situation doesn't occur: A points to B, and C doesn't point to anything when the situation begins. The marker scans C, and then the program thread makes a C->B pointer, and deletes the C->A pointer. So when the marker gets to A, it sees no reference to C, and C never gets marked. The techniques to avoid this add a little overhead.

Also, you need to try really hard to avoid getting to a state where the program thread is waiting for the collector to finish up what it's doing; that'd cause a pause.

About forking: I don't think that's a problem, unless you're using a very limited system (which Factor will probably never be ported to), though on single-processor systems a needless concurrent thread like this can decrease throughput somewhat.

Anonymous said...

Is there a reason most(?) Copy Collectors use a flip/flopping to/from partition opposed to sliding everything down?

Unknown said...

Anonymous, I don't really understand what you mean by "moving everything down." Sounds kinda like mark-compact collection, which I described in my last blog post. It saves space, but it takes a bit longer to do the collection because the whole heap, not just all of the objects that are referenced.

Anonymous said...

It sounds like you want real time garbage collection. Some researchers at IBM wrote about the Metronome system, which is a real time (concurrent?) collector for Java.

-Stu

Unknown said...

Stu,

Yeah, I saw that stuff. It looks interesting, but I don't think Factor needs hard real-time guarantees for most purposes. Anyway, if you see somewhat of a disconnect between the goals of the post and the conclusion, it's because I initially intended to write about incremental and concurrent techniques but later decided to save that for a later post...

Anonymous said...

re: fork then mark and sweep.

I think you missed the crucial bit. In a unix style fork the memory gets copied (well, copy-on-write), so you don't need to worry about what the program thread is doing. Anything that's free at the fork is free if nothing points to a bit of memory the program won't suddenly find a pointer to it. If the program stops pointing to something after the fork, that block will be caught the next time around.

Anonymous said...

With gencon, what would the garbage collector do if the tenured space is filled up and the nursery is not? Do we get an OOM the first instance when some aged objects in the nursery can't be moved to the tenured space or do we get it only if the whole nursery is filled up?

Unknown said...

To the most recent Anonymous,

Things can only move into tenured space by way of a nursery collection, so this problem doesn't exist. If a nursery collection would overfill tenured space, then a global collection is done. I apologize if my explanation wasn't clear. I don't know what you mean by OOM.

Anonymous said...

What do you do about non-memory garbage? That can only be solved correctly by integration with the high-level algorithm (which can't know whether non-memory resources are held by lower-level abstractions). Once a programmer has solved that for non-memory resources, as they must do to write a correct program, they've also solved it for memory resources which are a special case. At that point do you don't need to do any garbage collection - so the remaining issues are immediately solved.