Sunday, March 23, 2008

Some more advanced GC techniques

After my last two posts about garbage collection, some people people suggested some more advanced techniques be used to solve the pausing problem. Here's a quick* overview of some more advanced techniques, some of which can eliminate noticeable pauses and some of which can solve other problems in GC.

The train algorithm

The idea of the train algorithm is to break the heap (or, really, just the oldest generation) into small chunks that can be collected individually. These chunks need to contain a list, or remembered set, of things from the outside that point into it. (Card marking doesn't work so well in the presence of many separate chunks.) Then, crucially, cyclic data structures need to be put in the same chunk, or at least the same group of chunks which get collected together. You can think of the chunks as cars and the groups of chunks as trains.

It's a bit difficult to get this whole thing sound, though. The precise strategy is described really well in this term Würthinger. That PDF has tons of great diagrams. Java used to use the train algorithm optionally, but it was deprecated because the train algorithm has high overhead in terms of throughput and it can take several GC cycles to delete a cyclic data structure: as many as there are elements in the cycle.

Incremental mark-sweep

Another thing we can try is making mark-sweep incremental. Mark-sweep isn't that good for general collection, since it can cause memory fragmentation and make allocation slow. However, for just the oldest generation (aka tenured space) in a generational system, it's not completely horrible. Since the oldest generation can be pretty large, compaction takes a long time, since everything has to be copied. (This is true whether you're using mark-compact or copying collection.)

So, can we base something off mark-sweep that eliminates long pauses? Well, going by the heading, I guess we could try to make it incremental. There are two pieces to this: incremental mark and incremental sweep. Actually, instead of incremental sweep, we can do either lazy sweep (sweep as much as we need whenever there's an allocation) or concurrent sweep (sweep in a concurrent thread, and have allocating threads block on allocation until there's enough space swept).

The marking phase is more difficult because of a consistency problem. Imagine this situation with objects A, B and C. A and C are pointed to by something that we know we're keeping. When marking starts, A points to B, and B and C points to null. The marker visits C and marks it as already visited. Then, before the marker visits A or B, C is changed to point to B and A is changed to point to null. Then, the marker visits A, and A gets marked. But then B is never marked, and it is lost!

The easiest solution is to trap all pointer writes to watch out for cases like this, making sure that B gets marked when A is changed. This is called a snapshot-at-the-beginning write barrier. But this makes it so that, if A and C both point to null, B still won't be collected until the next time around. That phenomenon is called floating garbage, and more subtle strategies remedy it. Most of these incremental algorithms can be parallelized with a little bit of work.

Aside from the book I recommended before, a good resource on incremental techniques is this ACM survey on garbage collection. [Update: There's also this great blog post which I forgot to mention before. It has lots of pretty diagrams.]

Garbage-first

The people at Sun have come up with a new garbage collection strategy for Java called garbage-first garbage collection (G1GC). The idea is somewhat similar to the train algorithm: the heap is split up into small chunks which can be collected separately, maintaining a remembered set of inward references. But the G1GC uses all kinds of crazy heuristics to figure out what chunks are most likely to have a small remembered set. This works so well that this "youngness heuristic" can completely replace the generational mechanism. The whole thing is led by user-specified parameters about the maximum allowable pause time and throughput goals.

There's a paper describing G1GC [Update: link no longer requires ACM access], but I can't really understand it. A more intelligible source is FAQ #4 of the most recent blog post on Jon Masamitsu's blog. (Jon works in Java's GC group.)

Reference counting with concurrency and cycles

In a nondeterministically multithreaded environment, reference counting has problems. Increment and decrement operations need to be atomic, or else there will be consistency issues. For example, if two concurrent threads try to increment the reference count of a single variable at the same time, and it works out that they both read and then both write, then the reference count will only increase by one. This might mean that the memory is freed while there are still references to it! In the same way, a decrement could be missed.

A bad solution is to put a lock on each reference count. This is bad because it's slow: every time there's a new reference, you need to not only increment the refcount but also acquire and then free a lock. Another solution is to gave a worker thread which handles all increments and decrements; all other threads send messages to it.

To handle cycles, you could use a hybrid approach, to use mark-sweep when memory runs out in order to collect cycles. But there are other approaches. In an explicit refcounting system (where increments and decrements are manual), the user could be expected to insert a "weak reference", one which doesn't increase the refcount, whenever completing a cycle. Another way is to perform a small local marking trace when refcounts are decremented but not set to zero, to make sure there isn't an unreferenced cycle. That's described in this recent paper, which also handles concurrency. Here's an optimization on that with a proof of correctness.

Hard real-time GC

So far, I've been talking about minimizing pauses in a vague, general sense. We just want them to be a fraction of how long it takes to do a full tracing collection. But this isn't enough for some applications. Say you're making a video game where a 50ms GC pause (as the best incremental mark-sweep collectors benchmark at, I've heard) means a skipped frame or two. That can be noticeable, and it presents a real disadvantage compared to explicit allocation. Even refcounting doesn't always give really short pauses, since it causes memory fragmentation (making allocation take longer) and deallocation is not scheduled incrementally. That is, if you have a long linked list with just one reference to the head, and that reference ends, then the entire linked list gets deallocated in a chain, with no incremental scheduling.

What this situation needs is a garbage collector follow hard real-time guarantees. One way that this guarantee could be phrased is that pauses are at most 1ms, and that at least 7 out of 10 milliseconds are spent running the main program. This guarantee will be met even if the main program is acting "adversarially", bringing out the worst-case behavior in the collector. It's possible to specify a requirement like this that's unachievable for a particular application, but this requirement works for most things. Different applications can specify different requirements based on, say, their frame rate and how long it takes to render a frame. For applications like this, absolute throughput (which is basically maximized by a well-tuned generational collector) can be sacrificed in favor of better scheduling.

This sounds like an impossible dream, but it's actually been implemented in the Metronome system, implemented for Jikes by IBM. Metronome has been written about in the ACM Queue and there's also a paper which is harder to understand but explains more. The goal of the Metronome project is to allow high-level languages to be used for real-time applications on uniprocessor machines. While Java isn't what I'd choose, the GC seems to be the biggest barrier, and it's great that this research is being done.

The idea is to have an incremental mark-sweep collector (not generational) which segregates the heap into chunks (just for allocation purposes) of roughly the same size data. This minimizes fragmentation. However, fragmentation can still occur, and when a heap segment is too fragmented, it is incrementally copied and compacted to a different piece of memory. Large objects are split up into chunks called arraylets. By all of these techniques, garbage collection can be broken up into small tasks, and an innovative scheduler makes it satisfy the hard real-time guarantees.

Because the collector isn't generational, and because of the overhead of the scheduler and the floating garbage that's left by the incremental collector, this is far from optimal for applications that don't really need things to be so predictably well-spaced. But maybe, if there were more knobs on this algorithm (eg, the scheduler can be turned off, and more generations can be added), this could be a general-purpose GC that's really useful.

GC and language features

In the most basic sense, a garbage collection system consists of one exposed function, allocate, which takes a number of bytes and allocates a region of memory that's that big. But there are some other things that can be useful. For example, for tracing collectors, a collect-garbage function can be used to do a major collection when the program knows it's idle.

Another useful feature is finalizers. For most things, it's sufficient to just deallocate memory in when it's collected. But think about files. You should always explicitly close a file when you're done with it, but if the programmer makes an error, the file should still be closed once it is unreachable. With a reference counting or mark-sweep collector, this is relatively easy: just have a generic function finalize that gets called on everything that's collected. With copying collection, the collector maintains a list of objects that have finalizers, and on each copying cycle, this list is traversed and it is checked whether objects have forwarding pointers in fromspace. If an object with a finalizer doesn't have a forwarding pointer, it has been deleted and the finalizer should be invoked. This avoids a full traversal of the heap.

Actually invoking so simple, because now the object might contain some dead pointers. With a reference counting collector, if you're not collecting a cycle, you can call the finalizers in top-down order (also called topological order), and then the dead pointer issue doesn't exist. But this breaks down in the presence of cycles, and is difficult to calculate with a tracing collector. An easier-to-implement strategy is to call the finalizers in arbitrary order, but call them all before garbage is actually collected. Alternatively, everything the finalizer references can be considered a root. But in this situation, programmers have to be very careful not to retain the objects forever.

This summer, I hope to join this research community in a small way by participating in a Harvey Mudd REU (summer undergraduate research project) in garbage collection. In previous summers, an idea of blobs was developed, a generalization of a concept called ephemerons to make weak hashtables without memory leaks. They wrote and published a paper about it. They also researched and wrote about garbage collection techniques for certain hard-to-collect persistent data structures.

Another project of the leader of this group is called a simplifier, which is something that gets invoked in a copying collector to simplify a datastructure when it gets copied. This is a technique that is used in an ad-hoc way in graph-reduction runtimes in purely functional languages: you don't want to copy the whole graph if there's an easy way to simplify it without allocating any new nodes. It should be really fun to research these more advanced techniques for making garbage collection more correct and efficient.

Where to now for more research?

Academia has been working on this since the '60s. But recently, big companies like Sun, IBM and Microsoft have been doing more here in order to augment their Java and .NET platforms. Some academic resources to look at to learn more about GC are at UT Austin's website (especially bigsurv.ps). There are conferences which discuss memory management, like the International Symposium on Memory Management and SPACE.

When implementing a garbage collector for research purposes, you probably don't want to build a whole programming language runtime yourself. Jikes RVM provides an open-source research virtual machine that you can easily plug in different garbage collectors into. Jikes RVM's MMTk (Memory Manager Toolkit) makes this possible. There are visualization tools, and the heap can be split up into different segments which use different collectors.

These advanced garbage collection algorithms haven't been implemented many times and aren't well-understood by many people. There also hasn't been much work in formally analyzing and comparing these algorithms. This is partly because they're hard to analyze; constant factors have a gigantic effect. Someone came up with a unified theory of garbage collection, though, which analyzes all garbage collection strategies as some combination between marking and reference counting, which can be seen as duals. Just like with differential equations, there's no general solution which meets all of our requirements (maximizing throughput and locality, minimizing and organizing pause times, making allocation cheap) at once, though our understanding is always improving.



* You may be wondering why I keep saying that these posts are short when they're something like four pages long printed out. Mainly, it's because the things that I'm reading are much longer. It'd be kinda hard for me to describe anything meaningful in 500 words or less.

7 comments:

Berlin Brown Discussions said...

The real question; how do you find time to hack on factor and write these blog posts.

I am impressed with the factor team.

Anonymous said...

How about regions?

Region inference is a memory management method for computer programming. It is an alternative to manual memory management and garbage collection.

Region inference involves associating variables or objects with a "region" in a stack-like construct. Older regions are lower on the stack, and younger regions are higher. Regions are analyzed statically to determine when they "die" - and all the variables or objects associated with that region die with the region. Whenever an older region is determined to be "dead", all younger regions die with it.

See:

http://citeseer.ist.psu.edu/tofte97regionbased.html

This paper describes a memory management discipline for programs that perform dynamic memory allocation and de-allocation. At runtime, all values are put into regions. The store consists of a stack of regions. All points of region allocation and deallocation are inferred automatically, using a type and effect based program analysis. The scheme does not assume the presence of a garbage collector.

Also:

http://berkeley.intel-research.net/dgay/rc/

RC is a dialect of C that adds safe, region-based memory management to C:

Region-based memory management allocates objects in a program-specified region. Objects cannot be freed individually; instead regions are deleted with all their contained objects. Region-based memory management is a fairly common programming style, for instance it is used in the Apache web server (where the regions are called pools and also used to manage other resources).
RC is safe: RC maintains for each region r a reference count of the number of external pointers to objects in r, i.e., of pointers not stored within r. Deleting a region with a non-zero reference count causes a runtime error (abort). Thus RC programs cannot access freed memory. The overhead for safety in RC is low (less than 11% on a group of realistic benchmarks).

And:

http://citeseer.ist.psu.edu/pereira02dynamic.html

We present a garbage collection scheme based on reference counting and region inference which, unlike the standard reference counting algorithm, handles cycles correctly. In our algorithm, the fundamental operations of region inference are performed dynamically. No assistance is required from the programmer or the compiler, making our algorithm particularly well-suited for use in dynamically-typed languages such as scripting languages

Unknown said...

figg, Thanks for the pointers. I don't know enough about region inference to write about it, but that stuff looks interesting. I assumed it was fundamentally flawed because memory usage patterns can't, in general, be completely known before runtime, but I guess I'm wrong.

Sandro Magi said...

I posted the reference to the Unified Theory of GC on your first blog post.

You don't do ref-counting justice in your post. In my original post, I had already provided a reference to a state of the art concurrent, ref-counting garbage collector, which also provides a proof of correctness. I'll provide it again: Efficient On-the-Fly Cycle Collection.

As far as I know, there are proofs of correctness only for concurrent ref-counting algorithms, which is a pretty damning criticism of other concurrent GCs given how difficult concurrent algorithms are.

figg, nice reference to RC. First I'd heard of it.

As for region-based memory management: it's unclear whether there is an optimal, general region-inference algorithm. The paper introducing Makholm-Niss region inference (Makholm's thesis too) provides some evidence that region-inference will always need a little guidance to infer optimal object lifetimes, and this shouldn't be too surprising as all typing schemes require a little information in order to derive the rest. What form this information takes is an open question however.

Regions can be manually managed however. See the Cyclone language for instance. Fluett showed that linear typing is all you really need to enable safe region-based manual memory management.

Unknown said...

Sandro, Thanks. I should have credited you with the link you gave me. It's true that I didn't do refcounting justice, but I didn't really do any other technique justice either, since it was all a short summary. To everything that's relevant justice would mean writing a book. I hope no one misinterprets this as an attempt to cover everything.

I forgot about the cycle collection paper you listed, so thanks for providing it again. Many many cycle-collecting refcounting GCs have been made, and I'm nowhere near up-to-date on the research on them (or any other aspect of this field).

Sandro Magi said...

Wasn't looking for credit, just wanted to identify myself so you know who you were talking to. :-)

Re: justice, it just seemed like you covered ref-counting as, "it's hard to get right in a concurrent setting", which doesn't reflect the state of the art since the turn of the century. Ref-counting seems one of the only one we know how to get right! Mark-sweep is another that also uses the efficient "sliding views" used in Bacon's ref-counting collector.

As for cycle-collection algorithms, Bacon's is the most efficient that I know of (the most recent I've read at least), and his on-the-fly age-oriented garbage collection is pretty compelling compared to other concurrent GCs. Bacon has pointed out many problems with previous supposedly "concurrent" GC algorithms, and his algos suffer no such problems. Highly recommended if you have the time.

Anonymous said...

Sun's Garbage First paper is available here.

http://research.sun.com/jtech/pubs/04-g1-paper-ismm.pdf