Sunday, August 3, 2008

New mark-compact algorithms

It'd be good for a garbage collector to have low space overhead, good cache behavior and low fragmentation. Low space overhead is useful wherever there's a limited amount of RAM. Eliminating fragmentation is important, since fragmentation reduces the usable amount of memory and can make allocation slower. Improving cache behavior of ordinary programs is always good for performance, and surprisingly a garbage collector can help. These three goals point to mark-compact garbage collection as a good way to go.

The idea

First, mark all of the reachable data by traversing the heap. Then, slide the live (reachable) data as far over to one side as possible. This will (hopefully) free up some space on the other side, so we have a big contiguous free area. With this free space, we can allocate in by just keeping a pointer that shows where free space begins, and incrementing it when allocation happens. Pointer-bumping allocation eliminates fragmentation and creates good cache behavior, since all new data is allocated close together. In a practical implementation, you'd want to use a generational system, where the oldest generation is maintained by mark-compact.

Many recently developed GC algorithms fit into this class in a more general way. The heap is marked, and then through some means, the live data is clumped together. There are many different ways to do this, but to me they all feel fundamentally similar. The new algorithms have a greater space overhead, getting a speed improvement or incrementality (short pause times) as a tradeoff. The resulting overhead is much less than semispace (copying) collection, which takes more than two times as much space as the program uses.

Basic implementation

The marking part of this isn't too hard. You can perform depth-first search on the heap using an explicit stack. For locality reasons, a bit vector can be maintained off to the side to the side to hold the information about what's been marked. If the mark stack overflows (a rare condition), additional pushes to the stack can be ignored and the heap can be rescanned afterwards for marked objects with unmarked children. Marking can be done in parallel with a load balancing technique known as work-stealing queues, and it can be done incrementally or concurrently using a certain write barrier.

The hard part is the compacting. Traditional mark-compact algorithms, like the ones discussed in the Jones book on GC try to use little or no additional space for the process. The easiest way to do it is to have an extra word per object which is used to store forwarding addresses. After marking, there are three passes over the heap. In the first pass, the live data is iterated over in address order, and new addresses for all objects are calculated and stored in the forwarding addresses. On the second pass of the heap, every pointer is replaced with the new forwarding address. On the third pass, objects are moved to this new forwarding address.

Other methods (table-based methods and threading methods) eliminate the word per object overhead and reduce the number of passes to two, but they remain relatively slow. It's not obvious how to make them incremental, concurrent or parallel in the compacting stage, though incremental, concurrent and parallel marking is well-studied. The difficulty comes from the fact that, in all of these methods, pointers temporarily point to locations that aren't filled with the right thing. Unlike with concurrent/incremental copying, it's difficult to design even a read barrier for this. Additionally, sliding compaction is difficult to parallelize because of the way pointers are calculated from an accumulator, and the sliding itself must be done in address order, otherwise uncopied things could be overwritten.

New ideas

There are a number of ways to go about changing mark-compact. Here are some of them that I've read papers about:
  • Reserve some additional space off to the side to calculate forwarding addresses from the mark bitmap, so only one pass over the heap is required This is the idea developed in The Compressor [PDF], by Haim Kermany. Rather than compact the heap onto itself, it is copied in page-size chunks. Since the to-pointers are calculated off to the side beforehand, pointers can be updated at the same time as copying happens, and only one pass over the heap is required. Once a page is copied, the old page can be reused, making space overhead bounded. It can even run in parallel, copying to multiple pages at once without needing any synchronization beyond handing out the pages to copy. The paper also provides a concurrent version of this.

  • Divide the heap up into chunks, copying over by chunks, keeping track of some of the pointers between chunks. This is the approach taken by the mark-copy [PS] algorithm by Narendran Sachindran, further developed as MC2 [PDF]. In order to allow the copying to go on independently, there are remembered sets that each chunk has of pointers going into the chunk. Then, copying can proceed without looking at anything else, with the remsets going and updating just those parts afterward. The remsets are built during marking. To reduce the size of these remsets, the chunks are ordered, and pointers only need to be maintained going in a certain direction. This algorithm can be made incremental because the mutator can run in between copying windows, as long as a write barrier maintains the remsets. The MC2 paper describes a specific implementation where space overhead is bounded to 12.5%.

  • Switch from pointers to something else temporarily during compaction to let things go incrementally The grandiosely titled paper Mark-Sweep or Copying? A "Best of Both Worlds" Algorithm and a Hardware-Supported Real-Time Implementation [ACM] does this to implement a very concurrent mark-compact algorithm on specialized hardware using a read barrier. On one pass, all live objects are given "handles" in a separate part of the heap. These handles point back to the original object, and the original object points to the handle. Next, all pointers are replaced with handles. Objects are slid as far as possible towards the beginning of the heap, and their handles are updated appropriately. Finally, all objects have their pointers in the form of handles replaced with what those handles point to. This is given a concurrent implementation on specialized "object-oriented" hardware using multiple read barriers at different points in execution. During compaction, the read barrier makes sure that the mutator sees only handles, and afterwards, the read barrier makes sure that the mutator sees only pointers.

  • Use mark-sweep and a free list-based allocator, but occasionally compact areas that are extremely fragmented. This is perhaps the most widely studied idea, developed in a number of places. I can't find a good reference, but it's pretty clear that you can use mark-sweep over several GC cycles until the heap gets a certain amount fragmented, and then compact the entire heap. This can be done by compacting every n cycles, or by compacting when the free list gets to a certain length. But it seems like it'd be better to compact just the areas that are particularly fragmented. This technique is used in the hard real-time Metronome garbage collector as well as the mark-region Immix [PDF] collector. The compaction in both of these is actually done by evacuation, that is, copying to a different place. This involves some space overhead. Any old way of copying will do, including any incremental, concurrent or parallel method. A few [both ACM, sorry] concurrent copying schemes have been developed especially for this purpose. All of these deserve a full explanation, because they're really interesting, but I'll leave that for another blog post. One possibility related to this would be reference counting (with something to collect cycles) with occasional compaction, but I've never seen anything written about this. In these systems, the allocator has to be more complicated than pointer bumping, but it can still be very efficient.

Some of these ideas also use sliding compaction order, but this isn't really a necessary trait of mark-compact GC. Earlier algorithms (the two finger algorithm, linearizing compaction) used different orders. Nevertheless, the creators of these algorithms often don't refer to their own work as "mark-compact" when it doesn't use this order. I think it makes sense to put them in one category, since they all share the quality that the heap is marked and then it is compacted, at least partly.

When I tell people who know about CS that I've been trying to research garbage collection this summer, the reaction is often something like, "Hasn't that been pretty much solved?" There may be solutions, but interesting improvements in terms of time and space overhead are being developed all the time. All of the papers I've referenced are from the past decade.

[A historical note: Imagine if everywhere I said "compacting" I instead said "compactifying." People actually did this in the 70s and 80s in the GC literature. I'm very glad they came to their senses. An honesty note: I don't actually understand the concurrent algorithms very well.]

6 comments:

Rich Dougherty said...

Interesting read, thanks. I can imagine some of these algorithms are quite tricky to actually implement and debug!

Matt said...

A fully parallel compactor

Sandro Magi said...

A fully generalized "mark-sweep with compaction" GC was described in Using Page Residency to Balance Tradeoffs in Tracing Garbage Collection. This GC combines copying and mark-sweep algorithms via two tunable parameters, allocation and evacuation thresholds, which determine when objects should be copied.

Unknown said...

Thanks for the interesting paper links, guys! If you have any other cool papers about GC that you're just holding back until there's some relevant place to link to them, feel free to do that now.

Anonymous said...

Hey Dan, its me again, great overview of some of the more relevant mark-compact algs.

In regards to the Compressor paper, I just skimmed it (very interesting, must digest later)...unless I'm mistaken, it actually has to do a pass of the heap once to mark, and then once to copy, no? At first I thought it would mark-*AND* copy all in one pass. It also isn't as space efficient as some of the other algorithms it seems.

"Additionally, sliding compaction is difficult to parallelize because of the way pointers are calculated from an accumulator, and the sliding itself must be done in address order, otherwise uncopied things could be overwritten."
I was just thinking about that this past weekend. The compressor does solve this problem, but at the cost of space.

The interesting thing about GC algorithms is wasted space can negatively impact performance due to smaller 'generations'. The smaller the generation, the less time it takes to fill up, the less time objects have to 'die', and you end up copying things you don't need to. It would be interesting to see if the ability to parallelize copying (like the compressor) is worth the performance hit you'd take from this. What are your thoughts?

-V

Unknown said...

V, I completely agree with your analysis. More space overhead causes more frequent collections on the same amount of memory, that's a fact, and the Compressor has higher space overhead (as well as two passes in total; maybe I should have worded that more clearly). In personal communication with one of the authors of the compressor paper, I learned that they used a granularity of 512 bytes for their offset and first object tables.

But from what I've seen, the tradeoff of space overhead can be worth it. In many papers, there's a graph that shows speed vs memory given, and it tends to show that, after a certain amount of memory, performance doesn't really improve much more. So it's possible that this point is around 1.4 times the maximum size of the live heap for mark-sweep, and 1.8 times the size of the live heap for the compressor. It's really a lot of tradeoffs.