Tuesday, June 10, 2008

An idea for garbage collection research

This summer, I'm participating in an REU (Research Experience for Undergraduates) at Harvey Mudd College in southern California. For the past week or so, we've been trying to come up with ideas for what to research. I think I've stumbled on a reasonable one, though it might take a long time to implement. Hopefully, I'll spend the summer implementing it with a co-conspirator, Tony Leguia.

The idea is to combine MC2 and Sapphire. MC2 a copying algorithm which can be incremental and which has much lower memory overhead than the normal semispace algorithm, and Sapphire is an algorithm for concurrent copying GC. Together, they could form an efficient algorithm for concurrent compacting garbage collection with minimal space overhead. This has been done before, but also this idea allows pointer-bumping allocation and no reliance on hardware for things like memory page protection. I haven't found a paper which does all of these things at once.

Motivation


For me, but of course not for anyone else except maybe Tony, the motivation is to see how far this mark-copy algorithm can go. As far as I know (and as far as the co-author of the paper that I contacted knows), there have only been two papers published on it, the original one and MC2, none of which mention multiple processors.

One idea is that this would be useful for large servers with many cores that are under memory pressure. Existing algorithms which depend on page faults can be slow when those faults occur. Normal semispace collectors do poorly under memory pressure, usually. With a concurrent collector like this, lower pause times could be achieved, giving faster response times to clients. But, for servers, other techniques might actually be more relevant, like making separately collected nurseries for each thread.

In theory, a concurrent collector would increase throughput, too, especially for programs that have fewer threads than the machine they're running on has cores. But this is a somewhat different use case than the server. Picture a single-threaded media player running on a two-core desktop machine. Really, on a desktop, all programs should be as small a footprint as possible and have as short pause times as possible, and this idea can hopefully provide that. The lack of dependence on page protections should also help performance as compared to other algorithms.

The third use case is an embedded system with multiple processors, which is apparently increasingly common. A concurrent version of MC2 would be useful for the same reasons as MC2 is on uniprocessor embedded systems.

I'm not positive that this is the best way to go about it. Maybe I really want a parallel collector, along the lines of GHC's recent parallel copying collector. As that paper discusses, it's pretty tricky to actually get a performance improvement, but it ends up being worth the effort throughput-wise. Maybe I should be making mark-copy parallel. This is a different research project, but it might be one that we end up looking into more. Maybe we could even make a parallel concurrent collector this way! But definitely not over the next 9 weeks, which is how much time I have left.

Details


I wrote about MC2 earlier, but not about Sapphire. Sapphire is a mostly concurrent copying collector that works without a read barrier of any kind. I say mostly concurrent because some synchronization needs to be done surrounding the program stack, but not very much. Previous concurrent copying collectors worked with a read barrier that preserved the invariant that the mutator can only see things that have been copied into tospace. Sapphire, on the other hand, uses a write barrier to preserve the invariant that, roughly, during copying, fromspace and tospace are consistent.

The collector consists of four phases: (1) mark (2) allocate (3) copy (4) flip. In the mark phase, a concurrent snapshot-at-the-beginning mark is done. In the allocate phase, space in tospace is set up for each object and non-clobbering pointers are made to tospace from everything in fromspace. They can be non-clobbering because they take up some space in the header, and the old header can be put in tospace. Next, copy fromspace to tospace. The allocate step laid things out in breadth-first order, so Cheney's algorithm works here. In the flip step, the stack and other things outside of the space that's being copied has its pointers changed from pointing to fromspace to pointing to tospace. Throughout this, the write barrier assures that modifications to fromspace are propagated to tospace.

As far as I can tell, it takes nothing special idea-wise to combine these two things. The real work would be in the implementation details, and the benchmarks proving that this is possible and advantageous. One simplification we could make (though I'm not sure if it's a good idea in terms of locality) is that forwarding pointers and "reverse forwarding pointers" could be held each in window-sized blocks. So, overall, the collector would consist of
  1. A mark phase, run concurrently with the mutator, either the incremental update or snapshot variants. This would collect the additional data which
  2. A grouping phase as in MC2
  3. For each group:
    1. Allocate tospace pointers in the empty window
    2. Copy fromspace to tospace
    3. Flip all of the pointers recorded in the remembered set for fromspace

The Sapphire paper discusses condensing mark, allocate and copy phases into a single replicate phase. Within this framework, we could only combine the allocate and copy phases.

Related work


The number of other garbage collection systems trying to achieve the same goals is so dense and diverse that I feel hesitant to join in. But here are a bunch of related things, aside from what's already been mentioned.
  • One of the first incremental (and extendible to concurrent) copying collectors is that of Henry Baker. But this uses a read barrier, a little bit of code inserted before every pointer read, which ensures that it's not looking at fromspace.
  • An Appel-Ellis-Li-style copying collector (yes, they actually call it that) uses memory page permissions to make, in effect, a free-unless-it-gets-triggered read barrier maintaining the invariant that everything the mutator sees is in to-space. If it gets triggered, it's expensive, but the hope is that it won't get triggered very often because the concurrent copying will go fast enough.
  • The Compressor is a garbage collector which, I believe, maintains invariants similar to the Appel-Ellis-Li collector but uses this for a unique type of mark-compact collector which operates in just one pass. The algorithm is both parallel and concurrent.
  • There's been a lot of work in concurrent mark-sweep, but this leads to fragmentation over time. So a system for occasional mostly concurrent compaction has been developed.

In all but the last of these systems, a read may be expensive, due either to an explicit read barrier or the possibility of a page protection fault. The last one makes allocation (into the generation that is mark-swept) expensive due to a free list. Nevertheless, it might be reasonable to look into using an Appel-Ellis-Li-style barrier in place of Sapphire's write barrier.

Our plan


The plan is to implement a basic mark-copy collector, maybe on the Factor runtime, and then make it concurrent (or maybe parallel) somehow. At each step, we'll try to do the most basic thing possible. If we could get a concurrent or parallel version of mark-copy done by the end of this summer, I'll be happy. This'll be done while writing the paper (which this could be considered the first draft for). Optimizations along the line of MC2 and the finishing touches on the paper can be done after the summer, as long as we get most things done over the next 9 weeks.

It's an exciting time to be involved in academic computer science because so many basic results have been elaborated by now. The only problem is the minefield of patents (especially dense in the practical field of garbage collection) and the fact that everyone else has thought of your idea before you.

6 comments:

Anonymous said...

Daniel,

It would appear that you're focused on GC for shared memory environments. Do you think shared mem environments will maintain the same relevance for future software development?

James

Daniel Ehrenberg said...

I'm sure they'll maintain relevant for a good while. I'm not sure exactly what you mean by "shared" but if you mean uniform access, I don't think I'm smart enough right now to make an efficient garbage collector for NUMA architectures which takes into account the non-uniformity of memory access times. I'd imagine it'd take a shape similar to distributed memory designs. This is my first project. Anyway, I don't have a NUMA machine to play around with; people don't really use them yet.

Anonymous said...

I'm only using "shared" memory in the context of shared versus lock-free thread-specific memory as in Erlang. Also in Dos days we had a couple of simple but very useful tools for memory isolation - mark/release in Turbo Pascal, and multiple subheap capability in TS Modula-2.

James

Daniel Ehrenberg said...

James,

Oops, I suppose that when you said "shared memory" you meant "shared memory". My bad. Anyway, adapting GC algorithms to non-shared memory is orthogonal to the problem I'm trying to solve, which is to make a fast concurrent compacting GC with minimal space overhead. It could be used in an environment with non-shared memory. But maintaining lots of little separate heaps isn't always a good idea, as it can cause external fragmentation if done badly.

Anonymous said...

First of all, I really appreciate you writing this blog!

Now, I have a few questions:

When you said
"One idea is that this would be useful for large servers with many cores that are under memory pressure. Existing algorithms which depend on page faults can be slow when those faults occur."

Were those two seperate thoughts, or are you implying that GC's that use page faults perform slower on large heap/multicore boxeS?

My second question is regarding the choice of read or write barriers vs. page faults for concurrent copying.

Though a page fault is much less performant then a barrier when it occurs, how does it compare to the overhead of having barriers on *all* pointer/reference/object read/writes (Which I believe is neccessary for concurrent collection, no?)

I would love to hear your thoughts on this or if you have any suggestions on papers which adderss this question, I'd love to read those too.

Daniel Ehrenberg said...

Anonymous,

To your first question, I didn't mean to imply that. I should change the wording.

To your second question, it seems that, in practice, virtual read barriers through temporary memory protection is much, much, much more efficient than actual read barriers. This is only true if traps are sprung rarely, which should be true in well-designed collectors. Write barriers for things like keeping track of old to young pointers, or for maintaining the tricolor invariant during incremental/concurrent marking, don't seem so bad. Sapphire hasn't been benchmarked enough, though, to know if it's actually faster than Appel's concurrent copying collector using memory protection. (There is no publicly available implementation, unfortunately.)