The idea is to combine MC2 and Sapphire. MC2 a copying algorithm which can be incremental and which has much lower memory overhead than the normal semispace algorithm, and Sapphire is an algorithm for concurrent copying GC. Together, they could form an efficient algorithm for concurrent compacting garbage collection with minimal space overhead. This has been done before, but also this idea allows pointer-bumping allocation and no reliance on hardware for things like memory page protection. I haven't found a paper which does all of these things at once.
For me, but of course not for anyone else except maybe Tony, the motivation is to see how far this mark-copy algorithm can go. As far as I know (and as far as the co-author of the paper that I contacted knows), there have only been two papers published on it, the original one and MC2, none of which mention multiple processors.
One idea is that this would be useful for large servers with many cores that are under memory pressure. Existing algorithms which depend on page faults can be slow when those faults occur. Normal semispace collectors do poorly under memory pressure, usually. With a concurrent collector like this, lower pause times could be achieved, giving faster response times to clients. But, for servers, other techniques might actually be more relevant, like making separately collected nurseries for each thread.
In theory, a concurrent collector would increase throughput, too, especially for programs that have fewer threads than the machine they're running on has cores. But this is a somewhat different use case than the server. Picture a single-threaded media player running on a two-core desktop machine. Really, on a desktop, all programs should be as small a footprint as possible and have as short pause times as possible, and this idea can hopefully provide that. The lack of dependence on page protections should also help performance as compared to other algorithms.
The third use case is an embedded system with multiple processors, which is apparently increasingly common. A concurrent version of MC2 would be useful for the same reasons as MC2 is on uniprocessor embedded systems.
I'm not positive that this is the best way to go about it. Maybe I really want a parallel collector, along the lines of GHC's recent parallel copying collector. As that paper discusses, it's pretty tricky to actually get a performance improvement, but it ends up being worth the effort throughput-wise. Maybe I should be making mark-copy parallel. This is a different research project, but it might be one that we end up looking into more. Maybe we could even make a parallel concurrent collector this way! But definitely not over the next 9 weeks, which is how much time I have left.
I wrote about MC2 earlier, but not about Sapphire. Sapphire is a mostly concurrent copying collector that works without a read barrier of any kind. I say mostly concurrent because some synchronization needs to be done surrounding the program stack, but not very much. Previous concurrent copying collectors worked with a read barrier that preserved the invariant that the mutator can only see things that have been copied into tospace. Sapphire, on the other hand, uses a write barrier to preserve the invariant that, roughly, during copying, fromspace and tospace are consistent.
The collector consists of four phases: (1) mark (2) allocate (3) copy (4) flip. In the mark phase, a concurrent snapshot-at-the-beginning mark is done. In the allocate phase, space in tospace is set up for each object and non-clobbering pointers are made to tospace from everything in fromspace. They can be non-clobbering because they take up some space in the header, and the old header can be put in tospace. Next, copy fromspace to tospace. The allocate step laid things out in breadth-first order, so Cheney's algorithm works here. In the flip step, the stack and other things outside of the space that's being copied has its pointers changed from pointing to fromspace to pointing to tospace. Throughout this, the write barrier assures that modifications to fromspace are propagated to tospace.
As far as I can tell, it takes nothing special idea-wise to combine these two things. The real work would be in the implementation details, and the benchmarks proving that this is possible and advantageous. One simplification we could make (though I'm not sure if it's a good idea in terms of locality) is that forwarding pointers and "reverse forwarding pointers" could be held each in window-sized blocks. So, overall, the collector would consist of
- A mark phase, run concurrently with the mutator, either the incremental update or snapshot variants. This would collect the additional data which
- A grouping phase as in MC2
- For each group:
Flip all of the pointers recorded in the remembered set for fromspace
- Allocate tospace pointers in the empty window
- Copy fromspace to tospace
The Sapphire paper discusses condensing mark, allocate and copy phases into a single replicate phase. Within this framework, we could only combine the allocate and copy phases.
The number of other garbage collection systems trying to achieve the same goals is so dense and diverse that I feel hesitant to join in. But here are a bunch of related things, aside from what's already been mentioned.
- One of the first incremental (and extendible to concurrent) copying collectors is that of Henry Baker. But this uses a read barrier, a little bit of code inserted before every pointer read, which ensures that it's not looking at fromspace.
- An Appel-Ellis-Li-style copying collector (yes, they actually call it that) uses memory page permissions to make, in effect, a free-unless-it-gets-triggered read barrier maintaining the invariant that everything the mutator sees is in to-space. If it gets triggered, it's expensive, but the hope is that it won't get triggered very often because the concurrent copying will go fast enough.
- The Compressor is a garbage collector which, I believe, maintains invariants similar to the Appel-Ellis-Li collector but uses this for a unique type of mark-compact collector which operates in just one pass. The algorithm is both parallel and concurrent.
- There's been a lot of work in concurrent mark-sweep, but this leads to fragmentation over time. So a system for occasional mostly concurrent compaction has been developed.
In all but the last of these systems, a read may be expensive, due either to an explicit read barrier or the possibility of a page protection fault. The last one makes allocation (into the generation that is mark-swept) expensive due to a free list. Nevertheless, it might be reasonable to look into using an Appel-Ellis-Li-style barrier in place of Sapphire's write barrier.
The plan is to implement a basic mark-copy collector, maybe on the Factor runtime, and then make it concurrent (or maybe parallel) somehow. At each step, we'll try to do the most basic thing possible. If we could get a concurrent or parallel version of mark-copy done by the end of this summer, I'll be happy. This'll be done while writing the paper (which this could be considered the first draft for). Optimizations along the line of MC2 and the finishing touches on the paper can be done after the summer, as long as we get most things done over the next 9 weeks.
It's an exciting time to be involved in academic computer science because so many basic results have been elaborated by now. The only problem is the minefield of patents (especially dense in the practical field of garbage collection) and the fact that everyone else has thought of your idea before you.