Friday, October 31, 2008

Garbage collection camp post-mortem

This summer, I went to Harvey Mudd's REU (an NSF-funded research program) in computer science to study garbage collection. My adviser was Melissa O'Neill. I didn't really accomplish anything, but I guess I learned a lot.

My focus was on the memory behavior of garbage collection algorithms. Simple analyses of GC algorithms, like those of most algorithms, assume a random access machine (RAM) model: that all memory accesses have constant cost. In truth, different orders of access have different costs in a more or less predictable way, based off the way that memory is implemented.

Locality

I won't go into the details of the memory hierarchy, the reason for this difference, in this post. A good but very long reference is What Every Programmer Should Know about Memory [PDF] by Ulrich Drepper. It goes well beyond what every programmer needs to know. Basically, there are two properties which a program should have to use the memory hierarchy efficiently: spatial locality and temporal locality.

If you've just accessed a particular location of memory, it is fast to access it again if you do it soon enough. A program which accesses the same locations in a pattern to maximize this is said to have good temporal locality. This isn't something that can be easily changed in programs, most of the time.

A more flexible property is spatial locality. If you've just accessed a particular piece of memory, it is also fast to access nearby locations in memory. Programs that take advantage of this have good spatial locality. There are many ways to modify programs to take advantage of this property, so this is what I and most other people are talking about when they talk about improving programs' locality.

Memory is implemented this way because, naturally, most programs have pretty good spatial and temporal locality. Only a small percentage of memory reads and writes actually interact with the main memory, and usually faster mechanisms can be used. But, still, even though most reads and writes use this faster mechanism, a good portion of programs' running time is taken up by the slower method. Improvements on this could potentially have a very significant effect on performance.

A note: sometimes, people refer to giving programs "good cache behavior." The CPU cache is a part of the memory hierarchy that creates a part of the properties of spatial and temporal locality.

Copying orders

There are a couple ways that the garbage collector can improve locality. One way is that the traversal that the GC does can be optimized for good spatial locality. Another way is that the garbage collector can reorganize the data during copying such that the mutator has good spatial locality. Both of these have been researched in the past, to varying degree.

My research was on the second strategy. There are a few simple copying orders that we already know about: breadth-first coyping, depth-first copying, sliding mark-compact (which leaves things in the exact order of their creation time), and the control: not coyping. My advisor for the summer noticed that, on some example graphs, it appeared to improve locality to put things in a sort of "topological sort" order (with cycles arbitrarily broken). There's also a technique called hierarchical decomposition, which attempts to decompose a tree such into a number of subtrees with as few edges between them as possible. (That's what we want! but it might not be the best if we don't have trees.) For a little while, I was even trying to figure out the "optimal" compaction using a genetic algorithm, but it turned out to be extremely computationally expensive.

If I were less lazy, this would be the point where I draw a graph and show you the coyping orders indicated by each of these. Most of it is described in the Jones and Lins book, Garbage Collection Algorithms.

The results

My goal was to implement all of these and collect empirical data on their performance. I implemented several of them, but I found out very close to the end of the summer that, in fact, the system I was using (Jikes RVM) puts data in a different order than I was assuming it was. Apparently, it uses some kind of hybrid between breadth-first and depth-first order that one paper calls "approximate depth-first order", but my code had been based on the assumption that it used depth-first order, and I couldn't figure out how to fix it.

Nevertheless, I collected a bunch of data on the performance of the DaCapo benchmark suite on non-generational copying (in approximate depth-first order), mark-compact and mark-sweep. I used perfctr to measure the cache misses that the mutator encountered. I expected that mark-compact would have the fewest misses, and then copying and then mark-sweep.

I assumed this because many papers I read said things like, "It is widely assumed that sliding mark-compact is the optimal compaction order, so we'll assume this without further comment that that's true, and that's all the justification we need to keep going with this design."

But, as it turns out, mark-compact does worse than copying collection, by a margin that's greater than published results for the difference between depth-first and breadth-first copying. Well, it does this for a bunch of the DaCapo benchmarks, at least. And, umm, I didn't do all that many trials, and I set them up wrong so the running time was probably dominated by the JIT rather than the program actually running. But still, I got a new result! This contradicts what's been assumed in certain papers published just a few months earlier! And with a little more work and experience with Jikes, I could make the results rigorous and conference-worthy, right?

Oh, wait, no. Someone in a conference that June published a paper with a little graph in the introduction that proved the same thing. Except with a good methodology, and with a bunch of tiny, disjoint 99% confidence intervals. Oops! Better luck next time, I guess.




[A little note: I haven't been writing very much, or programming Factor very much, beacuse this semester I'm in Budapest in an intense math study abroad program for American undergraduates. I'm taking 5 math classes, the hardest of which is a group theory course whose midterm I probably just failed this morning. (A random fun question: If G is a group and H is a subgroup, where |G:H| = 55, and a is an element of G such that the order of a is 89, prove that a is in H. One of the few questions I could actually answer on the test...) Most of my time is spent studying with people, and though I might physically have time for programming, I've been feeling a little burned out. But I'm planning on devoting a lot of time next semester to working on Factor's garbage collector, first making an incremental generational mark-sweep collector (by snapshot-at-the-beginning) and then deciding where to go from there.]

[Update: Thank you all for correcting my typos. Spatial, not spacial, and I meant that spatial locality is more flexible, not temporal. Sorry if I confused anyone!]

7 comments:

Unknown said...

In the spirit of "green", why not rename GC to RR (i.e. Resource Recycling) :-)

Resource Recycling

Anonymous said...

"A more flexible property is temporal locality. If you've just accessed a particular piece of memory, it is also fast to access nearby locations in memory. Programs that take advantage of this have good spacial locality."

Do you mean "A more flexible property is spacial locality" in the beginning of that bit?

Anonymous said...

I think you meant to type "spatial" rather than "temporal" as the 6th word in the 5th paragraph.

Matt Donahoe said...

"this is said to have good temporal locality. This isn't something that can be easily changed in programs, most of the time.

A more flexible property is temporal locality."

Did you mean spatial locality for that last one??

Barry Kelly said...

One totally gratuitous spelling Nazi point: spatial, not spacial.

Otherwise, interesting stuff!

Anonymous said...

So what did your research find in terms of the best spatial locality? Off the top of my head, I would say depth-first copying would have much better locality vs. breadth first or allocation order.

-V

Unknown said...

anonymous,

The basic result was that depth-first or breadth-first copying did better than sliding compaction, a result which was already found in the Immix paper. I tried to compare depth-first and breadth-first copying on the DaCapo benchmarks, but my implementation was flawed and I wasn't able to fix it before the end of the REU. But earlier studies showed that depth-first and breadth-first copying differed by a very small margin, smaller than the margin that appears to exist between either of those and sliding compaction.

I've decided not to continue the project. Further serious attempts to improve the spatial locality of programs through garbage collection should probably take advantage of data collected during program execution, maybe collected as part of the JIT's normal runtime optimizations. Such a system has actually been constructed, but I forgot where the paper was. It's possible to do this such that the overhead is lower than the benefit, I think. These kinds of optimizations are amazing, in my opinion, and I wish I could create one myself.