Basic Factor object demographics
To figure out what we need in a garbage collector, it's useful to know some things about the data usage patterns in Factor code. A very detailed study of this has been done for Java. I hypothesize that these results, crucially the ones about allocation site predicting lifespan, are not entirely accurate for functional languages like Factor. But that's a topic for another time.
I want to find something really basic: what's the distribution of object sizes in Factor? This can be done easily without delving into C code:
USING: kernel math.statistics memory sequences ;
: object-data ( -- mean median std-dev )
gc
[ drop t ] instances [ size ] map
[ mean ] [ median ] [ std ] tri ;
When I ran that code on the current version of Factor on a 32-bit x86 processor at the command line with an unmodified image (you have to run it with an extra-big tenured space), I got a mean of about 64, median 24 and standard deviation of 4514. This should be interpreted to mean, roughly, that a typical Factor object takes up six words, but that some objects are far bigger. The distribution is skewed far to the right.
There are only 242 objects in the heap which are larger than a page (4kB) in size, totaling 10MB out of the heap's total 37MB of live data. The biggest object is about 3MB. Two of these are byte arrays, three are regular arrays, and the rest are hashtables.
So, in the Factor GC, we're dealing with lots of little objects, and a few big ones. Both of these need to be dealt with efficiently. This also gives us the information that, in terms of space overhead, it wouldn't be completely out of the question for the GC to use, say, an extra word per object, as this would take up only a relatively small proportion of the heap space. The median object size, actually, is a bit bigger than I expected.
(This data might not be representative, because it consists of things in tenured space, recorded while nothing useful was going on. Further study would be useful.)
The structure of Factor's heap
Factor's heap consists of two parts: the code heap and the data heap. The data heap is maintained by a three-generational semispace (aka copying) collector, and the code heap is maintained by a mark-sweep collector. Separation here isn't unusual. In MMTk, there are several (maybe 7 or 8, IIRC) different heap subspaces.
There are a few reasons for this separation. For a platform like x86, with separate data and instruction caches, it is beneficial to keep these two things separate. If data and code are mixed together, the icache will be frequently and unnecessarily cleared due to modifications of the data. This will mess up pipelining and all kinds of other hardware optimizations that I don't really understand. Additionally, on other platforms like PowerPC, jumps can only be within 32 megabytes of code space, to keep instruction width fixed. Keeping the code heap separate makes it so that code is all together, so no extra jumps have to be inserted.
Within the data heap, the structure is relatively simple. The nursery is a single semispace and the aging and tenured spaces consist of two semispaces. Card marking is used to track old-to-young pointers. A little optimization called decks, which Slava apparently invented, makes it so that fewer cards have to be scanned. They're basically cards of cards. The generational write barrier used to just mark a card corresponding to the pointer modified; it now marks two cards: the small card corresponding to less than a kilobyte of memory, and the deck corresponding to a subsection of the cards. This makes nursery collection faster, since it's easier to scan the cards for roots.
Originally, the collections of two spaces were not completely coordinated. The data heap could be collected without the code heap being collected. But this is insufficient: since the data and code heap can both point to each other, a collection of one has to take place at the same time as a collection of the other. If this weren't the case, then, for example, the data heap could be prematurely exhausted: imagine that there are 1000 quotations made and immediately discarded, each of which points to a 10MB array. The data heap will fill up but the code heap won't be collected, so none of the arrays could be deleted.
Future plans
I plan on continuing to try to work on Factor's garbage collector, both in small optimizations and bigger changes to policy. By "policy" I mean the strategies the GC uses to be efficient. Policy ideas that I'm looking at include:
- Reexamining generational policy
Factor uses three generations to make sure that not just anything gets into tenured space. This idea isn't new, but most modern collectors use two generations. This is combined with a more subtle strategy to make sure objects have to survive a few collections before being promoted to aging space.
The JVM's (Java) nursery is, in effect, like a nursery and an aging space, except that when the nursery fills up, there is an aging space collection. This simplifies the cards needed, and the write barrier can be made more efficient, possibly. But when I made a naive implementation of this in Factor, I got mixed results on benchmarks, so I didn't commit the changes.
Another possibility is to say that an object has to survive n nursery collections to be promoted to tenured space. This can be done most easily by having n nursery-sized spaces that things are copied through. On each nursery collection, objects are promoted one level, and objects on the highest nursery level are promoted to tenured space. Ideally there's some way to take up less space than that. GHC's (Haskell) runtime uses this strategy, and the heap's structure allows the higher levels to only use as many pages of memory as they need. - Bounded variable-sized nursery
Right now, Factor's nursery (and aging space) is a fixed size. But it's been known for some time that this isn't optimal: a tenured collection can be delayed slightly if the nursery shrinks when tenured space starts to fill up, allowing more of the heap to be used at once. A slight delay in tenured space collection translates into higher througput, which we want. - Lower space overhead for tenured space
Right now, with a semispace copying collector managing tenured space, Factor needs about three times as much memory to run as there is live data. I say three times rather than two because, with less than that, copying becomes too frequent and performance degrades. Mark-sweep and mark-compact are alternatives for managing the old generation which take much less space overhead, but they both have disadvantages. Mark-sweep can cause memory fragmentation, and allocation by free list isn't very fast (though this isn't necessarily true). And it's difficult to make mark-compact collection efficient.
Nevertheless, they'd be in some ways an improvement over the existing collector, because a relatively small footprint is very important for the desktop and embedded domains. Once one of these two collectors is implemented (or maybe some simple combination if the two), it could be used as a basis for implementing something like MC2, Immix, the Compressor or a parallel or concurrent mark-sweep collector.