Monday, June 23, 2008

Messing around with Factor's GC

Unfortunately, my summer project is going ahead using the MMTk component of Jikes RVM. At this point, it looks like we're going to focus on an attempt to parallelize MC2. So, work on Factor's GC is just taking place in my free time until after the project is over. I haven't gotten much done yet, just a couple micro-optimizations and minor doc fixes. But I've gotten an understanding of the structure, which I'll try to explain to you.

Basic Factor object demographics

To figure out what we need in a garbage collector, it's useful to know some things about the data usage patterns in Factor code. A very detailed study of this has been done for Java. I hypothesize that these results, crucially the ones about allocation site predicting lifespan, are not entirely accurate for functional languages like Factor. But that's a topic for another time.

I want to find something really basic: what's the distribution of object sizes in Factor? This can be done easily without delving into C code:

USING: kernel math.statistics memory sequences ;
: object-data ( -- mean median std-dev )
gc
[ drop t ] instances [ size ] map
[ mean ] [ median ] [ std ] tri ;

When I ran that code on the current version of Factor on a 32-bit x86 processor at the command line with an unmodified image (you have to run it with an extra-big tenured space), I got a mean of about 64, median 24 and standard deviation of 4514. This should be interpreted to mean, roughly, that a typical Factor object takes up six words, but that some objects are far bigger. The distribution is skewed far to the right.

There are only 242 objects in the heap which are larger than a page (4kB) in size, totaling 10MB out of the heap's total 37MB of live data. The biggest object is about 3MB. Two of these are byte arrays, three are regular arrays, and the rest are hashtables.

So, in the Factor GC, we're dealing with lots of little objects, and a few big ones. Both of these need to be dealt with efficiently. This also gives us the information that, in terms of space overhead, it wouldn't be completely out of the question for the GC to use, say, an extra word per object, as this would take up only a relatively small proportion of the heap space. The median object size, actually, is a bit bigger than I expected.

(This data might not be representative, because it consists of things in tenured space, recorded while nothing useful was going on. Further study would be useful.)

The structure of Factor's heap

Factor's heap consists of two parts: the code heap and the data heap. The data heap is maintained by a three-generational semispace (aka copying) collector, and the code heap is maintained by a mark-sweep collector. Separation here isn't unusual. In MMTk, there are several (maybe 7 or 8, IIRC) different heap subspaces.

There are a few reasons for this separation. For a platform like x86, with separate data and instruction caches, it is beneficial to keep these two things separate. If data and code are mixed together, the icache will be frequently and unnecessarily cleared due to modifications of the data. This will mess up pipelining and all kinds of other hardware optimizations that I don't really understand. Additionally, on other platforms like PowerPC, jumps can only be within 32 megabytes of code space, to keep instruction width fixed. Keeping the code heap separate makes it so that code is all together, so no extra jumps have to be inserted.

Within the data heap, the structure is relatively simple. The nursery is a single semispace and the aging and tenured spaces consist of two semispaces. Card marking is used to track old-to-young pointers. A little optimization called decks, which Slava apparently invented, makes it so that fewer cards have to be scanned. They're basically cards of cards. The generational write barrier used to just mark a card corresponding to the pointer modified; it now marks two cards: the small card corresponding to less than a kilobyte of memory, and the deck corresponding to a subsection of the cards. This makes nursery collection faster, since it's easier to scan the cards for roots.

Originally, the collections of two spaces were not completely coordinated. The data heap could be collected without the code heap being collected. But this is insufficient: since the data and code heap can both point to each other, a collection of one has to take place at the same time as a collection of the other. If this weren't the case, then, for example, the data heap could be prematurely exhausted: imagine that there are 1000 quotations made and immediately discarded, each of which points to a 10MB array. The data heap will fill up but the code heap won't be collected, so none of the arrays could be deleted.

Future plans

I plan on continuing to try to work on Factor's garbage collector, both in small optimizations and bigger changes to policy. By "policy" I mean the strategies the GC uses to be efficient. Policy ideas that I'm looking at include:
  • Reexamining generational policy
    Factor uses three generations to make sure that not just anything gets into tenured space. This idea isn't new, but most modern collectors use two generations. This is combined with a more subtle strategy to make sure objects have to survive a few collections before being promoted to aging space.

    The JVM's (Java) nursery is, in effect, like a nursery and an aging space, except that when the nursery fills up, there is an aging space collection. This simplifies the cards needed, and the write barrier can be made more efficient, possibly. But when I made a naive implementation of this in Factor, I got mixed results on benchmarks, so I didn't commit the changes.

    Another possibility is to say that an object has to survive n nursery collections to be promoted to tenured space. This can be done most easily by having n nursery-sized spaces that things are copied through. On each nursery collection, objects are promoted one level, and objects on the highest nursery level are promoted to tenured space. Ideally there's some way to take up less space than that. GHC's (Haskell) runtime uses this strategy, and the heap's structure allows the higher levels to only use as many pages of memory as they need.

  • Bounded variable-sized nursery
    Right now, Factor's nursery (and aging space) is a fixed size. But it's been known for some time that this isn't optimal: a tenured collection can be delayed slightly if the nursery shrinks when tenured space starts to fill up, allowing more of the heap to be used at once. A slight delay in tenured space collection translates into higher througput, which we want.

  • Lower space overhead for tenured space
    Right now, with a semispace copying collector managing tenured space, Factor needs about three times as much memory to run as there is live data. I say three times rather than two because, with less than that, copying becomes too frequent and performance degrades. Mark-sweep and mark-compact are alternatives for managing the old generation which take much less space overhead, but they both have disadvantages. Mark-sweep can cause memory fragmentation, and allocation by free list isn't very fast (though this isn't necessarily true). And it's difficult to make mark-compact collection efficient.

    Nevertheless, they'd be in some ways an improvement over the existing collector, because a relatively small footprint is very important for the desktop and embedded domains. Once one of these two collectors is implemented (or maybe some simple combination if the two), it could be used as a basis for implementing something like MC2, Immix, the Compressor or a parallel or concurrent mark-sweep collector.

Tuesday, June 10, 2008

An idea for garbage collection research

This summer, I'm participating in an REU (Research Experience for Undergraduates) at Harvey Mudd College in southern California. For the past week or so, we've been trying to come up with ideas for what to research. I think I've stumbled on a reasonable one, though it might take a long time to implement. Hopefully, I'll spend the summer implementing it with a co-conspirator, Tony Leguia.

The idea is to combine MC2 and Sapphire. MC2 a copying algorithm which can be incremental and which has much lower memory overhead than the normal semispace algorithm, and Sapphire is an algorithm for concurrent copying GC. Together, they could form an efficient algorithm for concurrent compacting garbage collection with minimal space overhead. This has been done before, but also this idea allows pointer-bumping allocation and no reliance on hardware for things like memory page protection. I haven't found a paper which does all of these things at once.

Motivation


For me, but of course not for anyone else except maybe Tony, the motivation is to see how far this mark-copy algorithm can go. As far as I know (and as far as the co-author of the paper that I contacted knows), there have only been two papers published on it, the original one and MC2, none of which mention multiple processors.

One idea is that this would be useful for large servers with many cores that are under memory pressure. Existing algorithms which depend on page faults can be slow when those faults occur. Normal semispace collectors do poorly under memory pressure, usually. With a concurrent collector like this, lower pause times could be achieved, giving faster response times to clients. But, for servers, other techniques might actually be more relevant, like making separately collected nurseries for each thread.

In theory, a concurrent collector would increase throughput, too, especially for programs that have fewer threads than the machine they're running on has cores. But this is a somewhat different use case than the server. Picture a single-threaded media player running on a two-core desktop machine. Really, on a desktop, all programs should be as small a footprint as possible and have as short pause times as possible, and this idea can hopefully provide that. The lack of dependence on page protections should also help performance as compared to other algorithms.

The third use case is an embedded system with multiple processors, which is apparently increasingly common. A concurrent version of MC2 would be useful for the same reasons as MC2 is on uniprocessor embedded systems.

I'm not positive that this is the best way to go about it. Maybe I really want a parallel collector, along the lines of GHC's recent parallel copying collector. As that paper discusses, it's pretty tricky to actually get a performance improvement, but it ends up being worth the effort throughput-wise. Maybe I should be making mark-copy parallel. This is a different research project, but it might be one that we end up looking into more. Maybe we could even make a parallel concurrent collector this way! But definitely not over the next 9 weeks, which is how much time I have left.

Details


I wrote about MC2 earlier, but not about Sapphire. Sapphire is a mostly concurrent copying collector that works without a read barrier of any kind. I say mostly concurrent because some synchronization needs to be done surrounding the program stack, but not very much. Previous concurrent copying collectors worked with a read barrier that preserved the invariant that the mutator can only see things that have been copied into tospace. Sapphire, on the other hand, uses a write barrier to preserve the invariant that, roughly, during copying, fromspace and tospace are consistent.

The collector consists of four phases: (1) mark (2) allocate (3) copy (4) flip. In the mark phase, a concurrent snapshot-at-the-beginning mark is done. In the allocate phase, space in tospace is set up for each object and non-clobbering pointers are made to tospace from everything in fromspace. They can be non-clobbering because they take up some space in the header, and the old header can be put in tospace. Next, copy fromspace to tospace. The allocate step laid things out in breadth-first order, so Cheney's algorithm works here. In the flip step, the stack and other things outside of the space that's being copied has its pointers changed from pointing to fromspace to pointing to tospace. Throughout this, the write barrier assures that modifications to fromspace are propagated to tospace.

As far as I can tell, it takes nothing special idea-wise to combine these two things. The real work would be in the implementation details, and the benchmarks proving that this is possible and advantageous. One simplification we could make (though I'm not sure if it's a good idea in terms of locality) is that forwarding pointers and "reverse forwarding pointers" could be held each in window-sized blocks. So, overall, the collector would consist of
  1. A mark phase, run concurrently with the mutator, either the incremental update or snapshot variants. This would collect the additional data which
  2. A grouping phase as in MC2
  3. For each group:
    1. Allocate tospace pointers in the empty window
    2. Copy fromspace to tospace
    3. Flip all of the pointers recorded in the remembered set for fromspace

The Sapphire paper discusses condensing mark, allocate and copy phases into a single replicate phase. Within this framework, we could only combine the allocate and copy phases.

Related work


The number of other garbage collection systems trying to achieve the same goals is so dense and diverse that I feel hesitant to join in. But here are a bunch of related things, aside from what's already been mentioned.
  • One of the first incremental (and extendible to concurrent) copying collectors is that of Henry Baker. But this uses a read barrier, a little bit of code inserted before every pointer read, which ensures that it's not looking at fromspace.
  • An Appel-Ellis-Li-style copying collector (yes, they actually call it that) uses memory page permissions to make, in effect, a free-unless-it-gets-triggered read barrier maintaining the invariant that everything the mutator sees is in to-space. If it gets triggered, it's expensive, but the hope is that it won't get triggered very often because the concurrent copying will go fast enough.
  • The Compressor is a garbage collector which, I believe, maintains invariants similar to the Appel-Ellis-Li collector but uses this for a unique type of mark-compact collector which operates in just one pass. The algorithm is both parallel and concurrent.
  • There's been a lot of work in concurrent mark-sweep, but this leads to fragmentation over time. So a system for occasional mostly concurrent compaction has been developed.

In all but the last of these systems, a read may be expensive, due either to an explicit read barrier or the possibility of a page protection fault. The last one makes allocation (into the generation that is mark-swept) expensive due to a free list. Nevertheless, it might be reasonable to look into using an Appel-Ellis-Li-style barrier in place of Sapphire's write barrier.

Our plan


The plan is to implement a basic mark-copy collector, maybe on the Factor runtime, and then make it concurrent (or maybe parallel) somehow. At each step, we'll try to do the most basic thing possible. If we could get a concurrent or parallel version of mark-copy done by the end of this summer, I'll be happy. This'll be done while writing the paper (which this could be considered the first draft for). Optimizations along the line of MC2 and the finishing touches on the paper can be done after the summer, as long as we get most things done over the next 9 weeks.

It's an exciting time to be involved in academic computer science because so many basic results have been elaborated by now. The only problem is the minefield of patents (especially dense in the practical field of garbage collection) and the fact that everyone else has thought of your idea before you.

Saturday, June 7, 2008

A second introduction to Unicode

If you're a normal programmer, you probably really don't want to have to think about Unicode. In what you're doing, text processing probably isn't a very important aspect, and most of your users will be using English. Nevertheless, text has a tendency to creep its way into almost everything, as the basic computer/human interface. So it might be a little beneficial to know about the basics of text processing and the Unicode character set.

A lot of people have written blog posts which are introductions to Unicode, and I didn't want to write another one with no new information in it. A popular one is Joel (on Software)'s one, which describes what Unicode is and why it's important. You've likely already read an introduction to Unicode, so I'll just summarize the most important points:
  • You can't assume text is in ASCII anymore This isn't just about being nice to non-English speakers. Even Americans enjoy their “curly quotes”, their cafés—and their em dashes. User input might come with these non-ASCII characters, and it must be handed properly by robust applications.
  • Unicode is the character set to use internally A bunch of character sets have been developed over the years for different purposes, but Unicode can represent more scripts than any one other character set. Unicode was designed to be able to include the characters from basically all other character sets in use. If you're using C or C++, wchar_t rather than char for strings works for most cases. If you're using a higher level language, then strings should already be stored in some representation that allows for Unicode uniformly.
  • There are several text encodings Not all text is in ASCII, and very little text is in the most common 8-bit extension, Latin 1 (ISO-8859-1). Lots of input is in UTF-8, which can represent all of Unicode, but there are other Unicode encodings, as well as specific East Asian encodings like GB 2312 and Shift JIS, in active use. Generally, UTF-8 should be used for output, and it's on the rise in terms of usage. Depending on the programming language or library used, you might have to account for the encoding when doing text processing internally. UTF-16 and UTF-8 are the most common, and careless programming can get meaningless results in non-ASCII or non-BMP cases if the encoding is ignored.
  • Unicode encodes characters, not glyphs Unicode can be seen as a mapping between numbers and code points, where a code point is the basic unit of Unicode stuff. It's been decided that this basic unit is for characters, like letters and spaces, rather than specific presentation forms, which are referred to as glyphs. Glyphs are something that only font designers and people who work on text rendering have to care about.

But there's a little bit more that programmers have to know about. Unicode is part of a bigger program of internationalization within a single framework of encodings and algorithms. The Unicode standard includes several important algorithms that programmers should be aware of. They don't have to be able to implement them, just to figure out where in the library they are.
  • Normalization Because of complications in the design, some Unicode strings have more than one possible form that are actually equivalent. There are a number of normalization forms that have been defined to get rid of these differences, and the one you should use is probably NFC. Usually, you should normalize before doing something like comparing for equality. This is independent of locale.
  • Grapheme, word, sentence and line breaks It's not true, anymore, that a single character forms a single unit for screen display. If you have a q with an umlaut over it, this needs to be represented as two characters, yet it is one grapheme. If you're dealing with screen units (imagine an internationalized Hangman), a library should be used for grapheme breaks. Similarly, you can't identify words as things separated by spaces or punctuation, or line break opportunities by looking for whitespace, or sentence breaks by looking just at punctuation marks. It's easy to write a regular expression which tries to do one of these things but does it wrong for English, and it's even easier to do it wrong for other languages, which use other conventions. So use a Unicode library for this. The locale affects how these breaks happen.
  • Bidirectional text When displaying text on a screen, it doesn't always go left to right as in most languages. Some scripts, like Hebrew and Arabic, go right to left. To account for this, use the Unicode Bidirectional Text Algorithm (BIDI), which should be implemented in your Unicode library. Locale doesn't matter here.
  • Case conversion Putting a string in lowercase is more complicated than replacing [A-Z] with [a-z]. Accent marks and other scripts should be taken into account, as well as a few weird cases like the character ß going to SS in upper case. The locale is also relevant in case conversion, to handle certain dots in Turkish, Azeri and Lithuanian.
  • Collation There's an algorithm for Unicode collation that works much better than sorting by ASCII value, and works reasonably for most languages. Depending on the locale, it should be modified. Even in English, the Unicode Collation Algorithm produces much more natural results. Parts of the collation key can be used for insensitive comparisons, eg. ignoring case.

For further reading, you can look at this much more in-depth article from SIL, or the Unicode 5.1 standard itself, which isn't that bad. Most programmers can't be expected to know all of this stuff, and they shouldn't. But it'd be nice if everyone used the appropriate library for text processing when needed, so that applications could be more easily internationalized.