Useless Factor: 2008

Friday, October 31, 2008

Garbage collection camp post-mortem

This summer, I went to Harvey Mudd's REU (an NSF-funded research program) in computer science to study garbage collection. My adviser was Melissa O'Neill. I didn't really accomplish anything, but I guess I learned a lot.

My focus was on the memory behavior of garbage collection algorithms. Simple analyses of GC algorithms, like those of most algorithms, assume a random access machine (RAM) model: that all memory accesses have constant cost. In truth, different orders of access have different costs in a more or less predictable way, based off the way that memory is implemented.

Locality

I won't go into the details of the memory hierarchy, the reason for this difference, in this post. A good but very long reference is What Every Programmer Should Know about Memory [PDF] by Ulrich Drepper. It goes well beyond what every programmer needs to know. Basically, there are two properties which a program should have to use the memory hierarchy efficiently: spatial locality and temporal locality.

If you've just accessed a particular location of memory, it is fast to access it again if you do it soon enough. A program which accesses the same locations in a pattern to maximize this is said to have good temporal locality. This isn't something that can be easily changed in programs, most of the time.

A more flexible property is spatial locality. If you've just accessed a particular piece of memory, it is also fast to access nearby locations in memory. Programs that take advantage of this have good spatial locality. There are many ways to modify programs to take advantage of this property, so this is what I and most other people are talking about when they talk about improving programs' locality.

Memory is implemented this way because, naturally, most programs have pretty good spatial and temporal locality. Only a small percentage of memory reads and writes actually interact with the main memory, and usually faster mechanisms can be used. But, still, even though most reads and writes use this faster mechanism, a good portion of programs' running time is taken up by the slower method. Improvements on this could potentially have a very significant effect on performance.

A note: sometimes, people refer to giving programs "good cache behavior." The CPU cache is a part of the memory hierarchy that creates a part of the properties of spatial and temporal locality.

Copying orders

There are a couple ways that the garbage collector can improve locality. One way is that the traversal that the GC does can be optimized for good spatial locality. Another way is that the garbage collector can reorganize the data during copying such that the mutator has good spatial locality. Both of these have been researched in the past, to varying degree.

My research was on the second strategy. There are a few simple copying orders that we already know about: breadth-first coyping, depth-first copying, sliding mark-compact (which leaves things in the exact order of their creation time), and the control: not coyping. My advisor for the summer noticed that, on some example graphs, it appeared to improve locality to put things in a sort of "topological sort" order (with cycles arbitrarily broken). There's also a technique called hierarchical decomposition, which attempts to decompose a tree such into a number of subtrees with as few edges between them as possible. (That's what we want! but it might not be the best if we don't have trees.) For a little while, I was even trying to figure out the "optimal" compaction using a genetic algorithm, but it turned out to be extremely computationally expensive.

If I were less lazy, this would be the point where I draw a graph and show you the coyping orders indicated by each of these. Most of it is described in the Jones and Lins book, Garbage Collection Algorithms.

The results

My goal was to implement all of these and collect empirical data on their performance. I implemented several of them, but I found out very close to the end of the summer that, in fact, the system I was using (Jikes RVM) puts data in a different order than I was assuming it was. Apparently, it uses some kind of hybrid between breadth-first and depth-first order that one paper calls "approximate depth-first order", but my code had been based on the assumption that it used depth-first order, and I couldn't figure out how to fix it.

Nevertheless, I collected a bunch of data on the performance of the DaCapo benchmark suite on non-generational copying (in approximate depth-first order), mark-compact and mark-sweep. I used perfctr to measure the cache misses that the mutator encountered. I expected that mark-compact would have the fewest misses, and then copying and then mark-sweep.

I assumed this because many papers I read said things like, "It is widely assumed that sliding mark-compact is the optimal compaction order, so we'll assume this without further comment that that's true, and that's all the justification we need to keep going with this design."

But, as it turns out, mark-compact does worse than copying collection, by a margin that's greater than published results for the difference between depth-first and breadth-first copying. Well, it does this for a bunch of the DaCapo benchmarks, at least. And, umm, I didn't do all that many trials, and I set them up wrong so the running time was probably dominated by the JIT rather than the program actually running. But still, I got a new result! This contradicts what's been assumed in certain papers published just a few months earlier! And with a little more work and experience with Jikes, I could make the results rigorous and conference-worthy, right?

Oh, wait, no. Someone in a conference that June published a paper with a little graph in the introduction that proved the same thing. Except with a good methodology, and with a bunch of tiny, disjoint 99% confidence intervals. Oops! Better luck next time, I guess.

[A little note: I haven't been writing very much, or programming Factor very much, beacuse this semester I'm in Budapest in an intense math study abroad program for American undergraduates. I'm taking 5 math classes, the hardest of which is a group theory course whose midterm I probably just failed this morning. (A random fun question: If G is a group and H is a subgroup, where |G:H| = 55, and a is an element of G such that the order of a is 89, prove that a is in H. One of the few questions I could actually answer on the test...) Most of my time is spent studying with people, and though I might physically have time for programming, I've been feeling a little burned out. But I'm planning on devoting a lot of time next semester to working on Factor's garbage collector, first making an incremental generational mark-sweep collector (by snapshot-at-the-beginning) and then deciding where to go from there.]

[Update: Thank you all for correcting my typos. Spatial, not spacial, and I meant that spatial locality is more flexible, not temporal. Sorry if I confused anyone!]

Sunday, August 3, 2008

New mark-compact algorithms

It'd be good for a garbage collector to have low space overhead, good cache behavior and low fragmentation. Low space overhead is useful wherever there's a limited amount of RAM. Eliminating fragmentation is important, since fragmentation reduces the usable amount of memory and can make allocation slower. Improving cache behavior of ordinary programs is always good for performance, and surprisingly a garbage collector can help. These three goals point to mark-compact garbage collection as a good way to go.

The idea

First, mark all of the reachable data by traversing the heap. Then, slide the live (reachable) data as far over to one side as possible. This will (hopefully) free up some space on the other side, so we have a big contiguous free area. With this free space, we can allocate in by just keeping a pointer that shows where free space begins, and incrementing it when allocation happens. Pointer-bumping allocation eliminates fragmentation and creates good cache behavior, since all new data is allocated close together. In a practical implementation, you'd want to use a generational system, where the oldest generation is maintained by mark-compact.

Many recently developed GC algorithms fit into this class in a more general way. The heap is marked, and then through some means, the live data is clumped together. There are many different ways to do this, but to me they all feel fundamentally similar. The new algorithms have a greater space overhead, getting a speed improvement or incrementality (short pause times) as a tradeoff. The resulting overhead is much less than semispace (copying) collection, which takes more than two times as much space as the program uses.

Basic implementation

The marking part of this isn't too hard. You can perform depth-first search on the heap using an explicit stack. For locality reasons, a bit vector can be maintained off to the side to the side to hold the information about what's been marked. If the mark stack overflows (a rare condition), additional pushes to the stack can be ignored and the heap can be rescanned afterwards for marked objects with unmarked children. Marking can be done in parallel with a load balancing technique known as work-stealing queues, and it can be done incrementally or concurrently using a certain write barrier.

The hard part is the compacting. Traditional mark-compact algorithms, like the ones discussed in the Jones book on GC try to use little or no additional space for the process. The easiest way to do it is to have an extra word per object which is used to store forwarding addresses. After marking, there are three passes over the heap. In the first pass, the live data is iterated over in address order, and new addresses for all objects are calculated and stored in the forwarding addresses. On the second pass of the heap, every pointer is replaced with the new forwarding address. On the third pass, objects are moved to this new forwarding address.

Other methods (table-based methods and threading methods) eliminate the word per object overhead and reduce the number of passes to two, but they remain relatively slow. It's not obvious how to make them incremental, concurrent or parallel in the compacting stage, though incremental, concurrent and parallel marking is well-studied. The difficulty comes from the fact that, in all of these methods, pointers temporarily point to locations that aren't filled with the right thing. Unlike with concurrent/incremental copying, it's difficult to design even a read barrier for this. Additionally, sliding compaction is difficult to parallelize because of the way pointers are calculated from an accumulator, and the sliding itself must be done in address order, otherwise uncopied things could be overwritten.

New ideas

There are a number of ways to go about changing mark-compact. Here are some of them that I've read papers about:

Reserve some additional space off to the side to calculate forwarding addresses from the mark bitmap, so only one pass over the heap is required This is the idea developed in The Compressor [PDF], by Haim Kermany. Rather than compact the heap onto itself, it is copied in page-size chunks. Since the to-pointers are calculated off to the side beforehand, pointers can be updated at the same time as copying happens, and only one pass over the heap is required. Once a page is copied, the old page can be reused, making space overhead bounded. It can even run in parallel, copying to multiple pages at once without needing any synchronization beyond handing out the pages to copy. The paper also provides a concurrent version of this.

Divide the heap up into chunks, copying over by chunks, keeping track of some of the pointers between chunks. This is the approach taken by the mark-copy [PS] algorithm by Narendran Sachindran, further developed as MC² [PDF]. In order to allow the copying to go on independently, there are remembered sets that each chunk has of pointers going into the chunk. Then, copying can proceed without looking at anything else, with the remsets going and updating just those parts afterward. The remsets are built during marking. To reduce the size of these remsets, the chunks are ordered, and pointers only need to be maintained going in a certain direction. This algorithm can be made incremental because the mutator can run in between copying windows, as long as a write barrier maintains the remsets. The MC² paper describes a specific implementation where space overhead is bounded to 12.5%.

Switch from pointers to something else temporarily during compaction to let things go incrementally The grandiosely titled paper Mark-Sweep or Copying? A "Best of Both Worlds" Algorithm and a Hardware-Supported Real-Time Implementation [ACM] does this to implement a very concurrent mark-compact algorithm on specialized hardware using a read barrier. On one pass, all live objects are given "handles" in a separate part of the heap. These handles point back to the original object, and the original object points to the handle. Next, all pointers are replaced with handles. Objects are slid as far as possible towards the beginning of the heap, and their handles are updated appropriately. Finally, all objects have their pointers in the form of handles replaced with what those handles point to. This is given a concurrent implementation on specialized "object-oriented" hardware using multiple read barriers at different points in execution. During compaction, the read barrier makes sure that the mutator sees only handles, and afterwards, the read barrier makes sure that the mutator sees only pointers.

Use mark-sweep and a free list-based allocator, but occasionally compact areas that are extremely fragmented. This is perhaps the most widely studied idea, developed in a number of places. I can't find a good reference, but it's pretty clear that you can use mark-sweep over several GC cycles until the heap gets a certain amount fragmented, and then compact the entire heap. This can be done by compacting every n cycles, or by compacting when the free list gets to a certain length. But it seems like it'd be better to compact just the areas that are particularly fragmented. This technique is used in the hard real-time Metronome garbage collector as well as the mark-region Immix [PDF] collector. The compaction in both of these is actually done by evacuation, that is, copying to a different place. This involves some space overhead. Any old way of copying will do, including any incremental, concurrent or parallel method. A few [both ACM, sorry] concurrent copying schemes have been developed especially for this purpose. All of these deserve a full explanation, because they're really interesting, but I'll leave that for another blog post. One possibility related to this would be reference counting (with something to collect cycles) with occasional compaction, but I've never seen anything written about this. In these systems, the allocator has to be more complicated than pointer bumping, but it can still be very efficient.

Some of these ideas also use sliding compaction order, but this isn't really a necessary trait of mark-compact GC. Earlier algorithms (the two finger algorithm, linearizing compaction) used different orders. Nevertheless, the creators of these algorithms often don't refer to their own work as "mark-compact" when it doesn't use this order. I think it makes sense to put them in one category, since they all share the quality that the heap is marked and then it is compacted, at least partly.

When I tell people who know about CS that I've been trying to research garbage collection this summer, the reaction is often something like, "Hasn't that been pretty much solved?" There may be solutions, but interesting improvements in terms of time and space overhead are being developed all the time. All of the papers I've referenced are from the past decade.

[A historical note: Imagine if everywhere I said "compacting" I instead said "compactifying." People actually did this in the 70s and 80s in the GC literature. I'm very glad they came to their senses. An honesty note: I don't actually understand the concurrent algorithms very well.]

Thursday, July 10, 2008

Persistent heaps in Factor

Inspired by an email where Slava proposed that Factor begin to use more persistent data structures, I implemented a functional priority queue (minheap). It's in the main Factor repository in extra/persistent-heaps. Strangely, the Haskell Hierarchical Libraries don't seem to have a priority queue implementation, but there's a nice literate program by Michael Richter implementing them in Haskell, and my implementation is generally similar to that. (Neither one really takes advantage of laziness.)

They're based on a really cool idea that's described in the book ML for the Working Programmer by L. C. Paulson. Instead of pushing onto a heap by putting an element at the end of the array and percolating up, we can add it going down from the top. This frees us from the use of the array, allowing a functional pointer-based structure to be used. The heap can be balanced if we alternate back and forth which side we push onto. This can be done easily if we always push on to the right side and then swap the left and the right sides (unless the priority is the least, in which case we relegate the old top value to go to the right side and swap the sides).

Monday, June 23, 2008

Messing around with Factor's GC

Unfortunately, my summer project is going ahead using the MMTk component of Jikes RVM. At this point, it looks like we're going to focus on an attempt to parallelize MC². So, work on Factor's GC is just taking place in my free time until after the project is over. I haven't gotten much done yet, just a couple micro-optimizations and minor doc fixes. But I've gotten an understanding of the structure, which I'll try to explain to you.

Basic Factor object demographics

To figure out what we need in a garbage collector, it's useful to know some things about the data usage patterns in Factor code. A very detailed study of this has been done for Java. I hypothesize that these results, crucially the ones about allocation site predicting lifespan, are not entirely accurate for functional languages like Factor. But that's a topic for another time.

I want to find something really basic: what's the distribution of object sizes in Factor? This can be done easily without delving into C code:


USING: kernel math.statistics memory sequences ;
: object-data ( -- mean median std-dev )
    gc
    [ drop t ] instances [ size ] map
    [ mean ] [ median ] [ std ] tri ;

When I ran that code on the current version of Factor on a 32-bit x86 processor at the command line with an unmodified image (you have to run it with an extra-big tenured space), I got a mean of about 64, median 24 and standard deviation of 4514. This should be interpreted to mean, roughly, that a typical Factor object takes up six words, but that some objects are far bigger. The distribution is skewed far to the right.

There are only 242 objects in the heap which are larger than a page (4kB) in size, totaling 10MB out of the heap's total 37MB of live data. The biggest object is about 3MB. Two of these are byte arrays, three are regular arrays, and the rest are hashtables.

So, in the Factor GC, we're dealing with lots of little objects, and a few big ones. Both of these need to be dealt with efficiently. This also gives us the information that, in terms of space overhead, it wouldn't be completely out of the question for the GC to use, say, an extra word per object, as this would take up only a relatively small proportion of the heap space. The median object size, actually, is a bit bigger than I expected.

(This data might not be representative, because it consists of things in tenured space, recorded while nothing useful was going on. Further study would be useful.)

The structure of Factor's heap

Factor's heap consists of two parts: the code heap and the data heap. The data heap is maintained by a three-generational semispace (aka copying) collector, and the code heap is maintained by a mark-sweep collector. Separation here isn't unusual. In MMTk, there are several (maybe 7 or 8, IIRC) different heap subspaces.

There are a few reasons for this separation. For a platform like x86, with separate data and instruction caches, it is beneficial to keep these two things separate. If data and code are mixed together, the icache will be frequently and unnecessarily cleared due to modifications of the data. This will mess up pipelining and all kinds of other hardware optimizations that I don't really understand. Additionally, on other platforms like PowerPC, jumps can only be within 32 megabytes of code space, to keep instruction width fixed. Keeping the code heap separate makes it so that code is all together, so no extra jumps have to be inserted.

Within the data heap, the structure is relatively simple. The nursery is a single semispace and the aging and tenured spaces consist of two semispaces. Card marking is used to track old-to-young pointers. A little optimization called decks, which Slava apparently invented, makes it so that fewer cards have to be scanned. They're basically cards of cards. The generational write barrier used to just mark a card corresponding to the pointer modified; it now marks two cards: the small card corresponding to less than a kilobyte of memory, and the deck corresponding to a subsection of the cards. This makes nursery collection faster, since it's easier to scan the cards for roots.

Originally, the collections of two spaces were not completely coordinated. The data heap could be collected without the code heap being collected. But this is insufficient: since the data and code heap can both point to each other, a collection of one has to take place at the same time as a collection of the other. If this weren't the case, then, for example, the data heap could be prematurely exhausted: imagine that there are 1000 quotations made and immediately discarded, each of which points to a 10MB array. The data heap will fill up but the code heap won't be collected, so none of the arrays could be deleted.

Future plans

I plan on continuing to try to work on Factor's garbage collector, both in small optimizations and bigger changes to policy. By "policy" I mean the strategies the GC uses to be efficient. Policy ideas that I'm looking at include:

Reexamining generational policy
Factor uses three generations to make sure that not just anything gets into tenured space. This idea isn't new, but most modern collectors use two generations. This is combined with a more subtle strategy to make sure objects have to survive a few collections before being promoted to aging space.

The JVM's (Java) nursery is, in effect, like a nursery and an aging space, except that when the nursery fills up, there is an aging space collection. This simplifies the cards needed, and the write barrier can be made more efficient, possibly. But when I made a naive implementation of this in Factor, I got mixed results on benchmarks, so I didn't commit the changes.

Another possibility is to say that an object has to survive n nursery collections to be promoted to tenured space. This can be done most easily by having n nursery-sized spaces that things are copied through. On each nursery collection, objects are promoted one level, and objects on the highest nursery level are promoted to tenured space. Ideally there's some way to take up less space than that. GHC's (Haskell) runtime uses this strategy, and the heap's structure allows the higher levels to only use as many pages of memory as they need.

Bounded variable-sized nursery
Right now, Factor's nursery (and aging space) is a fixed size. But it's been known for some time that this isn't optimal: a tenured collection can be delayed slightly if the nursery shrinks when tenured space starts to fill up, allowing more of the heap to be used at once. A slight delay in tenured space collection translates into higher througput, which we want.

Lower space overhead for tenured space
Right now, with a semispace copying collector managing tenured space, Factor needs about three times as much memory to run as there is live data. I say three times rather than two because, with less than that, copying becomes too frequent and performance degrades. Mark-sweep and mark-compact are alternatives for managing the old generation which take much less space overhead, but they both have disadvantages. Mark-sweep can cause memory fragmentation, and allocation by free list isn't very fast (though this isn't necessarily true). And it's difficult to make mark-compact collection efficient.

Nevertheless, they'd be in some ways an improvement over the existing collector, because a relatively small footprint is very important for the desktop and embedded domains. Once one of these two collectors is implemented (or maybe some simple combination if the two), it could be used as a basis for implementing something like MC², Immix, the Compressor or a parallel or concurrent mark-sweep collector.

Tuesday, June 10, 2008

An idea for garbage collection research

This summer, I'm participating in an REU (Research Experience for Undergraduates) at Harvey Mudd College in southern California. For the past week or so, we've been trying to come up with ideas for what to research. I think I've stumbled on a reasonable one, though it might take a long time to implement. Hopefully, I'll spend the summer implementing it with a co-conspirator, Tony Leguia.

The idea is to combine MC² and Sapphire. MC² a copying algorithm which can be incremental and which has much lower memory overhead than the normal semispace algorithm, and Sapphire is an algorithm for concurrent copying GC. Together, they could form an efficient algorithm for concurrent compacting garbage collection with minimal space overhead. This has been done before, but also this idea allows pointer-bumping allocation and no reliance on hardware for things like memory page protection. I haven't found a paper which does all of these things at once.

Motivation

For me, but of course not for anyone else except maybe Tony, the motivation is to see how far this mark-copy algorithm can go. As far as I know (and as far as the co-author of the paper that I contacted knows), there have only been two papers published on it, the original one and MC², none of which mention multiple processors.

One idea is that this would be useful for large servers with many cores that are under memory pressure. Existing algorithms which depend on page faults can be slow when those faults occur. Normal semispace collectors do poorly under memory pressure, usually. With a concurrent collector like this, lower pause times could be achieved, giving faster response times to clients. But, for servers, other techniques might actually be more relevant, like making separately collected nurseries for each thread.

In theory, a concurrent collector would increase throughput, too, especially for programs that have fewer threads than the machine they're running on has cores. But this is a somewhat different use case than the server. Picture a single-threaded media player running on a two-core desktop machine. Really, on a desktop, all programs should be as small a footprint as possible and have as short pause times as possible, and this idea can hopefully provide that. The lack of dependence on page protections should also help performance as compared to other algorithms.

The third use case is an embedded system with multiple processors, which is apparently increasingly common. A concurrent version of MC² would be useful for the same reasons as MC² is on uniprocessor embedded systems.

I'm not positive that this is the best way to go about it. Maybe I really want a parallel collector, along the lines of GHC's recent parallel copying collector. As that paper discusses, it's pretty tricky to actually get a performance improvement, but it ends up being worth the effort throughput-wise. Maybe I should be making mark-copy parallel. This is a different research project, but it might be one that we end up looking into more. Maybe we could even make a parallel concurrent collector this way! But definitely not over the next 9 weeks, which is how much time I have left.

Details

I wrote about MC² earlier, but not about Sapphire. Sapphire is a mostly concurrent copying collector that works without a read barrier of any kind. I say mostly concurrent because some synchronization needs to be done surrounding the program stack, but not very much. Previous concurrent copying collectors worked with a read barrier that preserved the invariant that the mutator can only see things that have been copied into tospace. Sapphire, on the other hand, uses a write barrier to preserve the invariant that, roughly, during copying, fromspace and tospace are consistent.

The collector consists of four phases: (1) mark (2) allocate (3) copy (4) flip. In the mark phase, a concurrent snapshot-at-the-beginning mark is done. In the allocate phase, space in tospace is set up for each object and non-clobbering pointers are made to tospace from everything in fromspace. They can be non-clobbering because they take up some space in the header, and the old header can be put in tospace. Next, copy fromspace to tospace. The allocate step laid things out in breadth-first order, so Cheney's algorithm works here. In the flip step, the stack and other things outside of the space that's being copied has its pointers changed from pointing to fromspace to pointing to tospace. Throughout this, the write barrier assures that modifications to fromspace are propagated to tospace.

As far as I can tell, it takes nothing special idea-wise to combine these two things. The real work would be in the implementation details, and the benchmarks proving that this is possible and advantageous. One simplification we could make (though I'm not sure if it's a good idea in terms of locality) is that forwarding pointers and "reverse forwarding pointers" could be held each in window-sized blocks. So, overall, the collector would consist of

A mark phase, run concurrently with the mutator, either the incremental update or snapshot variants. This would collect the additional data which
A grouping phase as in MC²
For each group:
1. Allocate tospace pointers in the empty window
2. Copy fromspace to tospace

The Sapphire paper discusses condensing mark, allocate and copy phases into a single replicate phase. Within this framework, we could only combine the allocate and copy phases.

Related work

The number of other garbage collection systems trying to achieve the same goals is so dense and diverse that I feel hesitant to join in. But here are a bunch of related things, aside from what's already been mentioned.

One of the first incremental (and extendible to concurrent) copying collectors is that of Henry Baker. But this uses a read barrier, a little bit of code inserted before every pointer read, which ensures that it's not looking at fromspace.
An Appel-Ellis-Li-style copying collector (yes, they actually call it that) uses memory page permissions to make, in effect, a free-unless-it-gets-triggered read barrier maintaining the invariant that everything the mutator sees is in to-space. If it gets triggered, it's expensive, but the hope is that it won't get triggered very often because the concurrent copying will go fast enough.
The Compressor is a garbage collector which, I believe, maintains invariants similar to the Appel-Ellis-Li collector but uses this for a unique type of mark-compact collector which operates in just one pass. The algorithm is both parallel and concurrent.
There's been a lot of work in concurrent mark-sweep, but this leads to fragmentation over time. So a system for occasional mostly concurrent compaction has been developed.

In all but the last of these systems, a read may be expensive, due either to an explicit read barrier or the possibility of a page protection fault. The last one makes allocation (into the generation that is mark-swept) expensive due to a free list. Nevertheless, it might be reasonable to look into using an Appel-Ellis-Li-style barrier in place of Sapphire's write barrier.

Our plan

The plan is to implement a basic mark-copy collector, maybe on the Factor runtime, and then make it concurrent (or maybe parallel) somehow. At each step, we'll try to do the most basic thing possible. If we could get a concurrent or parallel version of mark-copy done by the end of this summer, I'll be happy. This'll be done while writing the paper (which this could be considered the first draft for). Optimizations along the line of MC² and the finishing touches on the paper can be done after the summer, as long as we get most things done over the next 9 weeks.

It's an exciting time to be involved in academic computer science because so many basic results have been elaborated by now. The only problem is the minefield of patents (especially dense in the practical field of garbage collection) and the fact that everyone else has thought of your idea before you.

Saturday, June 7, 2008

A second introduction to Unicode

If you're a normal programmer, you probably really don't want to have to think about Unicode. In what you're doing, text processing probably isn't a very important aspect, and most of your users will be using English. Nevertheless, text has a tendency to creep its way into almost everything, as the basic computer/human interface. So it might be a little beneficial to know about the basics of text processing and the Unicode character set.

A lot of people have written blog posts which are introductions to Unicode, and I didn't want to write another one with no new information in it. A popular one is Joel (on Software)'s one, which describes what Unicode is and why it's important. You've likely already read an introduction to Unicode, so I'll just summarize the most important points:

You can't assume text is in ASCII anymore This isn't just about being nice to non-English speakers. Even Americans enjoy their “curly quotes”, their cafés—and their em dashes. User input might come with these non-ASCII characters, and it must be handed properly by robust applications.
Unicode is the character set to use internally A bunch of character sets have been developed over the years for different purposes, but Unicode can represent more scripts than any one other character set. Unicode was designed to be able to include the characters from basically all other character sets in use. If you're using C or C++, wchar_t rather than char for strings works for most cases. If you're using a higher level language, then strings should already be stored in some representation that allows for Unicode uniformly.
There are several text encodings Not all text is in ASCII, and very little text is in the most common 8-bit extension, Latin 1 (ISO-8859-1). Lots of input is in UTF-8, which can represent all of Unicode, but there are other Unicode encodings, as well as specific East Asian encodings like GB 2312 and Shift JIS, in active use. Generally, UTF-8 should be used for output, and it's on the rise in terms of usage. Depending on the programming language or library used, you might have to account for the encoding when doing text processing internally. UTF-16 and UTF-8 are the most common, and careless programming can get meaningless results in non-ASCII or non-BMP cases if the encoding is ignored.
Unicode encodes characters, not glyphs Unicode can be seen as a mapping between numbers and code points, where a code point is the basic unit of Unicode stuff. It's been decided that this basic unit is for characters, like letters and spaces, rather than specific presentation forms, which are referred to as glyphs. Glyphs are something that only font designers and people who work on text rendering have to care about.

But there's a little bit more that programmers have to know about. Unicode is part of a bigger program of internationalization within a single framework of encodings and algorithms. The Unicode standard includes several important algorithms that programmers should be aware of. They don't have to be able to implement them, just to figure out where in the library they are.

Normalization Because of complications in the design, some Unicode strings have more than one possible form that are actually equivalent. There are a number of normalization forms that have been defined to get rid of these differences, and the one you should use is probably NFC. Usually, you should normalize before doing something like comparing for equality. This is independent of locale.
Grapheme, word, sentence and line breaks It's not true, anymore, that a single character forms a single unit for screen display. If you have a q with an umlaut over it, this needs to be represented as two characters, yet it is one grapheme. If you're dealing with screen units (imagine an internationalized Hangman), a library should be used for grapheme breaks. Similarly, you can't identify words as things separated by spaces or punctuation, or line break opportunities by looking for whitespace, or sentence breaks by looking just at punctuation marks. It's easy to write a regular expression which tries to do one of these things but does it wrong for English, and it's even easier to do it wrong for other languages, which use other conventions. So use a Unicode library for this. The locale affects how these breaks happen.
Bidirectional text When displaying text on a screen, it doesn't always go left to right as in most languages. Some scripts, like Hebrew and Arabic, go right to left. To account for this, use the Unicode Bidirectional Text Algorithm (BIDI), which should be implemented in your Unicode library. Locale doesn't matter here.
Case conversion Putting a string in lowercase is more complicated than replacing [A-Z] with [a-z]. Accent marks and other scripts should be taken into account, as well as a few weird cases like the character ß going to SS in upper case. The locale is also relevant in case conversion, to handle certain dots in Turkish, Azeri and Lithuanian.
Collation There's an algorithm for Unicode collation that works much better than sorting by ASCII value, and works reasonably for most languages. Depending on the locale, it should be modified. Even in English, the Unicode Collation Algorithm produces much more natural results. Parts of the collation key can be used for insensitive comparisons, eg. ignoring case.

For further reading, you can look at this much more in-depth article from SIL, or the Unicode 5.1 standard itself, which isn't that bad. Most programmers can't be expected to know all of this stuff, and they shouldn't. But it'd be nice if everyone used the appropriate library for text processing when needed, so that applications could be more easily internationalized.

Sunday, May 25, 2008

Unicode collation works now

This morning I got Unicode collation to pass all of the 130,000+ unit tests. It was a bit more difficult than I imagined, and it's still far from complete for a number of reasons. The whole thing is less than 200 lines (the core algorithm in about 100) in the unicode.collation vocab in the working version of Factor in git. Here's what I figured out:

Weird format of UnicodeData.txt It's not documented anywhere that I can find, but the UnicodeData.txt resource file has a weird range format for specifying certain properties, including character classes, which are used in collation. It looks just like two regular lines, but they have weird names for the characters that apparently need to be parsed. For example, lines that look like
```
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;
```
mean that all of the characters in the range U+100000 to U+10FFFF have the category Co, the combining class 0, etc.
My normalization bugs Working on this uncovered a bunch of bugs in older code, including that my conjoining Jamo behavior inserted nonexistent terminator characters.
Triple contractions The UCA specifies that collation graphemes should be found by checking if an adjacent character or non-blocked combining character has a contraction with a previous character. But this incremental approach doesn't work with four of the contractions listed in the DUCET which consist of three, not two, elements, without having the first two forming a contraction. So a simple identity contraction for the first two of each of those must be added.
Combining character contractions Apparently, two combining marks can form a contraction. A straight reading of the UCA wouldn't predict this, but not all of the UCA tests pass unless you check for non-adjacent combining marks being in a contraction together, without a noncombining mark to start it off.

And here's what I still have to do:

Korean stuff Because of some disagreement with the ISO people, the DUCET doesn't actually decide the best way to sort Korean. Instead, they provide three methods, both of which require modifying the table. I don't really understand the issue right now.
Tailoring for locales Actually, heh, the default algorithm is inaccurate for any specific locale you might be in. And, for human interfaces, it's actually pretty important that the sort order corresponds to expectations. So, if you want to do sorting that's correct, you have to modify the data. Unfortunately, the standard doesn't go into the specific algorithms for tailoring the table, though there is data available through the Common Locale Data Repository (CLDR).
Efficiency My implementation is both time and space inefficient, because I paid absolutely no attention to those, because solving the basic problem is hard enough (for me). Collation keys should be made shorter, and they should be made in fewer passes over the string.

Update: Here's an overview of the words that are defined in the vocabulary. It's about the minimum that any UCA implementation should have, in my opinion:

sort-strings This word takes a sequence of strings and sorts them according to the UCA, using code point order as a tie-breaker.
collation-key This takes a string and gives a representation of the collation key, which can be compared with <=>
string<=> This word takes two strings and compares them using the UCA with the DUCET, using code point order as a tie-breaker.
primary= This checks whether the first level of collation is identical. This is the least specific kind of equality test. In Latin script, it can be understood as ignoring case, punctuation and accent marks.
secondary= This checks whether the first two levels of collation are equal. For Latin script, this means accent marks are significant again.
tertiary= Along the same lines as secondary=, but case is significant.
quaternary= This is a little less typical (and definitely non-essential, unlike the other things), and it's my own nomenclature, but it makes punctuation significant again, while still leaving out things like null bytes and Hebrew vowel marks, which mean absolutely nothing in collation.

Friday, May 23, 2008

Little things I've been working on

I've been working on a few different things that, individually, are too insignificant for a blog post, so I'll put them together.

One is, I expanded my previous survey of algorithms for XML diffing, and the result is here [PDF].

I've been working on a few libraries in Factor. One is extra/delegate, which now interacts properly with reloading. For example, if you define a protocol, then say that a class consults something for that protocol, and then redefine the protocol to include more generic words, the consulting class will be automatically updated. Unfortunately, this doubled the size of the code, give or take. Slava changed duplex-streams to use extra/delegate, and the code is much simpler now, as it previously amounted to manual delegation. I got rid of mimic because it's unsafe and violates encapsulation in unexpected ways.

Another little thing I made was extra/lcs, a library for calculating Levenshtein distance between two strings, the longest common subsequence of two strings, and an edit script between two strings. Because the LCS problem and Levenshtein distance are duals, I was able to share most of the code between them. I used the simple quadratic time and space algorithm that Wikipedia describes rather than the better O(nd) algorithm [PDF] commonly called the GNU diff algorithm. I'll upgrade it to this once I understand the algorithm. This replaces extra/levenshtein. I expected it to be significantly faster, because the old code used dynamically scoped variables and this uses statically scoped locals, but the speed improvement turned out to be just around 4% in small informal benchmarks on short strings.

Now, I'm working on the Unicode Collation Algorithm. The basics were simple, but I'm still unsure how to recognize collation graphemes efficiently in general. Either way, I discovered a bug in normalization: my insertion sort, used for canonical ordering of combining marks, wasn't a stable sort as required for normalization. It was actually an anti-stable sort: it reversed subsequences which were of the same sort key. That was really stupid of me. I'm going to work on incorporating existing test suites for things like this. For the test suite for collation, all but 8000 of 130,000 tests pass, making it far from ready.

Sunday, May 18, 2008

Writings on regexp group capture

So, in researching regular expression group capture, I had a little bit of trouble. It turns out that some people call it "capture groups", others call it "submatch extraction" and some people call it "subexpression match". In Google, it looks like "submatch extraction" gets the most research hits, and "subexpression match" is the most broadly used.

That behind me, I'm not the first one to come up with an algorithm for group capture in regular expressions in linear time and space. (My algorithm was, basically: annotate the NFA states which lie on a group boundary, then turn this into a DFA which marks a location in the string when that state could be entered. Run this, and then run the same thing on the reverse regular expression, putting the string in backwards, and find the intersection between the possible points of group boundary. Then, get the first possible group boundary point for each one, or the last. This can be proven correct easily in the case of one boundary point: if a proposed boundary is in the set marked for the forward pass and the backward pass, then the part before the boundary matches the first part of the regexp, and the part after the boundary matches the second part.)

Actually, there's been a bit of research here over the past 20 years. I haven't read the following papers very closely (though I plan to), but for anyone interested in understanding how to process regular expressions efficiently to get a parse tree, here are a few interesting papers:

Extending Regular Expressions with Context Operators and Parse Extraction by Steven Kearns, 1991. This does something like the algorithm I was developing, but it's further thought-out

Efficiently building a parse tree from a regular expression by Danny Dubé, Marc Feeley, 2000. This goes into more depth on building parse trees, but their algorithm is apparently less efficient than the one just below.

Efficient submatch addressing for regular expressions [PDF] by Ville Laurikari, 2001. This is someone's Master dissertation, so it's easier to read and presents background information. The formal model of a tagged NFA is introduced. Benchmarks are provided, showing the system to be much faster than other widely used libraries.

Greedy Regular Expression Matching by Alain Frisch, Luca Cardell, 2004

All of these papers go about submatch extraction in somewhat difficult ways. I hope I helped someone avoid a difficult literature search like I had.

Update: It seems the best way to do a literature search is to blog about something, and have commenters give you relevant papers. Here's one by Burak Emir describing how to get the shortest match (think non-greedy, but globally optimal) with group capture, taking advantage of transformations of regexes. Thanks, Alain Frisch!

Saturday, May 10, 2008

Parsing with regular expressions and group capture

Update: This idea is completely not new. See Ville Laurikari's master's thesis, Efficient Submatch Addressing for Regular Expressions, especially chapter 2.

Though I'd rather avoid it, string parsing is a crucial part of programming. Since we're more than 60 years into the use of modern computers, it seems like we should have a pretty good handle on how to build abstractions over parsing. Indeed, there are tons of great tools out there, like GNU Yacc, Haskell's Parsec, newer Packrat-based parsers like Factor's EBNF syntax for PEGs, and a bunch of other high level parsing libraries. These libraries are relatively easy to use once you understand the underlying structure (each one parses a different subset of context-free grammars), because they expose the programmer to a tree-like view of the string.

However, these incur too much overhead to be used for certain domains, like the parsing that goes on in an HTTP server or client. They're really overkill when, as in the case of HTTP interchanges, what you're dealing with is a regular language and processing can be done on-line. (I'll get back later to what I mean by those two things.) The main tools that exist to deal with this are Lex and Ragel.

Ragel seems like a really interesting solution for this domain. The entire description of parsing is eventually put in one regular expression, which is compiled to a DFA, where states and transitions can be annotated by actions. Fine-grained control is given to limit non-determinism. But the user must be acutely aware of how regular expressions correspond to DFAs in order to use the abstraction. So it is somewhat leaky. Also, it's difficult to get a tree-like view on things: actions are used purely for their side effect.

So, here's an idea: let's find a middle ground. Let's try to use regular expressions, with all of their efficiency advantages, but get an abstract tree-like view of the grammar and an ability to use parsing actions like high-level abstractions allow. Ideally, the user won't have to know about the implementation beyond two simple facts: regular languages can't use general recursion, and nondeterminism should be minimized.

This isn't something that I've implemented, but I have a pretty good idea for the design of such as system, and I wanted to share it with you. First, a little background.

DFAs and regular expressions

I'm taking a computer science class about this right now, so I'm gonna be totally pedantic. When I say regular expression, I mean an expression that describes a regular language. Perl regexes aren't regular expressions (and Larry Wall knows this). If you don't feel like putting on your theoretician's hat, this blog post will be mostly meaningless.

What's a regular language? First off, a language is a set of strings. We care about infinite sets of strings, since finite sets are trivial to represent. If a string is in the language, that means that the language matches the string, intuitively. A regular language is one which can be represented by a deterministic finite automaton (DFA) without extensions, also called a finite state machine (FSM) for some reason. Many useful languages are regular, and many are not.

The idea of a DFA is a finite number of states and a transition function, which takes the current state and a character of a string and returns the next state. The transition function is defined on all states and all characters in the alphabet. There is a set of final states, and if the string runs out when the machine is in a final state, then the string is accepted. The language of the DFA is the set of strings accepted by the DFA. For a given DFA, that DFA can be run in linear time with respect to the length of the input string and constant space. It can also be run "on-line", that is, without the whole string in advance, going incrementally.

A related construction is an NFA, or nondeterministic finite automaton. Imagine the previous idea, but instead of a transition function, there is a transition relation. That is, for any character and current state, there are zero or more next states to go to, and the NFA always picks the right one. This is called nondeterminism (at least that's what it means here). Amazingly, NFAs can accept only regular languages and nothing more, because NFAs can be translated into DFAs. Basically, you build a DFA which picks all possible states at once, given all possible paths through the NFA. Potentially, though, there's an exponential blowup in the number of states.

Every regular expression can be converted into an equivalent NFA, which can be converted into a DFA, which can then be converted back into a regular expression. They're all equivalent. So then what's a regular expression? There are different ways to define it. One is that you can build up a regular expression from the following elements: the epsilon regex (matching the empty string), the empty regex (matching nothing), single character regexes (matching just a single character), concatenation (one followed by another), disjunction (or) and the Kleene star (0 or more copies of something). Counterintuitively, it's possible to construct regexes which support negation, conjunction, lookahead and other interesting things.

The most important distinction from Perl regexes is that regular expressions cannot contain backreferences, because these are provably impossible to express in a DFA. It's impossible to parse something with backreferences in the same linear time and constant space that you get from regexes which are regular. In fact, parsing patterns with backreferences is NP-hard and not believed possible in polynomial time (with respect to the length of the input string). Since regular expressions which are regular give us such nice properties, I'm going to stick to them.

Regular expressions in practice in parsing today

The formal study of regular languages is a basically solved problem within the formalism itself: they are equivalent to DFAs, and satisfy a convenient set of properties summarized by the pumping lemmaand the Myhill-Nerode theorem. The problem is just, is the given string a member of the language? What languages are regular?

This was solved in the 1950s and 1960s, and the basic results are in most introductory compiler books. Those books use the solution to build lexers, like Lex. Lex basically takes a list of regular expressions, each associated with an action, finds one of them to match maximally with the current input, executes the associated action on the portion that matches, and then repeats with the rest of the string. This is useful to build lexers, but the programmer has very little context for things, so it's difficult to use for much else.

More recently, Ragel has been used as a way to parse more complicated things using regular expressions. Its strategy is to turn its compile-time input into one big regular expression, annotated with actions on certain states or transitions. The actions are fragments of C code, and they form the processing power of the machine. However, their behavior can get a little unintuitive if too much nondeterminism is used, so Ragel provides a bunch of tools to limit that. Also, Ragel lets you explicitly specify a DFA through transitions, which seems useful but low-level.

Group capture with regexes

One of the most useful features of Perl's regular expressions is group capture. By this, I mean how you can do something like s/^(1*)(0*)$/$2$1/ to swap ones and zeros in a string. This is different from backreferences (like the non-regular expression /(.*)$1/) because it's only used in subsequent code, to figure out what got matched to what part of the regex. It doesn't parse any languages which aren't regular, but it's a useful tool for processing.

Curiously, this has been ignored both by academics and DFA implementors so far. I hypothesize that it's been ignored by theorists for two reasons: (1) It's easy to confuse with backreferences, which make the language non-regular, which is totally uninteresting to theorists (2) They're not part of the formalism of regular expressions as previously expressed.

Implementors of (non-Perl-based) regular expression-based parsing mechanisms tend to avoid group capture because, in the general case, it's not fast enough and can't be done on-line. Also, as far as I can tell, it hasn't been implemented any other way than interpreting an NFA, using backtracking, and keeping track of where the parser is within the regex to determine group boundaries. This would be terrible for the domain of Lex and Ragel. By "on-line" I don't mean on the internet but rather that an algorithm that can be performed incrementally, getting little pieces (characters, say) of the input and doing computation incrementally, as the input is received, without storing the whole thing and running the algorithm all at once.

So how can we do group capture on-line? Well, in general, we can't. Consider the regular expression (1*)1 where you're trying to capture the group 1*. As the input is being processed, we don't know when we've gotten to the end of the group until the entire input is over, since if there are two more 1's, then the first one must be in the group. However, in many cases group capture can in fact be done on-line, as in (0*)(1*), where the groups captured are 0* and 1*. As the regex is processing on the string, it knows that, if there is a match, the group boundary is just before the first 1. This can be formalized as a "boundary of determinism": a point where, in the subset construction to form a DFA from an NFA gets a subset of exactly one state.

I believe this can handle most cases of group capture in practice, if the regular expression is well-written, but surely not all of them. I have an idea for how to do group capture in the few remaining circumstances, but unfortunately it takes linear space and it's not online. I'll blog about it once I have a proof of correctness.

Hierarchical parsing using group capture

Using this group capture mechanism, we can build a hierarchical parsing mechanism with actions on different things, which can be built to parse regular languages in a higher-level way. Regular expressions can't use arbitrary recursion like context-free grammars can, so the parse tree will be of fixed size, but it could still be useful. In designing this, I'm thinking specifically about making a SAX-like XML parser. It'd be awkward to write everything out as one big regular expression, but split into smaller things, each with their own little steps in processing, it could be much more elegant. My goal for syntax is something like EBNF syntax, as Chris Double's PEGs library in Factor does. Here's some future pseudocode for how it could look in parsing an XML tag, simplified. (In this code, > is used like Ragel :>>, to indicate that when the expression afterwards can be matched by the regex, it is, as soon as possible (basically).)


REG: tag
chars = "&" entity:any* > ";" [[ entity lookup-entity ]]
    | any
string = "\"" str:chars > "\"" [[ str ]]
    | "'" str:chars > "'" [[ str ]]
xml-name = name-start-char name-char*
attribute = name:xml-name ws "=" ws str:string [[ name str 2array ]]
tag = "<" ws closer?:("/"?) name:xml-name attrs:(attribute*) ws contained?:("/"?) ws ">" [[ ... ]]

Conclusion

Though I haven't implemented this yet, and probably shouldn't even be talking about it, I'm really excited about this idea. I even came up with a stupid little name with it: Hegel, both for High-level Ragel and because it represents the synthesis of the dialectic (as described by Hegel) of slower, higher-level parsing and fast low-level parsing into fast, high-level parsing of regular languages. I hope it works.

Wednesday, May 7, 2008

Interval maps in Factor

Recently, I wrote a little library in Factor to get the script of a Unicode code point. It's in the Factor git repository in the vocab unicode.script. Initially, I relatively simple representation of the data: there was a byte array, where the index was the code point and the elements were bytes corresponding to scripts. (It's possible to use a byte array because there are only seventy-some scripts to care about.) Lookup consisted of char>num-table nth num>name-table nth. But this was pretty inefficient. The largest code point (that I wanted to represent here) was something around number 195,000, meaning that the byte array took up almost 200Kb. Even if I somehow got rid of that empty space (and I don't see an obvious way how, without a bunch of overhead), there are 100,000 code points whose script I wanted to encode.

But we can do better than taking up 100Kb. The thing about this data is that scripts are in a bunch of contiguous ranges. That is, two characters that are next to each other in code point order are very likely to have the same script. The file in the Unicode Character Database encoding this information actually uses special syntax to denote a range, rather than write out each one individually. So what if we store these intervals directly rather than store each element of the intervals?

A data structure to hold intervals with O(log n) lookup and insertion has already been developed: interval trees. They're described in Chapter 14 of Introduction to Algorithms starting on page 311, but I won't describe them here. At first, I tried to implement these, but I realized that, for my purposes, they're overkill. They're really easy to get wrong: if you implement them on top of another kind of balanced binary tree, you have to make sure that balancing preserves certain invariants about annotations on the tree. Still, if you need fast insertion and deletion, they make the most sense.

A much simpler solution is to just have a sorted array of intervals, each associated with a value. The right interval, and then the corresponding value, can be found by simple binary search. I don't even need to know how to do binary search, because it's already in the Factor library! This is efficient as long as the interval map is constructed all at once, which it is in this case. By a high constant factor, this is also more space-efficient than using binary trees. The whole solution takes less than 30 lines of code.

(Note: the intervals here are closed and must be disjoint. <=> must be defined on them. They don't use the intervals in math.intervals to save space, and since they're overkill. Interval maps don't follow the assoc protocol because intervals aren't discrete, eg floats are acceptable as keys.)

First, the tuples we'll be using: an interval-map is the whole associative structure, containing a single slot for the underlying array.


TUPLE: interval-map array ;

That array consists of interval-nodes, which have a beginning, end and corresponding value.


TUPLE: interval-node from to value ;

Let's assume we already have the sorted interval maps. Given a key and an interval map, find-interval will give the index of the interval which might contain the given key.


: find-interval ( key interval-map -- i )
    [ from>> <=> ] binsearch ;

interval-contains? tests if a node contains a given key.


: interval-contains? ( object interval-node -- ? )
    [ from>> ] [ to>> ] bi between? ;

Finally, interval-at* searches an interval map to find a key, finding the correct interval and returning its value only if the interval contains the key.


: fixup-value ( value ? -- value/f ? )
    [ drop f f ] unless* ;

: interval-at* ( key map -- value ? )
    array>> [ find-interval ] 2keep swapd nth
    [ nip value>> ] [ interval-contains? ] 2bi
    fixup-value ;

A few convenience words, analogous to those for assocs:


: interval-at ( key map -- value ) interval-at* drop ;
: interval-key? ( key map -- ? ) interval-at* nip ;

So, to construct an interval map, there are a fewi things that have to be done. The input is an abstract specification, consisting of an assoc where the keys are either (1) 2arrays, where the first is the beginning of an interval and the second is the end (2) numbers, representing an interval of the form [a,a]. This can be converted into a form of all (1) with the following:


: all-intervals ( sequence -- intervals )
    [ >r dup number? [ dup 2array ] when r> ] assoc-map
    { } assoc-like ;

Once that is done, the objects should be converted to intervals:


: >intervals ( specification -- intervals )
    [ >r first2 r> interval-node boa ] { } assoc>map ;

After that, and after the intervals are sorted, it needs to be assured that all intervals are disjoint. For this, we can use the monotonic? combinator, which checks to make sure that all adjacent pairs in a sequence satisfy a predicate. (This is more useful than it sounds at first.)


: disjoint? ( node1 node2 -- ? )
    [ to>> ] [ from>> ] bi* < ;

: ensure-disjoint ( intervals -- intervals )
    dup [ disjoint? ] monotonic?
    [ "Intervals are not disjoint" throw ] unless ;

And, to put it all together, using a tuple array for improved space efficiency:


: <interval-map> ( specification -- map )
    all-intervals [ [ first second ] compare ] sort
    >intervals ensure-disjoint >tuple-array
    interval-map boa ;

All in all, in the case of representing the table of scripts, a table which was previously 200KB is now 20KB. Yay!

Saturday, May 3, 2008

A couple GC algorithms in more detail

In previous posts on garbage collection, I've given a pretty cursory overview as to how things actually work. In this post, I hope to give a somewhat more specific explanation of two incremental (and potentially concurrent or parallel, but we'll ignore that for now) GC algorithms: Yuasa's snapshot-at-the-beginning incremental mark-sweep collector, and the MC² algorithm. Yuasa's collector is very widely used, for example in Java 5 when an incremental collector is requested. MC² is a more recent algorithm designed to reduce the fragmentation that mark-sweep creates, and appears to get great performance, though it isn't used much yet. In their practical implementation, both collectors are generational.

Yuasa's mark-sweep collector

The idea is pretty simple: take a mark-sweep collector and split up the work, doing a little bit on each allocation. When the heap occupancy passes a certain threshold, say 80%, switch into "mark phase", and on each allocation, mark the right amount of the heap so that everything's marked by the time the heap is full. (You can ensure this by making the amount of marking proportional to the amount of memory allocated.) Then, switch into sweep phase, and on each allocation sweep the heap by a certain amount. If a big object is allocated, sweeping continues until there's enough room. Once sweeping is done, the collector returns to a neutral state and allocation takes place without any special collection actions until the free space dips below the threshold.

Making this work

This is a neat little way to specify a GC algorithm. The implementor has three knobs at their disposal: the threshold to begin collection, the speed of marking, and the speed of sweeping. But there's a problem: the algorithm, as I described it, doesn't work. See, the graph of interconnections in the heap may change during the course of marking, and that's a problem. As I described in a previous post, if a pointer gets moved to another location, it might evade marking and get swept, causing memory corruption.

In a snapshot-at-the-beginning incremental marking GC, the technique to save this is to trap all pointer writes and execute a little bit of code: if the collector is in the marking phase, and if the old pointer value isn't marked, it needs to get marked and get pushed on the marking stack so that its children get marked. (The marking stack is the explicit stack used for depth-first traversal of the heap, to mark everything it reaches.) This piece of code is called the write barrier, and it goes on in addition to the generational write barrier, if one is necessary.

Conservativeness

One more thing: objects are allocated as marked, if an object is allocated during a GC cycle. This means that they can't be collected until the next time around. Unfortunately, this means that any generational GC will be ineffective while marking is going on: everything is effectively allocated in the oldest generation. Nevertheless, generations still provide a significant performance advantage, since most time is spent in the neural non-GC state.

This is called snapshot-at-the-beginning not because an actual snapshot is made, but because everything is saved that had something referring to it at the beginning of the marking phase. (Everything that gets a reference to it during the cycle is also saved.) Of all incremental mark-sweep GC algorithms, a snapshot-at-the-beginning collector is the most conservative, causing the most floating garbage to lie around and wait, uncollected, until the next cycle. Other algorithms have techniques to avoid this, but it often comes at other costs.

MC²

Unfortunately, no matter what strategy is used to minimize fragmentation, there is a program which will cause bad fragmentation of the heap, making it less usable and allocation more expensive. For this reason, a compaction strategy is helpful, and the MC² algorithm (Memory-Constrained Compaction), created by Narendran Sachindran, provides one within an incremental and generational system. The details are somewhat complicated, and in this blog post I'll offer a simplified view. You can also look at the full paper.

MC

The idea is based on the Mark-Copy (MC) algorithm. The heap is divided up into a number of equally sized windows, say 40. One of these is the nursery, and the others act as tenured space. (I don't know why, but the papers about this seem to use a two-generation rather than three-generation model. I think it could easily be updated to use three generations, but I'll stick with this for now.) Each window has a logical number, with the nursery having the highest number.

Nursery collections go on as I've described in a previous post. A tenured space collection is triggered when there is only one (non-nursery) window left free. At this point, the heap is fully marked. During marking, remembered sets of pointers into each window are made. In turn, each window is copied (using Cheney's copying collector) to the open space that exists, starting in the one free window. The remembered sets can be used to update pointers that go to things that were moved. If the lowest number window is copied first, the remembered sets only need to contain pointers from higher windows to lower windows.

New modifications

MC² adds a few things to this, to make the algorithm incremental and give low upper bounds on space overhead. The first change is that incremental marking is done. This is similar to the incremental snapshot-at-the-beginning marker described above, though the creators of MC² opted for a version called incremental update, which is less conservative and more complicated but equally sound. The next change is in the copying technique. If a window is determined to have high occupancy (like more than 95%), it is left as it is without copying. Otherwise, windows are collected into groups whose remaining data can fit into one window. Those groups are incrementally copied into a new window.

Other changes make sure that the space overhead is bounded. The size of remembered sets is limited by switching to a card marking system in the event of an overflow. Objects with many references to them are put in semi-permanent storage in the lowest possible window number, minimizing the size of remembered set that they need.

In a benchmark included in the MC² paper, it is demonstrated that MC² has the same or slightly better performance compared to non-incremental generational mark-sweep or generational mark-compact, the alternatives for the domain of memory-constrained systems. Pauses more than 30ms are rare, and performance appears to be consistent over a wide range of Java programs.

Tuesday, April 29, 2008

Potential ideas to explore

I haven't written in a while, and it's a little hard to get started back up, so here are just a bunch of random ideas in my head that I'd like to share with you guys. Sorry if it's a little incoherent...

Possible extensions to Inverse

I've been thinking about possible ways to generalize my system for concatenative pattern matching, currently in extra/inverse. There are two ways to go about it: making a more general constraint solving system, and giving access to the old input when inverting something, as in the Harmony project. A third way is to add backtracking (in a different place than constraint solving would put it). To someone familiar with Inverse, these might seem like they're coming from nowhere, but they're actually very closely related. (To someone not familiar with it, see my previous blog post describing Inverse.)

Constraint solving

The idea of resolving constraints is to figure out as much as you can about a situation given certain facts. This is easy in some cases, but impossible in others, even if enough facts are known to, potentially, figure out what everything is. For example, Diophantine equations can be solved by a fully general constraint-solving system, but they're known to be undecidable in general.

So what can constraint solving get you in Inverse? Well, imagine an inverse to bi. It's not difficult to make one within the current framework, but some information is lost: everything must be completely determined. Think about inverting [ first ] [ second ] bi. Inverting this should get the same result as first2 (which has a hard-coded inverse right now, inverting to 2array). But it won't work.

A way for [ first ] [ second ] bi to work would be using the following steps:

Initialize a logic variable X as unbound
Unify X with the information, "the first element is what's second from the top of the stack (at runtime)". Now it's known that X is a sequence of length at least 1.
Unify X with the information, "the second element is what's on the top of the stack (at runtime)". Now it's know that X is a sequence of length at least two.
From the information we have about X, produce a canonical representation, since the inverted quotation is over: an array of the minimum possible length.

This isn't easy to do in general, but it should be possible, in theory. It'd be extremely cool if it worked out.

Formally, you can think of Inverse as already a reasonable constraint solving system, for a limited problem domain. Given [ f ], and the statement about stacks A and B that f(A) = B, and given B, find a possible value for A. The strategy used right now is mathematically sound, and I hope to write it up some day. But, a more general use of logic variables is possible: explicit logic variables in code. This could be used to make a better-integrated logic language in Factor.

The Harmony Project

The Harmony Project, led by Benjamin C. Pierce, is an attempt to solve the "view-update problem" using a new programming language and type system which is largely invertible. The view-update problem is that we want to convert different storage formats into an abstract representation, manipulate that representation and put it back without duplicating code about the representation. Everything operates on edge-labeled trees.

Within the Harmony framework, it's possible to do all your work in bijections (one-to-one onto functions, similar but not identical to the domain of Inverse right now), but there's extra power included: the function to put the abstract representation back into the original form has access to the original. This adds a huge amount of power, giving the possibility of conditionals and recursion, in limited cases. Also, it gives the power to ignore certain things about the surface structure when looking at the abstract form. (Harmony also has ideas about tree merging, and of course a new type system, but I'm not as interested in that right now.)

So far, only relatively trivial things have been made with Harmony, but the idea looks really useful, though there are two problems: (1) I don't really understand it fully (like constraints) and (2) I have no idea how it can fit together with Inverse as it is right now.

Backtracking

In Mark Tullsen's paper on first-class patterns, there was an interesting idea that Inverse could adopt. Tullsen used monads to sequence the patterns. It's the simplest to use the Maybe monad, and that corresponds to how pattern matching systems normally work. But if the List monad is used instead, then you easily get backtracking. This could be ported to Factor either by using monads or, maybe easier, by using continuations. Years ago, Chris Double implemented amb in Factor using continuations, though the code won't work anymore. The sequencing and backtracking I'm talking about is relevant in things like switch statements, rather than undo itself. I'm not sure if it'd actually be useful in practice.

Garbage collection research ideas

Because the summer's coming up, and I'll be participating in Harvey Mudd's Garbage Collection REU, I've been coming up with a few research ideas. The suggested one is to continue with the work of previous years' REUs and think about simplifiers and collecting certain persistent data structures and weak hashtables, but here are a couple more:

Figure out how efficient garbage collection on Non-Uniform Memory Access systems can work. The problem (if it is a problem) is that plain old garbage collection on multiprocessor NUMA systems isn't as fast as it could be, because a chunk of memory allocated for a thread may be far away from where it's used. One way to ensure locality is to give each processor (at least) its own heap, where the heap is guaranteed to be stored in the closest memory. But if data needs to be shared between processors, this can be too limiting. A piece of data can be kept on the RAM closest the processor which made the allocating call, but maybe it'd be beneficial to collect data on which processor is using which data, and dynamically move data around to different places in RAM to put it closest to where it's used. A related issue is maximizing locality when actually performing the tracing in the GC, which I have no ideas about.

Run a real benchmark comparing several GC algorithms. Probably the most annoying thing for programming language implementors trying to pick a good GC algorithm is that there's no comprehensive benchmark to refer to. No one really knows which algorithm is the fastest, so there are two strategies remaining: pick the one that sounds the fastest, or do trial and error among just a few. Each paper about a new algorithm reports speed improvements—over significantly older algorithms. It'd be a big project, but I think it's possible to make a good benchmark suite and test how long it takes for these algorithms to run, in terms of absolute throughput and pause length and frequency, given different allocation strategies. If it's possible, it'd be nice to know what kind of GC performs best given a particular memory use pattern.

Garbage collector implementation in proof-carrying code. There are a couple invariants that garbage collectors have, that must be preserved. For example, the user can't be exposed to any forwarding pointers, and a new garbage collection can't be started when forwarding pointers exist. The idea of proof-carrying code (an explicit proof, which is type-checked to be accurate, is given with the code) isn't new; it's mostly been used to prove memory consistency safety given untrusted code. But maybe it could be used to prove that a GC implementation is correct.

These ideas are really difficult, but I think they're interesting, and with four other smart people working with me, maybe in a summer we can do something really cool, like this or whatever other idea they come up with.

Ragel-style state machines in Factor

In my Automata and Computability class at Carleton, we've been studying (what else) finite automata, and it got me thinking about regular expressions and their utility in Factor. By regular expression, I mean an expression denoting a regular language: a real, academic regexp. A regular language is one that can be written as a deterministic finite automaton (finite state machine). Hopefully, I'll explain more about this in a future blog post.

Anyway, if you've heard of Ragel, it's basically what I want to do. But the form it'd take is basically the same as PEGs (Chris Double's Pacrat parser), with the one restriction that no recursion is allowed. In return for this restriction, there is no linear space overhead. Basically everything else, as far as I know, could stay the same.

I'm thinking I'll redo the XML parser with this. The SAX-like view will be done with this regular languages parser (since all that's needed is a tokenizer), and then that can be formed into a tree using PEGs (since linear space overhead is acceptable there). Linear space overhead, by the way, is unacceptable for the SAX-like view, since it should be usable for extremely large documents that couldn't easily fit in memory all at once.

(By the way, I know Ragel also allows you to explicitly make state charts, but I won't include this until I see a place where I want to use it.)

Sunday, April 6, 2008

Programming in a series of trivial one-liners

Among Perl programmers, a one-line program is considered a useful piece of hackage, something to show off to your friends as a surprisingly simple way to do a particular Unix or text-processing task. Outsiders tend to deride these one-liners as line noise, but there's a certain virtue to it: in just one line, in certain programming languages, it's possible to create meaningful functionality.

APL, lived on by its derivatives like K, Q, J and Dyalog pioneered the concept of writing entire programs in a bunch of one-liners. Because their syntax is so terse and because of the powerful and high-level constructs of array processing, you can pack a lot into just 80 characters. In most K programs I've seen, each one does something non-trivial, though this isn't always the case. It can take some time to decode just a single line. Reading Perl one-liners is the same way.

Factor continues the one-line tradition. In general, it's considered good style to write your words in one, or sometimes two or three, lines each. But this isn't because we like to pack a lot into each line. Rather, each word is rather trivial, using the words defined before it. After enough simple things are combined, something non-trivial can result, but each step is easy to understand.

Because Factor is concatenative (concatenation of programs denotes composition) it's easier to split things into these trivial one-liners. It can be done by copy and paste after the initial code is already written; there are no local variables whose name has to be changed. One liners in Factor aren't exceptional or an eccentric trait of the community. They're the norm and programs written otherwise are considered in bad style.

Enough philosophizing. How does this work in practice? I'm working on encodings right now, so I'll break down how this worked out in implementing 8-bit encodings like ISO-8859 and Windows-1252. These encodings are just a mapping of bytes to characters. Conveniently, a bunch of resource files describing these mappings which are all in exactly the same format is already exists on the Unicode website.

The first thing to do in implementing this is to parse and process the resource file, turning it into two tables for fast lookup in either direction. Instead of putting this in one word, it's defined in five, each one or two lines long. First, tail-if is a utility word which works like tail but leaves the sequence as it is if it's shorter.


: tail-if ( seq n -- newseq )
    2dup swap length <= [ tail ] [ drop ] if ;

Using that, process-contents an array of lines and turns it into an associative mapping (in the form of an array of pairs) from octets to code points.


: process-contents ( lines -- assoc )
    [ "#" split1 drop ] map [ empty? not ] subset
    [ "\t" split 2 head [ 2 tail-if hex> ] map ] map ;

byte>ch takes this assoc, the product of process-contents and produces an array which can be used to get the code point corresponding to a byte.


: byte>ch ( assoc -- array )
    256 replacement-char <array>
    [ [ swapd set-nth ] curry assoc-each ] keep ;

ch>byte is the opposite, taking the original assoc and producing an efficiently indexable mapping from code points to octets.


: ch>byte ( assoc -- newassoc )
    [ swap ] assoc-map >hashtable ;

Finally, parse-file puts these all together and makes both mappings, given a stream for the resource file.


: parse-file ( stream -- byte>ch ch>byte )
    lines process-contents
    [ byte>ch ] [ ch>byte ] bi ;

Next, the structure of the encoding itself is defined. A single tuple named 8-bit is used to represent all 8-bit encodings. It contains the encoding and decoding table, as well as the name of the encoding. The encode-8-bit and decode-8-bit words just take some encoding or decoding information and look the code point or octet up in the given table.


TUPLE: 8-bit name decode encode ;

: encode-8-bit ( char stream assoc -- )
    swapd at* [ encode-error ] unless swap stream-write1 ;

M: 8-bit encode-char
    encode>> encode-8-bit ;

: decode-8-bit ( stream array -- char/f )
    swap stream-read1 dup
    [ swap nth [ replacement-char ] unless* ] [ nip ] if ;

M: 8-bit decode-char
    decode>> decode-8-bit ;

I wanted to design this, like existing Unicode functionality, to read resource files at parsetime rather than to generate Factor source code. Though I don't expect these encodings to change, the result is still more maintainable as it leaves a lower volume of code. If I were implementing this in C or Java or R5RS Scheme or Haskell98, this wouldn't be possible. So make-8-bit defines an encoding given a word and the lookup tables to use:


: make-8-bit ( word byte>ch ch>byte -- )
    [ 8-bit construct-boa ] 2curry dupd curry define ;

define-8-bit-encoding puts everything together. It takes a string for the name of an encoding to be defined and a stream, reads the appropriate resource file and defines an 8-bit encoding.


: define-8-bit-encoding ( name stream -- )
    >r in get create r> parse-file make-8-bit ;

To top it all off, here's what's needed to define all the 8-bit encodings we want:


: mappings {
    { "latin1" "8859-1" }
    { "latin2" "8859-2" }
    ! ...
} ;

: encoding-file ( file-name -- stream )
    "extra/io/encodings/8-bit/" ".TXT"
    swapd 3append resource-path ascii  ;

[
    "io.encodings.8-bit" in [
        mappings [ encoding-file define-8-bit-encoding ] assoc-each
    ] with-variable
] with-compilation-unit

So by combining these trivial one-liners or two-liners, you can make something that's not as trivial. The end product is that hard things are made easy, which is the goal of every practical programming language. The point of this isn't to say that this code is perfect (it's very far from that), but just to demonstrate how clear things become when they're broken down in this way.

When I first started programming Factor, I thought that it only made sense to define things separately when it was conceivable that something else would use them, or that it'd be individually useful for testing, or something like that. But actually, it's useful for more than that: for just making your program clear. In a way, the hardest thing to do when programming in Factor once you have the basics is to name each of these pieces and factor them out properly from your program. The result is far more maintainable and readable than if the factoring process has not been done.

Saturday, March 29, 2008

How the Factor meeting went in New York

I invited all of you, at the very last minute, to come meet me in New York to talk about Factor and stuff, and at least two people asked me to post in detail about what happened... so here's my best shot. Dan McCarthy was the brave soul who attended, and we had a really interesting conversation about various aspects of programming. One thing we discussed was the inverse pattern matching library. I showed Dan how it works, and he found it really interesting that quotations were sequences at runtime—similar to s-expressions, but directly executed. Dan works as a programmer/sysadmin at a company that provides closed captioning services for media companies, and it seems like a more interesting task than I would have thought. There are some text encoding issues there (the HD captioning standard, if I understand it correctly, actually has encoding left unspecified for characters outside of Windows-1252, though it leaves room for two-byte and three-byte characters) and Dan has been researching them for a project for a Korean client. I explained the encoding definition protocol to Dan, and I'm going to try to get him to implement East Asian encodings, which there seem to be quite a few of in use (Shift-JIS, ISO 2022-JP, GB 2312, Big5, EUC-JP, EUC-KR, GB 18030). These all need big tables for encoding and decoding, and some require state to decode. Many have multiple possible representations of the same string for output, which complicates things somewhat. So, there's not much to report, but I've definitely learned my lesson about organizing things: I need to announce things more than 11 days in advance, and I need to advertise them better.

Sunday, March 23, 2008

Some more advanced GC techniques

After my last two posts about garbage collection, some people people suggested some more advanced techniques be used to solve the pausing problem. Here's a quick* overview of some more advanced techniques, some of which can eliminate noticeable pauses and some of which can solve other problems in GC.

The train algorithm

The idea of the train algorithm is to break the heap (or, really, just the oldest generation) into small chunks that can be collected individually. These chunks need to contain a list, or remembered set, of things from the outside that point into it. (Card marking doesn't work so well in the presence of many separate chunks.) Then, crucially, cyclic data structures need to be put in the same chunk, or at least the same group of chunks which get collected together. You can think of the chunks as cars and the groups of chunks as trains.

It's a bit difficult to get this whole thing sound, though. The precise strategy is described really well in this term Würthinger. That PDF has tons of great diagrams. Java used to use the train algorithm optionally, but it was deprecated because the train algorithm has high overhead in terms of throughput and it can take several GC cycles to delete a cyclic data structure: as many as there are elements in the cycle.

Incremental mark-sweep

Another thing we can try is making mark-sweep incremental. Mark-sweep isn't that good for general collection, since it can cause memory fragmentation and make allocation slow. However, for just the oldest generation (aka tenured space) in a generational system, it's not completely horrible. Since the oldest generation can be pretty large, compaction takes a long time, since everything has to be copied. (This is true whether you're using mark-compact or copying collection.)

So, can we base something off mark-sweep that eliminates long pauses? Well, going by the heading, I guess we could try to make it incremental. There are two pieces to this: incremental mark and incremental sweep. Actually, instead of incremental sweep, we can do either lazy sweep (sweep as much as we need whenever there's an allocation) or concurrent sweep (sweep in a concurrent thread, and have allocating threads block on allocation until there's enough space swept).

The marking phase is more difficult because of a consistency problem. Imagine this situation with objects A, B and C. A and C are pointed to by something that we know we're keeping. When marking starts, A points to B, and B and C points to null. The marker visits C and marks it as already visited. Then, before the marker visits A or B, C is changed to point to B and A is changed to point to null. Then, the marker visits A, and A gets marked. But then B is never marked, and it is lost!

The easiest solution is to trap all pointer writes to watch out for cases like this, making sure that B gets marked when A is changed. This is called a snapshot-at-the-beginning write barrier. But this makes it so that, if A and C both point to null, B still won't be collected until the next time around. That phenomenon is called floating garbage, and more subtle strategies remedy it. Most of these incremental algorithms can be parallelized with a little bit of work.

Aside from the book I recommended before, a good resource on incremental techniques is this ACM survey on garbage collection. [Update: There's also this great blog post which I forgot to mention before. It has lots of pretty diagrams.]

Garbage-first

The people at Sun have come up with a new garbage collection strategy for Java called garbage-first garbage collection (G1GC). The idea is somewhat similar to the train algorithm: the heap is split up into small chunks which can be collected separately, maintaining a remembered set of inward references. But the G1GC uses all kinds of crazy heuristics to figure out what chunks are most likely to have a small remembered set. This works so well that this "youngness heuristic" can completely replace the generational mechanism. The whole thing is led by user-specified parameters about the maximum allowable pause time and throughput goals.

There's a paper describing G1GC [Update: link no longer requires ACM access], but I can't really understand it. A more intelligible source is FAQ #4 of the most recent blog post on Jon Masamitsu's blog. (Jon works in Java's GC group.)

Reference counting with concurrency and cycles

In a nondeterministically multithreaded environment, reference counting has problems. Increment and decrement operations need to be atomic, or else there will be consistency issues. For example, if two concurrent threads try to increment the reference count of a single variable at the same time, and it works out that they both read and then both write, then the reference count will only increase by one. This might mean that the memory is freed while there are still references to it! In the same way, a decrement could be missed.

A bad solution is to put a lock on each reference count. This is bad because it's slow: every time there's a new reference, you need to not only increment the refcount but also acquire and then free a lock. Another solution is to gave a worker thread which handles all increments and decrements; all other threads send messages to it.

To handle cycles, you could use a hybrid approach, to use mark-sweep when memory runs out in order to collect cycles. But there are other approaches. In an explicit refcounting system (where increments and decrements are manual), the user could be expected to insert a "weak reference", one which doesn't increase the refcount, whenever completing a cycle. Another way is to perform a small local marking trace when refcounts are decremented but not set to zero, to make sure there isn't an unreferenced cycle. That's described in this recent paper, which also handles concurrency. Here's an optimization on that with a proof of correctness.

Hard real-time GC

So far, I've been talking about minimizing pauses in a vague, general sense. We just want them to be a fraction of how long it takes to do a full tracing collection. But this isn't enough for some applications. Say you're making a video game where a 50ms GC pause (as the best incremental mark-sweep collectors benchmark at, I've heard) means a skipped frame or two. That can be noticeable, and it presents a real disadvantage compared to explicit allocation. Even refcounting doesn't always give really short pauses, since it causes memory fragmentation (making allocation take longer) and deallocation is not scheduled incrementally. That is, if you have a long linked list with just one reference to the head, and that reference ends, then the entire linked list gets deallocated in a chain, with no incremental scheduling.

What this situation needs is a garbage collector follow hard real-time guarantees. One way that this guarantee could be phrased is that pauses are at most 1ms, and that at least 7 out of 10 milliseconds are spent running the main program. This guarantee will be met even if the main program is acting "adversarially", bringing out the worst-case behavior in the collector. It's possible to specify a requirement like this that's unachievable for a particular application, but this requirement works for most things. Different applications can specify different requirements based on, say, their frame rate and how long it takes to render a frame. For applications like this, absolute throughput (which is basically maximized by a well-tuned generational collector) can be sacrificed in favor of better scheduling.

This sounds like an impossible dream, but it's actually been implemented in the Metronome system, implemented for Jikes by IBM. Metronome has been written about in the ACM Queue and there's also a paper which is harder to understand but explains more. The goal of the Metronome project is to allow high-level languages to be used for real-time applications on uniprocessor machines. While Java isn't what I'd choose, the GC seems to be the biggest barrier, and it's great that this research is being done.

The idea is to have an incremental mark-sweep collector (not generational) which segregates the heap into chunks (just for allocation purposes) of roughly the same size data. This minimizes fragmentation. However, fragmentation can still occur, and when a heap segment is too fragmented, it is incrementally copied and compacted to a different piece of memory. Large objects are split up into chunks called arraylets. By all of these techniques, garbage collection can be broken up into small tasks, and an innovative scheduler makes it satisfy the hard real-time guarantees.

Because the collector isn't generational, and because of the overhead of the scheduler and the floating garbage that's left by the incremental collector, this is far from optimal for applications that don't really need things to be so predictably well-spaced. But maybe, if there were more knobs on this algorithm (eg, the scheduler can be turned off, and more generations can be added), this could be a general-purpose GC that's really useful.

GC and language features

In the most basic sense, a garbage collection system consists of one exposed function, allocate, which takes a number of bytes and allocates a region of memory that's that big. But there are some other things that can be useful. For example, for tracing collectors, a collect-garbage function can be used to do a major collection when the program knows it's idle.

Another useful feature is finalizers. For most things, it's sufficient to just deallocate memory in when it's collected. But think about files. You should always explicitly close a file when you're done with it, but if the programmer makes an error, the file should still be closed once it is unreachable. With a reference counting or mark-sweep collector, this is relatively easy: just have a generic function finalize that gets called on everything that's collected. With copying collection, the collector maintains a list of objects that have finalizers, and on each copying cycle, this list is traversed and it is checked whether objects have forwarding pointers in fromspace. If an object with a finalizer doesn't have a forwarding pointer, it has been deleted and the finalizer should be invoked. This avoids a full traversal of the heap.

Actually invoking so simple, because now the object might contain some dead pointers. With a reference counting collector, if you're not collecting a cycle, you can call the finalizers in top-down order (also called topological order), and then the dead pointer issue doesn't exist. But this breaks down in the presence of cycles, and is difficult to calculate with a tracing collector. An easier-to-implement strategy is to call the finalizers in arbitrary order, but call them all before garbage is actually collected. Alternatively, everything the finalizer references can be considered a root. But in this situation, programmers have to be very careful not to retain the objects forever.

This summer, I hope to join this research community in a small way by participating in a Harvey Mudd REU (summer undergraduate research project) in garbage collection. In previous summers, an idea of blobs was developed, a generalization of a concept called ephemerons to make weak hashtables without memory leaks. They wrote and published a paper about it. They also researched and wrote about garbage collection techniques for certain hard-to-collect persistent data structures.

Another project of the leader of this group is called a simplifier, which is something that gets invoked in a copying collector to simplify a datastructure when it gets copied. This is a technique that is used in an ad-hoc way in graph-reduction runtimes in purely functional languages: you don't want to copy the whole graph if there's an easy way to simplify it without allocating any new nodes. It should be really fun to research these more advanced techniques for making garbage collection more correct and efficient.

Where to now for more research?

Academia has been working on this since the '60s. But recently, big companies like Sun, IBM and Microsoft have been doing more here in order to augment their Java and .NET platforms. Some academic resources to look at to learn more about GC are at UT Austin's website (especially bigsurv.ps). There are conferences which discuss memory management, like the International Symposium on Memory Management and SPACE.

When implementing a garbage collector for research purposes, you probably don't want to build a whole programming language runtime yourself. Jikes RVM provides an open-source research virtual machine that you can easily plug in different garbage collectors into. Jikes RVM's MMTk (Memory Manager Toolkit) makes this possible. There are visualization tools, and the heap can be split up into different segments which use different collectors.

These advanced garbage collection algorithms haven't been implemented many times and aren't well-understood by many people. There also hasn't been much work in formally analyzing and comparing these algorithms. This is partly because they're hard to analyze; constant factors have a gigantic effect. Someone came up with a unified theory of garbage collection, though, which analyzes all garbage collection strategies as some combination between marking and reference counting, which can be seen as duals. Just like with differential equations, there's no general solution which meets all of our requirements (maximizing throughput and locality, minimizing and organizing pause times, making allocation cheap) at once, though our understanding is always improving.

* You may be wondering why I keep saying that these posts are short when they're something like four pages long printed out. Mainly, it's because the things that I'm reading are much longer. It'd be kinda hard for me to describe anything meaningful in 500 words or less.