Saturday, March 29, 2008

How the Factor meeting went in New York

I invited all of you, at the very last minute, to come meet me in New York to talk about Factor and stuff, and at least two people asked me to post in detail about what happened... so here's my best shot. Dan McCarthy was the brave soul who attended, and we had a really interesting conversation about various aspects of programming. One thing we discussed was the inverse pattern matching library. I showed Dan how it works, and he found it really interesting that quotations were sequences at runtime—similar to s-expressions, but directly executed. Dan works as a programmer/sysadmin at a company that provides closed captioning services for media companies, and it seems like a more interesting task than I would have thought. There are some text encoding issues there (the HD captioning standard, if I understand it correctly, actually has encoding left unspecified for characters outside of Windows-1252, though it leaves room for two-byte and three-byte characters) and Dan has been researching them for a project for a Korean client. I explained the encoding definition protocol to Dan, and I'm going to try to get him to implement East Asian encodings, which there seem to be quite a few of in use (Shift-JIS, ISO 2022-JP, GB 2312, Big5, EUC-JP, EUC-KR, GB 18030). These all need big tables for encoding and decoding, and some require state to decode. Many have multiple possible representations of the same string for output, which complicates things somewhat. So, there's not much to report, but I've definitely learned my lesson about organizing things: I need to announce things more than 11 days in advance, and I need to advertise them better.

Sunday, March 23, 2008

Some more advanced GC techniques

After my last two posts about garbage collection, some people people suggested some more advanced techniques be used to solve the pausing problem. Here's a quick* overview of some more advanced techniques, some of which can eliminate noticeable pauses and some of which can solve other problems in GC.

The train algorithm

The idea of the train algorithm is to break the heap (or, really, just the oldest generation) into small chunks that can be collected individually. These chunks need to contain a list, or remembered set, of things from the outside that point into it. (Card marking doesn't work so well in the presence of many separate chunks.) Then, crucially, cyclic data structures need to be put in the same chunk, or at least the same group of chunks which get collected together. You can think of the chunks as cars and the groups of chunks as trains.

It's a bit difficult to get this whole thing sound, though. The precise strategy is described really well in this term Würthinger. That PDF has tons of great diagrams. Java used to use the train algorithm optionally, but it was deprecated because the train algorithm has high overhead in terms of throughput and it can take several GC cycles to delete a cyclic data structure: as many as there are elements in the cycle.

Incremental mark-sweep

Another thing we can try is making mark-sweep incremental. Mark-sweep isn't that good for general collection, since it can cause memory fragmentation and make allocation slow. However, for just the oldest generation (aka tenured space) in a generational system, it's not completely horrible. Since the oldest generation can be pretty large, compaction takes a long time, since everything has to be copied. (This is true whether you're using mark-compact or copying collection.)

So, can we base something off mark-sweep that eliminates long pauses? Well, going by the heading, I guess we could try to make it incremental. There are two pieces to this: incremental mark and incremental sweep. Actually, instead of incremental sweep, we can do either lazy sweep (sweep as much as we need whenever there's an allocation) or concurrent sweep (sweep in a concurrent thread, and have allocating threads block on allocation until there's enough space swept).

The marking phase is more difficult because of a consistency problem. Imagine this situation with objects A, B and C. A and C are pointed to by something that we know we're keeping. When marking starts, A points to B, and B and C points to null. The marker visits C and marks it as already visited. Then, before the marker visits A or B, C is changed to point to B and A is changed to point to null. Then, the marker visits A, and A gets marked. But then B is never marked, and it is lost!

The easiest solution is to trap all pointer writes to watch out for cases like this, making sure that B gets marked when A is changed. This is called a snapshot-at-the-beginning write barrier. But this makes it so that, if A and C both point to null, B still won't be collected until the next time around. That phenomenon is called floating garbage, and more subtle strategies remedy it. Most of these incremental algorithms can be parallelized with a little bit of work.

Aside from the book I recommended before, a good resource on incremental techniques is this ACM survey on garbage collection. [Update: There's also this great blog post which I forgot to mention before. It has lots of pretty diagrams.]

Garbage-first

The people at Sun have come up with a new garbage collection strategy for Java called garbage-first garbage collection (G1GC). The idea is somewhat similar to the train algorithm: the heap is split up into small chunks which can be collected separately, maintaining a remembered set of inward references. But the G1GC uses all kinds of crazy heuristics to figure out what chunks are most likely to have a small remembered set. This works so well that this "youngness heuristic" can completely replace the generational mechanism. The whole thing is led by user-specified parameters about the maximum allowable pause time and throughput goals.

There's a paper describing G1GC [Update: link no longer requires ACM access], but I can't really understand it. A more intelligible source is FAQ #4 of the most recent blog post on Jon Masamitsu's blog. (Jon works in Java's GC group.)

Reference counting with concurrency and cycles

In a nondeterministically multithreaded environment, reference counting has problems. Increment and decrement operations need to be atomic, or else there will be consistency issues. For example, if two concurrent threads try to increment the reference count of a single variable at the same time, and it works out that they both read and then both write, then the reference count will only increase by one. This might mean that the memory is freed while there are still references to it! In the same way, a decrement could be missed.

A bad solution is to put a lock on each reference count. This is bad because it's slow: every time there's a new reference, you need to not only increment the refcount but also acquire and then free a lock. Another solution is to gave a worker thread which handles all increments and decrements; all other threads send messages to it.

To handle cycles, you could use a hybrid approach, to use mark-sweep when memory runs out in order to collect cycles. But there are other approaches. In an explicit refcounting system (where increments and decrements are manual), the user could be expected to insert a "weak reference", one which doesn't increase the refcount, whenever completing a cycle. Another way is to perform a small local marking trace when refcounts are decremented but not set to zero, to make sure there isn't an unreferenced cycle. That's described in this recent paper, which also handles concurrency. Here's an optimization on that with a proof of correctness.

Hard real-time GC

So far, I've been talking about minimizing pauses in a vague, general sense. We just want them to be a fraction of how long it takes to do a full tracing collection. But this isn't enough for some applications. Say you're making a video game where a 50ms GC pause (as the best incremental mark-sweep collectors benchmark at, I've heard) means a skipped frame or two. That can be noticeable, and it presents a real disadvantage compared to explicit allocation. Even refcounting doesn't always give really short pauses, since it causes memory fragmentation (making allocation take longer) and deallocation is not scheduled incrementally. That is, if you have a long linked list with just one reference to the head, and that reference ends, then the entire linked list gets deallocated in a chain, with no incremental scheduling.

What this situation needs is a garbage collector follow hard real-time guarantees. One way that this guarantee could be phrased is that pauses are at most 1ms, and that at least 7 out of 10 milliseconds are spent running the main program. This guarantee will be met even if the main program is acting "adversarially", bringing out the worst-case behavior in the collector. It's possible to specify a requirement like this that's unachievable for a particular application, but this requirement works for most things. Different applications can specify different requirements based on, say, their frame rate and how long it takes to render a frame. For applications like this, absolute throughput (which is basically maximized by a well-tuned generational collector) can be sacrificed in favor of better scheduling.

This sounds like an impossible dream, but it's actually been implemented in the Metronome system, implemented for Jikes by IBM. Metronome has been written about in the ACM Queue and there's also a paper which is harder to understand but explains more. The goal of the Metronome project is to allow high-level languages to be used for real-time applications on uniprocessor machines. While Java isn't what I'd choose, the GC seems to be the biggest barrier, and it's great that this research is being done.

The idea is to have an incremental mark-sweep collector (not generational) which segregates the heap into chunks (just for allocation purposes) of roughly the same size data. This minimizes fragmentation. However, fragmentation can still occur, and when a heap segment is too fragmented, it is incrementally copied and compacted to a different piece of memory. Large objects are split up into chunks called arraylets. By all of these techniques, garbage collection can be broken up into small tasks, and an innovative scheduler makes it satisfy the hard real-time guarantees.

Because the collector isn't generational, and because of the overhead of the scheduler and the floating garbage that's left by the incremental collector, this is far from optimal for applications that don't really need things to be so predictably well-spaced. But maybe, if there were more knobs on this algorithm (eg, the scheduler can be turned off, and more generations can be added), this could be a general-purpose GC that's really useful.

GC and language features

In the most basic sense, a garbage collection system consists of one exposed function, allocate, which takes a number of bytes and allocates a region of memory that's that big. But there are some other things that can be useful. For example, for tracing collectors, a collect-garbage function can be used to do a major collection when the program knows it's idle.

Another useful feature is finalizers. For most things, it's sufficient to just deallocate memory in when it's collected. But think about files. You should always explicitly close a file when you're done with it, but if the programmer makes an error, the file should still be closed once it is unreachable. With a reference counting or mark-sweep collector, this is relatively easy: just have a generic function finalize that gets called on everything that's collected. With copying collection, the collector maintains a list of objects that have finalizers, and on each copying cycle, this list is traversed and it is checked whether objects have forwarding pointers in fromspace. If an object with a finalizer doesn't have a forwarding pointer, it has been deleted and the finalizer should be invoked. This avoids a full traversal of the heap.

Actually invoking so simple, because now the object might contain some dead pointers. With a reference counting collector, if you're not collecting a cycle, you can call the finalizers in top-down order (also called topological order), and then the dead pointer issue doesn't exist. But this breaks down in the presence of cycles, and is difficult to calculate with a tracing collector. An easier-to-implement strategy is to call the finalizers in arbitrary order, but call them all before garbage is actually collected. Alternatively, everything the finalizer references can be considered a root. But in this situation, programmers have to be very careful not to retain the objects forever.

This summer, I hope to join this research community in a small way by participating in a Harvey Mudd REU (summer undergraduate research project) in garbage collection. In previous summers, an idea of blobs was developed, a generalization of a concept called ephemerons to make weak hashtables without memory leaks. They wrote and published a paper about it. They also researched and wrote about garbage collection techniques for certain hard-to-collect persistent data structures.

Another project of the leader of this group is called a simplifier, which is something that gets invoked in a copying collector to simplify a datastructure when it gets copied. This is a technique that is used in an ad-hoc way in graph-reduction runtimes in purely functional languages: you don't want to copy the whole graph if there's an easy way to simplify it without allocating any new nodes. It should be really fun to research these more advanced techniques for making garbage collection more correct and efficient.

Where to now for more research?

Academia has been working on this since the '60s. But recently, big companies like Sun, IBM and Microsoft have been doing more here in order to augment their Java and .NET platforms. Some academic resources to look at to learn more about GC are at UT Austin's website (especially bigsurv.ps). There are conferences which discuss memory management, like the International Symposium on Memory Management and SPACE.

When implementing a garbage collector for research purposes, you probably don't want to build a whole programming language runtime yourself. Jikes RVM provides an open-source research virtual machine that you can easily plug in different garbage collectors into. Jikes RVM's MMTk (Memory Manager Toolkit) makes this possible. There are visualization tools, and the heap can be split up into different segments which use different collectors.

These advanced garbage collection algorithms haven't been implemented many times and aren't well-understood by many people. There also hasn't been much work in formally analyzing and comparing these algorithms. This is partly because they're hard to analyze; constant factors have a gigantic effect. Someone came up with a unified theory of garbage collection, though, which analyzes all garbage collection strategies as some combination between marking and reference counting, which can be seen as duals. Just like with differential equations, there's no general solution which meets all of our requirements (maximizing throughput and locality, minimizing and organizing pause times, making allocation cheap) at once, though our understanding is always improving.



* You may be wondering why I keep saying that these posts are short when they're something like four pages long printed out. Mainly, it's because the things that I'm reading are much longer. It'd be kinda hard for me to describe anything meaningful in 500 words or less.

Wednesday, March 19, 2008

Three open language design problems in Factor

Factor is progressing at a surprisingly rapid pace for its small developer base. The compiler and virtual machine make code run relatively fast and getting faster, and there's a large and growing standard library. As far as language design methods go, Factor is fairly unique. With most languages, when they get to a certain state of maturity, code gets written and backwards compatibility for that code needs to be maintained so that that code base still works.

In Factor, nearly all code written is distributed in extra/ with Factor itself. We can keep making changes to the language and update the entire code base whenever we want. It's pretty amazing, and allows us to remove unused things and make major changes in the API like the recent one with encodings. So the resulting language design and standard library can be as clean as some abstract, mostly a priori language standard like Haskell98 or R5RS while being as useful as we want. (Of course, after 1.0, stability will be a major goal.)

The library is progressing amazingly. But within this context, there are a few things that need to be worked out language-design-wise.

Mixing currying with macros

Macros (by which I mean the things you define with MACRO:, described here) are really an optimization. You could, instead of using MACRO:, do a normal word definition with call at the end: Macros just take a number of things from the stack and generate a quotation which is called. The thing you put in the body of the definition is everything but the call. The thing macros do is move this calculation of the quotation as early as possible. If it's possible, the quotation is calculated when the optimizing compiler passes through the code; otherwise, it's calculated at runtime and the result is memoized. This way, it's only calculated once for any given input.

Macros work great as a way to describe things like undo, cond and case. But what if we want to use locals in a cond? Or curry something onto the quotation used by undo? These are both instances of the same problem, since locals boil down to currying and a quotation transformation and so does undo.

Consider the following code:

:: foo ( bar -- baz )
{
{ [ bar 1 = ] [ 2 ] }
{ [ t ] [ bar 1 + ] }
} cond ;

The double colon (::) indicates that this definition uses locals, and that bar is a lexically scoped local variable taken from the stack. The code I wrote won't work, because the locals implementation doesn't know how to deal with cond. When dealing with ordinary quotations, it can curry the locals that get used onto the front, and do the typical locals transformation inside the quotation. But cond takes an array from the stack. What should it do there?

Similarly, consider this code:

: second-if-tagged ( pair first-tag -- second )
[ swap 2array ] curry undo ;

This code will work as expected, but the inverse of the quotation has to be compiled separately on each run. Even worse, there's a memory leak as macros memoize their input: the first-tag would never be freed.

It should be that undo and cond specify some way to deal with currying and the transformation that locals use. But how should this be structured?

Formalizing protocols

Right now, mixins are used for two things. One, in things like plain-writer (in the latest development sources) implement a bunch of generic words. Others, like immutable-sequence, have a bunch of generic words defined so that instances of it are expected to implement it. Some things like virtual-sequence, do both.

For things like sequence, there's only an implicit replationship with the generic words that are associated with them. In extra/delegate, there's a formally-defined set of generic words, but it's not associated with the sequence mixin.

It'd be even nicer if we could make some static verification tools based on this. Just to throw some kind of error, some time before runtime, if a necessary method isn't defined. But it's a little harder to do this than it sounds. For example, for sequences, there is a default method for both nth and nth-unsafe. These are mutually recursive, so that only one has to be implemented. But every sequence has to implement at least one of these to be valid. How should this be represented? It couldn't be done automatically. There are also some things, like like, whose implementation is completely optional, and length, whose implementation is mandatory.

A soft type system

Factor is dynamically typed, and to some degree, this is a strength. Inherent in any fully static type system is a limitation preventing certain programs from being run. It's much easier to have a dynamic interactive development environment that uses dynamic typing. If Factor had a static type system, it's unlikely that it could have had the gradual development of programing concepts that it did.

But, at the same time, there are some advantages to static typing. For one, it opens up a bunch of optimization opportunities for the compiler. The Factor optimizing compiler already infers some things about types. This happens in two ways. The compiler infers the stack effect of arbitrary quotations, that is, how many things a block of code takes and leaves on the stack. This is exposed to the user through the infer. word. A weakness of this is that stack effects can only be inferred if all quotations can be inlined at compiletime, and deeper knowledge about quotations could fix this.

The other way type inference happens is that the compiler eliminates type checks and method dispatch where it can prove that something is of a particular class. This happens in a more ad-hoc way and is not exposed to the user because it doesn't infer which types are required. If there were a stronger type system, then it might be possible to remove even more type checks, and it might be possible to automatically insert what's now done by the hints mechanism.

Another advantage is that a user-visible static type system can reject programs that won't work, and can provide useful metadata about programs based on their type. Factor can currently reject programs whose stack effect comment doesn't match the stack effect that's inferred. Usually, once you get the hang of stack programming, you don't need this, but occasionally it still comes in handy. Type checking does the same thing.

I'm not sure if Factor will ever have a type system that is this expressive, but I think it's pretty cool that in Haskell, you can have a function which dispatches (at compiletime) off its expected return value. This is seriously cool, and without it, monads would be a bit more awkward.

Anyway, none of these things represents an urgent need, but there's also something to consider: if we make a static type system for Factor, we don't want to restrict the flexibility of Factor at all. There would need to be a number of things. First, a staged approach similar to the way metaprogramming currently works. Second, the type system would have to be a soft typing system, which can infer many things but doesn't reject everything that it can't infer. It can warn the user if it can tell that there'll be a type mismatch, though. Neither of these things has been attempted in a concatenative language yet.

[Update: A commenter pointed out that it's also possible to have explicit declarations and no inference. A version this is planned to happen before 1.0, in the form of a better syntax on top of multimethods. Basically, you have a generic word with only one method, and that method is for the types of arguments that you specify. This allows for earlier failure and some generic dispatch/type check eliminations. It's still not clear how more general, before-runtime type checking (with or without inference) can be done, though.]

Conclusion

There are a bunch more language design problems to come, but these three seem to be the biggest right now. It's possible that these won't be solved, and for the last one, I'm not sure if that'd be too horrible. I hope that the currying problem can be resolved before Factor 1.0, and I expect everything to have some sort of resolution by Factor 2.0. The sooner we solve the problems, the easier it will be to update the code to conform with the solution. I just wish I had good enough ideas to figure them out right now.

Monday, March 17, 2008

Another FactorCon in NYC

Zed Shaw suggested to me recently that we hold a Factor-related meeting this month in New York, since I'll be there on March 28th. Since he hasn't done anything to organize it, I thought I'd take the lead: let's meet at

Earth Matters
177 Ludlow St, Manhattan, NY, USA

This'll happen at 7 PM, on Friday the 28th. The idea of the event is to talk about Factor, meaning (depending on who comes) sharing current projects in Factor, or a Factor tutorial for beginners, or a little of both. It'll all be very informal. You can come whether or not you know Factor.

Unlike the recent PyCon, this Factor convention will not be beholden to our sponsors' interests! But if we had any sponsors we might be... it's a little too late to solicit them.

If you can come, please send me an email so I can get an idea of how many people are going to show up. (Sorry about the short notice.) I hope to see you there!

Friday, March 14, 2008

A protocol for creating encodings

I previously wrote about the API that I designed for creating streams with encodings in Factor. I'm not sure if that's going to stick around permanently in this form, due to concerns about easily changing stream encodings and grouping encodings with a pathname as one object on the stack.

Either way, I wanted to describe the protocol I'm developing for actually defining new encodings in Factor. This code isn't completely debugged but it should be done and in the main Factor development repository very soon. There are four words in the encoding protocol:

GENERIC: <encoder> ( stream encoding -- encoder-stream )
GENERIC: <decoder> ( stream decoding -- decoder-stream )
GENERIC: encode-char ( char stream encoding -- )
GENERIC: decode-char ( stream decoding -- char/f )

Let's go through these. First, the constructors <encoder> and <decoder>. These are very rarely called directly by the library user, more often by stream constructors. For example, when you do "filename" utf8 <file-reader>, what's going on underneath is "filename" (file-reader) utf8 <decoder>. (file-reader) is a low-level constructor that gives you a binary stream, and <decoder> wraps it an a decoded stream using the specified encoding descriptor, utf8.

I have some slightly funny methods on <encoder> and <decoder>. See, right now, all encodings are tuples, and their abstract descriptors are tuple classes. All tuple class symbols are in the class tuple-class, and all tuples are in the class tuple. So we can define methods on the two constructor words, for tuple classes one which makes an empty instance of the encoding tuple class and calls the constructor again, and for encoding tuples one which actually puts together the instance of the physical encoder or decoder tuple. Here's how it looks:

M: tuple-class <decoder> construct-empty <decoder> ;
M: tuple <decoder> f decoder construct-boa ;

M: tuple-class <encoder> construct-empty <encoder> ;
M: tuple <encoder> encoder construct-boa ;

One reason these need to be generic is for things like binary streams, where methods on these generic words are implemented as dummies: a binary encoding is just the lack of encoding

TUPLE: binary ;
M: binary <encoder> drop ;
M: binary <decoder> drop ;

Another reason is that certain encodings require processing at the beginning. For example, UTF-16 should write a byte order mark (BOM) immediately when it's initialized for writing, and read a BOM immediately when it's initialized for reading.

M: utf16 <decoder> ( stream utf16 -- decoder )
2 rot stream-read bom>le/be <decoder> ;

M: utf16 <encoder> ( stream utf16 -- encoder )
drop bom-le over stream-write utf16le <encoder> ;

Now, let's look at the other words. The idea of encode-char and decode-char is that it's simpler for encodings to encode or decode one character than implement all the relevant functions of the stream protocol. encode-char takes an encoding, an underlying stream and a character to write to that underlying stream.

The inverse, decode-char, takes an underlying stream and an encoding and uses the encoding to pull a character from the stream. For everything I've implemented so far, the encoding is dropped after method dispatch, but when things like Shift JIS, which require state in decoding, are implemented, the state will be stored in the tuple.

This is all much simpler than my previous design, which required looping to decode a single character and forced encodings to adopt a complicated state-machine-based model. This is something like the third iteration of the encoding protocol I've made, and the code is finally starting to look good.

In Factor, it takes a little bit of work to make certain things, like encodings, have clean code. The appropriate abstractions don't fall out as immediately obvious, but eventually they're found. The result is far more maintainable and clean. I'm not sure what this would imply on big projects with bad programmers. (But I plan to never work in an environment like that; better to be an academic if good work in a small company can't be found.)

Anyway, future pie-in-the-sky plans for encodings include treating cryptographic protocols and compression as encodings (under different protocols, of course). This is really cool: there are five orthogonal layers: stream, cryptography, compression, text encoding and usage. It'll be possible to compose them and factor out their compositions in any way you want! But this doesn't exist, so I probably shouldn't even be talking about it.

Update: Fixed stupid typos. Thanks Slava!

Sunday, March 9, 2008

Explaining garbage collection

Imagine you have a lot of garbage in your house. You have so much garbage that there's no room to put any new stuff. This isn't garbage that you've thrown in the garbage can, and it's a little unclear what's garbage and what's not.

See, you've built a bunch of Rube Goldberg contraptions in your house, and some machines share parts. Your washing machine might use a pencil somewhere to write down how much longer it has until it's done, and your drier might use the same pencil on a different pad of paper. Even if you decide you don't want the washing machine, you can't just throw out the pencil (though you can throw out the washer's pad of paper.) So how do you determine what you can safely throw away?

It'd be really tedious to do this yourself, so you bring out your handy dandy garbage collection robot. You give the robot a list of Rube Goldberg machines that you use, and the robot will look at what uses what. It'll throw out everything that is unused once it figures everything out.

One strategy the robot could use is to write down a list of all of the objects in your house. The robot will then go down the list of machines that you use and put a mark next to each one you say you use. Until everything's been visited that has a mark, the robot then goes to each marked object and marks everything that it uses. When it's done, it can go back to all of the unmarked objects and thrown them out. This is known as "mark and sweep" garbage collection.

Another strategy is, when there's no more room, to build a new house, and then put in it a copy everything that you use. Then, you look at everything that those Rube Goldberg machines use and make a copy of them, for use in the new house. This doesn't have to be so inefficient, since you can actually reuse the first house again the next time garbage collection happens. This is known as "copying" garbage collection.

The last strategy is to just tell the robot, when you're building your machines, what uses what, and when you start and stop using things. The robot will keep a count of how many things use what. If only one machine is using something, or if you're using it directly and nothing else is, and then that usage stops, then you tell the robot and it throws out that thing. When it throws that piece of garbage out, it has to remember that each thing that that object uses is now used by one fewer thing. This is called "reference counting", because you count how many things refer to a particular object.

Now imagine all of this in a computer. Computer programs often amount to Rube Goldberg machines on a much larger scale. Different pieces of RAM refer to other pieces of RAM, and it's hard for the programmer to tell what she can safely throw out. By using a garbage collector, this can all be handled automatically.

Friday, March 7, 2008

A little more about garbage collection

In my last post about garbage collection, I mentioned that that there are more complicated and efficient algorithms to do GC better. A few people asked me for a followup describing those, so here's my best shot. For inspiration, let's look at an excellent runtime system for a horrible programming language.

Java's garbage collection

Sun's HotSpot JVM gives us a good example of how to go about garbage collection. It may not be the best programming language, but for more than 10 years they've been working on a good implementation of a garbage collector. What they have now is pretty good, and programs can expect 95-99% throughput with few noticeable pauses. Everything can be tweaked by advanced users through command-line options.

Originally, the JVM used a very inefficient, naive GC algorithm. (I think it was mark-sweep, but I can't say for sure because the docs are fairly sketchy.) By version 1.2, Java switched to a generational scheme where copying (originally mark-compact) is used for the younger generations and mark-compact is used for the oldest generation. In version 1.3, there was an optional implementation of the train algorithm for applications which needed to avoid a pause, but this has since been discontinued. In version 1.4, an incremental mark-sweep algorithm was introduced for programs with harder real-time constraints, along with "heap ergonomics"--smart runtime heap and generation resizing. In Java 1.5, the main maximum-throughput generational collector was upgraded to use multiple threads in minor GC collections. Java 7 will have a new collector, invented at Sun, called Garbage First, which can be seen as an extremely complicated generalization of generational GC which I don't completely understand. A listing of the collectors in Java 6, along with information about G1GC, is here

Java uses some interesting concepts in its garbage collector, and in this post I hope to describe enough to understand the ideas behind these things, though implementation details will vary wildly. I haven't described the incremental algorithms here (train, G1, concurrent mark-sweep), but hope to do so in a later post.

Increasing throughput with generational collection

We want to make our garbage collector such that programmers don't really have to worry about memory allocation costs. In copying and mark-compact collection systems, the cost of allocation is just incrementing a pointer, so this is great. Or is it? In truth, the cost is higher: each time we allocate memory, we move closer to having to do a full garbage collection, which results in either a traversal of the heap or copying of all the data on the heap. This isn't so good if we have to do it a lot, so memory allocation isn't so cheap.

With a quick realization, we can make this more efficient: most objects die young. This is called the weak generational hypothesis. (There's a stronger variant of this, but it's not really true, so we'll ignore it.) Anyway, once we have this realization, we can use generational garbage collection, or garbage collection with objects segregated by age.

Here's the idea, in somewhat better-thought-out form: let's take the heap and split it into three generations. The first generation is called the nursery, the second is called aging space and the last tenured space. Aging space is bisected into a "fromspace" and a "tospace" in the way that copying garbage collection works. When objects get allocated, they are first put in the nursery. When the nursery fills up, there is a minor garbage collection, where everything in the nursery which is referenced from the roots is copied to the next generation up, the place in aging space where allocation happens. When aging space fills up, we copy to the other aging space, back and forth in the style that copying GC normally works. A minor GC needs to happen whenever this takes place. When aging space is full (or almost full) after one of these back-and-forth aging space GC cycles, an intermediate collection takes place, where things in aging space are moved to tenured space, and things in the nursery are moved to aging space. Tenured space can be managed by a mark-compact algorithm, or potentially other things instead. When all of tenured space (and everything below it) collects garbage, it's called a major garbage collection.

There are variations to this, but that's the basic idea. Except what I wrote above is unsound, potentially: in a minor or intermediate GC, it's insufficient to only consider the roots; we also have to treat pointers from older generations to younger generations as roots. The idea is that we can figure out where these roots are without traversing the entire heap.

There are two ways to do this, using a remembered set or a card marking system. Both of these require a write barrier, or a small piece of code that's executed when writing a pointer. With a remembered set, there's a list of memory addresses stored which contain old-to-young pointers. The write barrier for this strategy must check each write operation to see if a pointer is being stored, and if so, if it's an old to young pointer. A more efficient strategy is card marking: divide the heap for older generations into cards, say, 128 bytes each, and when an object is modified, unconditionally set this bit on. Then, when a garbage collection occurs, old-to-young pointers only need to be scanned among cards which are marked. If no such pointer is found, the card can be unmarked.

A few more spaces to put stuff

What I've described so far interacts terribly with concurrency. This is because there has to be a global lock on allocation, which can form a bottleneck if there are many threads and allocation is relatively frequent. To fix this, just make a separate nursery for each thread. This is called a thread-local allocation buffer. The only thing you have to watch out for is that, whenever there's a GC going on from one TLAB, other allocation must stop or there is a risk that more than one TLAB could try to move to aging space at a time. Also, if something's referenced from more than one thread, it has to be moved outside of a TLAB (or else the write barrier has to be accommodated to deal with inter-thread pointers).

It's fairly inefficient to copy large objects, so a collection strategy more like mark-sweep might be more appropriate. To have this, we can make a large object area where the bodies of large objects are put immediately when they are allocated. Though it's managed by a free list, fragmentation won't be so severe (though it still happens) because of the size of the objects involved. Headers for these objects aren't in the LOA itself but rather passed around the heap like regular objects, promoted through various age groups in the normal way.

Alternatively, large objects could just be immediately placed in the oldest generation. This strategy works best when the oldest generation isn't compacting.

How big should things be?

Counterintuitively, it can actually speed things up for the nursery and aging space to be relatively small. It's good for the nursery to be small because, if it's small enough, it can fit entirely in the CPU's L2 cache for faster access. Also, since most of it is expected to be garbage (since most objects die young), there's no benefit to having it be large. It's good for the aging space to be relatively small, too, since if it's too big, then the same long-lived objects will bounce back and forth inside of it, needlessly copied and increasing pause times. It'd be better if these objects were promoted to tenured space earlier. Also, aging space represents objects allocated at around the same time, so if they're moved to tenured space at the same time, they'll be close together in memory, improving locality.

When the tenured space fills up and stays filled up (or almost filled up) after a major garbage collection, the heap needs more space. You could increase the size of all of the generations, but because of the reasons in the previous paragraph, it's usually best just to increase the size of the tenured space, and maybe the large object area if you're using one.

So, say the heap size required was very high, but now not as much is needed. How can we go back and make the heap smaller? It wouldn't be good to immediately free part of the heap when it's not being used, because it might be needed soon again. A good strategy is to set a soft goal for throughput: shrink the heap as long as the total throughput, or percentage of time spent not collecting garbage, is below a certain bound. This will avoid a cycle of growing and shrinking the heap while allowing the heap to shrink as long as heap shrinking doesn't occupy too much time.

Why this is still insufficient

Is this the end of what we need to know about garbage collection? When I started writing this post, I thought it would be. Generational collection, along with TLABs and a LOA, can significantly improve the performance of garbage collection, but there's still the risk of long, noticeable pauses caused by a full traversal of the heap.

So I still need to do a bit more research before being able to implement a perfect garbage collection algorithm for Factor which can avoid noticeable pauses. Maybe G1 is the answer here (but it's really complicated), or maybe something simpler like the train algorithm or concurrent mark-sweep (but these both have significant performance disadvantages). Until then, heap ergonomics, mark-compact or mark-sweep in the oldest generation and maybe a large object area would be good places to start in improving Factor's memory management system.

Right now, Factor's system consists of a three-generational copying collector which grows all generations by a particular factor when the heap is exhausted. The heap never shrinks. There is a separate section of the heap for compiled code, which uses a separate mark-sweep system, which can potentially make the system run out of memory when it doesn't have to. I hope this can be fixed by Factor 1.0, but if not, the GC will definitely be redone in Factor 2.0 when multiple system threads are supported.