Tuesday, February 26, 2008

A quick intro to garbage collection

I've been reading about garbage collection algorithms to potentially create a better one for Factor, so I thought I'd share the basics of what I've learned so far. Garbage collection, by the way, is the automatic reclamation of unreferenced heap-allocated memory. This means you can malloc without worrying about free, since the runtime system does it all for you.

Modern programming languages like Lisp, Java, Python, Prolog, Haskell and Factor all have some form of garbage collection built-in, whereas C, C++ and Forth require explicit freeing of unused memory.* Because all heap-allocated data is handled by the garbage collector (and stack allocation is rarely, if ever, used), garbage collector performance has an impact on all programs written in a language with GC.

For this reason, a lot of effort has been put into maximizing throughput (the amount of time spent not doing garbage collection), and more recently, minimizing the length of individual pauses, possibly at the expense of some throughput. Other good things a garbage collector can do is improve locality (for both cache and virtual memory paging purposes), reduce memory fragmentation, and provide for fast allocation. When everything is considered, a well-constructed garbage collector can actually make a program have greater throughput than manual memory management. When there's efficient allocation, then programmers can be free to use higher-level techniques. First, let's look at the simplest garbage collection algorithms.

Reference counting—the naive algorithm

Here's a pretty easy way to automatically collect unused memory. Put a counter on each object, signifying how many outside references there are to this object. When it gets to zero, you know there are no references to it, so it can be deleted. When another object starts referencing that object, it needs to increment the reference count by one. When it stops referencing it, the referrer needs to decrement the reference count.

This adds just a little overhead: at each read of a pointer, you may have to increment the reference count. But this cost is distributed throughout the program, and it's not that great; no noticeable pause is caused. Another piece of overhead is that each reference-counted object has to have a counter which is as big as a a pointer, since potentially, everything in memory points to a particular object. In practice, it doesn't need to be made this big for most applications, but in the case of overflow, you need some other mechanism to clean up the mess.

There's a bigger problem, though. Reference counting fails to collect cyclic data structures. Say I have a doubly linked list, or a tree with parent pointers. Then simple reference counting will fail to delete it when there are no outside references. Here's why: say we have three objects, A, B and C. A points to B. B points to C. C points to B. There are no other pointers to B and C. We'll assume A is a local variable, so its reference count is 1. B's reference count is two, since A and C point to it. C's reference count is 1 since A points to it. When A goes out of scope and stops pointing to B, then B's reference count will be decremented. So B and C will have a count of 1 and they won't be deleted!

There are a couple different ways to deal with this. One is to detect cycles some time. This isn't so good because cycles might be kinda long, and you don't want to go chasing pointers around unless you know where you're going. There's no good way to do this. Another way is to use a hybrid technique, where cycles are collected with something which doesn't have this weirdness. The easiest way to do this is with partial mark-sweep collection.

Ensuring soundness with mark-sweep

Another way to go about this is to just allocate memory as normal, without worrying about freeing it, until we run out of memory. At that time, you can scan the heap and free everything that's not being referenced.

Say that each object has a mark bit, determining whether it's being referenced. At the beginning of the heap scan, every mark bit is set to zero. Then, we do a depth-first search of the heap, starting at things that we know have references to then, and at each item we visit, we set the mark bit to 1. After the marking process is done, we can scan through the heap again and add everything that's not referenced to the free list.

One problem with this is that the stack space needed for the depth-first search is, potentially, as big as all of the memory. Even if you don't use the host programming language's stack and instead use a manually-maintained stack, things might not always work out. So the Deutsch-Schorr-Waite algorithm uses the space inside the pointers themselves to store what to do next. It uses pointer reversal during the depth-first search: each object has a flag bit to determine whether it's reversed or not, and when it's visited, it is changed to a pointer to its referent after reading its value and deciding to recurse on it. Pointer reversal is somewhat expensive, but it comes with a constant space guarantee, which is valuable.

The free list: not as efficient as it sounds

Any good C programmer will tell you, don't allocate memory in an inner loop. Why is this? It's because malloc'ing memory isn't a simple process, and in the worst case can become as slow, asymptotically, as traversing the entire heap. Here's how it works: when the program starts, there's this big free area called the heap where memory is available. There's something called a free list, a circular singly linked list holding all of the free blocks of memory. Initially, it contains just this one big block. When you allocate memory, the free list is searched to find a block that's big enough to hold what you want. Then the free list is altered to say that that memory is taken. This operation preserves the length of the free list.

Freeing memory is where the trouble starts. When you free memory, the old location is added to the free list. If memory is allocated in the same order that it's freed, this is no problem: adjacent free blocks of memory in the free list are joined, so there will only ever be two things in the free list. But this generally doesn't happen. Generally, as the program runs, things get freed in different orders, so the free list grows longer and longer, with pieces in it getting smaller and less usable. When the free list grows, it takes longer to allocate more memory. This is called heap fragmentation, and it's one reason why memory allocation is avoided in C.

There are ways around it. The simplest way to allocate memory is to look for the first place in the free list that the requested block size will fit, but you can also look for the best fit. Another strategy is to partition the heap into areas for different sizes of allocated memory. But, fundamentally, heap fragmentation in general is unavoidable because the allocator isn't free to move things around. Once a pointer is allocated, it's supposed to stay in the same place for ever. This limitation is maintained by refcounting and mark-sweep collectors. More advanced compacting collectors, like copying collectors and mark-compact collectors, don't have this limitation, making allocation cheaper.

Copying collection with Cheney's algorithm

Instead of dealing with a free list, we could just increment a pointer for allocation, and when memory fills up, move all the data that's referenced someplace else. This is the basic idea behind copying collection. We have two spaces, Fromspace and Tospace. All of the memory is allocated in Fromspace by incrementing a pointer, since everything in it above that pointer is free. When it fills up, copy each thing in Fromspace to Tospace. We can preserve the structure of the pointers by making sure that if something gets moved to Tospace, then its old value in Fromspace is set to a special value that indicates where it's already been copied to. When we're all done, swap the titles of Fromspace and Tospace.

This has the same basic problem as mark-sweep collection: we can't just do a depth-first search of the heap. The easiest way to solve this is by instead using breadth-first search. No, we don't want an external queue to store this stuff in; we can use Tospace itself.

Here's how it works with Cheney's algorithm. First, copy the stuff we know is referenced into Tospace. (These are called the roots, by the way.) Maintain one pointer, initially at the beginning of the heap, called "scan", and another, initially after the copied roots, called "free". "Scan" represents where we're looking for new things to copy, and "free" represents unalllocated memory in Tospace. Look at the thing at "scan", and copy it to right after "free". Then increment "scan", and increment "free" by the size of what you just copied. Continue this until scan and free are equal, that is, until nothing more is referenced that hasn't already been copied. To me, this is a very elegant instance of breadth-first search, where a queue emerges naturally between the "scan" and "free" pointers.

Sliding mark-compact using less space

Copying is pretty good because everything is compacted, and allocation is as simple as incrementing a pointer. But there's a cost: it takes twice as much space as the earlier algorithms because half of the space just isn't used at any given time. Another technique is to use something like mark-sweep (with the same marking phase), but instead of putting things on a free list, to move memory back towards the beginning. This is called mark-(sweep-)compact collection.

One goal in compaction is to preserve locality. The breadth-first search of copying collection doesn't do this well for most applications, but mark-compact collection can. The key to modern techniques for this, called "table-based", is to store information about free space that's been discovered in tables which are located in the free space itself. Once this information has been compiled, new locations for each piece of data can be calculated, pointers updated, and finally objects moved. The details are a little obscure, though.

Further ideas

This is only the beginning. There's a big problem with the last two methods discussed, that they can cause long pauses since they require a scan of the whole heap or all of the data that's in use. To do this, there are a number of possible improvements. The most basic one is generational garbage collection, and there are also incremental and concurrent methods. These get a lot more complicated, though, and I can't discuss them in this blog post. But before it's possible to understand those, you need the basics.



* Caveat: Linear Lisp (a Lisp dialect designed by Henry Baker) and JHC (a Haskell implementation) don't use garbage collection, instead using only stack allocation (in Linear Lisp) or region inference (in JHC). Mercury, a Prolog derivative, also uses region inference. But this is for another blog post which I'm not qualified to write. Also, you can define a garbage collector for C, C++ or Forth, but it has to treat each word with the precaution that it might be a pointer, even if it isn't.

If any of this was hard to understand (especially for lack of diagrams), I strongly recommend reading Garbage Collection: Algorithms for Automatic Dynamic Memory Management by Richard Jones and Rafael Lins.

Update: Minor changes in wording in response to comments.

10 comments:

Anonymous said...

Hey dan, actually when a program is garbage collected, it's usually faster than a malloced and freed, or perceived to be anyway.
that's because there aren't the extra operations of freeing when in the critical or main code, and it (the gc) is saved for later.
Also, I know that erlang's GC happens very fast, simply because each process is very small, and the gc happens within the process. so this is a way of working around inventing some super intelligent GC. hth, kobi

Gwenhwyfaer said...

dan: Re conservative GC: it's not just that you have to assume any word "is a pointer", it's worse - any word might be a pointer. So non-pointers cause incorrect retention, and any GC method which mutates pointers (eg. mark & compact) cannot be used. Static typing can help, but only if type information is available to the GC process, which requires compiler co-operation (or really good debugging metadata).

kobi: Each of erlang's processes might be very small, but there are thousands of them, and they might be distributed across many nodes - and unfortunately complete distributed garbage collection is impossible in the presence of patchily-reliable distributed systems (see the GC FAQ).

Daniel Ehrenberg said...

anonymous,
Some GC mechanisms are faster and others are slower. Historically, some scanning GC algorithms took multi-second pauses and had significantly less throughput than manual memory management. It's not completely clear-cut. It's hard to eliminate these pauses, which in the worst case take more time than any single malloc/free call, without some very complicated incremental or concurrent technique. (One concurrent technique is to take Erlang's route and just have a bunch of small heaps with no sharing.) Anyway, even if GC were always slower, the benefits of automatic memory management are so great that it'd be worth it.

Gwenhwyfaer,
Yeah, I know, I guess i just explained this badly, so I reworded that.

Eric said...

My guess is that you're already aware of it, but just in case, I thought I'd mention libgarbagecollector as it is "a plug-in garbage collector library designed to be used by high level language implementors".

http://tinyurl.com/339y5z

Out of curiousity: what sort of gc algorithm does Factor currently use?

Daniel Ehrenberg said...

Eric,
I didn't know about libgarbagecollector; that might be something to look into for using in Factor. Right now, Factor uses (for data) a three-generation copying collector and (for code) a mark-sweep collector. The GC isn't sound because a full data GC doesn't happen at the same time as a full code GC, so if there's code that nothing refers to, but doesn't fill up the code heap, and that code refers to data which exausts the data memory, the code won't be collected. But, more importantly, an incremental algorithm like libgarbagecollector's would eliminate noticeable pauses, which would be very good. (I'd be surprised if using libgarbagecollector didn't require rewriting large parts of it, though.)

Slava Pestov said...

I've looked at Io's libgarbagecollector in the past but decided it was not good enough for Factor. First it is not generational so it is only suitable for small heaps. It also sits on top of the C library's malloc so heap fragmentation and free list scanning are a concern.

Slava Pestov said...

I should also mention libgarbagecollector's use of libc malloc precludes scanning the entire heap or saving the heap to a file (Factor is image based).

None of this should be taken as bashing Dekote's work; his garbage collector library is small and easy to understand, and it works very well in the context in which it is used -- lightweight scripting.

Eric said...

Dan,

Thanks for posting this stuff and answering my question. As a new Factor tinkerer and not-so-new dynamic languages fan, it makes for interesting reading.

Slava,

Are you referring to the library's use of malloc to allocate it's own book-keeping data structures? In it's current version, the library gives the user control over how his objects are allocated and freed -- you don't have to use malloc if you don't want to.

Also, reading the Io OOPSLA 2005 slides, I read that Io has a generational collector. Maybe Io's implementation builds a generational collector on top of the library by clever use of the mark and free callbacks?

Eric said...

I did some more reading. Now I think I understand Slava's objections better.

Although libgarbagecollector allows you to specify how memory is allocated and freed, since the collector doesn't move objects, you run into the fragmentation problem whether you use malloc or not.

Sandro Magi said...

libGarbageCollector has serious flaws:

1. it's not really incremental; it has degenerate cases incurring long pauses, which kind of defeats the purpose of a heavyweight incremental algorithm.

2. it incurs three words of overhead per allocated object, not including per-object headers reserved by malloc/header, which can be up to 2 or 3 words on its own! That means up to 6 words per object.

That's simply unacceptable overhead for functional-type languages which typically allocate many small objects. Better to allocate large regions using malloc to create small custom GC heaps.

As for this blog post, reference-counting isn't really a "naive" algorithm. Bacon demonstrated that all garbage collection algorithms are actually hybrids of tracing and ref-counting. There has been significant research in reducing the cost of ref-counting as well, and ref-counting is now within 15% of the throughput of tracing GCs. Bacon also published a local cycle-collection algorithm which outperforms tracing in tight heaps, such as embedded devices, or for programs with large live data sets.

The cost of ref-counting is proportional to the cost of tracing dead objects, whereas the cost of mark-sweep is the cost of tracing live objects. If you have a large old generation with long-lived objects, ref-counting is thus more suitable than mark-sweep. Mark-sweep is more suitable for the young generation, where there's a high death rate. So combining mark-sweep for the young generation and ref-counting for the old gen is a good match. Thus was born age-oriented GC, a concurrent GC that almost matches concurrent mark-sweep in throughput with maximum pause times on the order of 2.5ms.

Bacon et al. then refined their cycle-collection algorithm in Efficient On-the-Fly Cycle Collection, and the age-oriented collector is now even more efficient.

If Factor needs a concurrent, incremental GC suitable for code and data collection, check out "Efficient On-the-Fly Cycle Collection". I'll probably end up using it as well.

Long live ref-counting! ;-)