Saturday, May 10, 2008

Parsing with regular expressions and group capture

Though I'd rather avoid it, string parsing is a crucial part of programming. Since we're more than 60 years into the use of modern computers, it seems like we should have a pretty good handle on how to build abstractions over parsing. Indeed, there are tons of great tools out there, like GNU Yacc, Haskell's Parsec, newer Packrat-based parsers like Factor's EBNF syntax for PEGs, and a bunch of other high level parsing libraries. These libraries are relatively easy to use once you understand the underlying structure (each one parses a different subset of context-free grammars), because they expose the programmer to a tree-like view of the string.

However, these incur too much overhead to be used for certain domains, like the parsing that goes on in an HTTP server or client. They're really overkill when, as in the case of HTTP interchanges, what you're dealing with is a regular language and processing can be done on-line. (I'll get back later to what I mean by those two things.) The main tools that exist to deal with this are Lex and Ragel.

Ragel seems like a really interesting solution for this domain. The entire description of parsing is eventually put in one regular expression, which is compiled to a DFA, where states and transitions can be annotated by actions. Fine-grained control is given to limit non-determinism. But the user must be acutely aware of how regular expressions correspond to DFAs in order to use the abstraction. So it is somewhat leaky. Also, it's difficult to get a tree-like view on things: actions are used purely for their side effect.

So, here's an idea: let's find a middle ground. Let's try to use regular expressions, with all of their efficiency advantages, but get an abstract tree-like view of the grammar and an ability to use parsing actions like high-level abstractions allow. Ideally, the user won't have to know about the implementation beyond two simple facts: regular languages can't use general recursion, and nondeterminism should be minimized.

This isn't something that I've implemented, but I have a pretty good idea for the design of such as system, and I wanted to share it with you. First, a little background.

DFAs and regular expressions


I'm taking a computer science class about this right now, so I'm gonna be totally pedantic. When I say regular expression, I mean an expression that describes a regular language. Perl regexes aren't regular expressions (and Larry Wall knows this). If you don't feel like putting on your theoretician's hat, this blog post will be mostly meaningless.

What's a regular language? First off, a language is a set of strings. We care about infinite sets of strings, since finite sets are trivial to represent. If a string is in the language, that means that the language matches the string, intuitively. A regular language is one which can be represented by a deterministic finite automaton (DFA) without extensions, also called a finite state machine (FSM) for some reason. Many useful languages are regular, and many are not.

The idea of a DFA is a finite number of states and a transition function, which takes the current state and a character of a string and returns the next state. The transition function is defined on all states and all characters in the alphabet. There is a set of final states, and if the string runs out when the machine is in a final state, then the string is accepted. The language of the DFA is the set of strings accepted by the DFA. For a given DFA, that DFA can be run in linear time with respect to the length of the input string and constant space. It can also be run "on-line", that is, without the whole string in advance, going incrementally.

A related construction is an NFA, or nondeterministic finite automaton. Imagine the previous idea, but instead of a transition function, there is a transition relation. That is, for any character and current state, there are zero or more next states to go to, and the NFA always picks the right one. This is called nondeterminism (at least that's what it means here). Amazingly, NFAs can accept only regular languages and nothing more, because NFAs can be translated into DFAs. Basically, you build a DFA which picks all possible states at once, given all possible paths through the NFA. Potentially, though, there's an exponential blowup in the number of states.

Every regular expression can be converted into an equivalent NFA, which can be converted into a DFA, which can then be converted back into a regular expression. They're all equivalent. So then what's a regular expression? There are different ways to define it. One is that you can build up a regular expression from the following elements: the epsilon regex (matching the empty string), the empty regex (matching nothing), single character regexes (matching just a single character), concatenation (one followed by another), disjunction (or) and the Kleene star (0 or more copies of something). Counterintuitively, it's possible to construct regexes which support negation, conjunction, lookahead and other interesting things.

The most important distinction from Perl regexes is that regular expressions cannot contain backreferences, because these are provably impossible to express in a DFA. It's impossible to parse something with backreferences in the same linear time and constant space that you get from regexes which are regular. In fact, parsing patterns with backreferences is NP-hard and not believed possible in polynomial time (with respect to the length of the input string). Since regular expressions which are regular give us such nice properties, I'm going to stick to them.

Regular expressions in practice in parsing today


The formal study of regular languages is a basically solved problem within the formalism itself: they are equivalent to DFAs, and satisfy a convenient set of properties summarized by the pumping lemmaand the Myhill-Nerode theorem. The problem is just, is the given string a member of the language? What languages are regular?

This was solved in the 1950s and 1960s, and the basic results are in most introductory compiler books. Those books use the solution to build lexers, like Lex. Lex basically takes a list of regular expressions, each associated with an action, finds one of them to match maximally with the current input, executes the associated action on the portion that matches, and then repeats with the rest of the string. This is useful to build lexers, but the programmer has very little context for things, so it's difficult to use for much else.

More recently, Ragel has been used as a way to parse more complicated things using regular expressions. Its strategy is to turn its compile-time input into one big regular expression, annotated with actions on certain states or transitions. The actions are fragments of C code, and they form the processing power of the machine. However, their behavior can get a little unintuitive if too much nondeterminism is used, so Ragel provides a bunch of tools to limit that. Also, Ragel lets you explicitly specify a DFA through transitions, which seems useful but low-level.

Group capture with regexes


One of the most useful features of Perl's regular expressions is group capture. By this, I mean how you can do something like s/^(1*)(0*)$/$2$1/ to swap ones and zeros in a string. This is different from backreferences (like the non-regular expression /(.*)$1/) because it's only used in subsequent code, to figure out what got matched to what part of the regex. It doesn't parse any languages which aren't regular, but it's a useful tool for processing.

Curiously, this has been ignored both by academics and DFA implementors so far. I hypothesize that it's been ignored by theorists for two reasons: (1) It's easy to confuse with backreferences, which make the language non-regular, which is totally uninteresting to theorists (2) They're not part of the formalism of regular expressions as previously expressed.

Implementors of (non-Perl-based) regular expression-based parsing mechanisms tend to avoid group capture because, in the general case, it's not fast enough and can't be done on-line. Also, as far as I can tell, it hasn't been implemented any other way than interpreting an NFA, using backtracking, and keeping track of where the parser is within the regex to determine group boundaries. This would be terrible for the domain of Lex and Ragel. By "on-line" I don't mean on the internet but rather that an algorithm that can be performed incrementally, getting little pieces (characters, say) of the input and doing computation incrementally, as the input is received, without storing the whole thing and running the algorithm all at once.

So how can we do group capture on-line? Well, in general, we can't. Consider the regular expression (1*)1 where you're trying to capture the group 1*. As the input is being processed, we don't know when we've gotten to the end of the group until the entire input is over, since if there are two more 1's, then the first one must be in the group. However, in many cases group capture can in fact be done on-line, as in (0*)(1*), where the groups captured are 0* and 1*. As the regex is processing on the string, it knows that, if there is a match, the group boundary is just before the first 1. This can be formalized as a "boundary of determinism": a point where, in the subset construction to form a DFA from an NFA gets a subset of exactly one state.

I believe this can handle most cases of group capture in practice, if the regular expression is well-written, but surely not all of them. I have an idea for how to do group capture in the few remaining circumstances, but unfortunately it takes linear space and it's not online. I'll blog about it once I have a proof of correctness.

Hierarchical parsing using group capture


Using this group capture mechanism, we can build a hierarchical parsing mechanism with actions on different things, which can be built to parse regular languages in a higher-level way. Regular expressions can't use arbitrary recursion like context-free grammars can, so the parse tree will be of fixed size, but it could still be useful. In designing this, I'm thinking specifically about making a SAX-like XML parser. It'd be awkward to write everything out as one big regular expression, but split into smaller things, each with their own little steps in processing, it could be much more elegant. My goal for syntax is something like EBNF syntax, as Chris Double's PEGs library in Factor does. Here's some future pseudocode for how it could look in parsing an XML tag, simplified. (In this code, > is used like Ragel :>>, to indicate that when the expression afterwards can be matched by the regex, it is, as soon as possible (basically).)

REG: tag
chars = "&" entity:any* > ";" [[ entity lookup-entity ]]
| any
string = "\"" str:chars > "\"" [[ str ]]
| "'" str:chars > "'" [[ str ]]
xml-name = name-start-char name-char*
attribute = name:xml-name ws "=" ws str:string [[ name str 2array ]]
tag = "<" ws closer?:("/"?) name:xml-name attrs:(attribute*) ws contained?:("/"?) ws ">" [[ ... ]]

Conclusion


Though I haven't implemented this yet, and probably shouldn't even be talking about it, I'm really excited about this idea. I even came up with a stupid little name with it: Hegel, both for High-level Ragel and because it represents the synthesis of the dialectic (as described by Hegel) of slower, higher-level parsing and fast low-level parsing into fast, high-level parsing of regular languages. I hope it works.

Wednesday, May 7, 2008

Interval maps in Factor

Recently, I wrote a little library in Factor to get the script of a Unicode code point. It's in the Factor git repository in the vocab unicode.script. Initially, I relatively simple representation of the data: there was a byte array, where the index was the code point and the elements were bytes corresponding to scripts. (It's possible to use a byte array because there are only seventy-some scripts to care about.) Lookup consisted of char>num-table nth num>name-table nth. But this was pretty inefficient. The largest code point (that I wanted to represent here) was something around number 195,000, meaning that the byte array took up almost 200Kb. Even if I somehow got rid of that empty space (and I don't see an obvious way how, without a bunch of overhead), there are 100,000 code points whose script I wanted to encode.

But we can do better than taking up 100Kb. The thing about this data is that scripts are in a bunch of contiguous ranges. That is, two characters that are next to each other in code point order are very likely to have the same script. The file in the Unicode Character Database encoding this information actually uses special syntax to denote a range, rather than write out each one individually. So what if we store these intervals directly rather than store each element of the intervals?

A data structure to hold intervals with O(log n) lookup and insertion has already been developed: interval trees. They're described in Chapter 14 of Introduction to Algorithms starting on page 311, but I won't describe them here. At first, I tried to implement these, but I realized that, for my purposes, they're overkill. They're really easy to get wrong: if you implement them on top of another kind of balanced binary tree, you have to make sure that balancing preserves certain invariants about annotations on the tree. Still, if you need fast insertion and deletion, they make the most sense.

A much simpler solution is to just have a sorted array of intervals, each associated with a value. The right interval, and then the corresponding value, can be found by simple binary search. I don't even need to know how to do binary search, because it's already in the Factor library! This is efficient as long as the interval map is constructed all at once, which it is in this case. By a high constant factor, this is also more space-efficient than using binary trees. The whole solution takes less than 30 lines of code.

(Note: the intervals here are closed and must be disjoint. <=> must be defined on them. They don't use the intervals in math.intervals to save space, and since they're overkill. Interval maps don't follow the assoc protocol because intervals aren't discrete, eg floats are acceptable as keys.)

First, the tuples we'll be using: an interval-map is the whole associative structure, containing a single slot for the underlying array.


TUPLE: interval-map array ;

That array consists of interval-nodes, which have a beginning, end and corresponding value.

TUPLE: interval-node from to value ;

Let's assume we already have the sorted interval maps. Given a key and an interval map, find-interval will give the index of the interval which might contain the given key.

: find-interval ( key interval-map -- i )
[ from>> <=> ] binsearch ;

interval-contains? tests if a node contains a given key.

: interval-contains? ( object interval-node -- ? )
[ from>> ] [ to>> ] bi between? ;

Finally, interval-at* searches an interval map to find a key, finding the correct interval and returning its value only if the interval contains the key.

: fixup-value ( value ? -- value/f ? )
[ drop f f ] unless* ;

: interval-at* ( key map -- value ? )
array>> [ find-interval ] 2keep swapd nth
[ nip value>> ] [ interval-contains? ] 2bi
fixup-value ;

A few convenience words, analogous to those for assocs:

: interval-at ( key map -- value ) interval-at* drop ;
: interval-key? ( key map -- ? ) interval-at* nip ;

So, to construct an interval map, there are a fewi things that have to be done. The input is an abstract specification, consisting of an assoc where the keys are either (1) 2arrays, where the first is the beginning of an interval and the second is the end (2) numbers, representing an interval of the form [a,a]. This can be converted into a form of all (1) with the following:

: all-intervals ( sequence -- intervals )
[ >r dup number? [ dup 2array ] when r> ] assoc-map
{ } assoc-like ;

Once that is done, the objects should be converted to intervals:

: >intervals ( specification -- intervals )
[ >r first2 r> interval-node boa ] { } assoc>map ;

After that, and after the intervals are sorted, it needs to be assured that all intervals are disjoint. For this, we can use the monotonic? combinator, which checks to make sure that all adjacent pairs in a sequence satisfy a predicate. (This is more useful than it sounds at first.)

: disjoint? ( node1 node2 -- ? )
[ to>> ] [ from>> ] bi* < ;

: ensure-disjoint ( intervals -- intervals )
dup [ disjoint? ] monotonic?
[ "Intervals are not disjoint" throw ] unless ;

And, to put it all together, using a tuple array for improved space efficiency:

: <interval-map> ( specification -- map )
all-intervals [ [ first second ] compare ] sort
>intervals ensure-disjoint >tuple-array
interval-map boa ;

All in all, in the case of representing the table of scripts, a table which was previously 200KB is now 20KB. Yay!

Saturday, May 3, 2008

A couple GC algorithms in more detail

In previous posts on garbage collection, I've given a pretty cursory overview as to how things actually work. In this post, I hope to give a somewhat more specific explanation of two incremental (and potentially concurrent or parallel, but we'll ignore that for now) GC algorithms: Yuasa's snapshot-at-the-beginning incremental mark-sweep collector, and the MC2 algorithm. Yuasa's collector is very widely used, for example in Java 5 when an incremental collector is requested. MC2 is a more recent algorithm designed to reduce the fragmentation that mark-sweep creates, and appears to get great performance, though it isn't used much yet. In their practical implementation, both collectors are generational.

Yuasa's mark-sweep collector


The idea is pretty simple: take a mark-sweep collector and split up the work, doing a little bit on each allocation. When the heap occupancy passes a certain threshold, say 80%, switch into "mark phase", and on each allocation, mark the right amount of the heap so that everything's marked by the time the heap is full. (You can ensure this by making the amount of marking proportional to the amount of memory allocated.) Then, switch into sweep phase, and on each allocation sweep the heap by a certain amount. If a big object is allocated, sweeping continues until there's enough room. Once sweeping is done, the collector returns to a neutral state and allocation takes place without any special collection actions until the free space dips below the threshold.

Making this work


This is a neat little way to specify a GC algorithm. The implementor has three knobs at their disposal: the threshold to begin collection, the speed of marking, and the speed of sweeping. But there's a problem: the algorithm, as I described it, doesn't work. See, the graph of interconnections in the heap may change during the course of marking, and that's a problem. As I described in a previous post, if a pointer gets moved to another location, it might evade marking and get swept, causing memory corruption.

In a snapshot-at-the-beginning incremental marking GC, the technique to save this is to trap all pointer writes and execute a little bit of code: if the collector is in the marking phase, and if the old pointer value isn't marked, it needs to get marked and get pushed on the marking stack so that its children get marked. (The marking stack is the explicit stack used for depth-first traversal of the heap, to mark everything it reaches.) This piece of code is called the write barrier, and it goes on in addition to the generational write barrier, if one is necessary.

Conservativeness


One more thing: objects are allocated as marked, if an object is allocated during a GC cycle. This means that they can't be collected until the next time around. Unfortunately, this means that any generational GC will be ineffective while marking is going on: everything is effectively allocated in the oldest generation. Nevertheless, generations still provide a significant performance advantage, since most time is spent in the neural non-GC state.

This is called snapshot-at-the-beginning not because an actual snapshot is made, but because everything is saved that had something referring to it at the beginning of the marking phase. (Everything that gets a reference to it during the cycle is also saved.) Of all incremental mark-sweep GC algorithms, a snapshot-at-the-beginning collector is the most conservative, causing the most floating garbage to lie around and wait, uncollected, until the next cycle. Other algorithms have techniques to avoid this, but it often comes at other costs.

MC2


Unfortunately, no matter what strategy is used to minimize fragmentation, there is a program which will cause bad fragmentation of the heap, making it less usable and allocation more expensive. For this reason, a compaction strategy is helpful, and the MC2 algorithm (Memory-Constrained Compaction), created by Narendran Sachindran, provides one within an incremental and generational system. The details are somewhat complicated, and in this blog post I'll offer a simplified view. You can also look at the full paper.

MC


The idea is based on the Mark-Copy (MC) algorithm. The heap is divided up into a number of equally sized windows, say 40. One of these is the nursery, and the others act as tenured space. (I don't know why, but the papers about this seem to use a two-generation rather than three-generation model. I think it could easily be updated to use three generations, but I'll stick with this for now.) Each window has a logical number, with the nursery having the highest number.

Nursery collections go on as I've described in a previous post. A tenured space collection is triggered when there is only one (non-nursery) window left free. At this point, the heap is fully marked. During marking, remembered sets of pointers into each window are made. In turn, each window is copied (using Cheney's copying collector) to the open space that exists, starting in the one free window. The remembered sets can be used to update pointers that go to things that were moved. If the lowest number window is copied first, the remembered sets only need to contain pointers from higher windows to lower windows.

New modifications


MC2 adds a few things to this, to make the algorithm incremental and give low upper bounds on space overhead. The first change is that incremental marking is done. This is similar to the incremental snapshot-at-the-beginning marker described above, though the creators of MC2 opted for a version called incremental update, which is less conservative and more complicated but equally sound. The next change is in the copying technique. If a window is determined to have high occupancy (like more than 95%), it is left as it is without copying. Otherwise, windows are collected into groups whose remaining data can fit into one window. Those groups are incrementally copied into a new window.

Other changes make sure that the space overhead is bounded. The size of remembered sets is limited by switching to a card marking system in the event of an overflow. Objects with many references to them are put in semi-permanent storage in the lowest possible window number, minimizing the size of remembered set that they need.

In a benchmark included in the MC2 paper, it is demonstrated that MC2 has the same or slightly better performance compared to non-incremental generational mark-sweep or generational mark-compact, the alternatives for the domain of memory-constrained systems. Pauses more than 30ms are rare, and performance appears to be consistent over a wide range of Java programs.

Tuesday, April 29, 2008

Potential ideas to explore

I haven't written in a while, and it's a little hard to get started back up, so here are just a bunch of random ideas in my head that I'd like to share with you guys. Sorry if it's a little incoherent...

Possible extensions to Inverse

I've been thinking about possible ways to generalize my system for concatenative pattern matching, currently in extra/inverse. There are two ways to go about it: making a more general constraint solving system, and giving access to the old input when inverting something, as in the Harmony project. A third way is to add backtracking (in a different place than constraint solving would put it). To someone familiar with Inverse, these might seem like they're coming from nowhere, but they're actually very closely related. (To someone not familiar with it, see my previous blog post describing Inverse.)

Constraint solving

The idea of resolving constraints is to figure out as much as you can about a situation given certain facts. This is easy in some cases, but impossible in others, even if enough facts are known to, potentially, figure out what everything is. For example, Diophantine equations can be solved by a fully general constraint-solving system, but they're known to be undecidable in general.

So what can constraint solving get you in Inverse? Well, imagine an inverse to bi. It's not difficult to make one within the current framework, but some information is lost: everything must be completely determined. Think about inverting [ first ] [ second ] bi. Inverting this should get the same result as first2 (which has a hard-coded inverse right now, inverting to 2array). But it won't work.

A way for [ first ] [ second ] bi to work would be using the following steps:
  1. Initialize a logic variable X as unbound
  2. Unify X with the information, "the first element is what's second from the top of the stack (at runtime)". Now it's known that X is a sequence of length at least 1.
  3. Unify X with the information, "the second element is what's on the top of the stack (at runtime)". Now it's know that X is a sequence of length at least two.
  4. From the information we have about X, produce a canonical representation, since the inverted quotation is over: an array of the minimum possible length.

This isn't easy to do in general, but it should be possible, in theory. It'd be extremely cool if it worked out.

Formally, you can think of Inverse as already a reasonable constraint solving system, for a limited problem domain. Given [ f ], and the statement about stacks A and B that f(A) = B, and given B, find a possible value for A. The strategy used right now is mathematically sound, and I hope to write it up some day. But, a more general use of logic variables is possible: explicit logic variables in code. This could be used to make a better-integrated logic language in Factor.

The Harmony Project


The Harmony Project, led by Benjamin C. Pierce, is an attempt to solve the "view-update problem" using a new programming language and type system which is largely invertible. The view-update problem is that we want to convert different storage formats into an abstract representation, manipulate that representation and put it back without duplicating code about the representation. Everything operates on edge-labeled trees.

Within the Harmony framework, it's possible to do all your work in bijections (one-to-one onto functions, similar but not identical to the domain of Inverse right now), but there's extra power included: the function to put the abstract representation back into the original form has access to the original. This adds a huge amount of power, giving the possibility of conditionals and recursion, in limited cases. Also, it gives the power to ignore certain things about the surface structure when looking at the abstract form. (Harmony also has ideas about tree merging, and of course a new type system, but I'm not as interested in that right now.)

So far, only relatively trivial things have been made with Harmony, but the idea looks really useful, though there are two problems: (1) I don't really understand it fully (like constraints) and (2) I have no idea how it can fit together with Inverse as it is right now.

Backtracking

In Mark Tullsen's paper on first-class patterns, there was an interesting idea that Inverse could adopt. Tullsen used monads to sequence the patterns. It's the simplest to use the Maybe monad, and that corresponds to how pattern matching systems normally work. But if the List monad is used instead, then you easily get backtracking. This could be ported to Factor either by using monads or, maybe easier, by using continuations. Years ago, Chris Double implemented amb in Factor using continuations, though the code won't work anymore. The sequencing and backtracking I'm talking about is relevant in things like switch statements, rather than undo itself. I'm not sure if it'd actually be useful in practice.

Garbage collection research ideas

Because the summer's coming up, and I'll be participating in Harvey Mudd's Garbage Collection REU, I've been coming up with a few research ideas. The suggested one is to continue with the work of previous years' REUs and think about simplifiers and collecting certain persistent data structures and weak hashtables, but here are a couple more:
  • Figure out how efficient garbage collection on Non-Uniform Memory Access systems can work. The problem (if it is a problem) is that plain old garbage collection on multiprocessor NUMA systems isn't as fast as it could be, because a chunk of memory allocated for a thread may be far away from where it's used. One way to ensure locality is to give each processor (at least) its own heap, where the heap is guaranteed to be stored in the closest memory. But if data needs to be shared between processors, this can be too limiting. A piece of data can be kept on the RAM closest the processor which made the allocating call, but maybe it'd be beneficial to collect data on which processor is using which data, and dynamically move data around to different places in RAM to put it closest to where it's used. A related issue is maximizing locality when actually performing the tracing in the GC, which I have no ideas about.

  • Run a real benchmark comparing several GC algorithms. Probably the most annoying thing for programming language implementors trying to pick a good GC algorithm is that there's no comprehensive benchmark to refer to. No one really knows which algorithm is the fastest, so there are two strategies remaining: pick the one that sounds the fastest, or do trial and error among just a few. Each paper about a new algorithm reports speed improvements—over significantly older algorithms. It'd be a big project, but I think it's possible to make a good benchmark suite and test how long it takes for these algorithms to run, in terms of absolute throughput and pause length and frequency, given different allocation strategies. If it's possible, it'd be nice to know what kind of GC performs best given a particular memory use pattern.

  • Garbage collector implementation in proof-carrying code. There are a couple invariants that garbage collectors have, that must be preserved. For example, the user can't be exposed to any forwarding pointers, and a new garbage collection can't be started when forwarding pointers exist. The idea of proof-carrying code (an explicit proof, which is type-checked to be accurate, is given with the code) isn't new; it's mostly been used to prove memory consistency safety given untrusted code. But maybe it could be used to prove that a GC implementation is correct.

These ideas are really difficult, but I think they're interesting, and with four other smart people working with me, maybe in a summer we can do something really cool, like this or whatever other idea they come up with.

Ragel-style state machines in Factor

In my Automata and Computability class at Carleton, we've been studying (what else) finite automata, and it got me thinking about regular expressions and their utility in Factor. By regular expression, I mean an expression denoting a regular language: a real, academic regexp. A regular language is one that can be written as a deterministic finite automaton (finite state machine). Hopefully, I'll explain more about this in a future blog post.

Anyway, if you've heard of Ragel, it's basically what I want to do. But the form it'd take is basically the same as PEGs (Chris Double's Pacrat parser), with the one restriction that no recursion is allowed. In return for this restriction, there is no linear space overhead. Basically everything else, as far as I know, could stay the same.

I'm thinking I'll redo the XML parser with this. The SAX-like view will be done with this regular languages parser (since all that's needed is a tokenizer), and then that can be formed into a tree using PEGs (since linear space overhead is acceptable there). Linear space overhead, by the way, is unacceptable for the SAX-like view, since it should be usable for extremely large documents that couldn't easily fit in memory all at once.

(By the way, I know Ragel also allows you to explicitly make state charts, but I won't include this until I see a place where I want to use it.)

Sunday, April 6, 2008

Programming in a series of trivial one-liners

Among Perl programmers, a one-line program is considered a useful piece of hackage, something to show off to your friends as a surprisingly simple way to do a particular Unix or text-processing task. Outsiders tend to deride these one-liners as line noise, but there's a certain virtue to it: in just one line, in certain programming languages, it's possible to create meaningful functionality.

APL, lived on by its derivatives like K, Q, J and Dyalog pioneered the concept of writing entire programs in a bunch of one-liners. Because their syntax is so terse and because of the powerful and high-level constructs of array processing, you can pack a lot into just 80 characters. In most K programs I've seen, each one does something non-trivial, though this isn't always the case. It can take some time to decode just a single line. Reading Perl one-liners is the same way.

Factor continues the one-line tradition. In general, it's considered good style to write your words in one, or sometimes two or three, lines each. But this isn't because we like to pack a lot into each line. Rather, each word is rather trivial, using the words defined before it. After enough simple things are combined, something non-trivial can result, but each step is easy to understand.

Because Factor is concatenative (concatenation of programs denotes composition) it's easier to split things into these trivial one-liners. It can be done by copy and paste after the initial code is already written; there are no local variables whose name has to be changed. One liners in Factor aren't exceptional or an eccentric trait of the community. They're the norm and programs written otherwise are considered in bad style.

Enough philosophizing. How does this work in practice? I'm working on encodings right now, so I'll break down how this worked out in implementing 8-bit encodings like ISO-8859 and Windows-1252. These encodings are just a mapping of bytes to characters. Conveniently, a bunch of resource files describing these mappings which are all in exactly the same format is already exists on the Unicode website.

The first thing to do in implementing this is to parse and process the resource file, turning it into two tables for fast lookup in either direction. Instead of putting this in one word, it's defined in five, each one or two lines long. First, tail-if is a utility word which works like tail but leaves the sequence as it is if it's shorter.


: tail-if ( seq n -- newseq )
2dup swap length <= [ tail ] [ drop ] if ;

Using that, process-contents an array of lines and turns it into an associative mapping (in the form of an array of pairs) from octets to code points.

: process-contents ( lines -- assoc )
[ "#" split1 drop ] map [ empty? not ] subset
[ "\t" split 2 head [ 2 tail-if hex> ] map ] map ;

byte>ch takes this assoc, the product of process-contents and produces an array which can be used to get the code point corresponding to a byte.

: byte>ch ( assoc -- array )
256 replacement-char <array>
[ [ swapd set-nth ] curry assoc-each ] keep ;

ch>byte is the opposite, taking the original assoc and producing an efficiently indexable mapping from code points to octets.

: ch>byte ( assoc -- newassoc )
[ swap ] assoc-map >hashtable ;

Finally, parse-file puts these all together and makes both mappings, given a stream for the resource file.

: parse-file ( stream -- byte>ch ch>byte )
lines process-contents
[ byte>ch ] [ ch>byte ] bi ;

Next, the structure of the encoding itself is defined. A single tuple named 8-bit is used to represent all 8-bit encodings. It contains the encoding and decoding table, as well as the name of the encoding. The encode-8-bit and decode-8-bit words just take some encoding or decoding information and look the code point or octet up in the given table.

TUPLE: 8-bit name decode encode ;

: encode-8-bit ( char stream assoc -- )
swapd at* [ encode-error ] unless swap stream-write1 ;

M: 8-bit encode-char
encode>> encode-8-bit ;

: decode-8-bit ( stream array -- char/f )
swap stream-read1 dup
[ swap nth [ replacement-char ] unless* ] [ nip ] if ;

M: 8-bit decode-char
decode>> decode-8-bit ;

I wanted to design this, like existing Unicode functionality, to read resource files at parsetime rather than to generate Factor source code. Though I don't expect these encodings to change, the result is still more maintainable as it leaves a lower volume of code. If I were implementing this in C or Java or R5RS Scheme or Haskell98, this wouldn't be possible. So make-8-bit defines an encoding given a word and the lookup tables to use:

: make-8-bit ( word byte>ch ch>byte -- )
[ 8-bit construct-boa ] 2curry dupd curry define ;

define-8-bit-encoding puts everything together. It takes a string for the name of an encoding to be defined and a stream, reads the appropriate resource file and defines an 8-bit encoding.

: define-8-bit-encoding ( name stream -- )
>r in get create r> parse-file make-8-bit ;

To top it all off, here's what's needed to define all the 8-bit encodings we want:

: mappings {
{ "latin1" "8859-1" }
{ "latin2" "8859-2" }
! ...
} ;

: encoding-file ( file-name -- stream )
"extra/io/encodings/8-bit/" ".TXT"
swapd 3append resource-path ascii ;

[
"io.encodings.8-bit" in [
mappings [ encoding-file define-8-bit-encoding ] assoc-each
] with-variable
] with-compilation-unit

So by combining these trivial one-liners or two-liners, you can make something that's not as trivial. The end product is that hard things are made easy, which is the goal of every practical programming language. The point of this isn't to say that this code is perfect (it's very far from that), but just to demonstrate how clear things become when they're broken down in this way.

When I first started programming Factor, I thought that it only made sense to define things separately when it was conceivable that something else would use them, or that it'd be individually useful for testing, or something like that. But actually, it's useful for more than that: for just making your program clear. In a way, the hardest thing to do when programming in Factor once you have the basics is to name each of these pieces and factor them out properly from your program. The result is far more maintainable and readable than if the factoring process has not been done.

Saturday, March 29, 2008

How the Factor meeting went in New York

I invited all of you, at the very last minute, to come meet me in New York to talk about Factor and stuff, and at least two people asked me to post in detail about what happened... so here's my best shot. Dan McCarthy was the brave soul who attended, and we had a really interesting conversation about various aspects of programming. One thing we discussed was the inverse pattern matching library. I showed Dan how it works, and he found it really interesting that quotations were sequences at runtime—similar to s-expressions, but directly executed. Dan works as a programmer/sysadmin at a company that provides closed captioning services for media companies, and it seems like a more interesting task than I would have thought. There are some text encoding issues there (the HD captioning standard, if I understand it correctly, actually has encoding left unspecified for characters outside of Windows-1252, though it leaves room for two-byte and three-byte characters) and Dan has been researching them for a project for a Korean client. I explained the encoding definition protocol to Dan, and I'm going to try to get him to implement East Asian encodings, which there seem to be quite a few of in use (Shift-JIS, ISO 2022-JP, GB 2312, Big5, EUC-JP, EUC-KR, GB 18030). These all need big tables for encoding and decoding, and some require state to decode. Many have multiple possible representations of the same string for output, which complicates things somewhat. So, there's not much to report, but I've definitely learned my lesson about organizing things: I need to announce things more than 11 days in advance, and I need to advertise them better.

Sunday, March 23, 2008

Some more advanced GC techniques

After my last two posts about garbage collection, some people people suggested some more advanced techniques be used to solve the pausing problem. Here's a quick* overview of some more advanced techniques, some of which can eliminate noticeable pauses and some of which can solve other problems in GC.

The train algorithm

The idea of the train algorithm is to break the heap (or, really, just the oldest generation) into small chunks that can be collected individually. These chunks need to contain a list, or remembered set, of things from the outside that point into it. (Card marking doesn't work so well in the presence of many separate chunks.) Then, crucially, cyclic data structures need to be put in the same chunk, or at least the same group of chunks which get collected together. You can think of the chunks as cars and the groups of chunks as trains.

It's a bit difficult to get this whole thing sound, though. The precise strategy is described really well in this term Würthinger. That PDF has tons of great diagrams. Java used to use the train algorithm optionally, but it was deprecated because the train algorithm has high overhead in terms of throughput and it can take several GC cycles to delete a cyclic data structure: as many as there are elements in the cycle.

Incremental mark-sweep

Another thing we can try is making mark-sweep incremental. Mark-sweep isn't that good for general collection, since it can cause memory fragmentation and make allocation slow. However, for just the oldest generation (aka tenured space) in a generational system, it's not completely horrible. Since the oldest generation can be pretty large, compaction takes a long time, since everything has to be copied. (This is true whether you're using mark-compact or copying collection.)

So, can we base something off mark-sweep that eliminates long pauses? Well, going by the heading, I guess we could try to make it incremental. There are two pieces to this: incremental mark and incremental sweep. Actually, instead of incremental sweep, we can do either lazy sweep (sweep as much as we need whenever there's an allocation) or concurrent sweep (sweep in a concurrent thread, and have allocating threads block on allocation until there's enough space swept).

The marking phase is more difficult because of a consistency problem. Imagine this situation with objects A, B and C. A and C are pointed to by something that we know we're keeping. When marking starts, A points to B, and B and C points to null. The marker visits C and marks it as already visited. Then, before the marker visits A or B, C is changed to point to B and A is changed to point to null. Then, the marker visits A, and A gets marked. But then B is never marked, and it is lost!

The easiest solution is to trap all pointer writes to watch out for cases like this, making sure that B gets marked when A is changed. This is called a snapshot-at-the-beginning write barrier. But this makes it so that, if A and C both point to null, B still won't be collected until the next time around. That phenomenon is called floating garbage, and more subtle strategies remedy it. Most of these incremental algorithms can be parallelized with a little bit of work.

Aside from the book I recommended before, a good resource on incremental techniques is this ACM survey on garbage collection. [Update: There's also this great blog post which I forgot to mention before. It has lots of pretty diagrams.]

Garbage-first

The people at Sun have come up with a new garbage collection strategy for Java called garbage-first garbage collection (G1GC). The idea is somewhat similar to the train algorithm: the heap is split up into small chunks which can be collected separately, maintaining a remembered set of inward references. But the G1GC uses all kinds of crazy heuristics to figure out what chunks are most likely to have a small remembered set. This works so well that this "youngness heuristic" can completely replace the generational mechanism. The whole thing is led by user-specified parameters about the maximum allowable pause time and throughput goals.

There's a paper describing G1GC [Update: link no longer requires ACM access], but I can't really understand it. A more intelligible source is FAQ #4 of the most recent blog post on Jon Masamitsu's blog. (Jon works in Java's GC group.)

Reference counting with concurrency and cycles

In a nondeterministically multithreaded environment, reference counting has problems. Increment and decrement operations need to be atomic, or else there will be consistency issues. For example, if two concurrent threads try to increment the reference count of a single variable at the same time, and it works out that they both read and then both write, then the reference count will only increase by one. This might mean that the memory is freed while there are still references to it! In the same way, a decrement could be missed.

A bad solution is to put a lock on each reference count. This is bad because it's slow: every time there's a new reference, you need to not only increment the refcount but also acquire and then free a lock. Another solution is to gave a worker thread which handles all increments and decrements; all other threads send messages to it.

To handle cycles, you could use a hybrid approach, to use mark-sweep when memory runs out in order to collect cycles. But there are other approaches. In an explicit refcounting system (where increments and decrements are manual), the user could be expected to insert a "weak reference", one which doesn't increase the refcount, whenever completing a cycle. Another way is to perform a small local marking trace when refcounts are decremented but not set to zero, to make sure there isn't an unreferenced cycle. That's described in this recent paper, which also handles concurrency. Here's an optimization on that with a proof of correctness.

Hard real-time GC

So far, I've been talking about minimizing pauses in a vague, general sense. We just want them to be a fraction of how long it takes to do a full tracing collection. But this isn't enough for some applications. Say you're making a video game where a 50ms GC pause (as the best incremental mark-sweep collectors benchmark at, I've heard) means a skipped frame or two. That can be noticeable, and it presents a real disadvantage compared to explicit allocation. Even refcounting doesn't always give really short pauses, since it causes memory fragmentation (making allocation take longer) and deallocation is not scheduled incrementally. That is, if you have a long linked list with just one reference to the head, and that reference ends, then the entire linked list gets deallocated in a chain, with no incremental scheduling.

What this situation needs is a garbage collector follow hard real-time guarantees. One way that this guarantee could be phrased is that pauses are at most 1ms, and that at least 7 out of 10 milliseconds are spent running the main program. This guarantee will be met even if the main program is acting "adversarially", bringing out the worst-case behavior in the collector. It's possible to specify a requirement like this that's unachievable for a particular application, but this requirement works for most things. Different applications can specify different requirements based on, say, their frame rate and how long it takes to render a frame. For applications like this, absolute throughput (which is basically maximized by a well-tuned generational collector) can be sacrificed in favor of better scheduling.

This sounds like an impossible dream, but it's actually been implemented in the Metronome system, implemented for Jikes by IBM. Metronome has been written about in the ACM Queue and there's also a paper which is harder to understand but explains more. The goal of the Metronome project is to allow high-level languages to be used for real-time applications on uniprocessor machines. While Java isn't what I'd choose, the GC seems to be the biggest barrier, and it's great that this research is being done.

The idea is to have an incremental mark-sweep collector (not generational) which segregates the heap into chunks (just for allocation purposes) of roughly the same size data. This minimizes fragmentation. However, fragmentation can still occur, and when a heap segment is too fragmented, it is incrementally copied and compacted to a different piece of memory. Large objects are split up into chunks called arraylets. By all of these techniques, garbage collection can be broken up into small tasks, and an innovative scheduler makes it satisfy the hard real-time guarantees.

Because the collector isn't generational, and because of the overhead of the scheduler and the floating garbage that's left by the incremental collector, this is far from optimal for applications that don't really need things to be so predictably well-spaced. But maybe, if there were more knobs on this algorithm (eg, the scheduler can be turned off, and more generations can be added), this could be a general-purpose GC that's really useful.

GC and language features

In the most basic sense, a garbage collection system consists of one exposed function, allocate, which takes a number of bytes and allocates a region of memory that's that big. But there are some other things that can be useful. For example, for tracing collectors, a collect-garbage function can be used to do a major collection when the program knows it's idle.

Another useful feature is finalizers. For most things, it's sufficient to just deallocate memory in when it's collected. But think about files. You should always explicitly close a file when you're done with it, but if the programmer makes an error, the file should still be closed once it is unreachable. With a reference counting or mark-sweep collector, this is relatively easy: just have a generic function finalize that gets called on everything that's collected. With copying collection, the collector maintains a list of objects that have finalizers, and on each copying cycle, this list is traversed and it is checked whether objects have forwarding pointers in fromspace. If an object with a finalizer doesn't have a forwarding pointer, it has been deleted and the finalizer should be invoked. This avoids a full traversal of the heap.

Actually invoking so simple, because now the object might contain some dead pointers. With a reference counting collector, if you're not collecting a cycle, you can call the finalizers in top-down order (also called topological order), and then the dead pointer issue doesn't exist. But this breaks down in the presence of cycles, and is difficult to calculate with a tracing collector. An easier-to-implement strategy is to call the finalizers in arbitrary order, but call them all before garbage is actually collected. Alternatively, everything the finalizer references can be considered a root. But in this situation, programmers have to be very careful not to retain the objects forever.

This summer, I hope to join this research community in a small way by participating in a Harvey Mudd REU (summer undergraduate research project) in garbage collection. In previous summers, an idea of blobs was developed, a generalization of a concept called ephemerons to make weak hashtables without memory leaks. They wrote and published a paper about it. They also researched and wrote about garbage collection techniques for certain hard-to-collect persistent data structures.

Another project of the leader of this group is called a simplifier, which is something that gets invoked in a copying collector to simplify a datastructure when it gets copied. This is a technique that is used in an ad-hoc way in graph-reduction runtimes in purely functional languages: you don't want to copy the whole graph if there's an easy way to simplify it without allocating any new nodes. It should be really fun to research these more advanced techniques for making garbage collection more correct and efficient.

Where to now for more research?

Academia has been working on this since the '60s. But recently, big companies like Sun, IBM and Microsoft have been doing more here in order to augment their Java and .NET platforms. Some academic resources to look at to learn more about GC are at UT Austin's website (especially bigsurv.ps). There are conferences which discuss memory management, like the International Symposium on Memory Management and SPACE.

When implementing a garbage collector for research purposes, you probably don't want to build a whole programming language runtime yourself. Jikes RVM provides an open-source research virtual machine that you can easily plug in different garbage collectors into. Jikes RVM's MMTk (Memory Manager Toolkit) makes this possible. There are visualization tools, and the heap can be split up into different segments which use different collectors.

These advanced garbage collection algorithms haven't been implemented many times and aren't well-understood by many people. There also hasn't been much work in formally analyzing and comparing these algorithms. This is partly because they're hard to analyze; constant factors have a gigantic effect. Someone came up with a unified theory of garbage collection, though, which analyzes all garbage collection strategies as some combination between marking and reference counting, which can be seen as duals. Just like with differential equations, there's no general solution which meets all of our requirements (maximizing throughput and locality, minimizing and organizing pause times, making allocation cheap) at once, though our understanding is always improving.



* You may be wondering why I keep saying that these posts are short when they're something like four pages long printed out. Mainly, it's because the things that I'm reading are much longer. It'd be kinda hard for me to describe anything meaningful in 500 words or less.