Saturday, October 10, 2009

Bitfields in Factor structs and the special style

Sometimes, it's useful to pack multiple small numbers into a small amount of memory. Say you have a tuple like
`TUPLE: nums a b c ;`
where you know that `a` is always between 0 and 31, `b` is always between -2 and 1, and `c` is always either 0 or 1. It should be possible to store this in a single byte, since `a` could be represented in 5 bits, `b` can be represented in 2 bits, and `c` can be represented in 1 bit.

One way to implement this is to define words to store in a single fixnum what a `nums` tuple stores. These words would consist of shifts, masks and maybe calls to a sign extension routine. But this code would be boring and annoying to write.

I've automated this process by allowing Factor structs to have bit fields, like C structs can. Here's how the implementation of `nums` would look:
`STRUCT: nums   { a uint bits: 5 }   { b int bits: 2 }   { c uint bits: 1 } ;`
Thanks to Joe Groff's changes on how structs work, this is a class, with prettyprinting and accessors just like tuple classes. Structs were originally a feature of Factor for the foreign function interface, but they're actually useful in pure Factor programs too: they're an efficient way to manipulate binary data.

Bitfields deviate from the FFI origins of structs because they don't follow any standard convention on bitfield layout. In C, unlike for structs, there is no single standard ABI for bitfields, even on a single OS/architecture platform. Different compilers act differently. So why not make up my own convention? I store everything little-endian, and the first bitfield stores in the least significant bits of the byte. Because bitfield access only reads the underlying memory one byte at a time in the current implementation, this is just as efficient on big-endian hardware as little-endian hardware.

Factor supports efficient homogeneous arrays of structs, allowing lots of data to be packed together efficiently. Because I extended structs for bitfields, rather than creating a new type of class, struct arrays can immediately be used with structs that have bitfields. This worked immediately; I didn't have to modify struct arrays.

The actual code for the setters and getters is implemented in a high-level style. There are no type declarations, and all math is done with generic arithmetic calls. Factor's compiler is smart enough to translate this into fixnum arithmetic with no overflow checks in the case that the bitfields are small enough. If you make a field 100 bits wide, it'll use generic integer arithmetic, to take into account possible overflow into bignums. But this will only be used for certain calculations, the ones that could overflow.

The Factor code generated for accessors is naive, not only in how it doesn't declare types, but it also does some things that could be special-cased for more efficient code. For example, in every byte read of the array, a mask is applied to the result with bitwise and, so that the relevant bits can be extracted. Sometimes, that mask will be 255, so it won't actually mask away any bits, and does nothing. Rather than special-casing a mask of 255 and generating code that doesn't call bitand, I extended the compiler to understand the algebraic identity that, in the range of numbers from 0 to 255, `255 bitand` is the identity function. (This works for (2^n)-1, for any n.)

Not all code can be compiled as efficiently for the compiler as the bitfield accessors. There is a special style of code that the compiler can compile more efficiently. As the compiler becomes more advanced, the subset grows. Really, there's a continuum of programs that the compiler can compile more or less efficiently. Originally, the special style consisted of code whose stack effect could be inferred by the compiler, allowing optimizations to take place. Now, all valid Factor code has an inferrable stack effect, but the compiler is advanced enough that it can do further optimizations when more information is available.

Code written in the special style has to have something indicating all of this information about the code. In the case of bitfield accessors, we want the compiler to be able to infer that it can use fixnum arithmetic without overflow checks if the bitfield is small enough that the result won't ever overflow a bignum. The compiler can figure this out in the case of bitfield getters because it is only doing shifts, bitands and bitors on the result of `alien-unsigned-1`, a word to read one byte of memory. The shifts are all constants.

In the sparse conditional constant propagation pass, the compiler tracks the possible interval of values that a virtual register could hold. The compiler understands how bitands, bitors and shifts by a literal relate the interval of their output to the interval or their input. Another pass, the modular arithmetic pass, propagates information backwards about the modulus that earlier calculations could be done in, based on how the data is used later. This also allows overflow checks to be eliminated.

This aspect of the special style allows overflow checks to be eliminated, allowing the use of more direct machine arithmetic. Other aspects of special style allow other efficient code to be generated. On code where types can infer, and a float or SIMD type is used, the compiler can allocate float or SIMD registers, and inline efficient code for arithmetic operations. Code using immutable tuples get the benefit of tuple unboxing, so repeated lookup of slots is cheap, and sometimes allocation can be eliminated. Code that uses mutable datastructures gets unboxed at another point in the compiler, but since it is harder to reason about, unboxing right now has only been implemented within a restricted piece of code. When the class of the input to a generic word is known, the correct method is called directly, and sometimes inlined. There are many other optimizations that have been implemented over the years, too many to describe in this blog post.

Not all code could be written in the special style, and it would be unreasonable to expect most Factor programmers to learn about it. The compiler and the UI framework, for example, would be difficult to translate into a form taking heavy advantage of style. But a lot of the code in the standard library could be written this way. For example, the code for appending sequences is written in a style that allows the inner loop to have no subroutine calls inside of it and register allocation to be performed, when operating on certain types of sequences. There is only one piece of code that implements append, but using the hints feature lets specialized versions be generated for certain types which are often appended together. Append is used in many places, so code that's written in a dynamic style will benefit from this, even if the dynamic style code isn't optimized better.

Special style is an extremely important feature of Factor, and without it, Factor would be as slow as many scripting languages. Without special style, many libraries would have to be implemented in C, as scripting languages do. Because of special style, Factor can be self-hosting and use many libraries written in Factor without an overwhelming performance penalty. Without special style, to implement bitfields as fast as they are right now, it would have been necessary to generate machine code on each platform.

Update: I guess I forgot to tell you exactly what special style is, and how to write code in it. Well, it's complicated. I'll write about it in the future. Special style grows as the compiler becomes more advanced, but I can't describe how it is right now.

Thursday, August 13, 2009

Hoisting write barriers out of loops

In a generational garbage collector, pointers from the old generation to the young generation need to be tracked. In every minor collection, these need to be considered as roots. A small piece of code called a write barrier is run on each pointer write. The write barrier records the object as modified. Each minor collection considers modified objects as potential roots into the youngest generation.

Write barriers don't have to be run on every single write, actually. There are two cases where they don't have to be run:
• If a small enough object has been allocated, and no GC could have been run since the allocation, the it must be in the nursery.
• If a write barrier has been run on an object and the GC hasn't been run after that, then the write barrier does not need to run on further writes to the object.

These things don't work across subroutine calls, since there might be a garbage collection there. They're also invalid across GC checks. But there's still a lot of code that can be improved with these observations.

For example, the word `reverse`, if specialized for arrays, doesn't have any subroutine calls or allocations in its inner loop, after enough compiler passes have run. But a naive code generator would put a write barrier call on each loop iteration. It's enough to just call the write barrier once, outside of the loop, and doing this gives you a 15% speedup.

I implemented this as a compiler optimization on Factor's low-level IR, extending Slava's local write barrier elimination pass, described here. Slava's pass eliminates redundant write barriers within a basic block, based on the two facts I just mentioned. For the local case, Slava's implementation is optimal, but with control flow we can do much better.

Here's the idea: first, insert a call to the write barrier outside of any loop that calls the write barrier on each iteration. Next, delete redundant write barriers using a dataflow analysis. With Factor's new dataflow analysis and loop analysis frameworks, both of these tasks are pretty easy.

Inserting write barriers outside of loops

Slava just implemented loop detection in low-level IR. For each loop, I want to find the set of registers that each iteration will definitely call the write barrier on. Once I find this set, I just place the write barriers right before the start of the loop. Deleting the ones inside the loop comes as the next step.

The output of loop detection is a list of loops, and each loop has a header (the entry point), a list of ends (sources for jumps back to the header) and a list of basic blocks contained in the loop. If a basic block dominates each end, then it must be run on each iteration of the loop. So a conservative approximation of the list of write barriers that must be run on each iteration is the list of write barriers contained in basic blocks that dominate each end of the loop. It turns out this is enough to get all of the meaningful, practical cases like `append` and `reverse`.

We have to be a little bit careful, though. You can't always insert a write barrier outside of a loop, because you can't run the write barrier on something like a fixnum. If you do, the VM might crash. Because type information isn't available in low-level IR, I reconstruct what can have the write barrier called on it by seeing what has had a slot lookup. This is a simple dataflow analysis.

Removing redundant write barriers

Write barriers can be removed with another dataflow analysis. Here, for each basic block, we want to calculate the registers where the write barrier does not need to be called. Once we have this set, we can run the old write barrier removal algorithm.

This is a forward analysis. I call the sets of registers where the write barrier does not need to be called again the "safe" set. Safe-in for a basic block is the intersection of the safe-outs of the predecessors if the current block has no allocation, and it is the empty set if it does have allocation. Safe-out is safe-in plus all registers that have been allocated in the block, and those that have had the write barrier run on them. Factor's dataflow framework handles this perfectly.

Conclusion

Hoisting write barriers out of loops was easier than I expected, just two evenings of work. Unfortunately, it isn't as general as it should be. The type reconstruction by looking at slot calls doesn't work as well as it should, since there is a subroutine call between the call to slot and the loop in the case of `append` and `push-all`. I think the solution to this would be for high-level IR to pass type information down to low-level IR. It would be useful here, and I bet as the low-level optimizer becomes more advanced, it will become useful in other places too. The code implementing the optimization is in the Factor pastebin and should be in the main Factor repository soon.

Thursday, July 16, 2009

A simple interprocedural optimization

An important feature of a compiler is that it compile code fast enough so that you don't feel like you're waiting forever. For this reason, most optimizations stop at the boundary between words.

If one word calls another word, then the callee, in general, doesn't get the benefit of any information collected about what its arguments look like. And the caller doesn't get any information about what might be returned. There are exceptions to this, for specific words that have special optimization behavior in the compiler. For example, class predicates interact specially with the propagation pass to fold to constants whenever the compiler can prove that they'll always evaluate to true or false.

One optimization the compiler works hard to do is eliminating type checks and runtime generic dispatch. It likes to turn virtual method calls into direct jumps, both because this is faster and because it enables further optimizations. Type inference in the SCCP pass is what drives the elimination of dispatch.

But type inference has to stop at procedure boundaries, in general. We can't know all of the possible inputs to a word, since it can be called from anywhere, including the listener. And it would take too much time for callers to trace through every procedure they call to see what they can deduce about the output from what they know about the input.

On the other hand, there would sometimes be a lot of benefit for callers and callees interact to perform optimizations. It'd be especially helpful for things words like `append` which would benefit the most from type inference. `append` consists of many generic calls (to `length` and `nth-unsafe`), and the dispatch can be eliminated if the types of the inputs are known. Additionally, the type of the output follows from the types of the inputs, and

Maybe interprocedural analysis is too much work in general, but for something like `append`, it would be helpful to have versions specialized for several different types, which are used when the type of the inputs is known. I implemented in Factor a simple system where words can be annotated to do this. The code is in the Factor pastebin; this is just a prototype and needs some changes before it's fully read to use.

With this system, to make the word `append` automatically create specialized versions based on the types of its two inputs, you can use the declaration
`SPECIALIZED: append 2`

This doesn't immediately compile a ton of different versions of the word. Instead, it compiles them "on demand", whenever the propagation pass finds that append is used with certain types.

When I applied this to the nbody benchmark, part of the Programming Language Shootout, by making certain words in `math.vectors`, the running time went from around 4.3 seconds to around 4.0 seconds. This is a modest gain, but it's on top of something already highly optimized--there is some code which gives the compiler special knowledge of how to run vector operations on the kind of array used in the benchmark. I hope that this technique can make most of that code go away.

Tuesday, July 14, 2009

Some new compiler optimizations

I've been working on Factor's optimizing compiler, adding a few new simple optimizations. I've made `call(` and `execute(` do more inlining, extended dead code elimination, increased the number of cases where overflow checks can be eliminated, and made object instantiation fast in more cases. Here, I'll explain what the optimizations are and how they're implemented.

Inlining with `call(` and `execute(`

`call( -- )` and `execute( -- )` are words which let you call quotations or execute words. Slava explained them in a blog post. They differ from `call` and `execute` in that they don't require that the word or quotation is available through combinator inlining. But they require an explicit stack effect to be given, to ensure that it takes and returns the right number of parameters. This is nice, because the versions with a stack effect have an additional safety property: they'll only run with if the code has the right stack effect.

Until now, these combinators carried with them a perfomance penalty over using `call` or `execute` with known quotations. The penalty is that the stack effect of the quotation must be checked, at runtime, to match the stack effect of the callsite. With an optimization I implemented, the performance is the same when the quotation is known. With matching performance in this case (they're both completely free in the case where either would work), it should be easier to write code that uses the checked versions.

For example, the following implementation of an absolute value function compiles down to the same code that you'd write with the normal `if` combinator.
`:: iff ( x cond true-quot false-quot -- x' )    cond    [ x true-quot call( x -- x' ) ]    [ x false-quot call( x -- x' ) ] if ; inline: abs ( n -- abs(n) )    dup 0 < [ neg ] [ ] iff ;`

This is implemented as part of the sparse conditional constant propagation (SCCP) pass in Factor's compiler. `call-effect` and `execute-effect` have custom inlining behavior there, which takes advantage of information collected by the propagation pass. If enough is known about the quotation at this time (if it is a literal quotation, or a literal quotation with something (literal or non-literal) curried on to it, or two literal quotations composed together) so that the stack effect can be inferred and the code can be inlined, then it is inlined.

I could have implemented this as a transform in the stack checker, but this strategy gives a stronger optimization, since it can interact with everything in constant propagation. For example, it interacts with method inlining. This will help improve performance in the Factor Smalltalk implementation, where previously combinator inlining would have been impossible without special support from the Smalltalk-to-Factor translator.

Eliminating overflow checks

In Factor, unlike C and Java, calculations on integers never overflow. Instead, numbers that are too big are converted to a representation that scales to arbitrary size. The smaller integers which are faster to calculate on are called "fixnums" and the larger ones, which are slower to use, are called "bignums"

Factor's compiler does a lot of work to convert general arithmetic to arithmetic on fixnums when possible. One thing the compiler does is try to infer that integers will be in a small enough range that no overflow can happen. This is part of the SCCP pass.

Another compiler pass checks if a value is only used in places where overflowing doesn't matter. For example, if the code `+ >fixnum` is run in a context where it's known that its arguments will both be fixnums, then it doesn't matter if an overflow check is done because of how modular arithmetic works.

I extended this pass to take into account a couple other words that have this property that `>fixnum` has. Specifically, the words which set memory at a given pointer address, like `set-alien-unsigned-1` and take integers as arguments. Depending on the word size of the platform (32 vs 64 bits), this optimization is valid for a different set of words.

One time that this comes up is in `v+n` when called with a byte array and a fixnum. Adding two fixnums gives back something that can be either a fixnum or a bignum, but the form of addition without an overflow check can be used since the result is going to be stored back into a byte array. Storing into a byte array is implemented with `set-alien-unsigned-1`, so the optimization applies.

Inlining for `new`

The word `new` instantiates a tuple with default values for all slots, given a tuple class. By default, this is done dynamically: the tuple class does not need to be known before runtime. But usually it is known ahead of time. If it is known ahead of time, then code can be inlined which instantiates the specific tuple that is there.

This was previously implemented as part of the stack checker. I moved it to the propagation pass, which makes the optimization stronger. I plan on moving more transforms like this (for example, the one for `member?`) to the propagation pass.

[Update: I did this, adding a utility called `define-partial-eval`. Its interface is identical to `define-transform`, but it operates on the propagation pass IR. Transformations which don't need to interact with stack checking should use `define-partial-eval` rather than `define-transform`, since it creates a stronger optimization.]

Better dead code elimination

I extended dead code elimination in the low-level IR to be more accurate. Now, it realizes that an allocation is dead if it is only written to, and never read from. With this, together with alias analysis, the code `2array first2` compiles into a no-op, and no allocation takes place. (This isn't optimized out by tuple unboxing in the high-level IR, because arrays are mutable.)

Tuesday, March 17, 2009

The implementation of Factor's regexp library

I've been working on Factor's regular expression library, initially written by Doug Coleman, for the past few weeks. Recently, the library became good enough that I've pushed it to Factor's main repository. The latest Factor binaries have this new library.

The library uses an standard algorithm of converting a regular expression into an NFA, and that into a DFA which can be executed. This is a tradeoff: the code generated will be faster than you would get from a backtracking search or an NFA interpreter, but it takes exponential time, in the worst case, to generate the DFA. I might revisit this later.

The main features missing now are
• Possessive and reluctant matching
• Group capture
• Unicode support, in the form of UTS 18 level 1 compliance with some level 2 features

Right now, I'm working on Unicode support. Regexps already use Unicode strings, because all strings in Factor represent a sequence of Unicode code points, but not many Unicode properties are exposed now. I plan on working on this, and implementing more Unicode algorithms and properties, to reach level 1 compliance.

The rest of this article is an overview of how the regexp engine works. It is implemented as a series of passes, where the first pass takes a string as input, and the last pass outputs a word which runs the code of the regexp. In this way, it is rather like any other compiler, where the parse tree, DFA table and NFA table are just intermediate representations.

The parser

The parser is implemented with Chris Double's packrat parsing library. This makes it not very efficient, but the time spent in parsing is much less than the time in later processing stages, so the cost isn't very large. Things like `/a{2,4}/` are expanded into the equivalent, but simpler, form `/aa|aaa|aaaa/`.

If I were working only with ASCII, then ranges would be expanded into disjunctions as well, but the alphabet is far too big for that. Instead, something like `/[ac-z]/` is represented in the syntax tree as a item, a character class object, representing the information that it matches the character a, or something in the class which is the range c-z. For a character class like `/[^\dc]/`, an object is created which represents a character which is not a digit or c.

Constructing an NFA

From the syntax tree, a nondeterministic finite-state automaton is built. The algorithm is described here, and there is nothing special about the implementation.

Lookahead, lookbehind and anchors (like \$ and ^) expand to entries in the syntax tree called tagged epsilons. When these are encountered in building the NFA, an epsilon transition is created which is annotated with this information.

Negation is implemented here. If a negation syntax node is encountered, then the NFA builder constructs an NFA for the enclosed term, disambiguates it, converts it to a DFA, minimizes it, and attaches it back to the larger NFA that is being constructed.

Disambiguation

As I previously described, since the implementation doesn't iterate over every element of the alphabet, there needs to be a procedure to make transitions over characters have disjoint labels. Transitions are labeled by sets, and the output from creating an NFA might have intersecting outgoing sets from a transition.

The best way I've thought of doing this is to get all of the intersections of all of the edge labels, basically forming a Venn diagram. This is, unfortunately, exponential time and space to do. But I see no way of avoiding it when compiling a regular expression like `/\p{letter}a|[0-9a-z]b|\{script=latin}c|`...`/` where there are a large number of incomparable character classes used.

I implemented a small optimization for this: numbers (ie literal characters) are set aside at the beginning and treated specially, so work isn't wasted intersecting them with other classes. The complexity of the algorithm stays exponential, but instead of being exponential in the total number of character classes in the regexp, it becomes exponential in just the non-literal classes.

Constructing a DFA

This is also a standard algorithm. My only modification is to support the tagged epsilon transitions created by lookaround and anchors. I described the modification in a previous blog post.

Minimization

Next, the resulting DFA is minimized. I wrote about regexp minimization before. The algorithm had to be modified slightly to allow for the conditional transitions introduced by processing lookaround in the previous step.

Compilation to Factor

At this point, we have a nice minimal DFA with disjoint outward transitions. Translating it into Factor code is actually quite easy. For each state, we make a gensym. The gensym takes as arguments a string and an index. If the index is at the end of the string, the word returns a boolean, indicating whether the current state is an accepting state. If the index is not at the end of the string, the current character is found, and the word figures out which transition to take. A transition is taken by incrementing the index and then making a tail call to another state word.

The strategy for finding the right transition is somewhat complicated. First, the literal transitions (over constant characters) are partitioned out from the non-literal transitions. The literal transitions are formed into a case statement, where the default case handles non-literal transitions.

Non-literal transitions are all boolean expressions, built with the class algebra described below. They have a number of logic variables (classes of characters). So we can build a truth table over the logic variables, and test each condition exactly once to figure out which transition to take. For example, in the regexp `/\p{script=latin}b|\p{lower}c/`, the DFA will have three transitions from the start: one over characters which are Latin script and lower-cased, one over characters which are lower-cased but not in Latin script, and one over characters which are in Latin script but not lower-cased. Rather than having the compiled DFA check if the character is in the composite classes directly (which would duplicate cost, since it would be looked up multiple times if a character is lower-cased or Latin script), the compiler builds nested if statements that figure out the composite class while testing each property only once. This leads directly to the transition.

Class algebra

There is a system for simplifying the intersections built in disambiguation, as well as character class literals. It is built off simplifying logical expressions built with and, or, not. The things contained are true (the whole set of characters), false (the empty set), and all possible character classes.

There are constructors for these three logical operations, and then a few simple tactics are used to reduce them. Reducing the expression to simplest form is equivalent to circuit minimization. A friend told me that this is on the second level of the polynomial hierarchy. So I'll just live with the heuristics.

The not constructor is simple. If it's given true, it outputs false. False to true. If it's given a negation as input, it returns the contents. If it's given an and class, it uses De Morgan's law and negates each entry, returning an or. And vice versa.

The and/or constructors are slightly more complicated. I will describe how the and constructor works; the or constructor can be easily derived using De Morgan's law. The input is a sequence of classes, and we want to get their intersection. First, if the input contains intersection (and) classes, these are flattened into the larger sequence. Next, the sequence is sorted into categories: integers, negations of integers, simple classes (like the class digits), negations of those, union classes (ors), and booleans. Delete true from the booleans list, if it's there, as it cannot affect the outcome. If there is a false in the booleans list, then the answer is false. If there is more than one integer, the answer is immediately false. If there is exactly one integer, then the answer is that integer if it is contained in all of the other classes, otherwise false. Now, we are working with a sequence which does not have integer literals, or true or false. If there is a simple class and a not-simple class for the same class, we know that their intersection is false, so the entire expression is false. We can remove not-integers where the integer is contained in an existing not-simple class, as these are redundant. Finally, the or classes within the and class can be simplified in the case where they have logic variables overlapping with other things in the and class: these can all be substituted with true. For example, if you have and(lowercase, or(lowercase, latin)), this can be simplified to and(lowercase, latin). This is because true is substituted for lowercase in the or expression, and or(true, latin) simplifies to latin.

Previously, class algebra was not this strong in what reductions it did. This caused problems. For example, nested negations (useful in implementing conjunction) would result in multiple nested disambiguations, which would cause a very fast blowup in the size of the DFA. Now, running disambiguation twice gives the same results as running it once. (In other words, disambiguation is idempotent.) At least I think so.

Conclusion

Implementing regular expressions took a lot more thinking than I expected, but I'm satisfied with the result so far. Unlike traditional regular expressions, where pathologies might make regexp matching slow, in this system pathologies make regexp compilation time slow. That seems more acceptable to me. Factor's extensible syntax allows me to make regexp literals, which compile before the program runs, even though regexps are a library rather than built in.

If I were starting from scratch, I might instead use the algorithm of constructing a DFA directly from a regular expression using regexp derivatives. It's described in this paper [PDF]. I'm not sure how performance in practice compares, but it implements negation and conjunction in a much cleaner way, and basically eliminates the need for minimization. Most importantly, it allows for a variation of disambiguation which avoids exponential behavior in many cases.

In a later blog post, I'll talk about the API that I expose for regexps.

I haven't actually implemented `\p{script=foo}` yet, but I will very soon.

Sunday, March 8, 2009

Text encodings and API design

This blog post is about a problem that I haven't figured out the answer to: How should vendor extensions to encodings be exposed to programmers?

It seems like basically all pre-Unicode text encodings have proprietary Microsoft extensions that become, in practice, the next version of the standard. Sometimes these are basically supersets, but sometimes the extensions are backwards-incompatible in particular tiny places. One example of this that should be familiar to Westerners is the distinction between Latin 1 (ISO 8859-1) and Windows 1252. The only difference is in the range 0x80-0x9F, where Latin 1 has a set of basically useless control characters.

A similar situation exists with many East Asian character sets (eg. KS X 1001/1003 for Korean, JIS X 208 for Japanese). In these cases, the backslash (\ 0x5C) is mapped to a national currency symbol in the official standard. But in the Microsoft extension, 0x5C is mapped to backslash to maintain compatibility with ASCII, and another character represents the national currency symbol.

Some websites mark themselves as ISO 8859-1, but are in fact encoded in Windows 1252, and many web browsers take this interpretation into account. Similarly, the Microsoft versions of East Asian text encodings are often used in contexts where the standard versions are declared. In some cases, the Microsoft versions are registered with IANA separately for use in internet protocols (eg Shift-JIS/Windows_31J, and Latin1/Windows-1252), but in other cases there is only one registered encoding (eg EUC-KR).

So, there are two questions.
1. When interpreting an HTTP response, or receiving an email, where the encoding is declared in the header, should the text be interpreted with the standard encoding or the Microsoft extension?
2. In the encodings API for a programming language, if a particular encoding is used to read or write a file, should the standard be used or the Microsoft extension?

When I started reading about encodings, I assumed that everything could be done reasonably by following the standards precisely. Now I'm not sure what to do.

Thursday, March 5, 2009

Naive lookaround in a DFA

Note: If you don't understand what's below, read this for background on implementing regular expressions with DFAs.

I just implemented lookahead and lookbehind in the Factor regexp module. This lets you write regular expressions like `/(?<=foo)bar/`, to match "bar" when it's preceded by "foo". I couldn't think of a better way to do it for the general case, so I just did it the naive way. There should be a better way, though, because lookahead and lookbehind don't extend the set of possible strings matched. This algorithm isn't so good because it makes things worse than linear time in the general case.

For a lookahead or lookbehind clause, there is a little regular expression compiled. This regular expression annotates an epsilon transition. If I had an NFA interpreter, then the interpreter would just run the regular expression on the input string starting at the current point when it wants to know if it can use the epsilon transition. I'm using a DFA, so I needed to modify the subset construction to make this work.

What needs to be modified is the procedure to get the epsilon closure of an NFA state. Instead of returning a set of NFA states, this procedure should return a set of states and the conditions for reaching each state. Usually, there will be no conditions, but sometimes there will be a conjunction or disjunction of lookarounds. It could be the conjunction in regular expression like `/(?<=foo)(?=...bar)baz/`, and it could be the disjunction in a regular expression like `/((?<=foo)|(?=...bar))bar/`, and since I'm trying to be fully general, all of this is supported.

The epsilon closure procedure is usually something like this*. Start with your state on a list, and look at all of the epsilon transitions outwards. Add these to your list. Now, if anything on your list has more epsilon transitions outwards, add these to your list. Keep going until there's nothing to add.

With the modification: Start with your state on a list, with the associated information that there are no conditions that have to be met for it. Then, look at all of the outward epsilon transitions from all the states on your list. For each new transition that you're considering, the requirement to get from the original state to the target of the transition is to meet the requirement of the source, plus the requirement of the transition. If you already had a way to get to the target of the transition, then now there are two conditions, and either one can be used. Keep repeating this until examining epsilon transitions doesn't change anything.

Now, a little post-processing can turn the result of this into a bunch of nested conditionals, which can quickly tell you what states you can get to given which conditions are met. In the usual case, where there are no conditions, this tree is just a single leaf, a list of states. From here, transitions in the DFA go not to sets of states but to trees of conditions and states. The start state is also one of these trees.

The DFA interpreter** needs a little bit of extra logic to test the conditions and traverse the tree. The interpreter gives the condition the input string and index, and the condition can do whatever it wants with that information. I've implemented anchors (eg. \$ and ^) this way. They could be lookaround, but it's easier to just implement them directly as a predicate on the string and index.

*No, I don't implement it this way, because it would be a lot of repeated work, but this expresses the idea.
** Actually, everything compiles down to Factor code, which is compiled with the optimizing compiler if you want.

Tuesday, February 24, 2009

Draft paper about inverse

I recently submitted an abstract about inverse to an undergraduate computer science conference, and I just got word that it was accepted. I've written a draft paper, available on factorcode.org, and I'd like to hear your comments on it. I've written it for a completely general audience, not assuming knowledge of functional programming or stack-based programming languages. I didn't get into formal semantics or anything like that, partly because it'd make the paper too long and partly because I don't completely understand the point.

Speaking of undergraduate research, if you're interested in doing research in practical computer science this summer, I recommend applying to Harvey Mudd's CS REU. Applications are due Sunday. But only if you're a US citizen or permanent resident! The funding agencies don't like foreigners. (This is actually a serious problem for several of my friends, who have difficulty finding the same summer academic and internship opportunities because they are actively discriminated against as international students. I hope that, one day, discrimination based on citizenship is as illegal as discrimination against racial minorities or women, but we are very far away from that today. Now I'm getting off-topic.)

Monday, February 23, 2009

Regular languages with large alphabets

In implementing regular expressions, the formal theory of regular languages can be useful, especially when compiling regexps to DFAs. But the standard definition of a regular language is a little inconvenient when working with Unicode. In a standard DFA, a transition is over a single character. But for the range `/[\0-\u0f423f]/`, I don't really want to compile a DFA with a million transition edges!

A slightly modified formalism is useful here, where transitions in NFAs and DFAs happen over sets of letters, rather than individual letters. Then, things like ranges and character classes (eg `\p{Lower}`) are represented as sets which annotate transition edges.

It's perfectly straightforward to translate such a regular expression into an NFA with the standard construction, and the typical algorithm for executing an NFA by running all possible states in parallel works. For DFAs, though, there's a small complication: we have to remove ambiguity about where to go next.

Say you have the regexp `/\p{InLatin}b|\p{Lower}c/` This matches strings like "ab", "ac", "Ab", "Ï€c" but not "Ac" or "Ï€b". The simple textbook algorithm for regular expressions would have me expand `/\p{InLatin}/` out to `/a|A|b|B|.../`, and expand `/\p{Lower}/` to `/a|Ï€|.../`. This strategy would work, but the size of the resulting NFA and DFA would be gigantic.

What we actually want to do is change the two outward transitions from the start state--transitions over the sets `InLatin` and `Lower`--to transitions over `InLatin ∩ Lower`, `InLatin - Lower` and `Lower - InLatin`. Since these are disjoint, there's no ambiguity about which one to take. In general, for a state with n outward transitions, you have to look at 2n possibilities, since for each subset, you have to make a transition for characters which are in each of those transition groups, but not in any of the others.

Implemented naively, this would make the size of DFAs blow up. For the regexp `/ab|cd/`, you'd have a transition on characters that are a but not c, characters that are c but not a, and characters that are both c and a. Fortunately, it's simple to work out a system which recognizes that a - c = a, c - a = c and c ∩ a = ∅. With this in place, the resulting DFA (which didn't have any ambiguity in the first place) is just like what it would be without the system, but the DFA for `ab|\p{Lower}c` has two transitions from the start state: one over a, and one over lower cased letters that aren't a.

I've implemented all of this in the Factor regexp library. If you want to see the code, it's in the main repository, in the `regexp` branch.

PS. When you're considering transitions over sets, it's possible to consider "regular" languages over an infinite alphabet. It might be convenient to think of Unicode as infinite, since it's so large. But it's easy to prove that for any such "regular" language, there is a string homomorphism to a finite-alphabet regular language where a string is in the original language if and only if its homomorphic image is in the smaller regular language. So, in this way, it's a mathematically boring formalism to study. Other people have studied regular language formalisms with infinite alphabets that actually do have more power--they have the power to compare characters for equality in certain contexts. But that's completely different.

Wednesday, February 18, 2009

DFA minimization

I've been working on the `regexp` vocabulary by Doug Coleman, cleaning it up and adding new features with the goal of making it usable for the `xmode` syntax highlighter. This means compatibility with most of Java's regexp syntax.

The implementation strategy Doug and I are using is standard textbook stuff, the way Lex works: from the regular expression, a nondeterministic finite automaton (NFA) is constructed, and this is converted to a deterministic finite automaton (DFA). This can be used to efficiently match the string in linear time. Russ Cox wrote a good introduction to all of this.

There's a limitation: backreferences (like `/(.*)\1/`) are not supported, since they're incompatible with the NFA/DFA model. But there's no good way to implement backreferences, anyway: parsing with them is NP-complete. Perl uses a backtracking model where backreferences are easy to implement, but in certain cases the backtracking gets out of hand and performance is worse than linear.

Today, I worked out the code to minimize DFAs. The DFA minimization algorithm is really nice, so I thought I would share it with you. The implementation is just 65 lines of Factor code, which is in the Factor pastebin.

The issue is that sometimes, naively constructing a DFA gives you duplicate states which are really the same thing. For example, if you have two final states which each have no outward transitions, they can be consolidated. If you have the regular expression `/ac|bc/`, then there's a DFA for this in just 3 states. The naive way, however, would give you 5 states.

What we want is to partition the states into sets of states that are all the same, and then use only one of these. In mathematical language, we want to create an equivalence relation and quotient out by it. Here's how we figure out what states are the same.
1. Assume that all states are the same*.
2. Separate the final states from the non-final states.
3. Repeat the following until it doesn't make a change in the partitioning:
• Separate any two states which have a transition with the same label to two states which are separated.

Once this doesn't change anything, the states have been divided into the sets that we want.

Interestingly, this algorithm is pretty similar to optimistic global value numbering, a compiler optimization. Optimistic value numbering is a technique for eliminating duplicated computations. It works by assuming that all registers hold the same value, and there is an iteration until fixpoint tries to separate registers which are actually different from each other. When faced with loops, this can catch more than so-called pessimistic value numbering, which first assumes that everything is different, until it can prove that two registers hold the same value.

In the simple, controlled environment of a DFA, it's been proved that this minimization actually produces the smallest possible DFA matching the same set of strings. It's even been shown that, for a given regular language, the minimal DFA recognizing it is unique up to isomorphism. Such a nice analysis isn't possible in the more complicated world of compiler optimizations, however, where better optimizations than GVN exist.

*Actually, we assume that any two states with the same labeled outward transitions are the same. For example, if a state A goes to state B on character x and state C on character y, and state D goes to E on either x or y, then we'll assume at the beginning that A and E are the same. This is a simple modification I came up with to deal with the effectively infinite alphabet of Unicode, since it would be impossible to compare the transition on each Unicode code point.

Thursday, February 5, 2009

XML pattern matching

I've implemented syntax for pattern matching on interpolated XML literals. This is a Scala-inspired feature which may or may not be useful, but is definitely cool-looking. Here's a sample of code:
`: dispatch ( xml -- string )    {        { [ [XML <a><-></a> XML] ] [ "a" prepend ] }        { [ [XML <b><-></b> XML] ] [ "b" prepend ] }        { [ [XML <b val='yes'/> XML] ] [ "yes" ] }        { [ [XML <b val=<->/> XML] ] [ "no" prepend ] }    } switch ;`

And here's some examples of what it does:
`[XML <a>pple</a> XML] dispatch         => "apple"[XML <b>anana</b> XML] dispatch        => "banana"[XML <b val="yes"/> XML] dispatch        => "yes"[XML <b val="where"/> XML] dispatch        => "nowhere"`

The pattern matching here is based on my `inverse` library. Hopefully you get the high-level idea of how XML pattern matching works. A caveat is that namespaces are ignored, as in Scala's XML pattern matching, and I haven't thought of a good way to incorporate namespaces.

Jeff Attwood recently wrote a blog post about XML literal syntax*, and how it's a big help. In part, I agree with his statement that it lets you write cleaner code to have XML literals mixed in with everything. That's why I implemented them in Factor.

But I think it's insane to have them built in the way that Scala and, apparently, VB.NET have. Just think if the designers of Java had this idea 15 years ago. They would have made SGML literal syntax rather than XML, and there would be no way to remove this for backwards compatibility concerns. Every implementation of Java would have to contain a full-featured SGML parser. Generating HTML would be easier and allow for very pretty code, though.

XML won't be so prominent forever. It'll always be there, of course, just like SGML is still there in non-XML HTML and DocBook files which refuse to die. But I wouldn't be surprised if, in 20 years, what we want is a literal syntax for something other than XML. We'll still be using a lot of the same programming languages as were invented today, though.

Factor's XML literal syntax and XML pattern matching are just libraries, and I wouldn't want it any other way. Factor's parser needed no adjustments at all to allow for XML literals. If an XML literal is written when the vocabulary `xml.literals` isn't in scope, it's a syntax error. Sure, you need delimiters `[XML XML]` around the XML literals, but this is a small price to pay.

In 20 years, if XML isn't useful any longer, then Factor programmers will be just as well off as if they are. If there's a new data format, it'll just be a new library to download, and you'll have literal syntax for that data format. If you were using VB.NET, though, you'd have to upgrade to the most recent version of the language, with a parser that's been vastly complicated.

*He actually talked about HTML literal syntax, but that's such a ridiculous idea that I know he just mistyped. There's no way to tell where an HTML fragment ends, for one, and the HTML DTD would have to be part of the core grammar of the language. You would need delimiters around where the HTML starts and ends, and none of his code fragments have those delimiters. The X key is right next to the letters H and T, so it's an honest typo. His pseudocode sample just before the VB.NET fragment must have also been a simple mistake, as it would seem to imply that either XML literals get printed immediately or that there is some implicit way of collecting them.

Sunday, January 25, 2009

Factor supports XML literal syntax

Factor can now express XML as literals in code. There's a new library, `xml.interpolate`, which lets you create an XML chunk or document using interpolating either by locals or using a fry-like syntax. Here's a taste, from the `syndication` vocab, to create an Atom file:
`: feed>xml ( feed -- xml )    [ title>> ]    [ url>> present ]    [ entries>> [ entry>xml ] map ] tri    <XML        <feed xmlns="http://www.w3.org/2005/Atom">            <title><-></title>            <link href=<-> />            <->        </feed>     XML> ;`
This could also be written with locals:
`:: feed>xml ( feed -- xml )    feed title>> :> title    feed url>> present :> url    feed entries>> [ entry>xml ] map :> entries    <XML        <feed xmlns="http://www.w3.org/2005/Atom">            <title><-title-></title>            <link href=<-url-> />            <-entries->         </feed>    XML> ;`
Here's an example with more complicated logic:
`"one two three" " " split[ [XML <item><-></item> XML] ] map<XML <doc><-></doc> XML>`
whose prettyprinted output (using `xml.writer:pprint-xml`) is
`<?xml version="1.0" encoding="UTF-8"?><doc>  <item>    one  </item>  <item>    two  </item>  <item>    three  </item></doc>`
The word `<XML` starts a literal XML document, and `[XML` starts a literal XML chunk. (A document has a prolog and exactly one tag, whereas a chunk can be any kind of snippet, as long as tags are balanced.) The syntax for splicing things in is using a tag like `<-foo->`. This syntax is a strict superset of XML, as a tag name in XML 1.0 is not allowed to start with `-`.

It took me just an evening to hack this up. It's less than 15 lines of code modifying the XML parser, and all the interpolation is less than 75 lines of code. Best of all, unlike Scala's XML literals, this doesn't affect the core of the language or make anything else more complicated. It's just nicely tucked away in its own little vocabulary. I plan on replacing usages of `html.components` with this, once I make some tweaks.

Saturday, January 17, 2009

Quote of the day

It should be obvious that while garbage collection (or compaction) is going on, the execution of the user's program must be suspended. Hence out of common courtesy to the user, we must provide for some notification--perhaps a flashing message in the corner of the screen--that garbage collection is in progress. Otherwise when the program stops running, the user may suspect a bug, and an inexperienced user may panic.

--Introduction to Compiler Construction by Thomas Parsons, Chapter 8.2, "Runtime memory management"
(The book spends about half a page on garbage collection, mostly to describe the mark-sweep algorithm.)

With Lisp machines, it used to be that people would talk about things like running the garbage collector overnight, when the machine was not in use. Virtual memory in use was typically several times the size of the physical memory. Within its context, the quote is accurate (or at least 10 years earlier it was): running mark-sweep on mostly virtual memory with a slow disc will take a lot of time.

I'm happy that this concern is very outdated. If it weren't outdated, Factor (or Java) could never be a reasonable choice of programming language. There are two main reasons for the improvement.
1. Memory size has increased. Programs that would use tons of swap space 15 years ago now fit comfortably on the RAM.
2. Algorithms have improved. Not only do we have generational garbage collection to improve throughput, but there are also incremental techniques to spread out the pause, and various methods to take advantage of multiple processors.

So now, the idea of a flashing message for garbage collection in progress is hilariously quaint. If I were going to write a compiler textbook, I'd spend a whole chapter on garbage collection. It is essential for any modern programming language, and a good algorithm is essential.

Thursday, January 15, 2009

XML encoding auto-detection

The Factor XML parser now auto-detects the encodings of XML documents. This is implemented for all of the encodings that are implemented in Factor. To see how it's implemented, look at the XML standard, because it explains it much better than my blog post, which was below.

I was mystified myself when I first read that XML documents can specify what encoding they are, in the document itself. The encoding, if it's not UTF-8, must specified in the prolog like this:
`<?xml version="1.0" encoding="ISO-8859-1"?>`

The idea of the algorithm is simple. It just goes by cases. First, check for a byte order mark (BOM), which would indicate UTF-16 and a particular endianness, or UTF-8. If there's no BOM, then the first character must be < or whitespace. If it's <, we can differentiate between UTF-16BE, UTF-16LE (without BOMs) and an 8-bit encoding. If it's one of the first two, we can tell by the fact that there's a null byte before or after the <. If it's an 8-bit encoding, we can be sure that there won't be any non-ASCII in the prolog, so just read the prolog as if it's UTF-8, and if an encoding is declared, use that.

To implement it, I just read byte by byte and have a case statement for each level. After two just octets, it's possible to differentiate between UTF-8, UTF-16 (with a BOM, for both endiannesses), UTF-16BE and UTF-16LE. A similar process could also identify UTF-32 and friends after 4 octets. In my implementation, I had to do a little bit of hacking inside the XML code itself to get this integrated properly. All together, it's about 40 or 50 lines of code. It's available now in the Factor git repository.

[Update: Thanks for pointing out my error, Subbu Allamaraju. Fixed a typo, see comments.]

Thursday, January 8, 2009

I'm back

Hi again! I've been gone for the past few months because in the fall I was away in Budapest studying math, and it was really hard. Now, I've decided to take three months off from school to work on Factor full-time. My plan is first work on finishing up Unicode and XML, and then try to improve Factor's garbage collector.

For Unicode, over the past couple days, I fixed bugs in normalization and grapheme breaking, and I implemented word breaking. This was all made a lot easier by the test suites that Unicode includes, and these are now used as unit tests. I also wrote docs for everything. I still need to optimize everything and clean up the code, though. There are a lot more algorithms that I could implement, and plan on implementing eventually, but I'm just going to do this for now.

For XML, I plan on hooking up the XML conformance test suite and fixing any conformance issues that come up, for starters. I've been terribly informal about testing in the past, and I'll try to change this. Instead of optimizing the current XML code directly, I plan on replacing it with a generated lexer and parser: I'll try to make a system like lex/yacc in Factor. Previously, Chris Double has made some pretty cool stuff for parsing with parser combinators and parsing expression grammars, but these both have efficiency problems. I know that a traditional system like lex/yacc can solve the problem of parsing XML, and while it's not as flexible as PEGs, it still might be possible to add high-level OMeta-like features. Parsing is something that I don't know a lot about, so I'll be doing some reading for this.

For garbage collection, I want to actually implement something after studying it for so long! I plan on first making Factor's GC generational mark-sweep, and investigating changes to the generational policy to reduce thrashing. Then, there are two things that I want to do: make it incremental, and make some kind of compaction. Maybe this will take the form of mark-copy, or maybe it'll just be an incremental mark-sweep collector with occasional stop-the-world compaction (which could be disabled for certain applications). Either way, expect the footprint of Factor programs in the future to be reduced by almost half, and hopefully GC pauses will be shorter most of the time.