Thursday, July 16, 2009

A simple interprocedural optimization

An important feature of a compiler is that it compile code fast enough so that you don't feel like you're waiting forever. For this reason, most optimizations stop at the boundary between words.

If one word calls another word, then the callee, in general, doesn't get the benefit of any information collected about what its arguments look like. And the caller doesn't get any information about what might be returned. There are exceptions to this, for specific words that have special optimization behavior in the compiler. For example, class predicates interact specially with the propagation pass to fold to constants whenever the compiler can prove that they'll always evaluate to true or false.

One optimization the compiler works hard to do is eliminating type checks and runtime generic dispatch. It likes to turn virtual method calls into direct jumps, both because this is faster and because it enables further optimizations. Type inference in the SCCP pass is what drives the elimination of dispatch.

But type inference has to stop at procedure boundaries, in general. We can't know all of the possible inputs to a word, since it can be called from anywhere, including the listener. And it would take too much time for callers to trace through every procedure they call to see what they can deduce about the output from what they know about the input.

On the other hand, there would sometimes be a lot of benefit for callers and callees interact to perform optimizations. It'd be especially helpful for things words like append which would benefit the most from type inference. append consists of many generic calls (to length and nth-unsafe), and the dispatch can be eliminated if the types of the inputs are known. Additionally, the type of the output follows from the types of the inputs, and

Maybe interprocedural analysis is too much work in general, but for something like append, it would be helpful to have versions specialized for several different types, which are used when the type of the inputs is known. I implemented in Factor a simple system where words can be annotated to do this. The code is in the Factor pastebin; this is just a prototype and needs some changes before it's fully read to use.

With this system, to make the word append automatically create specialized versions based on the types of its two inputs, you can use the declaration

This doesn't immediately compile a ton of different versions of the word. Instead, it compiles them "on demand", whenever the propagation pass finds that append is used with certain types.

When I applied this to the nbody benchmark, part of the Programming Language Shootout, by making certain words in math.vectors, the running time went from around 4.3 seconds to around 4.0 seconds. This is a modest gain, but it's on top of something already highly optimized--there is some code which gives the compiler special knowledge of how to run vector operations on the kind of array used in the benchmark. I hope that this technique can make most of that code go away.

Tuesday, July 14, 2009

Some new compiler optimizations

I've been working on Factor's optimizing compiler, adding a few new simple optimizations. I've made call( and execute( do more inlining, extended dead code elimination, increased the number of cases where overflow checks can be eliminated, and made object instantiation fast in more cases. Here, I'll explain what the optimizations are and how they're implemented.

Inlining with call( and execute(

call( -- ) and execute( -- ) are words which let you call quotations or execute words. Slava explained them in a blog post. They differ from call and execute in that they don't require that the word or quotation is available through combinator inlining. But they require an explicit stack effect to be given, to ensure that it takes and returns the right number of parameters. This is nice, because the versions with a stack effect have an additional safety property: they'll only run with if the code has the right stack effect.

Until now, these combinators carried with them a perfomance penalty over using call or execute with known quotations. The penalty is that the stack effect of the quotation must be checked, at runtime, to match the stack effect of the callsite. With an optimization I implemented, the performance is the same when the quotation is known. With matching performance in this case (they're both completely free in the case where either would work), it should be easier to write code that uses the checked versions.

For example, the following implementation of an absolute value function compiles down to the same code that you'd write with the normal if combinator.

:: iff ( x cond true-quot false-quot -- x' )
[ x true-quot call( x -- x' ) ]
[ x false-quot call( x -- x' ) ] if ; inline
: abs ( n -- abs(n) )
dup 0 < [ neg ] [ ] iff ;

This is implemented as part of the sparse conditional constant propagation (SCCP) pass in Factor's compiler. call-effect and execute-effect have custom inlining behavior there, which takes advantage of information collected by the propagation pass. If enough is known about the quotation at this time (if it is a literal quotation, or a literal quotation with something (literal or non-literal) curried on to it, or two literal quotations composed together) so that the stack effect can be inferred and the code can be inlined, then it is inlined.

I could have implemented this as a transform in the stack checker, but this strategy gives a stronger optimization, since it can interact with everything in constant propagation. For example, it interacts with method inlining. This will help improve performance in the Factor Smalltalk implementation, where previously combinator inlining would have been impossible without special support from the Smalltalk-to-Factor translator.

Eliminating overflow checks

In Factor, unlike C and Java, calculations on integers never overflow. Instead, numbers that are too big are converted to a representation that scales to arbitrary size. The smaller integers which are faster to calculate on are called "fixnums" and the larger ones, which are slower to use, are called "bignums"

Factor's compiler does a lot of work to convert general arithmetic to arithmetic on fixnums when possible. One thing the compiler does is try to infer that integers will be in a small enough range that no overflow can happen. This is part of the SCCP pass.

Another compiler pass checks if a value is only used in places where overflowing doesn't matter. For example, if the code + >fixnum is run in a context where it's known that its arguments will both be fixnums, then it doesn't matter if an overflow check is done because of how modular arithmetic works.

I extended this pass to take into account a couple other words that have this property that >fixnum has. Specifically, the words which set memory at a given pointer address, like set-alien-unsigned-1 and take integers as arguments. Depending on the word size of the platform (32 vs 64 bits), this optimization is valid for a different set of words.

One time that this comes up is in v+n when called with a byte array and a fixnum. Adding two fixnums gives back something that can be either a fixnum or a bignum, but the form of addition without an overflow check can be used since the result is going to be stored back into a byte array. Storing into a byte array is implemented with set-alien-unsigned-1, so the optimization applies.

Inlining for new

The word new instantiates a tuple with default values for all slots, given a tuple class. By default, this is done dynamically: the tuple class does not need to be known before runtime. But usually it is known ahead of time. If it is known ahead of time, then code can be inlined which instantiates the specific tuple that is there.

This was previously implemented as part of the stack checker. I moved it to the propagation pass, which makes the optimization stronger. I plan on moving more transforms like this (for example, the one for member?) to the propagation pass.

[Update: I did this, adding a utility called define-partial-eval. Its interface is identical to define-transform, but it operates on the propagation pass IR. Transformations which don't need to interact with stack checking should use define-partial-eval rather than define-transform, since it creates a stronger optimization.]

Better dead code elimination

I extended dead code elimination in the low-level IR to be more accurate. Now, it realizes that an allocation is dead if it is only written to, and never read from. With this, together with alias analysis, the code 2array first2 compiles into a no-op, and no allocation takes place. (This isn't optimized out by tuple unboxing in the high-level IR, because arrays are mutable.)