Sunday, April 25, 2010

Guarded method inlining for Factor

As a compiler optimization to make code faster, Factor's compiler tries to eliminate calls to generic words, replacing them by direct calls to the appropriate method. If the method is declared inline, then its definition will also be inlined. Together, these two optimizations are informally called 'method inlining'. Method inlining is essential for making object-oriented code in Factor fast. And since basic operations like accessing elements of a sequence or adding two numbers are implemented by method calls, this is needed for any Factor code to be fast.

How things worked

Here's how the current algorithm works. Method inlining takes place within Factor's sparse conditional constant propagation pass. SCCP infers upper bounds for the classes of values that the code manipulates. When SCCP processes a generic word call, it examines the class of the receiver as well as the list of classes that have methods on the generic word. If SCCP can tell that a particular method will always be called, it can select that method. Below is some pseudocode for how it detects that.

For each method on the generic word:
If the class for that method intersects the receiver class:
If the class for that method is a superclass of the receiver class:
Put it on a list
Else:
We don't know whether this method will be called at runtime
or not, so bail out and fail to inline a method
Inline the method for the smallest class on the list

There's an additional complication. It might be that a word is compiled, with the method inlining optimization applied, but then after this, more vocabs get loaded that add additional methods to the generic word. This might invalidate the correctness of the method inlining, and the word should get recompiled to fix this problem. Factor uses a simple system to track these dependencies, in stack-checker.dependencies.

My new addition

A few days ago, another Factor developer told me about a hack he'd added to SCCP to make a particular benchmark faster. The hack was, if >fixnum is called on an object that SCCP knows is either a fixnum or f, then the >fixnum call is replaced with dup [ \ >fixnum no-method ] unless. This works because >fixnum doesn't have any methods on f, and >fixnum on fixnums is a no-op.

My immediate instinct here was to generalize this solution. The first step is to convert that code into dup fixnum? [ M\ fixnum >fixnum ] [ \ >fixnum no-method ] if, which we can do since >fixnum doesn't have any other methods on the union of f and fixnum. Unlike the kind of method inlining described earlier, this requires the insertion of a guard. Later code will know (through SCCP) that the object is a fixnum, even after the conditional exits, since it can tell that an exception would be thrown otherwise.

The second step is to convert fixnum? into >boolean, which we can do because we know that the value on the top of the stack is either f or a fixnum. With these two transformations, we should generate the same code as the hack generated, but the transformation should also work on other things.

The second change was easy. I just added custom inlining for the instance? word to detect basically this exact case, and convert the test into >boolean when this is valid.

The first change was a lot more work. First, I had to come up with the algorithm for choosing the right method (even if it seems obvious now in retrospect).

Make a list of methods on the generic word whose class intersects the receiver class
If this list consists of one element, then return it
If the list consists of multiple or zero elements, then there is no method to inline

Once this class is found, then propagation should generate code like this:

dup method's-class instance? [
M\ method's-class generic-word execute
] [
\ generic-word no-method
] if

This is valid because we know that, if the receiver fails the test, then it must be of some class that has no method on the generic word.

An extra requirement for the correct implementation of this compiler optimization is to track dependencies. For this, I made two new types of dependencies. One corresponds to the test that the generic word only has one particular method intersecting the class that's on the stack. The other tracks if a method that's inlined is overwritten. I wasn't able to reuse the dependency tracking for the other kind of method inlining, but it all fits into the same framework and didn't take much code.

All of this was pretty hard for me to debug, but it was a fun thing to work on. Among the code loaded in a basic development image, over 2400 methods are inlined using this technique, which were previously impossible to inline. There was probably also more method inlining done following this, due to improved type information, though I haven't collected statistics. And most importantly, I was able to eliminate the hack with >fixnum with no regression in performance.

Once I get a couple kinks worked out, this should be merged into mainline Factor. For now, it's in the propagation branch of my repository.

2 comments:

Jules said...

At compile time, you know the list of methods for a generic word. So suppose you have a generic word tostring. Couldn't you compile that like this (either you sort classes in topological order, or you emit a tree of conditionals instead of a linear list):

def tostring(it):
if it is int:
return inttostring(it)
if it is array:
return arraytostring(it)
if it is string:
return string
...
else:
nomethoderror(...)

Now wouldn't the ordinary optimizer for conditionals handle this optimization automatically, and much more? It is easy to generalize this to multiple and even predicate dispatch.

You could recover polymorphic inline caching with profile directed optimization for conditionals: you obtain profiles for the dispatch code which tell you how often each conditional got executed. Then you reorder the tests so that the most likely comes first. It has the advantage that the dispatch code require for polymorphic inline caching is removed, the disadvantages are (1) that the relative frequencies of the classes of the objects in the profile are perhaps not the same as the actual frequencies at runtime and (2) the code doesn't adapt to changing frequencies. For benchmarks (1) isn't a problem because you would of course obtain perfect profiles by running the benchmark ;)

This optimization would probably be useful for other conditionals too.

Daniel Ehrenberg said...

Jules,

This is an interesting idea, but it wouldn't be at all straightforward to implement within Factor right now. First of all, you'd want this definition to exist only for compiler analysis purposes; generic words are implemented in a much more efficient way in practice. Remember, the common case is that we have almost no type information. Second, it'd be a lot slower for the current compiler to traverse that code than to have a direct algorithm (like the one I've implemented) that traverses the data structures for generic words. Third, polymorphic inline caches operate at runtime, whereas Factor's compiler is mostly used as an ahead-of-time compiler, so unless we want to change that, we couldn't use your technique. Profile-guided optimizations are interesting, but it's problematic to collect a profile (which is one reason why JITs have become so popular). Fourth, making a compiler rearrange conditionals is hard in Factor's IR, and regardless of the IR you'd have to prove certain properties about the conditions being disjoint, which could be difficult to do efficiently.

Anyway, these concerns that I have aren't fundamental ones about the nature of object-oriented languages, but superficial ones about the current Factor implementation. It could be interesting to explore ideas like this in the future, either once Factor's implementation has evolved more or in the context of another system.