Thursday, December 23, 2010

Work at RethinkDB

I've been working at RethinkDB for the past month. My project has been to convert the code base from using callbacks for asynchronous I/O—essentially continuation-passing style—to a more natural style of concurrency using coroutines. You can read about my progress on the RethinkDB blog:

Sunday, May 30, 2010

Paper for DLS 2010

Slava Pestov, Joe Groff and I are writing a paper about Factor for the Dynamic Languages Symposium, part of OOPSLA. It's a survey discussing both the language design and its implementation, and you can read the draft on the Factor website. The paper is due on Tuesday, and we would really appreciate any comments you have on it.

Update: I've submitted the paper. Thanks everyone for your helpful suggestions! I'll hear back about this July 15, and report here when I do.

Sunday, April 25, 2010

Guarded method inlining for Factor

As a compiler optimization to make code faster, Factor's compiler tries to eliminate calls to generic words, replacing them by direct calls to the appropriate method. If the method is declared inline, then its definition will also be inlined. Together, these two optimizations are informally called 'method inlining'. Method inlining is essential for making object-oriented code in Factor fast. And since basic operations like accessing elements of a sequence or adding two numbers are implemented by method calls, this is needed for any Factor code to be fast.

How things worked

Here's how the current algorithm works. Method inlining takes place within Factor's sparse conditional constant propagation pass. SCCP infers upper bounds for the classes of values that the code manipulates. When SCCP processes a generic word call, it examines the class of the receiver as well as the list of classes that have methods on the generic word. If SCCP can tell that a particular method will always be called, it can select that method. Below is some pseudocode for how it detects that.

For each method on the generic word:
If the class for that method intersects the receiver class:
If the class for that method is a superclass of the receiver class:
Put it on a list
Else:
We don't know whether this method will be called at runtime
or not, so bail out and fail to inline a method
Inline the method for the smallest class on the list

There's an additional complication. It might be that a word is compiled, with the method inlining optimization applied, but then after this, more vocabs get loaded that add additional methods to the generic word. This might invalidate the correctness of the method inlining, and the word should get recompiled to fix this problem. Factor uses a simple system to track these dependencies, in stack-checker.dependencies.

My new addition

A few days ago, another Factor developer told me about a hack he'd added to SCCP to make a particular benchmark faster. The hack was, if >fixnum is called on an object that SCCP knows is either a fixnum or f, then the >fixnum call is replaced with dup [ \ >fixnum no-method ] unless. This works because >fixnum doesn't have any methods on f, and >fixnum on fixnums is a no-op.

My immediate instinct here was to generalize this solution. The first step is to convert that code into dup fixnum? [ M\ fixnum >fixnum ] [ \ >fixnum no-method ] if, which we can do since >fixnum doesn't have any other methods on the union of f and fixnum. Unlike the kind of method inlining described earlier, this requires the insertion of a guard. Later code will know (through SCCP) that the object is a fixnum, even after the conditional exits, since it can tell that an exception would be thrown otherwise.

The second step is to convert fixnum? into >boolean, which we can do because we know that the value on the top of the stack is either f or a fixnum. With these two transformations, we should generate the same code as the hack generated, but the transformation should also work on other things.

The second change was easy. I just added custom inlining for the instance? word to detect basically this exact case, and convert the test into >boolean when this is valid.

The first change was a lot more work. First, I had to come up with the algorithm for choosing the right method (even if it seems obvious now in retrospect).

Make a list of methods on the generic word whose class intersects the receiver class
If this list consists of one element, then return it
If the list consists of multiple or zero elements, then there is no method to inline

Once this class is found, then propagation should generate code like this:

dup method's-class instance? [
M\ method's-class generic-word execute
] [
\ generic-word no-method
] if

This is valid because we know that, if the receiver fails the test, then it must be of some class that has no method on the generic word.

An extra requirement for the correct implementation of this compiler optimization is to track dependencies. For this, I made two new types of dependencies. One corresponds to the test that the generic word only has one particular method intersecting the class that's on the stack. The other tracks if a method that's inlined is overwritten. I wasn't able to reuse the dependency tracking for the other kind of method inlining, but it all fits into the same framework and didn't take much code.

All of this was pretty hard for me to debug, but it was a fun thing to work on. Among the code loaded in a basic development image, over 2400 methods are inlined using this technique, which were previously impossible to inline. There was probably also more method inlining done following this, due to improved type information, though I haven't collected statistics. And most importantly, I was able to eliminate the hack with >fixnum with no regression in performance.

Once I get a couple kinks worked out, this should be merged into mainline Factor. For now, it's in the propagation branch of my repository.

Tuesday, April 6, 2010

A couple language design ideas

I know I shouldn't really write about something I haven't implemented yet, but here are a couple ideas that I've been talking about with some of the other Factor developers that I'd like to share with you. Credit for these ideas goes mostly to Slava Pestov and Joe Groff.

Multimethods in Factor

Factor should have multimethods. They'd clean up a lot of code. There already are multimethods, in extra/multimethods, but these are extremely slow (you have to search through half of method list to find the right one in the average case, which is unacceptable). They could also have their syntax cleaned up a bit. These multimethods, as they're already implemented, combine dispatching on things on the stack with dispatching on the values of dynamically scoped variables. The latter is called 'hooks' in Factor, and is useful for what other languages do using conditional compilation.

Here's a sample of what the syntax might look like:
! Before the -- is upper bounds on the allowable methods
! after the -- is a type declaration that is checked.
GENERIC: like ( seq: sequence exemplar: sequence -- newseq: sequence )

! Parameters with specific types in the parentheses--this is not a stack effect
M: like ( array array ) drop ;

! The single parameter refers to the top of the stack
! and the second element isn't used for dispatch
! so it defaults to sequence
M: like ( array ) >array ;

GENERIC: generate-insn ( instruction: insn -- | cpu ) ! after the | is the hook variables

! Before the | comes from the stack, after is variables. x86 is a singleton class.
M: ( ##add-float | cpu: x86 )
[ dst>> ] [ src1>> ] [ src2>> ] tri
double-rep two-operand ADDSD ;

I've implemented some of this syntax in the multimethods branch of my git repository. Another cool part, language-design-wise, is that multimethods can replace a few different other language features.

For one, they can replace Joe Groff's TYPED:. That lets you declare the types of arguments of a word, and have those be automatically checked, with checks removed if the compiler can prove they're unnecessary. In the ideal design, : will replace TYPED:, and if any parameters have type declarations, then the colon definition is macro-expanded into a generic word with only one method. This implements the type checking.

Multimethods could also be used to implement 'hints' better than they are right now. Hints are a feature of the compiler where a programmer can instruct the compiler to create specialized versions of a function for certain parameter types. Currently, hints use a very inefficient dispatch mechanism: when you call a word with hints, it goes down the list of specialized implementations of the word, testing the stack to see if the contents of the stack are members of those classes. But if efficient multimethod dispatch were worked out, this could be much faster. Also, method inlining would make it so that the hints dispatch would be eliminated if types are known by the compiler. This isn't done right now.

Callsite specialization could also be implemented with the help of multimethods. The difference between this and hints is just that a new hint would be generated if type inference found certain types for the arguments of a function, but there wasn't already a hint for that type. The basic idea for callsite specialization is that the specialized versions would only be used when the compiler can prove that they're appropriate. But if efficient multimethod dispatch were used, then callsite specialization could be used to generate hints that are also available on values whose types are unknown until runtime. Runtime type feedback could work this way too.

There are some implementation challenges in making multimethods fast. From what I've read, it seems like the best way is something like this algorithm by Chambers and Chen used by the Cecil compiler. It generates a dispatch DAG for each multiple-dispatch generic word, where the nodes are single dispatches and the leaves are methods. The algorithm described directly in the paper looks a little impractical, since it includes iterating over all classes, but I think if you just iterated over classes that are method arguments, together with the intersections of these, then that'd be enough.

There's a problem, though, which is that Factor's object system is complicated, and it's not obvious how to get the intersection of classes. The current implementation is probably buggy; if it's not, then I can't understand why it works. So class algebra will have to be fixed first, before multimethods can work. I hope I can figure out how to do this. I bet it has something to do with Binary Decision Diagrams, since we're talking about intersections, unions and negation. We might want to make things simpler by factoring predicate classes out in the way that the Cecil paper does. Either way, things are complicated.

A protocol for variables

I never would have expected it, but the stack-based programming language Factor uses tons of variables. And lots of different kinds, too.
  • Lexically scoped variables in the locals vocab
  • Dynamically scoped variables in the namespaces vocab
  • Two types of globals, using VALUE: and using words for dynamically scoped variables in the global scope
  • Thread-local variables in the thread vocab
  • For web programming with Furnace, there are session variables and conversation variables and scope variables
These are all useful and necessary (though maybe we don't need two kinds of globals), but it's not necessary that they all use different syntactic conventions. They could all use the same syntax.

Some of these variables always correspond to words (locals and values) and some of them don't. Probably it'd be better if all variables corresponded to words, to reduce the risk of typos and to make it possible to have terser syntax to use them, the way we have terse syntax to use locals and values.

Here's my proposal for syntax. Whenever you want to use a variable (unless it's a local), you have to declare it beforehand in some vocab. For example, the following would declare a dynamically scoped variable foo and a thread-local variable bar. Currently, you wouldn't declare these ahead of time, and instead you would just use symbols for these purposes.
DYNAMIC: foo
THREAD-LOCAL: bar
Then, to read the values of these variables, the following syntax would be used
foo
bar
Reading a variable should just be done by executing the word that is the variable. This is how it works in almost all programming languages. Since the variable has been declared ahead of time, we know what kind of variable it is and how to read it. There could be a single parsing word for writing to variables, such as to: (this is what values uses right now), as in the following.
4 to: foo
Again, since the variable is declared ahead of time, we know how to write to it, and the right code is generated at parsetime. We could also have combinators like change, written in a way that works with all variable types:
[ 1 + ] change: foo

Another change I'm interested in is the scoping rules for dynamically scoped variables. Right now, when with-variable is invoked, a totally new scope is made such that if any dynamic variable is set, that setting is only kept as long as the innermost with-variable call. These semantics were made back when the fundamental construct for variables was bind, but in modern Factor code, this isn't so important anymore.

The current semantics are a frequent source of bugs. Basically, it's the funarg problem all over again: if you have a higher order function which includes a with-variable call, and if the quotation sets any variables, then these variable settings don't escape the higher order function call. This is usually not what the programmer intended! There are various hacks, euphemistically called idioms, used in Factor to get around this.

The semantics should really be that set modifies the binding in the scope that defined the variable, with the original binding always coming from with-scope. But this change would break tons of Factor code that already exists, so I don't expect it to be used any time soon. Also, it would make things less efficient if each variable binding had its own scope, since right now things are often grouped together.

One of my far-out hopes with variables is that maybe dynamically scoped variables could be optimized so that they would be as efficient as lexically scoped variables. But this would require lots of changes to Factor's implementation. For example, right now no level of the Factor compiler knows anything about any kind of variables; they're either subroutine calls or stack shuffling by the time the compiler sees it. Still, a sufficiently smart compiler could translate a usage of make into a version that stores the partially built sequence on the stack, and there have been compilers that do the equivalent of this.

Wednesday, March 17, 2010

Expressing joint behavior of modules

Sometimes, special code is needed to make two modules interact together. This seems to come up all the time in Factor. For example, there are two language features, locals and prettyprinting, implemented in the Factor library. If they are both loaded, then there is special code to make locals prettyprint. But neither one depends on the other.

The old way to handle this situation was to have code that looked like this, as a top-level form in the locals vocab:
"prettyprint" vocab [
"locals.prettyprint" require
] when
But this is problematic. What if locals is loaded first, and then prettyprint? Well, this actually happened, due to a change in some other code. Then locals don't prettyprint correctly! You could fix this by adding top-level statements in both vocabs, as mirrors and specialized arrays did, but then this single joint dependency is duplicated in two places. Maintenance is more difficult than it should be.

To fix this problem, I added a new word, require-when. The code above, as a top-level form in the locals vocab, would be replaced with
"prettyprint" "locals.prettyprint" require-when
The logic is: if the prettyprint vocab is already loaded, then locals.prettyprint will be loaded. Otherwise, the dependency is registered with the module system, so if prettyprint is loaded later by something else, then locals.prettyprint will be as well.

I'm pretty satisfied with this solution to a somewhat long-standing problem in Factor. I wonder how other module systems solve the same problem. I have this implemented in a branch in my repository and it should be incorporated into mainline Factor soon.

Monday, March 8, 2010

Paper about call( -- ) inlining

Since I think my method of inlining closures in Factor is stronger than previous similar optimizations, and since I wouldn't mind a potentially free trip to Toronto, I wrote an abstract on what I did for the PLDI student research contest. It's called Closure Elimination as Constant Propagation (not sure if that's the best title). It's supposed to be under 800 words, and it's about double that now. Readers, I'd really appreciate your help in fixing errors, figuring out what to cut, and finding what is unclear and what I left out. Send me any suggestions as comments here or by email at ehrenbed@carleton.edu. Thanks!

Update: I edited the paper, and changed the link above.

Update 2: There's also a severely abridged version, which is more like what I'll submit, here.

Update 3: I got accepted to PLDI! If you're coming, I'll see you there!

Thursday, March 4, 2010

A protocol for sets

It's bothered me for some time that Factor has a protocol (an informal abstract class) for sequences and associative mappings, but nothing for sets. In core and basis, there are three set implementations: you can use a hashtable as a set, a bit array, or a plain old sequence. Different words are used to manipulate these, and each set type has a rather incomplete set of operations.

In a branch in my repository, I've fixed this omission. Now all three are under a common protocol, and more can easily be added. If you're interested in reading about the details of the protocol, you can pull from this repository, bootstrap and read the documentation included.

There are a number of benefits to this.
  1. All sets now support all operations using names that correspond to the intuitive meanings of the operations
  2. It is much easier to change set representations within a piece of code: just change the code that initializes the set
  3. It's easier to implement new types of sets because you get many operations 'for free' and a nice common API.

The design I went with had a few conflicting goals:
  • The protocol should be clean and simple, and therefore easy to implement
  • The new protocol should be mostly compatible with the existing sets vocabulary.
  • There should be no performance overhead (besides method dispatch) for using this protocol

Each of these had to be sacrificed a little. For performance, I had to make the protocol include many different operations, since different set implementations might have a way to override them for efficiency. To keep things simple and easy to implement, I did this only for operations that were currently in use in the Factor code base, and I included default methods for as many operations as I could. For compatibility, I made unordered sequences act as sets in a way that's very similar to the old sets vocabulary, and the major operations are generalizations of the operations on sequences.

There is not total backwards compatibility, though. If that had been a design requirement (say, if this were a change made on Factor 1.0), the design would be much cruftier. One change is that the word prune is gone, subsumed by members, which generates a sequence of the members of a set. Given a sequence, this gives a sequence with the same elements but only one copy of each.

A bigger change is that hashtables are no longer meant to be used as sets. In their place is a new data structure, the hash set. Hash sets have literal syntax like HS{ 1 2 3 }, similar to other collections. In their current implementation, they use a hashtable underneath, but in a future implementation a more memory-efficient construction may be used. Hash sets implement set operations and hashtables do not. Previously, words like conjoin, conjoin-at and key? were used to query hashtables as sets, but now these are subsumed by the set words adjoin, adjoin-at and in?. There is a lot of code that uses hashtables as sets, and it's not easy to sort out the set uses of hashtables from the non-set uses for someone who hasn't written the code. So for now, words to manipulate hashtables as sets are still present. conjoin and conjoin-at will be eliminated when all code in the Factor repository is updated to use hash sets instead.

It's somewhat to have a language evolution process where the language is not guaranteed to be compatible from one version to the next, as Factor is right now. There will continue to be incompatibilities until version 1.0 is released for the sake of clean organization of the language. There is a tradeoff here: incompatibilities make Factor harder to use now, and prevent adoption today, but the resulting system will be better-organized and easier to use as a result. Factor would probably be in much worse shape today if a policy of backwards compatibility had been adopted a few years ago, and it's a little too soon to start now in freezing the language.

Update: By the way, I forgot to mention in the original post: many other high-level programming languages don't have such a nice generic collection of data structures. Factor's isn't complete, but in many ways it's more advanced than many other popular programming languages. Java, C++ and C# are languages with generic data structures libraries, but using the data structures is more verbose due to certain missing language features, like the lack of syntax for literal hashtables. Popular scripting languages like Python, Ruby and Perl tend to privilege hashtables and resizable arrays over other data structures. It's possible to create data types that look just like arrays or hashtables using Python or Ruby in terms of their interface. But the higher methods for these data structures will only work for the builtin types and there's no way to make them work for your data type but to reimplement them. Functional languages like Scheme and Haskell have libraries for lists and arrays, but the interfaces are different for different data types. Even though Haskell has type classes, both of these languages' standard libraries are written with lists in mind for the most common operations. Factor used to resemble scripting languages in its support of data structures, but experience in writing large programs with these data structures has led to a better thought-out, more object-oriented model.

Wednesday, February 10, 2010

Instruction scheduling for register pressure

Today, I finished the first version of a compiler optimization for Factor's compiler that reorders instructions in order to reduce the number of spills and reloads that the register allocator has to emit. I've been working on this since last fall, so I'm really happy to have a working version. In the current version, it eliminates 20% of those operations in a full development image on the 32-bit x86 architecture. 32-bit x86 is the main target for this optimization because it has so few registers. Scheduling for register allocation isn't an optimization most compilers do, but it looks like it's a profitable one.

The algorithm I used is from the paper Register-sensitive selection, duplication, and sequencing of instructions by Vivek Sarkar et al. (I only did the sequencing part.) The main idea of this paper is to adapt the Sethi-Ullman algorithm, which finds the instruction ordering with minimum register pressure for a tree, to the more general case of a DAG where some edges represent data dependence and other edges represent control dependence. It uses a backwards list scheduling pass to order the instructions based on a heuristic derived from the Sethi-Ullman numbering.

The whole implementation took around 300 lines of high-level Factor code. It's in two vocabs compiler.cfg.dependence and compiler.cfg.scheduling. The first vocab defines procedures for constructing the dependence graph of a basic block, as well as partitioning that graph into fan-in trees and calculating the Sethi-Ullman numbering for these trees. The second vocab defines a word schedule-instructions which schedules the instructions of each basic block of a CFG. This word is called after write barrier elimination and before representation selection. To save time, only blocks which might cause register spilling are scheduled.

Update: The code is now up in a branch of my git repository.

On Reddit, Slava Pestov explains why scheduling instructions can reduce register pressure.

This is a really cool optimization, but Dan doesn't explain why re-ordering instructions can reduce the number of spills and reloads. If this doesn't make sense to you, here's a really basic example. Suppose you have this sequence of operations in a hypothetical SSA IR:

x = 1
y = 2
z[0] = x
z[1] = y

This needs three registers to compile without spilling, one for each one of x y and z. If your CPU only has two registers, there will be additional memory accesses generated.
If we re-arrange the instructions like so, and assuming that x and y are not referenced anywhere else, then x and y can share a register, since their live ranges no longer overlap:

x = 1
z[0] = x
y = 2
z[1] = y

This code no longer generates spills on a two-register CPU.


Update 2: The scheduling doesn't actually produce much of a speedup yet. It makes the fasta benchmark about 6% faster, and other benchmarks are unchanged as far as I can tell. Oops! I should have benchmarked before blogging. Anyway, I think I might be able to improve this by making a more accurate dependency graph, so there aren't so many false control dependences. At this point, the optimization isn't very useful, but I'm just happy that I managed to get any speedup at all.

Update 3: With a couple minor modifications, it actually makes the fasta benchmark 9% faster, and removes 29% of spills and 26% of reloads, but this still isn't so great. Also, the total size of compiled code in the development image decreased by 1.5% with this change.

Update 4: I modified the dependence graph construction code to reorder stack instructions (peek and replace). To do this without extra false dependences, I revived the old stack height normalization code, which is necessary because stack heights may change multiple times within a basic block due block joining. Stack height normalization was removed from the compiler when the new DCN was added. Now, around 35% of spills and reloads are eliminated, compared to the version without scheduling.

Saturday, January 30, 2010

Runtime type feedback in Factor

I realized today that, thanks to all the great machinery that Slava Pestov has built, Factor already has everything needed to support feedback-directed optimizations. To demonstrate this, I implemented a simple version of type feedback. The whole thing is less than 30 lines of code. It could definitely use a bit of improvement, but in a simple benchmark, I recorded a speedup of more than two times compared to the purely statically optimized version.

The idea underlying runtime type feedback is that we can produce specialized versions of code based on the types it is used with. For example, if we consider a word like append. On each iteration of the main loop, it calls a method on one of the two inputs to read from the input sequence. It also calls a method on the output to write to it. In general, none of these types are known until the program runs. However, if we know that the inputs are both arrays, then the output will also be an array and all of these method calls can be inlined using the static analysis of the compiler's propagation pass. This makes the program run significantly faster.

The existing hints mechanism capitalizes on this. Hints make it possible to declare, "this word is likely to be called with the following input from the following classes". With this declaration, the compiler can make a few specialized versions of a word annotated with hints, one for each predicted combination of classes. When the word gets called, it does a series of type checks to choose which version to use, based on the types of the arguments. If none of the types match, it calls the default version.

The problem with this is that you have to declare, in advance, what types you expect to come up. Sometimes, you don't really know what types to expect until the program runs. This is where runtime type feedback comes in. In this system, while the program runs, the language runtime collects profile information about what types words are being called with. If a word is called with a particular type enough times, then hints are added to the word to make it specialized on that type. The word is recompiled with these hints, and future calls to the word have access to this specialized version.

When I first heard about type feedback, it sounded really hard to implement. But it turns out, because Factor already has the hints mechanism and supports runtime redefinition of words, that type feedback is easy. All I did was use a hashtable to store the frequencies of types that a word was called with, and add hints based on this.

There are some disadvantages to this approach. The first is that the compiler must be packaged with any program that wants to use type feedback at runtime. This takes up memory at runtime and makes the program take longer to start up because of the bigger executable. Additionally, any runtime optimization makes programs run slower at the beginning. The system has a certain degree of "warmup time" where it is not running at full speed because runtime optimizations have not yet been applied. An extreme case of this is in a full just-in-time (JIT) compiler, where all compilation takes place at runtime, and the program is initially running in an interpreter or something similar. This would not be quite as big of an issue here, where runtime type feedback is added to an ahead-of-time (AOT) compiler, but it's still a problem.

My implementation of type feedback is far from perfect, and should be considered a proof of concept at this point. It won't be included as a default feature of Factor any time soon. First, the code that performs profiling might be faster if this code were written in a lower-level style. On some benchmarks, the cost of profiling in this implementation outweighs the benefits of the type feedback. Second, the way I use hints is pretty inefficient. I add a hint and then recompile the entire word with all the hints. So if n type combinations are added, then it spends O(n2) time compiling, where it should have only spent O(n) time. Third, hints are not implemented as well as they could be in terms of runtime performance. A better implementation, based on multimethods, would probably improve the performance of type feedback. Finally, the Factor compiler itself isn't as fast as it could be. Compilation speed matters much more in the presence of runtime compilation than for AOT compilation, so this hasn't been the top priority so far.

Despite these flaws, runtime type feedback, even implemented naively, creates a significant speedup. It seems like dynamic optimizations have a significant benefit for dynamically typed, object-oriented languages like Factor, that static optimizations alone can't completely match.