Write barriers don't have to be run on every single write, actually. There are two cases where they don't have to be run:
- If a small enough object has been allocated, and no GC could have been run since the allocation, the it must be in the nursery.
- If a write barrier has been run on an object and the GC hasn't been run after that, then the write barrier does not need to run on further writes to the object.
These things don't work across subroutine calls, since there might be a garbage collection there. They're also invalid across GC checks. But there's still a lot of code that can be improved with these observations.
For example, the word
reverse
, if specialized for arrays, doesn't have any subroutine calls or allocations in its inner loop, after enough compiler passes have run. But a naive code generator would put a write barrier call on each loop iteration. It's enough to just call the write barrier once, outside of the loop, and doing this gives you a 15% speedup.I implemented this as a compiler optimization on Factor's low-level IR, extending Slava's local write barrier elimination pass, described here. Slava's pass eliminates redundant write barriers within a basic block, based on the two facts I just mentioned. For the local case, Slava's implementation is optimal, but with control flow we can do much better.
Here's the idea: first, insert a call to the write barrier outside of any loop that calls the write barrier on each iteration. Next, delete redundant write barriers using a dataflow analysis. With Factor's new dataflow analysis and loop analysis frameworks, both of these tasks are pretty easy.
Inserting write barriers outside of loops
Slava just implemented loop detection in low-level IR. For each loop, I want to find the set of registers that each iteration will definitely call the write barrier on. Once I find this set, I just place the write barriers right before the start of the loop. Deleting the ones inside the loop comes as the next step.
The output of loop detection is a list of loops, and each loop has a header (the entry point), a list of ends (sources for jumps back to the header) and a list of basic blocks contained in the loop. If a basic block dominates each end, then it must be run on each iteration of the loop. So a conservative approximation of the list of write barriers that must be run on each iteration is the list of write barriers contained in basic blocks that dominate each end of the loop. It turns out this is enough to get all of the meaningful, practical cases like
append
and reverse
.We have to be a little bit careful, though. You can't always insert a write barrier outside of a loop, because you can't run the write barrier on something like a fixnum. If you do, the VM might crash. Because type information isn't available in low-level IR, I reconstruct what can have the write barrier called on it by seeing what has had a slot lookup. This is a simple dataflow analysis.
Removing redundant write barriers
Write barriers can be removed with another dataflow analysis. Here, for each basic block, we want to calculate the registers where the write barrier does not need to be called. Once we have this set, we can run the old write barrier removal algorithm.
This is a forward analysis. I call the sets of registers where the write barrier does not need to be called again the "safe" set. Safe-in for a basic block is the intersection of the safe-outs of the predecessors if the current block has no allocation, and it is the empty set if it does have allocation. Safe-out is safe-in plus all registers that have been allocated in the block, and those that have had the write barrier run on them. Factor's dataflow framework handles this perfectly.
Conclusion
Hoisting write barriers out of loops was easier than I expected, just two evenings of work. Unfortunately, it isn't as general as it should be. The type reconstruction by looking at slot calls doesn't work as well as it should, since there is a subroutine call between the call to slot and the loop in the case of
append
and push-all
. I think the solution to this would be for high-level IR to pass type information down to low-level IR. It would be useful here, and I bet as the low-level optimizer becomes more advanced, it will become useful in other places too. The code implementing the optimization is in the Factor pastebin and should be in the main Factor repository soon.