The algorithm I used is from the paper Register-sensitive selection, duplication, and sequencing of instructions by Vivek Sarkar et al. (I only did the sequencing part.) The main idea of this paper is to adapt the Sethi-Ullman algorithm, which finds the instruction ordering with minimum register pressure for a tree, to the more general case of a DAG where some edges represent data dependence and other edges represent control dependence. It uses a backwards list scheduling pass to order the instructions based on a heuristic derived from the Sethi-Ullman numbering.
The whole implementation took around 300 lines of high-level Factor code. It's in two vocabs
compiler.cfg.dependence
and compiler.cfg.scheduling
. The first vocab defines procedures for constructing the dependence graph of a basic block, as well as partitioning that graph into fan-in trees and calculating the Sethi-Ullman numbering for these trees. The second vocab defines a word schedule-instructions
which schedules the instructions of each basic block of a CFG. This word is called after write barrier elimination and before representation selection. To save time, only blocks which might cause register spilling are scheduled.Update: The code is now up in a branch of my git repository.
On Reddit, Slava Pestov explains why scheduling instructions can reduce register pressure.
This is a really cool optimization, but Dan doesn't explain why re-ordering instructions can reduce the number of spills and reloads. If this doesn't make sense to you, here's a really basic example. Suppose you have this sequence of operations in a hypothetical SSA IR:
x = 1
y = 2
z[0] = x
z[1] = y
This needs three registers to compile without spilling, one for each one of x y and z. If your CPU only has two registers, there will be additional memory accesses generated.
If we re-arrange the instructions like so, and assuming that x and y are not referenced anywhere else, then x and y can share a register, since their live ranges no longer overlap:
x = 1
z[0] = x
y = 2
z[1] = y
This code no longer generates spills on a two-register CPU.
Update 2: The scheduling doesn't actually produce much of a speedup yet. It makes the fasta benchmark about 6% faster, and other benchmarks are unchanged as far as I can tell. Oops! I should have benchmarked before blogging. Anyway, I think I might be able to improve this by making a more accurate dependency graph, so there aren't so many false control dependences. At this point, the optimization isn't very useful, but I'm just happy that I managed to get any speedup at all.
Update 3: With a couple minor modifications, it actually makes the fasta benchmark 9% faster, and removes 29% of spills and 26% of reloads, but this still isn't so great. Also, the total size of compiled code in the development image decreased by 1.5% with this change.
Update 4: I modified the dependence graph construction code to reorder stack instructions (peek and replace). To do this without extra false dependences, I revived the old stack height normalization code, which is necessary because stack heights may change multiple times within a basic block due block joining. Stack height normalization was removed from the compiler when the new DCN was added. Now, around 35% of spills and reloads are eliminated, compared to the version without scheduling.