Wednesday, February 10, 2010

Instruction scheduling for register pressure

Today, I finished the first version of a compiler optimization for Factor's compiler that reorders instructions in order to reduce the number of spills and reloads that the register allocator has to emit. I've been working on this since last fall, so I'm really happy to have a working version. In the current version, it eliminates 20% of those operations in a full development image on the 32-bit x86 architecture. 32-bit x86 is the main target for this optimization because it has so few registers. Scheduling for register allocation isn't an optimization most compilers do, but it looks like it's a profitable one.

The algorithm I used is from the paper Register-sensitive selection, duplication, and sequencing of instructions by Vivek Sarkar et al. (I only did the sequencing part.) The main idea of this paper is to adapt the Sethi-Ullman algorithm, which finds the instruction ordering with minimum register pressure for a tree, to the more general case of a DAG where some edges represent data dependence and other edges represent control dependence. It uses a backwards list scheduling pass to order the instructions based on a heuristic derived from the Sethi-Ullman numbering.

The whole implementation took around 300 lines of high-level Factor code. It's in two vocabs compiler.cfg.dependence and compiler.cfg.scheduling. The first vocab defines procedures for constructing the dependence graph of a basic block, as well as partitioning that graph into fan-in trees and calculating the Sethi-Ullman numbering for these trees. The second vocab defines a word schedule-instructions which schedules the instructions of each basic block of a CFG. This word is called after write barrier elimination and before representation selection. To save time, only blocks which might cause register spilling are scheduled.

Update: The code is now up in a branch of my git repository.

On Reddit, Slava Pestov explains why scheduling instructions can reduce register pressure.

This is a really cool optimization, but Dan doesn't explain why re-ordering instructions can reduce the number of spills and reloads. If this doesn't make sense to you, here's a really basic example. Suppose you have this sequence of operations in a hypothetical SSA IR:

x = 1
y = 2
z[0] = x
z[1] = y

This needs three registers to compile without spilling, one for each one of x y and z. If your CPU only has two registers, there will be additional memory accesses generated.
If we re-arrange the instructions like so, and assuming that x and y are not referenced anywhere else, then x and y can share a register, since their live ranges no longer overlap:

x = 1
z[0] = x
y = 2
z[1] = y

This code no longer generates spills on a two-register CPU.


Update 2: The scheduling doesn't actually produce much of a speedup yet. It makes the fasta benchmark about 6% faster, and other benchmarks are unchanged as far as I can tell. Oops! I should have benchmarked before blogging. Anyway, I think I might be able to improve this by making a more accurate dependency graph, so there aren't so many false control dependences. At this point, the optimization isn't very useful, but I'm just happy that I managed to get any speedup at all.

Update 3: With a couple minor modifications, it actually makes the fasta benchmark 9% faster, and removes 29% of spills and 26% of reloads, but this still isn't so great. Also, the total size of compiled code in the development image decreased by 1.5% with this change.

Update 4: I modified the dependence graph construction code to reorder stack instructions (peek and replace). To do this without extra false dependences, I revived the old stack height normalization code, which is necessary because stack heights may change multiple times within a basic block due block joining. Stack height normalization was removed from the compiler when the new DCN was added. Now, around 35% of spills and reloads are eliminated, compared to the version without scheduling.

3 comments:

Anonymous said...

What? When the eax register has up to 100 shadow register?

Daniel Ehrenberg said...

Sure, but these registers aren't available to the register allocator when it's generating code. It still needs to make spilling decisions based on a smaller register set.

Anonymous said...

Memory is still slow. Register spills cost a load and possibly a store. Those kinds of instructions have a minimum latency of 3-6 cycles, and on a core that can execute three ALU ops per cycle that's far more significant than pressure on hidden rename registers.

This is why instruction scheduling is mostly just a huge pile of wank outside critical inner loops. And that kind of critical inner loop doesn't occur in the vast majority of applications.