regexpvocabulary by Doug Coleman, cleaning it up and adding new features with the goal of making it usable for the
xmodesyntax highlighter. This means compatibility with most of Java's regexp syntax.
The implementation strategy Doug and I are using is standard textbook stuff, the way Lex works: from the regular expression, a nondeterministic finite automaton (NFA) is constructed, and this is converted to a deterministic finite automaton (DFA). This can be used to efficiently match the string in linear time. Russ Cox wrote a good introduction to all of this.
There's a limitation: backreferences (like
/(.*)\1/) are not supported, since they're incompatible with the NFA/DFA model. But there's no good way to implement backreferences, anyway: parsing with them is NP-complete. Perl uses a backtracking model where backreferences are easy to implement, but in certain cases the backtracking gets out of hand and performance is worse than linear.
Today, I worked out the code to minimize DFAs. The DFA minimization algorithm is really nice, so I thought I would share it with you. The implementation is just 65 lines of Factor code, which is in the Factor pastebin.
The issue is that sometimes, naively constructing a DFA gives you duplicate states which are really the same thing. For example, if you have two final states which each have no outward transitions, they can be consolidated. If you have the regular expression
/ac|bc/, then there's a DFA for this in just 3 states. The naive way, however, would give you 5 states.
What we want is to partition the states into sets of states that are all the same, and then use only one of these. In mathematical language, we want to create an equivalence relation and quotient out by it. Here's how we figure out what states are the same.
- Assume that all states are the same*.
- Separate the final states from the non-final states.
- Repeat the following until it doesn't make a change in the partitioning:
- Separate any two states which have a transition with the same label to two states which are separated.
Once this doesn't change anything, the states have been divided into the sets that we want.
Interestingly, this algorithm is pretty similar to optimistic global value numbering, a compiler optimization. Optimistic value numbering is a technique for eliminating duplicated computations. It works by assuming that all registers hold the same value, and there is an iteration until fixpoint tries to separate registers which are actually different from each other. When faced with loops, this can catch more than so-called pessimistic value numbering, which first assumes that everything is different, until it can prove that two registers hold the same value.
In the simple, controlled environment of a DFA, it's been proved that this minimization actually produces the smallest possible DFA matching the same set of strings. It's even been shown that, for a given regular language, the minimal DFA recognizing it is unique up to isomorphism. Such a nice analysis isn't possible in the more complicated world of compiler optimizations, however, where better optimizations than GVN exist.
*Actually, we assume that any two states with the same labeled outward transitions are the same. For example, if a state A goes to state B on character x and state C on character y, and state D goes to E on either x or y, then we'll assume at the beginning that A and E are the same. This is a simple modification I came up with to deal with the effectively infinite alphabet of Unicode, since it would be impossible to compare the transition on each Unicode code point.