`regexp`

vocabulary by Doug Coleman, cleaning it up and adding new features with the goal of making it usable for the `xmode`

syntax highlighter. This means compatibility with most of Java's regexp syntax.The implementation strategy Doug and I are using is standard textbook stuff, the way Lex works: from the regular expression, a nondeterministic finite automaton (NFA) is constructed, and this is converted to a deterministic finite automaton (DFA). This can be used to efficiently match the string in linear time. Russ Cox wrote a good introduction to all of this.

There's a limitation: backreferences (like

`/(.*)\1/`

) are not supported, since they're incompatible with the NFA/DFA model. But there's no good way to implement backreferences, anyway: parsing with them is NP-complete. Perl uses a backtracking model where backreferences are easy to implement, but in certain cases the backtracking gets out of hand and performance is worse than linear.Today, I worked out the code to minimize DFAs. The DFA minimization algorithm is really nice, so I thought I would share it with you. The implementation is just 65 lines of Factor code, which is in the Factor pastebin.

The issue is that sometimes, naively constructing a DFA gives you duplicate states which are really the same thing. For example, if you have two final states which each have no outward transitions, they can be consolidated. If you have the regular expression

`/ac|bc/`

, then there's a DFA for this in just 3 states. The naive way, however, would give you 5 states.What we want is to partition the states into sets of states that are all the same, and then use only one of these. In mathematical language, we want to create an equivalence relation and quotient out by it. Here's how we figure out what states are the same.

- Assume that all states are the same*.
- Separate the final states from the non-final states.
- Repeat the following until it doesn't make a change in the partitioning:
- Separate any two states which have a transition with the same label to two states which are separated.

Once this doesn't change anything, the states have been divided into the sets that we want.

Interestingly, this algorithm is pretty similar to optimistic global value numbering, a compiler optimization. Optimistic value numbering is a technique for eliminating duplicated computations. It works by assuming that all registers hold the same value, and there is an iteration until fixpoint tries to separate registers which are actually different from each other. When faced with loops, this can catch more than so-called pessimistic value numbering, which first assumes that everything is different, until it can prove that two registers hold the same value.

In the simple, controlled environment of a DFA, it's been proved that this minimization actually produces the smallest possible DFA matching the same set of strings. It's even been shown that, for a given regular language, the minimal DFA recognizing it is unique up to isomorphism. Such a nice analysis isn't possible in the more complicated world of compiler optimizations, however, where better optimizations than GVN exist.

*Actually, we assume that any two states with the same labeled outward transitions are the same. For example, if a state A goes to state B on character x and state C on character y, and state D goes to E on either x or y, then we'll assume at the beginning that A and E are the same. This is a simple modification I came up with to deal with the effectively infinite alphabet of Unicode, since it would be impossible to compare the transition on each Unicode code point.

## 5 comments:

Minimizing the DFA might take exponential space (and therefore time), since the number of reachable combinations of "x" NFA states can be up to "2^x". At least the construction cost is only paid once.

The minimization algorithm you describe is O(n^2). Hopcroft's algorithm is more complicated, but has a worst case of O(n lg n).

You can find Hopcroft's original paper online: An n log n Algorithm for Minimizing States in a Finite Automaton. There's a followup article by David Gries called Describing an algorithm by Hopcroft, but it is not available online.

The algorithm's not easy going, but could be worth it if the asymptotic complexity is important to you. I have implemented the algorithm here for my project Gazelle, but I have not rigorously verified my implementation (I have been using it for a while though).

TuringTest, you're right, it is pretty slow to construct a DFA. For example, the minimum DFA for /(0|...*)(0|...*)(0|...*)(0|...*)/ has 30 states (assuming my minimization code has no bugs), and adding another (0|...*) seems to make the number of states grow faster than linearly, though I'm not sure how much faster. But I don't really care, since usually the regular expressions aren't so big, and usually they don't take that much space, and usually they're available before runtime. If I run into problems with this approach, I might later switch to a caching NFA approach, where the DFA is lazily constructed while a string is matched. But I think this would be slower than precomputing the DFA, if there's time and space for it.

Josh, since I'm using this exponential algorithm, the difference between O(n^2) and O(n log n) isn't so important to me. (Even if it's done on exponential-sized output so it makes a difference, I'm already pretty screwed performance-wise.) But maybe I'll come back to this later if it ends up being an issue.

Josh, you can (now?) find David Gries article online (in a scanned pdf|ps format) at http://hdl.handle.net/1813/6002.

According to some papers (http://www.dcc.fc.up.pt/dcc/Pubs/TReports/TR07/dcc-2007-03.pdf)

There are algorithms that outperforms Hopcroft algorithm. There stated that for relatively medium-big alphabets (>20 letters), Hopcroft's algorithm is not efficient.

Post a Comment