Tuesday, March 17, 2009

The implementation of Factor's regexp library

I've been working on Factor's regular expression library, initially written by Doug Coleman, for the past few weeks. Recently, the library became good enough that I've pushed it to Factor's main repository. The latest Factor binaries have this new library.

The library uses an standard algorithm of converting a regular expression into an NFA, and that into a DFA which can be executed. This is a tradeoff: the code generated will be faster than you would get from a backtracking search or an NFA interpreter, but it takes exponential time, in the worst case, to generate the DFA. I might revisit this later.

The main features missing now are
  • Possessive and reluctant matching
  • Group capture
  • Unicode support, in the form of UTS 18 level 1 compliance with some level 2 features

Right now, I'm working on Unicode support. Regexps already use Unicode strings, because all strings in Factor represent a sequence of Unicode code points, but not many Unicode properties are exposed now. I plan on working on this, and implementing more Unicode algorithms and properties, to reach level 1 compliance.

The rest of this article is an overview of how the regexp engine works. It is implemented as a series of passes, where the first pass takes a string as input, and the last pass outputs a word which runs the code of the regexp. In this way, it is rather like any other compiler, where the parse tree, DFA table and NFA table are just intermediate representations.

The parser

The parser is implemented with Chris Double's packrat parsing library. This makes it not very efficient, but the time spent in parsing is much less than the time in later processing stages, so the cost isn't very large. Things like /a{2,4}/ are expanded into the equivalent, but simpler, form /aa|aaa|aaaa/.

If I were working only with ASCII, then ranges would be expanded into disjunctions as well, but the alphabet is far too big for that. Instead, something like /[ac-z]/ is represented in the syntax tree as a item, a character class object, representing the information that it matches the character a, or something in the class which is the range c-z. For a character class like /[^\dc]/, an object is created which represents a character which is not a digit or c.

Constructing an NFA

From the syntax tree, a nondeterministic finite-state automaton is built. The algorithm is described here, and there is nothing special about the implementation.

Lookahead, lookbehind and anchors (like $ and ^) expand to entries in the syntax tree called tagged epsilons. When these are encountered in building the NFA, an epsilon transition is created which is annotated with this information.

Negation is implemented here. If a negation syntax node is encountered, then the NFA builder constructs an NFA for the enclosed term, disambiguates it, converts it to a DFA, minimizes it, and attaches it back to the larger NFA that is being constructed.

Disambiguation

As I previously described, since the implementation doesn't iterate over every element of the alphabet, there needs to be a procedure to make transitions over characters have disjoint labels. Transitions are labeled by sets, and the output from creating an NFA might have intersecting outgoing sets from a transition.

The best way I've thought of doing this is to get all of the intersections of all of the edge labels, basically forming a Venn diagram. This is, unfortunately, exponential time and space to do. But I see no way of avoiding it when compiling a regular expression like /\p{letter}a|[0-9a-z]b|\{script=latin}c|.../ where there are a large number of incomparable character classes used.

I implemented a small optimization for this: numbers (ie literal characters) are set aside at the beginning and treated specially, so work isn't wasted intersecting them with other classes. The complexity of the algorithm stays exponential, but instead of being exponential in the total number of character classes in the regexp, it becomes exponential in just the non-literal classes.

Constructing a DFA

This is also a standard algorithm. My only modification is to support the tagged epsilon transitions created by lookaround and anchors. I described the modification in a previous blog post.

Minimization

Next, the resulting DFA is minimized. I wrote about regexp minimization before. The algorithm had to be modified slightly to allow for the conditional transitions introduced by processing lookaround in the previous step.

Compilation to Factor

At this point, we have a nice minimal DFA with disjoint outward transitions. Translating it into Factor code is actually quite easy. For each state, we make a gensym. The gensym takes as arguments a string and an index. If the index is at the end of the string, the word returns a boolean, indicating whether the current state is an accepting state. If the index is not at the end of the string, the current character is found, and the word figures out which transition to take. A transition is taken by incrementing the index and then making a tail call to another state word.

The strategy for finding the right transition is somewhat complicated. First, the literal transitions (over constant characters) are partitioned out from the non-literal transitions. The literal transitions are formed into a case statement, where the default case handles non-literal transitions.

Non-literal transitions are all boolean expressions, built with the class algebra described below. They have a number of logic variables (classes of characters). So we can build a truth table over the logic variables, and test each condition exactly once to figure out which transition to take. For example, in the regexp /\p{script=latin}b|\p{lower}c/, the DFA will have three transitions from the start: one over characters which are Latin script and lower-cased, one over characters which are lower-cased but not in Latin script, and one over characters which are in Latin script but not lower-cased. Rather than having the compiled DFA check if the character is in the composite classes directly (which would duplicate cost, since it would be looked up multiple times if a character is lower-cased or Latin script), the compiler builds nested if statements that figure out the composite class while testing each property only once. This leads directly to the transition.

Class algebra

There is a system for simplifying the intersections built in disambiguation, as well as character class literals. It is built off simplifying logical expressions built with and, or, not. The things contained are true (the whole set of characters), false (the empty set), and all possible character classes.

There are constructors for these three logical operations, and then a few simple tactics are used to reduce them. Reducing the expression to simplest form is equivalent to circuit minimization. A friend told me that this is on the second level of the polynomial hierarchy. So I'll just live with the heuristics.

The not constructor is simple. If it's given true, it outputs false. False to true. If it's given a negation as input, it returns the contents. If it's given an and class, it uses De Morgan's law and negates each entry, returning an or. And vice versa.

The and/or constructors are slightly more complicated. I will describe how the and constructor works; the or constructor can be easily derived using De Morgan's law. The input is a sequence of classes, and we want to get their intersection. First, if the input contains intersection (and) classes, these are flattened into the larger sequence. Next, the sequence is sorted into categories: integers, negations of integers, simple classes (like the class digits), negations of those, union classes (ors), and booleans. Delete true from the booleans list, if it's there, as it cannot affect the outcome. If there is a false in the booleans list, then the answer is false. If there is more than one integer, the answer is immediately false. If there is exactly one integer, then the answer is that integer if it is contained in all of the other classes, otherwise false. Now, we are working with a sequence which does not have integer literals, or true or false. If there is a simple class and a not-simple class for the same class, we know that their intersection is false, so the entire expression is false. We can remove not-integers where the integer is contained in an existing not-simple class, as these are redundant. Finally, the or classes within the and class can be simplified in the case where they have logic variables overlapping with other things in the and class: these can all be substituted with true. For example, if you have and(lowercase, or(lowercase, latin)), this can be simplified to and(lowercase, latin). This is because true is substituted for lowercase in the or expression, and or(true, latin) simplifies to latin.

Previously, class algebra was not this strong in what reductions it did. This caused problems. For example, nested negations (useful in implementing conjunction) would result in multiple nested disambiguations, which would cause a very fast blowup in the size of the DFA. Now, running disambiguation twice gives the same results as running it once. (In other words, disambiguation is idempotent.) At least I think so.

Conclusion

Implementing regular expressions took a lot more thinking than I expected, but I'm satisfied with the result so far. Unlike traditional regular expressions, where pathologies might make regexp matching slow, in this system pathologies make regexp compilation time slow. That seems more acceptable to me. Factor's extensible syntax allows me to make regexp literals, which compile before the program runs, even though regexps are a library rather than built in.

If I were starting from scratch, I might instead use the algorithm of constructing a DFA directly from a regular expression using regexp derivatives. It's described in this paper [PDF]. I'm not sure how performance in practice compares, but it implements negation and conjunction in a much cleaner way, and basically eliminates the need for minimization. Most importantly, it allows for a variation of disambiguation which avoids exponential behavior in many cases.

In a later blog post, I'll talk about the API that I expose for regexps.


I haven't actually implemented \p{script=foo} yet, but I will very soon.

5 comments:

Slava Pestov said...

You did a good job with the regexp library, Dan. I'm looking forward to the group capture and Unicode support too.

TuringTest said...

Thanks for publishing these regex design blog posts.

I wrote the regex-tdfa library for Haskell, and I am curious about your group capture support.

Are you finding the leftmost-longest overall match?

Are you trying to follow the POSIX standard for capturing parenthesized subexpressions?

Also: I expect that converting a{2,4} into aa|aaa|aaa is quadratic in size, but converting into something like aa(aa?)? is only linear in size.

Finally: I found testing tricky, and these unit tests invaluable.

Daniel Ehrenberg said...

TuringTest,

I haven't implemented group capture yet. I was planning on using the TDFA algorithm, but I wasn't planning on capturing all subexpressions. Or maybe I'll use one of the other algorithms capable of building a parse tree, which would handle kleene star better. My plan was to capture only specially named subexpressions, maybe with a syntax like foo(*bar:baz)bing to capture baz under the name bar inside the regexp foobazbing.

Oh, that's a clever trick for a{n,m}. I should use that. Compile time for those kinds of expressions is a big problem now.

Overall, I find the leftmost longest match, but I do it using a really stupid algorithm: I start matching the longest match at each index, until I find a match. There are a bunch of better algorithms for this, and I plan to fix it soon, to make finding all matches linear time in the worst case.

Thanks for linking me to those tests. I didn't think any testing package like that existed.

Dirk Moebius said...

Nitpick: a{2,4} should be expanded to aaa?a? instead of aa(aa?)? (and a{3,} could be expanded to aaa+ btw.)

Btw.: Plan 9's libregexp9 is a NFA based regex matcher with Unicode support. It is missing \p character classes but it features [a-z] and [^a-z] groups, and I think character classes could be transformed into [...] groups in linear time.

Anonymous said...

PJ
批發

seo

網路行銷