Tuesday, March 17, 2009

The implementation of Factor's regexp library

I've been working on Factor's regular expression library, initially written by Doug Coleman, for the past few weeks. Recently, the library became good enough that I've pushed it to Factor's main repository. The latest Factor binaries have this new library.

The library uses an standard algorithm of converting a regular expression into an NFA, and that into a DFA which can be executed. This is a tradeoff: the code generated will be faster than you would get from a backtracking search or an NFA interpreter, but it takes exponential time, in the worst case, to generate the DFA. I might revisit this later.

The main features missing now are
  • Possessive and reluctant matching
  • Group capture
  • Unicode support, in the form of UTS 18 level 1 compliance with some level 2 features

Right now, I'm working on Unicode support. Regexps already use Unicode strings, because all strings in Factor represent a sequence of Unicode code points, but not many Unicode properties are exposed now. I plan on working on this, and implementing more Unicode algorithms and properties, to reach level 1 compliance.

The rest of this article is an overview of how the regexp engine works. It is implemented as a series of passes, where the first pass takes a string as input, and the last pass outputs a word which runs the code of the regexp. In this way, it is rather like any other compiler, where the parse tree, DFA table and NFA table are just intermediate representations.

The parser

The parser is implemented with Chris Double's packrat parsing library. This makes it not very efficient, but the time spent in parsing is much less than the time in later processing stages, so the cost isn't very large. Things like /a{2,4}/ are expanded into the equivalent, but simpler, form /aa|aaa|aaaa/.

If I were working only with ASCII, then ranges would be expanded into disjunctions as well, but the alphabet is far too big for that. Instead, something like /[ac-z]/ is represented in the syntax tree as a item, a character class object, representing the information that it matches the character a, or something in the class which is the range c-z. For a character class like /[^\dc]/, an object is created which represents a character which is not a digit or c.

Constructing an NFA

From the syntax tree, a nondeterministic finite-state automaton is built. The algorithm is described here, and there is nothing special about the implementation.

Lookahead, lookbehind and anchors (like $ and ^) expand to entries in the syntax tree called tagged epsilons. When these are encountered in building the NFA, an epsilon transition is created which is annotated with this information.

Negation is implemented here. If a negation syntax node is encountered, then the NFA builder constructs an NFA for the enclosed term, disambiguates it, converts it to a DFA, minimizes it, and attaches it back to the larger NFA that is being constructed.

Disambiguation

As I previously described, since the implementation doesn't iterate over every element of the alphabet, there needs to be a procedure to make transitions over characters have disjoint labels. Transitions are labeled by sets, and the output from creating an NFA might have intersecting outgoing sets from a transition.

The best way I've thought of doing this is to get all of the intersections of all of the edge labels, basically forming a Venn diagram. This is, unfortunately, exponential time and space to do. But I see no way of avoiding it when compiling a regular expression like /\p{letter}a|[0-9a-z]b|\{script=latin}c|.../ where there are a large number of incomparable character classes used.

I implemented a small optimization for this: numbers (ie literal characters) are set aside at the beginning and treated specially, so work isn't wasted intersecting them with other classes. The complexity of the algorithm stays exponential, but instead of being exponential in the total number of character classes in the regexp, it becomes exponential in just the non-literal classes.

Constructing a DFA

This is also a standard algorithm. My only modification is to support the tagged epsilon transitions created by lookaround and anchors. I described the modification in a previous blog post.

Minimization

Next, the resulting DFA is minimized. I wrote about regexp minimization before. The algorithm had to be modified slightly to allow for the conditional transitions introduced by processing lookaround in the previous step.

Compilation to Factor

At this point, we have a nice minimal DFA with disjoint outward transitions. Translating it into Factor code is actually quite easy. For each state, we make a gensym. The gensym takes as arguments a string and an index. If the index is at the end of the string, the word returns a boolean, indicating whether the current state is an accepting state. If the index is not at the end of the string, the current character is found, and the word figures out which transition to take. A transition is taken by incrementing the index and then making a tail call to another state word.

The strategy for finding the right transition is somewhat complicated. First, the literal transitions (over constant characters) are partitioned out from the non-literal transitions. The literal transitions are formed into a case statement, where the default case handles non-literal transitions.

Non-literal transitions are all boolean expressions, built with the class algebra described below. They have a number of logic variables (classes of characters). So we can build a truth table over the logic variables, and test each condition exactly once to figure out which transition to take. For example, in the regexp /\p{script=latin}b|\p{lower}c/, the DFA will have three transitions from the start: one over characters which are Latin script and lower-cased, one over characters which are lower-cased but not in Latin script, and one over characters which are in Latin script but not lower-cased. Rather than having the compiled DFA check if the character is in the composite classes directly (which would duplicate cost, since it would be looked up multiple times if a character is lower-cased or Latin script), the compiler builds nested if statements that figure out the composite class while testing each property only once. This leads directly to the transition.

Class algebra

There is a system for simplifying the intersections built in disambiguation, as well as character class literals. It is built off simplifying logical expressions built with and, or, not. The things contained are true (the whole set of characters), false (the empty set), and all possible character classes.

There are constructors for these three logical operations, and then a few simple tactics are used to reduce them. Reducing the expression to simplest form is equivalent to circuit minimization. A friend told me that this is on the second level of the polynomial hierarchy. So I'll just live with the heuristics.

The not constructor is simple. If it's given true, it outputs false. False to true. If it's given a negation as input, it returns the contents. If it's given an and class, it uses De Morgan's law and negates each entry, returning an or. And vice versa.

The and/or constructors are slightly more complicated. I will describe how the and constructor works; the or constructor can be easily derived using De Morgan's law. The input is a sequence of classes, and we want to get their intersection. First, if the input contains intersection (and) classes, these are flattened into the larger sequence. Next, the sequence is sorted into categories: integers, negations of integers, simple classes (like the class digits), negations of those, union classes (ors), and booleans. Delete true from the booleans list, if it's there, as it cannot affect the outcome. If there is a false in the booleans list, then the answer is false. If there is more than one integer, the answer is immediately false. If there is exactly one integer, then the answer is that integer if it is contained in all of the other classes, otherwise false. Now, we are working with a sequence which does not have integer literals, or true or false. If there is a simple class and a not-simple class for the same class, we know that their intersection is false, so the entire expression is false. We can remove not-integers where the integer is contained in an existing not-simple class, as these are redundant. Finally, the or classes within the and class can be simplified in the case where they have logic variables overlapping with other things in the and class: these can all be substituted with true. For example, if you have and(lowercase, or(lowercase, latin)), this can be simplified to and(lowercase, latin). This is because true is substituted for lowercase in the or expression, and or(true, latin) simplifies to latin.

Previously, class algebra was not this strong in what reductions it did. This caused problems. For example, nested negations (useful in implementing conjunction) would result in multiple nested disambiguations, which would cause a very fast blowup in the size of the DFA. Now, running disambiguation twice gives the same results as running it once. (In other words, disambiguation is idempotent.) At least I think so.

Conclusion

Implementing regular expressions took a lot more thinking than I expected, but I'm satisfied with the result so far. Unlike traditional regular expressions, where pathologies might make regexp matching slow, in this system pathologies make regexp compilation time slow. That seems more acceptable to me. Factor's extensible syntax allows me to make regexp literals, which compile before the program runs, even though regexps are a library rather than built in.

If I were starting from scratch, I might instead use the algorithm of constructing a DFA directly from a regular expression using regexp derivatives. It's described in this paper [PDF]. I'm not sure how performance in practice compares, but it implements negation and conjunction in a much cleaner way, and basically eliminates the need for minimization. Most importantly, it allows for a variation of disambiguation which avoids exponential behavior in many cases.

In a later blog post, I'll talk about the API that I expose for regexps.


I haven't actually implemented \p{script=foo} yet, but I will very soon.

Sunday, March 8, 2009

Text encodings and API design

This blog post is about a problem that I haven't figured out the answer to: How should vendor extensions to encodings be exposed to programmers?

It seems like basically all pre-Unicode text encodings have proprietary Microsoft extensions that become, in practice, the next version of the standard. Sometimes these are basically supersets, but sometimes the extensions are backwards-incompatible in particular tiny places. One example of this that should be familiar to Westerners is the distinction between Latin 1 (ISO 8859-1) and Windows 1252. The only difference is in the range 0x80-0x9F, where Latin 1 has a set of basically useless control characters.

A similar situation exists with many East Asian character sets (eg. KS X 1001/1003 for Korean, JIS X 208 for Japanese). In these cases, the backslash (\ 0x5C) is mapped to a national currency symbol in the official standard. But in the Microsoft extension, 0x5C is mapped to backslash to maintain compatibility with ASCII, and another character represents the national currency symbol.

Some websites mark themselves as ISO 8859-1, but are in fact encoded in Windows 1252, and many web browsers take this interpretation into account. Similarly, the Microsoft versions of East Asian text encodings are often used in contexts where the standard versions are declared. In some cases, the Microsoft versions are registered with IANA separately for use in internet protocols (eg Shift-JIS/Windows_31J, and Latin1/Windows-1252), but in other cases there is only one registered encoding (eg EUC-KR).

So, there are two questions.
  1. When interpreting an HTTP response, or receiving an email, where the encoding is declared in the header, should the text be interpreted with the standard encoding or the Microsoft extension?
  2. In the encodings API for a programming language, if a particular encoding is used to read or write a file, should the standard be used or the Microsoft extension?

When I started reading about encodings, I assumed that everything could be done reasonably by following the standards precisely. Now I'm not sure what to do.

Thursday, March 5, 2009

Naive lookaround in a DFA

Note: If you don't understand what's below, read this for background on implementing regular expressions with DFAs.

I just implemented lookahead and lookbehind in the Factor regexp module. This lets you write regular expressions like /(?<=foo)bar/, to match "bar" when it's preceded by "foo". I couldn't think of a better way to do it for the general case, so I just did it the naive way. There should be a better way, though, because lookahead and lookbehind don't extend the set of possible strings matched. This algorithm isn't so good because it makes things worse than linear time in the general case.

For a lookahead or lookbehind clause, there is a little regular expression compiled. This regular expression annotates an epsilon transition. If I had an NFA interpreter, then the interpreter would just run the regular expression on the input string starting at the current point when it wants to know if it can use the epsilon transition. I'm using a DFA, so I needed to modify the subset construction to make this work.

What needs to be modified is the procedure to get the epsilon closure of an NFA state. Instead of returning a set of NFA states, this procedure should return a set of states and the conditions for reaching each state. Usually, there will be no conditions, but sometimes there will be a conjunction or disjunction of lookarounds. It could be the conjunction in regular expression like /(?<=foo)(?=...bar)baz/, and it could be the disjunction in a regular expression like /((?<=foo)|(?=...bar))bar/, and since I'm trying to be fully general, all of this is supported.

The epsilon closure procedure is usually something like this*. Start with your state on a list, and look at all of the epsilon transitions outwards. Add these to your list. Now, if anything on your list has more epsilon transitions outwards, add these to your list. Keep going until there's nothing to add.

With the modification: Start with your state on a list, with the associated information that there are no conditions that have to be met for it. Then, look at all of the outward epsilon transitions from all the states on your list. For each new transition that you're considering, the requirement to get from the original state to the target of the transition is to meet the requirement of the source, plus the requirement of the transition. If you already had a way to get to the target of the transition, then now there are two conditions, and either one can be used. Keep repeating this until examining epsilon transitions doesn't change anything.

Now, a little post-processing can turn the result of this into a bunch of nested conditionals, which can quickly tell you what states you can get to given which conditions are met. In the usual case, where there are no conditions, this tree is just a single leaf, a list of states. From here, transitions in the DFA go not to sets of states but to trees of conditions and states. The start state is also one of these trees.

The DFA interpreter** needs a little bit of extra logic to test the conditions and traverse the tree. The interpreter gives the condition the input string and index, and the condition can do whatever it wants with that information. I've implemented anchors (eg. $ and ^) this way. They could be lookaround, but it's easier to just implement them directly as a predicate on the string and index.



*No, I don't implement it this way, because it would be a lot of repeated work, but this expresses the idea.
** Actually, everything compiles down to Factor code, which is compiled with the optimizing compiler if you want.