Tuesday, March 17, 2009

The implementation of Factor's regexp library

I've been working on Factor's regular expression library, initially written by Doug Coleman, for the past few weeks. Recently, the library became good enough that I've pushed it to Factor's main repository. The latest Factor binaries have this new library.

The library uses an standard algorithm of converting a regular expression into an NFA, and that into a DFA which can be executed. This is a tradeoff: the code generated will be faster than you would get from a backtracking search or an NFA interpreter, but it takes exponential time, in the worst case, to generate the DFA. I might revisit this later.

The main features missing now are

  • Possessive and reluctant matching
  • Group capture
  • Unicode support, in the form of UTS 18 level 1 compliance with some level 2 features

Right now, I'm working on Unicode support. Regexps already use Unicode strings, because all strings in Factor represent a sequence of Unicode code points, but not many Unicode properties are exposed now. I plan on working on this, and implementing more Unicode algorithms and properties, to reach level 1 compliance.

The rest of this article is an overview of how the regexp engine works. It is implemented as a series of passes, where the first pass takes a string as input, and the last pass outputs a word which runs the code of the regexp. In this way, it is rather like any other compiler, where the parse tree, DFA table and NFA table are just intermediate representations.

The parser

The parser is implemented with Chris Double's packrat parsing library. This makes it not very efficient, but the time spent in parsing is much less than the time in later processing stages, so the cost isn't very large. Things like /a{2,4}/ are expanded into the equivalent, but simpler, form /aa|aaa|aaaa/.

If I were working only with ASCII, then ranges would be expanded into disjunctions as well, but the alphabet is far too big for that. Instead, something like /[ac-z]/ is represented in the syntax tree as a item, a character class object, representing the information that it matches the character a, or something in the class which is the range c-z. For a character class like /[^\dc]/, an object is created which represents a character which is not a digit or c.

Constructing an NFA

From the syntax tree, a nondeterministic finite-state automaton is built. The algorithm is described here, and there is nothing special about the implementation.

Lookahead, lookbehind and anchors (like $ and ^) expand to entries in the syntax tree called tagged epsilons. When these are encountered in building the NFA, an epsilon transition is created which is annotated with this information.

Negation is implemented here. If a negation syntax node is encountered, then the NFA builder constructs an NFA for the enclosed term, disambiguates it, converts it to a DFA, minimizes it, and attaches it back to the larger NFA that is being constructed.

Disambiguation

As I previously described, since the implementation doesn't iterate over every element of the alphabet, there needs to be a procedure to make transitions over characters have disjoint labels. Transitions are labeled by sets, and the output from creating an NFA might have intersecting outgoing sets from a transition.

The best way I've thought of doing this is to get all of the intersections of all of the edge labels, basically forming a Venn diagram. This is, unfortunately, exponential time and space to do. But I see no way of avoiding it when compiling a regular expression like /\p{letter}a|[0-9a-z]b|\{script=latin}c|.../ where there are a large number of incomparable character classes used.

I implemented a small optimization for this: numbers (ie literal characters) are set aside at the beginning and treated specially, so work isn't wasted intersecting them with other classes. The complexity of the algorithm stays exponential, but instead of being exponential in the total number of character classes in the regexp, it becomes exponential in just the non-literal classes.

Constructing a DFA

This is also a standard algorithm. My only modification is to support the tagged epsilon transitions created by lookaround and anchors. I described the modification in a previous blog post.

Minimization

Next, the resulting DFA is minimized. I wrote about regexp minimization before. The algorithm had to be modified slightly to allow for the conditional transitions introduced by processing lookaround in the previous step.

Compilation to Factor

At this point, we have a nice minimal DFA with disjoint outward transitions. Translating it into Factor code is actually quite easy. For each state, we make a gensym. The gensym takes as arguments a string and an index. If the index is at the end of the string, the word returns a boolean, indicating whether the current state is an accepting state. If the index is not at the end of the string, the current character is found, and the word figures out which transition to take. A transition is taken by incrementing the index and then making a tail call to another state word.

The strategy for finding the right transition is somewhat complicated. First, the literal transitions (over constant characters) are partitioned out from the non-literal transitions. The literal transitions are formed into a case statement, where the default case handles non-literal transitions.

Non-literal transitions are all boolean expressions, built with the class algebra described below. They have a number of logic variables (classes of characters). So we can build a truth table over the logic variables, and test each condition exactly once to figure out which transition to take. For example, in the regexp /\p{script=latin}b|\p{lower}c/, the DFA will have three transitions from the start: one over characters which are Latin script and lower-cased, one over characters which are lower-cased but not in Latin script, and one over characters which are in Latin script but not lower-cased. Rather than having the compiled DFA check if the character is in the composite classes directly (which would duplicate cost, since it would be looked up multiple times if a character is lower-cased or Latin script), the compiler builds nested if statements that figure out the composite class while testing each property only once. This leads directly to the transition.

Class algebra

There is a system for simplifying the intersections built in disambiguation, as well as character class literals. It is built off simplifying logical expressions built with and, or, not. The things contained are true (the whole set of characters), false (the empty set), and all possible character classes.

There are constructors for these three logical operations, and then a few simple tactics are used to reduce them. Reducing the expression to simplest form is equivalent to circuit minimization. A friend told me that this is on the second level of the polynomial hierarchy. So I'll just live with the heuristics.

The not constructor is simple. If it's given true, it outputs false. False to true. If it's given a negation as input, it returns the contents. If it's given an and class, it uses De Morgan's law and negates each entry, returning an or. And vice versa.

The and/or constructors are slightly more complicated. I will describe how the and constructor works; the or constructor can be easily derived using De Morgan's law. The input is a sequence of classes, and we want to get their intersection. First, if the input contains intersection (and) classes, these are flattened into the larger sequence. Next, the sequence is sorted into categories: integers, negations of integers, simple classes (like the class digits), negations of those, union classes (ors), and booleans. Delete true from the booleans list, if it's there, as it cannot affect the outcome. If there is a false in the booleans list, then the answer is false. If there is more than one integer, the answer is immediately false. If there is exactly one integer, then the answer is that integer if it is contained in all of the other classes, otherwise false. Now, we are working with a sequence which does not have integer literals, or true or false. If there is a simple class and a not-simple class for the same class, we know that their intersection is false, so the entire expression is false. We can remove not-integers where the integer is contained in an existing not-simple class, as these are redundant. Finally, the or classes within the and class can be simplified in the case where they have logic variables overlapping with other things in the and class: these can all be substituted with true. For example, if you have and(lowercase, or(lowercase, latin)), this can be simplified to and(lowercase, latin). This is because true is substituted for lowercase in the or expression, and or(true, latin) simplifies to latin.

Previously, class algebra was not this strong in what reductions it did. This caused problems. For example, nested negations (useful in implementing conjunction) would result in multiple nested disambiguations, which would cause a very fast blowup in the size of the DFA. Now, running disambiguation twice gives the same results as running it once. (In other words, disambiguation is idempotent.) At least I think so.

Conclusion

Implementing regular expressions took a lot more thinking than I expected, but I'm satisfied with the result so far. Unlike traditional regular expressions, where pathologies might make regexp matching slow, in this system pathologies make regexp compilation time slow. That seems more acceptable to me. Factor's extensible syntax allows me to make regexp literals, which compile before the program runs, even though regexps are a library rather than built in.

If I were starting from scratch, I might instead use the algorithm of constructing a DFA directly from a regular expression using regexp derivatives. It's described in this paper [PDF]. I'm not sure how performance in practice compares, but it implements negation and conjunction in a much cleaner way, and basically eliminates the need for minimization. Most importantly, it allows for a variation of disambiguation which avoids exponential behavior in many cases.

In a later blog post, I'll talk about the API that I expose for regexps.


I haven't actually implemented \p{script=foo} yet, but I will very soon.

Sunday, March 8, 2009

Text encodings and API design

This blog post is about a problem that I haven't figured out the answer to: How should vendor extensions to encodings be exposed to programmers?

It seems like basically all pre-Unicode text encodings have proprietary Microsoft extensions that become, in practice, the next version of the standard. Sometimes these are basically supersets, but sometimes the extensions are backwards-incompatible in particular tiny places. One example of this that should be familiar to Westerners is the distinction between Latin 1 (ISO 8859-1) and Windows 1252. The only difference is in the range 0x80-0x9F, where Latin 1 has a set of basically useless control characters.

A similar situation exists with many East Asian character sets (eg. KS X 1001/1003 for Korean, JIS X 208 for Japanese). In these cases, the backslash (\ 0x5C) is mapped to a national currency symbol in the official standard. But in the Microsoft extension, 0x5C is mapped to backslash to maintain compatibility with ASCII, and another character represents the national currency symbol.

Some websites mark themselves as ISO 8859-1, but are in fact encoded in Windows 1252, and many web browsers take this interpretation into account. Similarly, the Microsoft versions of East Asian text encodings are often used in contexts where the standard versions are declared. In some cases, the Microsoft versions are registered with IANA separately for use in internet protocols (eg Shift-JIS/Windows_31J, and Latin1/Windows-1252), but in other cases there is only one registered encoding (eg EUC-KR).

So, there are two questions.

  1. When interpreting an HTTP response, or receiving an email, where the encoding is declared in the header, should the text be interpreted with the standard encoding or the Microsoft extension?
  2. In the encodings API for a programming language, if a particular encoding is used to read or write a file, should the standard be used or the Microsoft extension?

When I started reading about encodings, I assumed that everything could be done reasonably by following the standards precisely. Now I'm not sure what to do.

Thursday, March 5, 2009

Naive lookaround in a DFA

Note: If you don't understand what's below, read this for background on implementing regular expressions with DFAs.

I just implemented lookahead and lookbehind in the Factor regexp module. This lets you write regular expressions like /(?<=foo)bar/, to match "bar" when it's preceded by "foo". I couldn't think of a better way to do it for the general case, so I just did it the naive way. There should be a better way, though, because lookahead and lookbehind don't extend the set of possible strings matched. This algorithm isn't so good because it makes things worse than linear time in the general case.

For a lookahead or lookbehind clause, there is a little regular expression compiled. This regular expression annotates an epsilon transition. If I had an NFA interpreter, then the interpreter would just run the regular expression on the input string starting at the current point when it wants to know if it can use the epsilon transition. I'm using a DFA, so I needed to modify the subset construction to make this work.

What needs to be modified is the procedure to get the epsilon closure of an NFA state. Instead of returning a set of NFA states, this procedure should return a set of states and the conditions for reaching each state. Usually, there will be no conditions, but sometimes there will be a conjunction or disjunction of lookarounds. It could be the conjunction in regular expression like /(?<=foo)(?=...bar)baz/, and it could be the disjunction in a regular expression like /((?<=foo)|(?=...bar))bar/, and since I'm trying to be fully general, all of this is supported.

The epsilon closure procedure is usually something like this*. Start with your state on a list, and look at all of the epsilon transitions outwards. Add these to your list. Now, if anything on your list has more epsilon transitions outwards, add these to your list. Keep going until there's nothing to add.

With the modification: Start with your state on a list, with the associated information that there are no conditions that have to be met for it. Then, look at all of the outward epsilon transitions from all the states on your list. For each new transition that you're considering, the requirement to get from the original state to the target of the transition is to meet the requirement of the source, plus the requirement of the transition. If you already had a way to get to the target of the transition, then now there are two conditions, and either one can be used. Keep repeating this until examining epsilon transitions doesn't change anything.

Now, a little post-processing can turn the result of this into a bunch of nested conditionals, which can quickly tell you what states you can get to given which conditions are met. In the usual case, where there are no conditions, this tree is just a single leaf, a list of states. From here, transitions in the DFA go not to sets of states but to trees of conditions and states. The start state is also one of these trees.

The DFA interpreter** needs a little bit of extra logic to test the conditions and traverse the tree. The interpreter gives the condition the input string and index, and the condition can do whatever it wants with that information. I've implemented anchors (eg. $ and ^) this way. They could be lookaround, but it's easier to just implement them directly as a predicate on the string and index.



*No, I don't implement it this way, because it would be a lot of repeated work, but this expresses the idea.
** Actually, everything compiles down to Factor code, which is compiled with the optimizing compiler if you want.

Tuesday, February 24, 2009

Draft paper about inverse

I recently submitted an abstract about inverse to an undergraduate computer science conference, and I just got word that it was accepted. I've written a draft paper, available through Scribd, and I'd like to hear your comments on it. I've written it for a completely general audience, not assuming knowledge of functional programming or stack-based programming languages. I didn't get into formal semantics or anything like that, partly because it'd make the paper too long and partly because I don't completely understand the point.

Speaking of undergraduate research, if you're interested in doing research in practical computer science this summer, I recommend applying to Harvey Mudd's CS REU. Applications are due Sunday. But only if you're a US citizen or permanent resident! The funding agencies don't like foreigners. (This is actually a serious problem for several of my friends, who have difficulty finding the same summer academic and internship opportunities because they are actively discriminated against as international students. I hope that, one day, discrimination based on citizenship is as illegal as discrimination against racial minorities or women, but we are very far away from that today. Now I'm getting off-topic.)

Monday, February 23, 2009

Regular languages with large alphabets

In implementing regular expressions, the formal theory of regular languages can be useful, especially when compiling regexps to DFAs. But the standard definition of a regular language is a little inconvenient when working with Unicode. In a standard DFA, a transition is over a single character. But for the range /[\0-\u0f423f]/, I don't really want to compile a DFA with a million transition edges!

A slightly modified formalism is useful here, where transitions in NFAs and DFAs happen over sets of letters, rather than individual letters. Then, things like ranges and character classes (eg \p{Lower}) are represented as sets which annotate transition edges.

It's perfectly straightforward to translate such a regular expression into an NFA with the standard construction, and the typical algorithm for executing an NFA by running all possible states in parallel works. For DFAs, though, there's a small complication: we have to remove ambiguity about where to go next.

Say you have the regexp /\p{InLatin}b|\p{Lower}c/ This matches strings like "ab", "ac", "Ab", "πc" but not "Ac" or "πb". The simple textbook algorithm for regular expressions would have me expand /\p{InLatin}/ out to /a|A|b|B|.../, and expand /\p{Lower}/ to /a|π|.../. This strategy would work, but the size of the resulting NFA and DFA would be gigantic.

What we actually want to do is change the two outward transitions from the start state--transitions over the sets InLatin and Lower--to transitions over InLatin ∩ Lower, InLatin - Lower and Lower - InLatin. Since these are disjoint, there's no ambiguity about which one to take. In general, for a state with n outward transitions, you have to look at 2n possibilities, since for each subset, you have to make a transition for characters which are in each of those transition groups, but not in any of the others.

Implemented naively, this would make the size of DFAs blow up. For the regexp /ab|cd/, you'd have a transition on characters that are a but not c, characters that are c but not a, and characters that are both c and a. Fortunately, it's simple to work out a system which recognizes that a - c = a, c - a = c and c ∩ a = ∅. With this in place, the resulting DFA (which didn't have any ambiguity in the first place) is just like what it would be without the system, but the DFA for ab|\p{Lower}c has two transitions from the start state: one over a, and one over lower cased letters that aren't a.

I've implemented all of this in the Factor regexp library. If you want to see the code, it's in the main repository, in the regexp branch.

PS. When you're considering transitions over sets, it's possible to consider "regular" languages over an infinite alphabet. It might be convenient to think of Unicode as infinite, since it's so large. But it's easy to prove that for any such "regular" language, there is a string homomorphism to a finite-alphabet regular language where a string is in the original language if and only if its homomorphic image is in the smaller regular language. So, in this way, it's a mathematically boring formalism to study. Other people have studied regular language formalisms with infinite alphabets that actually do have more power--they have the power to compare characters for equality in certain contexts. But that's completely different.

Wednesday, February 18, 2009

DFA minimization

I've been working on the regexp vocabulary by Doug Coleman, cleaning it up and adding new features with the goal of making it usable for the xmode syntax highlighter. This means compatibility with most of Java's regexp syntax.

The implementation strategy Doug and I are using is standard textbook stuff, the way Lex works: from the regular expression, a nondeterministic finite automaton (NFA) is constructed, and this is converted to a deterministic finite automaton (DFA). This can be used to efficiently match the string in linear time. Russ Cox wrote a good introduction to all of this.

There's a limitation: backreferences (like /(.*)\1/) are not supported, since they're incompatible with the NFA/DFA model. But there's no good way to implement backreferences, anyway: parsing with them is NP-complete. Perl uses a backtracking model where backreferences are easy to implement, but in certain cases the backtracking gets out of hand and performance is worse than linear.

Today, I worked out the code to minimize DFAs. The DFA minimization algorithm is really nice, so I thought I would share it with you. The implementation is just 65 lines of Factor code, which is in the Factor pastebin.

The issue is that sometimes, naively constructing a DFA gives you duplicate states which are really the same thing. For example, if you have two final states which each have no outward transitions, they can be consolidated. If you have the regular expression /ac|bc/, then there's a DFA for this in just 3 states. The naive way, however, would give you 5 states.

What we want is to partition the states into sets of states that are all the same, and then use only one of these. In mathematical language, we want to create an equivalence relation and quotient out by it. Here's how we figure out what states are the same.

  1. Assume that all states are the same*.
  2. Separate the final states from the non-final states.
  3. Repeat the following until it doesn't make a change in the partitioning:
    • Separate any two states which have a transition with the same label to two states which are separated.

Once this doesn't change anything, the states have been divided into the sets that we want.

Interestingly, this algorithm is pretty similar to optimistic global value numbering, a compiler optimization. Optimistic value numbering is a technique for eliminating duplicated computations. It works by assuming that all registers hold the same value, and there is an iteration until fixpoint tries to separate registers which are actually different from each other. When faced with loops, this can catch more than so-called pessimistic value numbering, which first assumes that everything is different, until it can prove that two registers hold the same value.

In the simple, controlled environment of a DFA, it's been proved that this minimization actually produces the smallest possible DFA matching the same set of strings. It's even been shown that, for a given regular language, the minimal DFA recognizing it is unique up to isomorphism. Such a nice analysis isn't possible in the more complicated world of compiler optimizations, however, where better optimizations than GVN exist.


*Actually, we assume that any two states with the same labeled outward transitions are the same. For example, if a state A goes to state B on character x and state C on character y, and state D goes to E on either x or y, then we'll assume at the beginning that A and E are the same. This is a simple modification I came up with to deal with the effectively infinite alphabet of Unicode, since it would be impossible to compare the transition on each Unicode code point.

Thursday, February 5, 2009

XML pattern matching

I've implemented syntax for pattern matching on interpolated XML literals. This is a Scala-inspired feature which may or may not be useful, but is definitely cool-looking. Here's a sample of code:

: dispatch ( xml -- string )
{
{ [ [XML <a><-></a> XML] ] [ "a" prepend ] }
{ [ [XML <b><-></b> XML] ] [ "b" prepend ] }
{ [ [XML <b val='yes'/> XML] ] [ "yes" ] }
{ [ [XML <b val=<->/> XML] ] [ "no" prepend ] }
} switch ;

And here's some examples of what it does:
[XML <a>pple</a> XML] dispatch
=> "apple"
[XML <b>anana</b> XML] dispatch
=> "banana"
[XML <b val="yes"/> XML] dispatch
=> "yes"
[XML <b val="where"/> XML] dispatch
=> "nowhere"

The pattern matching here is based on my inverse library. Hopefully you get the high-level idea of how XML pattern matching works. A caveat is that namespaces are ignored, as in Scala's XML pattern matching, and I haven't thought of a good way to incorporate namespaces.

Jeff Attwood recently wrote a blog post about XML literal syntax*, and how it's a big help. In part, I agree with his statement that it lets you write cleaner code to have XML literals mixed in with everything. That's why I implemented them in Factor.

But I think it's insane to have them built in the way that Scala and, apparently, VB.NET have. Just think if the designers of Java had this idea 15 years ago. They would have made SGML literal syntax rather than XML, and there would be no way to remove this for backwards compatibility concerns. Every implementation of Java would have to contain a full-featured SGML parser. Generating HTML would be easier and allow for very pretty code, though.

XML won't be so prominent forever. It'll always be there, of course, just like SGML is still there in non-XML HTML and DocBook files which refuse to die. But I wouldn't be surprised if, in 20 years, what we want is a literal syntax for something other than XML. We'll still be using a lot of the same programming languages as were invented today, though.

Factor's XML literal syntax and XML pattern matching are just libraries, and I wouldn't want it any other way. Factor's parser needed no adjustments at all to allow for XML literals. If an XML literal is written when the vocabulary xml.literals isn't in scope, it's a syntax error. Sure, you need delimiters [XML XML] around the XML literals, but this is a small price to pay.

In 20 years, if XML isn't useful any longer, then Factor programmers will be just as well off as if they are. If there's a new data format, it'll just be a new library to download, and you'll have literal syntax for that data format. If you were using VB.NET, though, you'd have to upgrade to the most recent version of the language, with a parser that's been vastly complicated.


*He actually talked about HTML literal syntax, but that's such a ridiculous idea that I know he just mistyped. There's no way to tell where an HTML fragment ends, for one, and the HTML DTD would have to be part of the core grammar of the language. You would need delimiters around where the HTML starts and ends, and none of his code fragments have those delimiters. The X key is right next to the letters H and T, so it's an honest typo. His pseudocode sample just before the VB.NET fragment must have also been a simple mistake, as it would seem to imply that either XML literals get printed immediately or that there is some implicit way of collecting them.