Tuesday, February 24, 2009

Draft paper about inverse

I recently submitted an abstract about inverse to an undergraduate computer science conference, and I just got word that it was accepted. I've written a draft paper, available on factorcode.org, and I'd like to hear your comments on it. I've written it for a completely general audience, not assuming knowledge of functional programming or stack-based programming languages. I didn't get into formal semantics or anything like that, partly because it'd make the paper too long and partly because I don't completely understand the point.

Speaking of undergraduate research, if you're interested in doing research in practical computer science this summer, I recommend applying to Harvey Mudd's CS REU. Applications are due Sunday. But only if you're a US citizen or permanent resident! The funding agencies don't like foreigners. (This is actually a serious problem for several of my friends, who have difficulty finding the same summer academic and internship opportunities because they are actively discriminated against as international students. I hope that, one day, discrimination based on citizenship is as illegal as discrimination against racial minorities or women, but we are very far away from that today. Now I'm getting off-topic.)

Monday, February 23, 2009

Regular languages with large alphabets

In implementing regular expressions, the formal theory of regular languages can be useful, especially when compiling regexps to DFAs. But the standard definition of a regular language is a little inconvenient when working with Unicode. In a standard DFA, a transition is over a single character. But for the range /[\0-\u0f423f]/, I don't really want to compile a DFA with a million transition edges!

A slightly modified formalism is useful here, where transitions in NFAs and DFAs happen over sets of letters, rather than individual letters. Then, things like ranges and character classes (eg \p{Lower}) are represented as sets which annotate transition edges.

It's perfectly straightforward to translate such a regular expression into an NFA with the standard construction, and the typical algorithm for executing an NFA by running all possible states in parallel works. For DFAs, though, there's a small complication: we have to remove ambiguity about where to go next.

Say you have the regexp /\p{InLatin}b|\p{Lower}c/ This matches strings like "ab", "ac", "Ab", "πc" but not "Ac" or "πb". The simple textbook algorithm for regular expressions would have me expand /\p{InLatin}/ out to /a|A|b|B|.../, and expand /\p{Lower}/ to /a|π|.../. This strategy would work, but the size of the resulting NFA and DFA would be gigantic.

What we actually want to do is change the two outward transitions from the start state--transitions over the sets InLatin and Lower--to transitions over InLatin ∩ Lower, InLatin - Lower and Lower - InLatin. Since these are disjoint, there's no ambiguity about which one to take. In general, for a state with n outward transitions, you have to look at 2n possibilities, since for each subset, you have to make a transition for characters which are in each of those transition groups, but not in any of the others.

Implemented naively, this would make the size of DFAs blow up. For the regexp /ab|cd/, you'd have a transition on characters that are a but not c, characters that are c but not a, and characters that are both c and a. Fortunately, it's simple to work out a system which recognizes that a - c = a, c - a = c and c ∩ a = ∅. With this in place, the resulting DFA (which didn't have any ambiguity in the first place) is just like what it would be without the system, but the DFA for ab|\p{Lower}c has two transitions from the start state: one over a, and one over lower cased letters that aren't a.

I've implemented all of this in the Factor regexp library. If you want to see the code, it's in the main repository, in the regexp branch.

PS. When you're considering transitions over sets, it's possible to consider "regular" languages over an infinite alphabet. It might be convenient to think of Unicode as infinite, since it's so large. But it's easy to prove that for any such "regular" language, there is a string homomorphism to a finite-alphabet regular language where a string is in the original language if and only if its homomorphic image is in the smaller regular language. So, in this way, it's a mathematically boring formalism to study. Other people have studied regular language formalisms with infinite alphabets that actually do have more power--they have the power to compare characters for equality in certain contexts. But that's completely different.

Wednesday, February 18, 2009

DFA minimization

I've been working on the regexp vocabulary by Doug Coleman, cleaning it up and adding new features with the goal of making it usable for the xmode syntax highlighter. This means compatibility with most of Java's regexp syntax.

The implementation strategy Doug and I are using is standard textbook stuff, the way Lex works: from the regular expression, a nondeterministic finite automaton (NFA) is constructed, and this is converted to a deterministic finite automaton (DFA). This can be used to efficiently match the string in linear time. Russ Cox wrote a good introduction to all of this.

There's a limitation: backreferences (like /(.*)\1/) are not supported, since they're incompatible with the NFA/DFA model. But there's no good way to implement backreferences, anyway: parsing with them is NP-complete. Perl uses a backtracking model where backreferences are easy to implement, but in certain cases the backtracking gets out of hand and performance is worse than linear.

Today, I worked out the code to minimize DFAs. The DFA minimization algorithm is really nice, so I thought I would share it with you. The implementation is just 65 lines of Factor code, which is in the Factor pastebin.

The issue is that sometimes, naively constructing a DFA gives you duplicate states which are really the same thing. For example, if you have two final states which each have no outward transitions, they can be consolidated. If you have the regular expression /ac|bc/, then there's a DFA for this in just 3 states. The naive way, however, would give you 5 states.

What we want is to partition the states into sets of states that are all the same, and then use only one of these. In mathematical language, we want to create an equivalence relation and quotient out by it. Here's how we figure out what states are the same.
  1. Assume that all states are the same*.
  2. Separate the final states from the non-final states.
  3. Repeat the following until it doesn't make a change in the partitioning:
    • Separate any two states which have a transition with the same label to two states which are separated.

Once this doesn't change anything, the states have been divided into the sets that we want.

Interestingly, this algorithm is pretty similar to optimistic global value numbering, a compiler optimization. Optimistic value numbering is a technique for eliminating duplicated computations. It works by assuming that all registers hold the same value, and there is an iteration until fixpoint tries to separate registers which are actually different from each other. When faced with loops, this can catch more than so-called pessimistic value numbering, which first assumes that everything is different, until it can prove that two registers hold the same value.

In the simple, controlled environment of a DFA, it's been proved that this minimization actually produces the smallest possible DFA matching the same set of strings. It's even been shown that, for a given regular language, the minimal DFA recognizing it is unique up to isomorphism. Such a nice analysis isn't possible in the more complicated world of compiler optimizations, however, where better optimizations than GVN exist.

*Actually, we assume that any two states with the same labeled outward transitions are the same. For example, if a state A goes to state B on character x and state C on character y, and state D goes to E on either x or y, then we'll assume at the beginning that A and E are the same. This is a simple modification I came up with to deal with the effectively infinite alphabet of Unicode, since it would be impossible to compare the transition on each Unicode code point.

Thursday, February 5, 2009

XML pattern matching

I've implemented syntax for pattern matching on interpolated XML literals. This is a Scala-inspired feature which may or may not be useful, but is definitely cool-looking. Here's a sample of code:
: dispatch ( xml -- string )
{ [ [XML <a><-></a> XML] ] [ "a" prepend ] }
{ [ [XML <b><-></b> XML] ] [ "b" prepend ] }
{ [ [XML <b val='yes'/> XML] ] [ "yes" ] }
{ [ [XML <b val=<->/> XML] ] [ "no" prepend ] }
} switch ;

And here's some examples of what it does:
[XML <a>pple</a> XML] dispatch
=> "apple"
[XML <b>anana</b> XML] dispatch
=> "banana"
[XML <b val="yes"/> XML] dispatch
=> "yes"
[XML <b val="where"/> XML] dispatch
=> "nowhere"

The pattern matching here is based on my inverse library. Hopefully you get the high-level idea of how XML pattern matching works. A caveat is that namespaces are ignored, as in Scala's XML pattern matching, and I haven't thought of a good way to incorporate namespaces.

Jeff Attwood recently wrote a blog post about XML literal syntax*, and how it's a big help. In part, I agree with his statement that it lets you write cleaner code to have XML literals mixed in with everything. That's why I implemented them in Factor.

But I think it's insane to have them built in the way that Scala and, apparently, VB.NET have. Just think if the designers of Java had this idea 15 years ago. They would have made SGML literal syntax rather than XML, and there would be no way to remove this for backwards compatibility concerns. Every implementation of Java would have to contain a full-featured SGML parser. Generating HTML would be easier and allow for very pretty code, though.

XML won't be so prominent forever. It'll always be there, of course, just like SGML is still there in non-XML HTML and DocBook files which refuse to die. But I wouldn't be surprised if, in 20 years, what we want is a literal syntax for something other than XML. We'll still be using a lot of the same programming languages as were invented today, though.

Factor's XML literal syntax and XML pattern matching are just libraries, and I wouldn't want it any other way. Factor's parser needed no adjustments at all to allow for XML literals. If an XML literal is written when the vocabulary xml.literals isn't in scope, it's a syntax error. Sure, you need delimiters [XML XML] around the XML literals, but this is a small price to pay.

In 20 years, if XML isn't useful any longer, then Factor programmers will be just as well off as if they are. If there's a new data format, it'll just be a new library to download, and you'll have literal syntax for that data format. If you were using VB.NET, though, you'd have to upgrade to the most recent version of the language, with a parser that's been vastly complicated.

*He actually talked about HTML literal syntax, but that's such a ridiculous idea that I know he just mistyped. There's no way to tell where an HTML fragment ends, for one, and the HTML DTD would have to be part of the core grammar of the language. You would need delimiters around where the HTML starts and ends, and none of his code fragments have those delimiters. The X key is right next to the letters H and T, so it's an honest typo. His pseudocode sample just before the VB.NET fragment must have also been a simple mistake, as it would seem to imply that either XML literals get printed immediately or that there is some implicit way of collecting them.