I just implemented lookahead and lookbehind in the Factor regexp module. This lets you write regular expressions like
/(?<=foo)bar/
, to match "bar" when it's preceded by "foo". I couldn't think of a better way to do it for the general case, so I just did it the naive way. There should be a better way, though, because lookahead and lookbehind don't extend the set of possible strings matched. This algorithm isn't so good because it makes things worse than linear time in the general case.For a lookahead or lookbehind clause, there is a little regular expression compiled. This regular expression annotates an epsilon transition. If I had an NFA interpreter, then the interpreter would just run the regular expression on the input string starting at the current point when it wants to know if it can use the epsilon transition. I'm using a DFA, so I needed to modify the subset construction to make this work.
What needs to be modified is the procedure to get the epsilon closure of an NFA state. Instead of returning a set of NFA states, this procedure should return a set of states and the conditions for reaching each state. Usually, there will be no conditions, but sometimes there will be a conjunction or disjunction of lookarounds. It could be the conjunction in regular expression like
/(?<=foo)(?=...bar)baz/
, and it could be the disjunction in a regular expression like /((?<=foo)|(?=...bar))bar/
, and since I'm trying to be fully general, all of this is supported.The epsilon closure procedure is usually something like this*. Start with your state on a list, and look at all of the epsilon transitions outwards. Add these to your list. Now, if anything on your list has more epsilon transitions outwards, add these to your list. Keep going until there's nothing to add.
With the modification: Start with your state on a list, with the associated information that there are no conditions that have to be met for it. Then, look at all of the outward epsilon transitions from all the states on your list. For each new transition that you're considering, the requirement to get from the original state to the target of the transition is to meet the requirement of the source, plus the requirement of the transition. If you already had a way to get to the target of the transition, then now there are two conditions, and either one can be used. Keep repeating this until examining epsilon transitions doesn't change anything.
Now, a little post-processing can turn the result of this into a bunch of nested conditionals, which can quickly tell you what states you can get to given which conditions are met. In the usual case, where there are no conditions, this tree is just a single leaf, a list of states. From here, transitions in the DFA go not to sets of states but to trees of conditions and states. The start state is also one of these trees.
The DFA interpreter** needs a little bit of extra logic to test the conditions and traverse the tree. The interpreter gives the condition the input string and index, and the condition can do whatever it wants with that information. I've implemented anchors (eg. $ and ^) this way. They could be lookaround, but it's easier to just implement them directly as a predicate on the string and index.
*No, I don't implement it this way, because it would be a lot of repeated work, but this expresses the idea.
** Actually, everything compiles down to Factor code, which is compiled with the optimizing compiler if you want.
1 comment:
Daniel,
Instead of just looking at DFA/NFA's for string scanning, have a look at the string scanning facilities of ICON/UnIcon. There is no doubt that regular expressions are useful in their own right but the scanning expressions of ICON (which are a modification of the scanning expressions of SNOBOL4) allow a much more generalised way of processing strings.
Adapting it to Unicode strings will be interesting.
regards
Bruce Rennie
(God's Own Country Downunder)
Post a Comment