Sunday, May 18, 2008

Writings on regexp group capture

So, in researching regular expression group capture, I had a little bit of trouble. It turns out that some people call it "capture groups", others call it "submatch extraction" and some people call it "subexpression match". In Google, it looks like "submatch extraction" gets the most research hits, and "subexpression match" is the most broadly used.

That behind me, I'm not the first one to come up with an algorithm for group capture in regular expressions in linear time and space. (My algorithm was, basically: annotate the NFA states which lie on a group boundary, then turn this into a DFA which marks a location in the string when that state could be entered. Run this, and then run the same thing on the reverse regular expression, putting the string in backwards, and find the intersection between the possible points of group boundary. Then, get the first possible group boundary point for each one, or the last. This can be proven correct easily in the case of one boundary point: if a proposed boundary is in the set marked for the forward pass and the backward pass, then the part before the boundary matches the first part of the regexp, and the part after the boundary matches the second part.)

Actually, there's been a bit of research here over the past 20 years. I haven't read the following papers very closely (though I plan to), but for anyone interested in understanding how to process regular expressions efficiently to get a parse tree, here are a few interesting papers:

All of these papers go about submatch extraction in somewhat difficult ways. I hope I helped someone avoid a difficult literature search like I had.

Update: It seems the best way to do a literature search is to blog about something, and have commenters give you relevant papers. Here's one by Burak Emir describing how to get the shortest match (think non-greedy, but globally optimal) with group capture, taking advantage of transformations of regexes. Thanks, Alain Frisch!

1 comment:

Alain Frisch said...

You might also want to have a look at this technical report by Burak Emir:

Compiling Regular Patterns to Sequential Machines