That behind me, I'm not the first one to come up with an algorithm for group capture in regular expressions in linear time and space. (My algorithm was, basically: annotate the NFA states which lie on a group boundary, then turn this into a DFA which marks a location in the string when that state could be entered. Run this, and then run the same thing on the reverse regular expression, putting the string in backwards, and find the intersection between the possible points of group boundary. Then, get the first possible group boundary point for each one, or the last. This can be proven correct easily in the case of one boundary point: if a proposed boundary is in the set marked for the forward pass and the backward pass, then the part before the boundary matches the first part of the regexp, and the part after the boundary matches the second part.)
Actually, there's been a bit of research here over the past 20 years. I haven't read the following papers very closely (though I plan to), but for anyone interested in understanding how to process regular expressions efficiently to get a parse tree, here are a few interesting papers:
- Extending Regular Expressions with Context Operators and Parse Extraction by Steven Kearns, 1991. This does something like the algorithm I was developing, but it's further thought-out
- Efficiently building a parse tree from a regular expression by Danny Dubé, Marc Feeley, 2000. This goes into more depth on building parse trees, but their algorithm is apparently less efficient than the one just below.
- Efficient submatch addressing for regular expressions [PDF] by Ville Laurikari, 2001. This is someone's Master dissertation, so it's easier to read and presents background information. The formal model of a tagged NFA is introduced. Benchmarks are provided, showing the system to be much faster than other widely used libraries.
- Greedy Regular Expression Matching by Alain Frisch, Luca Cardell, 2004 . This takes an interesting axiomatic approach to the issue, and develops a different way to resolve ambiguity.
All of these papers go about submatch extraction in somewhat difficult ways. I hope I helped someone avoid a difficult literature search like I had.
Update: It seems the best way to do a literature search is to blog about something, and have commenters give you relevant papers. Here's one by Burak Emir describing how to get the shortest match (think non-greedy, but globally optimal) with group capture, taking advantage of transformations of regexes. Thanks, Alain Frisch!