tag:blogger.com,1999:blog-273593670040001243.post238500383383454362..comments2022-03-28T05:51:26.366-07:00Comments on Useless Factor: Parsing with regular expressions and group captureAnonymoushttp://www.blogger.com/profile/00902922561603041049noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-273593670040001243.post-82618785529613018562008-05-10T15:38:00.000-07:002008-05-10T15:38:00.000-07:00Josh, thanks for the pointers to your project and ...Josh, thanks for the pointers to your project and that paper about regex implementations. I'm not sure; you may be right about regular languages being insufficient; I probably haven't done enough research into it. (Anyway, this isn't suggested for parsing everything in general.)Anonymoushttps://www.blogger.com/profile/00902922561603041049noreply@blogger.comtag:blogger.com,1999:blog-273593670040001243.post-24643038301008099052008-05-10T15:18:00.000-07:002008-05-10T15:18:00.000-07:00I think you will be interested in my project Gazel...I think you will be interested in my project <A HREF="http://www.reverberate.org/gazelle/" REL="nofollow">Gazelle</A>. It aims to solve a lot of the issues you raise with existing parser generators (specifically, it aims to be extremely low overhead).<BR/><BR/>In my opinion, <I>the</I> major issue to resolve in parser space is how to reuse parsers. For example, with your scenario of using a parser in an HTTP server or client, someone implementing one of these should be able to pull parsers that implement the relevant RFCs off the shelf and use them in their application, without having to re-implement the RFC again. This is the major focus of Gazelle.<BR/><BR/>My opinion is that regular languages will always leave you wanting for more power. You'll never be able to "natively" parse anything that is tree structured.<BR/><BR/>You will be interested in <A HREF="http://swtch.com/~rsc/regexp/regexp1.html" REL="nofollow">this article</A> if you haven't read it already. It has a little discussion and some sample code for doing capture groups with an NFA-based implementation.<BR/><BR/>I have devised an algorithm (which I haven't seen published anywhere) for parsing regular languages with capture groups that can handle any case where the bounds of the capture groups are nonambiguous. It's not on-line (as you note, it can't be), but it's linear time and linear space.<BR/><BR/>For a sub-class of regular expressions, you can parse capture groups on-line with constant space. The sub-class is kind of self-evident: it's the class where there's never any ambiguity about whether you are entering a capture group or not. I think that for parsing well-designed formal languages (eg. programming languages and text-based protocols) it should cover most things you would encounter.Anonymoushttps://www.blogger.com/profile/06784200909094825965noreply@blogger.com