Thursday, January 8, 2009

I'm back

Hi again! I've been gone for the past few months because in the fall I was away in Budapest studying math, and it was really hard. Now, I've decided to take three months off from school to work on Factor full-time. My plan is first work on finishing up Unicode and XML, and then try to improve Factor's garbage collector.

For Unicode, over the past couple days, I fixed bugs in normalization and grapheme breaking, and I implemented word breaking. This was all made a lot easier by the test suites that Unicode includes, and these are now used as unit tests. I also wrote docs for everything. I still need to optimize everything and clean up the code, though. There are a lot more algorithms that I could implement, and plan on implementing eventually, but I'm just going to do this for now.

For XML, I plan on hooking up the XML conformance test suite and fixing any conformance issues that come up, for starters. I've been terribly informal about testing in the past, and I'll try to change this. Instead of optimizing the current XML code directly, I plan on replacing it with a generated lexer and parser: I'll try to make a system like lex/yacc in Factor. Previously, Chris Double has made some pretty cool stuff for parsing with parser combinators and parsing expression grammars, but these both have efficiency problems. I know that a traditional system like lex/yacc can solve the problem of parsing XML, and while it's not as flexible as PEGs, it still might be possible to add high-level OMeta-like features. Parsing is something that I don't know a lot about, so I'll be doing some reading for this.

For garbage collection, I want to actually implement something after studying it for so long! I plan on first making Factor's GC generational mark-sweep, and investigating changes to the generational policy to reduce thrashing. Then, there are two things that I want to do: make it incremental, and make some kind of compaction. Maybe this will take the form of mark-copy, or maybe it'll just be an incremental mark-sweep collector with occasional stop-the-world compaction (which could be disabled for certain applications). Either way, expect the footprint of Factor programs in the future to be reduced by almost half, and hopefully GC pauses will be shorter most of the time.

9 comments:

Aaron Schaefer said...

Very cool Dan...great to have you back!

Anonymous said...

I can't believe you're dropping out of school to work on this silly toy stack language.

wtanksley said...

Dan, I didn't know the Factor PEG was slow... But I've seen a few blog posts elsewhere about linear-time PEG parsers, and since PEGs are a more powerful than LR grammars it's probably a good idea to figure them out rather than implement an inferior and vastly more complex parser.

P.S. Bison/flex are LALR parsers. You don't want to implement an LALR parser. It's not pleasant. I've done it as part of a class.

Adam said...

Welcome back! I hope the compiler book helps in your integration of the GC. Oh, and ignore Anon there... A little angry today ;-)

Daniel Ehrenberg said...

Hi William,

Factor's PEG implementation is linear time, which is great, but it's also linear space. You might say, "well, the input is linear size, so how could you complain about linear space?" One problem is that for some things (like XML), linear space is unacceptable because documents are often parsed on-line as the contents are interpreted. Another problem is that this constant factor is high. I hypothesize that LALR and LR(1) and similar things have been used so much in the past because they're the best solution in most cases. They're definitely suitable for XML.

wtanksley said...

There are nicer things than LALR... Its main reason for existence is its speed relative to LR. But LR is a lot easier to write real grammars for, and PEG is still easier. You're right about memory consumption, although I didn't know the constant factor was so high (so you're even MORE right).

As for the specific example of XML parsing, I don't think I'd use a parser generator to handle that; it screams out for custom coding. Or, of course, a parser generator customized for XML parser generation...

By the way, have you seen VTD-XML? It's an interesting API... It's an indexer rather than (strictly speaking) a parser, so it can handle some of the tasks of a parser much better than a real parser can. It's also a good example of a really bad OO API.

Chris Double said...

wtanksley, my benchmarking from a while back showed that Factor PEG's are slow mainly due to garbage collection costs. I have grammars that parse some things where 75% of the time taken in the parse is spent in the garbage collector.

I'm wondering if it's the large amount of slice's it creates.

I have a partial implementation of peg's that are faster due to reduciing use of memory allocation during parsing, and also allows parsing of streams, but haven't had a chance to complete it. The implementation is partial in that it doesn't yet support left recursion in grammars or the peg.ebnf syntax.

Qi Li said...

Hi Dan, good to hear update from you.
I'm looking forward to seeing you back to Carleton some time soon.

Anonymous said...

VTD-XML is a parser first and foremost, it has indexing features, but that doesn't diminish its roles as a parser...