Sunday, January 25, 2009

Factor supports XML literal syntax

Factor can now express XML as literals in code. There's a new library, xml.interpolate, which lets you create an XML chunk or document using interpolating either by locals or using a fry-like syntax. Here's a taste, from the syndication vocab, to create an Atom file:
: feed>xml ( feed -- xml )
[ title>> ]
[ url>> present ]
[ entries>> [ entry>xml ] map ] tri
<feed xmlns="">
<link href=<-> />
XML> ;
This could also be written with locals:
:: feed>xml ( feed -- xml )
feed title>> :> title
feed url>> present :> url
feed entries>> [ entry>xml ] map :> entries
<feed xmlns="">
<link href=<-url-> />
XML> ;
Here's an example with more complicated logic:
"one two three" " " split
[ [XML <item><-></item> XML] ] map
<XML <doc><-></doc> XML>
whose prettyprinted output (using xml.writer:pprint-xml) is
<?xml version="1.0" encoding="UTF-8"?>
The word <XML starts a literal XML document, and [XML starts a literal XML chunk. (A document has a prolog and exactly one tag, whereas a chunk can be any kind of snippet, as long as tags are balanced.) The syntax for splicing things in is using a tag like <-foo->. This syntax is a strict superset of XML, as a tag name in XML 1.0 is not allowed to start with -.

It took me just an evening to hack this up. It's less than 15 lines of code modifying the XML parser, and all the interpolation is less than 75 lines of code. Best of all, unlike Scala's XML literals, this doesn't affect the core of the language or make anything else more complicated. It's just nicely tucked away in its own little vocabulary. I plan on replacing usages of html.components with this, once I make some tweaks.

Saturday, January 17, 2009

Quote of the day

It should be obvious that while garbage collection (or compaction) is going on, the execution of the user's program must be suspended. Hence out of common courtesy to the user, we must provide for some notification--perhaps a flashing message in the corner of the screen--that garbage collection is in progress. Otherwise when the program stops running, the user may suspect a bug, and an inexperienced user may panic.

--Introduction to Compiler Construction by Thomas Parsons, Chapter 8.2, "Runtime memory management"
(The book spends about half a page on garbage collection, mostly to describe the mark-sweep algorithm.)

With Lisp machines, it used to be that people would talk about things like running the garbage collector overnight, when the machine was not in use. Virtual memory in use was typically several times the size of the physical memory. Within its context, the quote is accurate (or at least 10 years earlier it was): running mark-sweep on mostly virtual memory with a slow disc will take a lot of time.

I'm happy that this concern is very outdated. If it weren't outdated, Factor (or Java) could never be a reasonable choice of programming language. There are two main reasons for the improvement.
  1. Memory size has increased. Programs that would use tons of swap space 15 years ago now fit comfortably on the RAM.
  2. Algorithms have improved. Not only do we have generational garbage collection to improve throughput, but there are also incremental techniques to spread out the pause, and various methods to take advantage of multiple processors.

So now, the idea of a flashing message for garbage collection in progress is hilariously quaint. If I were going to write a compiler textbook, I'd spend a whole chapter on garbage collection. It is essential for any modern programming language, and a good algorithm is essential.

Thursday, January 15, 2009

XML encoding auto-detection

The Factor XML parser now auto-detects the encodings of XML documents. This is implemented for all of the encodings that are implemented in Factor. To see how it's implemented, look at the XML standard, because it explains it much better than my blog post, which was below.

I was mystified myself when I first read that XML documents can specify what encoding they are, in the document itself. The encoding, if it's not UTF-8, must specified in the prolog like this:
<?xml version="1.0" encoding="ISO-8859-1"?>

The idea of the algorithm is simple. It just goes by cases. First, check for a byte order mark (BOM), which would indicate UTF-16 and a particular endianness, or UTF-8. If there's no BOM, then the first character must be < or whitespace. If it's <, we can differentiate between UTF-16BE, UTF-16LE (without BOMs) and an 8-bit encoding. If it's one of the first two, we can tell by the fact that there's a null byte before or after the <. If it's an 8-bit encoding, we can be sure that there won't be any non-ASCII in the prolog, so just read the prolog as if it's UTF-8, and if an encoding is declared, use that.

To implement it, I just read byte by byte and have a case statement for each level. After two just octets, it's possible to differentiate between UTF-8, UTF-16 (with a BOM, for both endiannesses), UTF-16BE and UTF-16LE. A similar process could also identify UTF-32 and friends after 4 octets. In my implementation, I had to do a little bit of hacking inside the XML code itself to get this integrated properly. All together, it's about 40 or 50 lines of code. It's available now in the Factor git repository.

[Update: Thanks for pointing out my error, Subbu Allamaraju. Fixed a typo, see comments.]

Thursday, January 8, 2009

I'm back

Hi again! I've been gone for the past few months because in the fall I was away in Budapest studying math, and it was really hard. Now, I've decided to take three months off from school to work on Factor full-time. My plan is first work on finishing up Unicode and XML, and then try to improve Factor's garbage collector.

For Unicode, over the past couple days, I fixed bugs in normalization and grapheme breaking, and I implemented word breaking. This was all made a lot easier by the test suites that Unicode includes, and these are now used as unit tests. I also wrote docs for everything. I still need to optimize everything and clean up the code, though. There are a lot more algorithms that I could implement, and plan on implementing eventually, but I'm just going to do this for now.

For XML, I plan on hooking up the XML conformance test suite and fixing any conformance issues that come up, for starters. I've been terribly informal about testing in the past, and I'll try to change this. Instead of optimizing the current XML code directly, I plan on replacing it with a generated lexer and parser: I'll try to make a system like lex/yacc in Factor. Previously, Chris Double has made some pretty cool stuff for parsing with parser combinators and parsing expression grammars, but these both have efficiency problems. I know that a traditional system like lex/yacc can solve the problem of parsing XML, and while it's not as flexible as PEGs, it still might be possible to add high-level OMeta-like features. Parsing is something that I don't know a lot about, so I'll be doing some reading for this.

For garbage collection, I want to actually implement something after studying it for so long! I plan on first making Factor's GC generational mark-sweep, and investigating changes to the generational policy to reduce thrashing. Then, there are two things that I want to do: make it incremental, and make some kind of compaction. Maybe this will take the form of mark-copy, or maybe it'll just be an incremental mark-sweep collector with occasional stop-the-world compaction (which could be disabled for certain applications). Either way, expect the footprint of Factor programs in the future to be reduced by almost half, and hopefully GC pauses will be shorter most of the time.