Tuesday, June 19, 2007

Macrology never gets old

Right now, Factor's predicates like digit?, letter?, printable? etc. work only on ASCII--the first 127 characters of Unicode. But since I'm trying to add Unicode support to Factor, these should instead work for all Unicode code points.

Unicode's character classes make this fairly easy to do. The Unicode standard splits all characters up into 29 character classes. These character classes can be used to implement predicates listed above. All of Factor's existing character predicates can be expressed as the union of one or more Unicode type classes.

At first, I had a lot of pieces of code that looked like this:

: uncased? ( ch -- ? )
category { "Lu" "Ll" "Lt" "Lm" "Mn" "Me" } member? ;

There were ten words defined this way, and four more in extra/xml/char-classes. Additionally, to "optimize" some of the code, I had things like

: Letter? ( ch -- ? )
category first CHAR: L = ;

since Letter? is the union of all character classes beginning with L. But wouldn't it all be clearer if the code were written like

CATEGORY: uncased Lu Ll Lt Lm Mn Me ;
CATEGORY: Letter Lu Ll Lt Lm Lo ;

This makes predicate classes on fixnums that do the same as the above code. So it's basically equivalent to

PREDICATE: fixnum uncased
category { "Lu" "Ll" "Lt" "Lm" "Mn" "Me" } member? ;

You can read more about predicate classes in the Factor documentation, but basically, it creates a new class, uncased which is a subclass of fixnum consisting of fixnums which satisfy the given predicate. A fixnum is a number 0-2^29, which can be represented by a single cell in Factor. In the end result, the word uncased? is defined in basically the same way, but we also have a class to use.

Factor parsing words work differently from Lisp or Scheme macros. Instead of providing a source to source transformation, parsing words are directly part of the parser, and use the same tools that the parser itself does. In fact, the parser consists mostly of parsing words, with just a little glue to hold it together. But this ends up easier than it sounds, because all of the internals are exposed. Here's how CATEGORY: is defined, skipping the word [category] which takes a list of character classes and other strings and generates an appropriate test quotation:

: define-category ( word categories -- )
[category] fixnum -rot define-predicate-class ;

CREATE ";" parse-tokens define-category ; parsing

The word define-predicate-class defines a word as a predicate class, given the test quotation and the superclass. It is the runtime equivalent of PREDICATE:, just as define-category is the runtime equivalent of CATEGORY:. By convention, these runtime equivalents should be defined for the sake of future parsing words which might want to use them.

Now, looking into the word CATEGORY:, the first word used, CREATE scans the next word and creates it, as a new word, in the current vocabulary. Next, ";" parse-tokens lets the parser read until the word ";" is encountered. The blank-separated "tokens" in between are returned as an array of strings. This array of strings is next processed by define-category, which gives the word its definition.

For efficiency, I changed the old test using member? to a bit array mapping category numbers to whether they are included. (In the implementation, the categories are all given numbers, which can then be translated into names.) I also added a feature for XML, that you can include any other characters in this, too, just writing them in as if they were character classes, with all string escapes allowed. So after all that, the automatically generated definition of uncased ended up being

: uncased? ( object -- object )
dup fixnum? [
dup category# ?{
f f f f t f t f t t t t t t t t t t t t t t t t t t
t t t t
} nth [ drop t ] [ "" member? ] if
] [ drop f ] if ;

This code should be really quick to execute. Since category# is implemented as a byte-array nth too, everything should run in a constant number of steps. So Unicode here doesn't give a significant speed penalty at all. On my computer, unicode:Letter? runs about 20% slower than strings:Letter? (the current definition), but given how fast they both run, that difference is negligible.

Update: After more careful consideration of the benchmarks, it actually looks like the new Letter? is five times slower than the old one. Still, it's really fast.


Slava Pestov said...

Dan, I'm puzzled by this piece of code:

[ drop t ] [ "" member? ] if

"" member? is always false.

Is this a typo?

Daniel Ehrenberg said...

No, I should have explained that more clearly. "" member? is automatically generated in case there's some extra letters added, as there are in XML char classes. I benchmarked it and it takes almost no time to run, so why bother removing it? Well, maybe I should put in a test for that when generating the quotation.