digit?
, letter?
, printable?
etc. work only on ASCII--the first 127 characters of Unicode. But since I'm trying to add Unicode support to Factor, these should instead work for all Unicode code points.Unicode's character classes make this fairly easy to do. The Unicode standard splits all characters up into 29 character classes. These character classes can be used to implement predicates listed above. All of Factor's existing character predicates can be expressed as the union of one or more Unicode type classes.
At first, I had a lot of pieces of code that looked like this:
: uncased? ( ch -- ? )
category { "Lu" "Ll" "Lt" "Lm" "Mn" "Me" } member? ;
There were ten words defined this way, and four more in
extra/xml/char-classes
. Additionally, to "optimize" some of the code, I had things like
: Letter? ( ch -- ? )
category first CHAR: L = ;
since
Letter?
is the union of all character classes beginning with L. But wouldn't it all be clearer if the code were written like
CATEGORY: uncased Lu Ll Lt Lm Mn Me ;
CATEGORY: Letter Lu Ll Lt Lm Lo ;
This makes predicate classes on fixnums that do the same as the above code. So it's basically equivalent to
PREDICATE: fixnum uncased
category { "Lu" "Ll" "Lt" "Lm" "Mn" "Me" } member? ;
You can read more about predicate classes in the Factor documentation, but basically, it creates a new class,
uncased
which is a subclass of fixnum
consisting of fixnum
s which satisfy the given predicate. A fixnum
is a number 0-2^29, which can be represented by a single cell in Factor. In the end result, the word uncased?
is defined in basically the same way, but we also have a class to use.Factor parsing words work differently from Lisp or Scheme macros. Instead of providing a source to source transformation, parsing words are directly part of the parser, and use the same tools that the parser itself does. In fact, the parser consists mostly of parsing words, with just a little glue to hold it together. But this ends up easier than it sounds, because all of the internals are exposed. Here's how
CATEGORY:
is defined, skipping the word [category]
which takes a list of character classes and other strings and generates an appropriate test quotation:
: define-category ( word categories -- )
[category] fixnum -rot define-predicate-class ;
: CATEGORY:
CREATE ";" parse-tokens define-category ; parsing
The word
define-predicate-class
defines a word as a predicate class, given the test quotation and the superclass. It is the runtime equivalent of PREDICATE:
, just as define-category
is the runtime equivalent of CATEGORY:
. By convention, these runtime equivalents should be defined for the sake of future parsing words which might want to use them.Now, looking into the word
CATEGORY:
, the first word used, CREATE
scans the next word and creates it, as a new word, in the current vocabulary. Next, ";" parse-tokens
lets the parser read until the word ";" is encountered. The blank-separated "tokens" in between are returned as an array of strings. This array of strings is next processed by define-category
, which gives the word its definition.For efficiency, I changed the old test using
member?
to a bit array mapping category numbers to whether they are included. (In the implementation, the categories are all given numbers, which can then be translated into names.) I also added a feature for XML, that you can include any other characters in this, too, just writing them in as if they were character classes, with all string escapes allowed. So after all that, the automatically generated definition of uncased
ended up being
: uncased? ( object -- object )
dup fixnum? [
dup category# ?{
f f f f t f t f t t t t t t t t t t t t t t t t t t
t t t t
} nth [ drop t ] [ "" member? ] if
] [ drop f ] if ;
This code should be really quick to execute. Since
category#
is implemented as a byte-array nth too, everything should run in a constant number of steps. So Unicode here doesn't give a significant speed penalty at all. On my computer, unicode:Letter? runs about 20% slower than strings:Letter? (the current definition), but given how fast they both run, that difference is negligible.Update: After more careful consideration of the benchmarks, it actually looks like the new
Letter?
is five times slower than the old one. Still, it's really fast.
2 comments:
Dan, I'm puzzled by this piece of code:
[ drop t ] [ "" member? ] if
"" member? is always false.
Is this a typo?
No, I should have explained that more clearly. "" member? is automatically generated in case there's some extra letters added, as there are in XML char classes. I benchmarked it and it takes almost no time to run, so why bother removing it? Well, maybe I should put in a test for that when generating the quotation.
Post a Comment