Showing posts with label macros. Show all posts
Showing posts with label macros. Show all posts

Friday, May 23, 2008

Little things I've been working on

I've been working on a few different things that, individually, are too insignificant for a blog post, so I'll put them together.

One is, I expanded my previous survey of algorithms for XML diffing, and the result is here [PDF].

I've been working on a few libraries in Factor. One is extra/delegate, which now interacts properly with reloading. For example, if you define a protocol, then say that a class consults something for that protocol, and then redefine the protocol to include more generic words, the consulting class will be automatically updated. Unfortunately, this doubled the size of the code, give or take. Slava changed duplex-streams to use extra/delegate, and the code is much simpler now, as it previously amounted to manual delegation. I got rid of mimic because it's unsafe and violates encapsulation in unexpected ways.

Another little thing I made was extra/lcs, a library for calculating Levenshtein distance between two strings, the longest common subsequence of two strings, and an edit script between two strings. Because the LCS problem and Levenshtein distance are duals, I was able to share most of the code between them. I used the simple quadratic time and space algorithm that Wikipedia describes rather than the better O(nd) algorithm [PDF] commonly called the GNU diff algorithm. I'll upgrade it to this once I understand the algorithm. This replaces extra/levenshtein. I expected it to be significantly faster, because the old code used dynamically scoped variables and this uses statically scoped locals, but the speed improvement turned out to be just around 4% in small informal benchmarks on short strings.

Now, I'm working on the Unicode Collation Algorithm. The basics were simple, but I'm still unsure how to recognize collation graphemes efficiently in general. Either way, I discovered a bug in normalization: my insertion sort, used for canonical ordering of combining marks, wasn't a stable sort as required for normalization. It was actually an anti-stable sort: it reversed subsequences which were of the same sort key. That was really stupid of me. I'm going to work on incorporating existing test suites for things like this. For the test suite for collation, all but 8000 of 130,000 tests pass, making it far from ready.

Monday, December 10, 2007

Multiline string literals in Factor

It's always annoyed me somewhat that Factor strings can only be on one line and that there was no mechanism for anything like "here documents" like Perl has. So I decided to write it myself. At this point, I don't really need it and have forgotten what I wanted it for, but it was still a fun exercise.

I started out thinking I should do something similar to some other languages do it: write a word, maybe called <<<, which is followed by some string which is used to delineate a multiline string literal expression. But I realized this wouldn't be the most idiomatic way in Factor. First, if you're making a multiline string literal, why would you ever have it be within a word? For constants like this, it's considered best practice to put them in their own separate words. Second, why do I need to give the option of choosing your own ending? What's wrong with just using a semicolon, like other Factor definitions?

Putting this together, I came up with the following syntax:

STRING: something
Hey, I can type all the text I want here
And it can be over multiple lines
But without the backslash-n escape sequence!
;

The final line, the semicolon, has to have no whitespace before or after it. This allows for practically any useful string to be written this way. The syntax will compile into something like this:

: something
"Hey, I can type all the text I want here\nAnd it can be over multiple lines\nBut without the backslash-n escape sequence!" ;

With a previous parser design, multiline string literals like this were impossible, but now they can be done in 11 lines. I packaged this up and put it in my repository under extra/multiline so others can use it.


Using the internals of the parser, the first word advances the parser state one line and returns the text of the new line.

: next-line-text ( -- str )
lexer get dup next-line line-text ;

The next two words do the bulk of the parsing. They advance the parser's current line until reaching a line consisting only of ";", and then advance the line one more time. While moving forward, a string is compiled consisting of all of the lines, interspersed with newlines.

: (parse-here) ( -- )
next-line-text dup ";" =
[ drop lexer get next-line ] [ % "\n" % (parse-here) ] if ;

: parse-here ( -- str )
[ (parse-here) ] "" make 1 head*
lexer get next-line ;

Finally, the word STRING: puts it all together, defining a new word using a string gathered with parse-here.

: STRING:
CREATE dup reset-generic
parse-here 1quotation define-compound ; parsing

There are downsides to having an extremely flexible syntax like Factor. Things can be less predictable when libraries can alter the fundamentals of syntax. It'd be impossible to create a fixed BNF description of Factor syntax. Additionally, the particular structure of Factor sometimes encourages syntax extension that's heavily dependent on the details of the current implementation. But the upside is that we can do things like this. I think it's worth it.

Monday, August 27, 2007

Symbolic constants and enumeration

For grapheme classes in Unicode, I needed a bunch of symbolic constants 0..n to describe a number of things. (More graphemes in the next post.) At first, I used regular old words corresponding to regular old numbers, in a typical Factor idiom:

: Any 0 ;
: Extend 1 ;
: L 2 ;
! ...

But later, in constructing the table (discussed in more detail below), I found the need to write arrays like { Control CR LF }. But in Factor, this doesn't work the way you might think it does; it's basically equivalent to, in Lisp, '(Control CR LF) (a list of the symbols) when we actually want `(,Control ,CR ,LF) (a list of the values that the symbols refer to).

How can we get that quasiquotation in Factor? The most obvious way is using make: [ Control , CR , LF , ] { } make. But that solution is pretty wordy, and not nearly fun enough. Here's another idea: what if you make the grapheme class words expand directly, at parsetime, to their value? This can be done if you use

: Any 0 parsed ; parsing
: Extend 1 parsed ; parsing
! ...

But who wants to write this eight times? I don't! We can abstract it into a parsing word CONST: to do this for us:

: define-const ( word value -- )
[ parsed ] curry dupd define-compound
t "parsing" set-word-prop ;

: CONST:
CREATE scan-word dup parsing?
[ execute dup pop ] when define-const ; parsing

But in this particular code, we use a particular pattern, similar to C's enum. Why not abstract this into our own ENUM:?

: define-enum ( words -- )
dup length [ define-const ] 2each ;

: ENUM:
";" parse-tokens [ create-in ] map define-enum ; parsing

Going back to the original problem, the ENUM: parsing word allows me to write

ENUM: Any L V T Extend Control CR LF graphemes ;

to specify the grapheme classes, without having to care about which number they correspond to.

This solution isn't perfect. The problem here is that this completely ruins homoiconic syntax for all uses of the constants. By "homoiconic syntax," I mean that see can print out the original source code, with the only real difference being whitespace. A macro using compiler transformations, which would preserve homoiconic syntax by using a different strategy, might be preferable. But I still wanted to share this little thing with you all.

Note: At the time I wrote this, this strategy made sense. But now, I'm thinking it'd be better to just go with a regular C-ENUM:, which should be renamed ENUM:. This is partly because of changes in the parser which make it more difficult to use parsing words defined in the same source file.

Monday, July 16, 2007

Eww, that's the dirtiest macro I've ever seen!

I've shown a few parsing words on this blog that create words that aren't mentioned directly in the text. An example of this is TUPLE:, which makes a class with a number of slots, a constructor, a predicate, accessors and setters. It's very useful, even if not all words it creates are given literally in the code. Now, what would you say if I told you I have an unhygenic macro that creates O(n^m) words? (I'll tell you what m and n are later.) In reality, it makes more than n^m words. With 7 as m and n, it creates 1419760 words, and with 8 as both of them, Factor crashes because it runs out of memory after expanding the heap a number of times.

You're probably thinking, "Why would I ever want such a stupid unhygenic macro? And what does it do? Your introduction was so boring and pointless, Dan." Well, I'll tell you: it's a new stack shuffling syntax. dmpk2k on #concatenative asked me about alternative shuffle word syntax. For example, why not make dup into x-xx, swap xy-yx, etc? It'd be much easier to remember, and there would be no need for "stack idioms" like swap rot for xyz-zyx. Even complicated stack shuffling would be simple.

Believe it or not, it's not all to hard to do in Factor. It'd be prohibitively difficult to extend the parser to make this work for all possible shuffles, so I took a halfway approach: provide a way to define as many as you want. The syntax I came up with was
SHUFFLE: abcd 6

which defines shuffle words for up to 4 inputs labeled a, b, c and d (in order), and up to 6 outputs, defining a paltry 6686 shuffle operators, everything from abcd- to ab-baabaa. (If you want, you can alias them, as in : tuckd xyz-yxyz ;, but there's generally little point.) Some of the shufflers do the same thing as others, such as ab-a and a-, but this constitutes a very small fraction of the total. Going with the variables at the beginning of the article, n is the length of the alphabet (here, abcd) and m is the maximum output length (here, 6). The total number of transformations is higher than 4^6 (4096) because that count only includes the longest transforms, from four elements on the stack to six. However, it is still an accurate asymptotic approximation.

So, there are two difficult programming problems here, in addition to some easy ones. The hard ones are (a) generating all possible stack transformations and (b) turning a stack transformation into a quotation.

Generating all stack transformations

(a) may look simple, but it's a bit more complex than it seems. It is easy to get the possible inputs: it's just somewhere between the first element of the alphabet and the entire alphabet. But the outputs can be any number of the inputs, in any order, duplicated or not.

This can be reduced to a simpler problem. Let the length of the alphabet be A, and the length of the output be O. Given an alphabet length and array length, generate an array of arrays consisting of all the possible combinations of numbers 0 to A-1. The length of the array generated will be O^A because there are O spaces, each of which can be filled with A different items. So if we can find a bijection between the numbers from 0 to (O^A)-1 (let's call one of these numbers N) and the arrays we need, the problem is easy.

As it happens, the structure of this problem is remarkably similar to converting numbers to different bases. We can treat our number N as a base A number. Then we create an array of length O consisting of the digits of the number. Each digit should be between 0 and A-1. If we convert each number N into this base, we'll get every possible combination for the output array. The code for this is below, where translate converts a number into an array with a given length and alphabet size, and (combinations) creates all possible combinations for the given length and alphabet size:

: translate ( alphabet out-len n -- seq )
-rot [ drop /mod ] map-with nip ;

: (combinations) ( alphabet out-len -- seq[seq] )
2dup ^ [ translate ] map-with2 ;

To make an array of all the possible stack shuffles with input length from 1 to max-in and output length from 1 to max-out, we need a couple more words. To represent a stack shuffle, I used a 2array whose first element is the number of inputs and whose second element is an array of the output indices.

: combinations ( n max-out -- seq[seq] )
1+ [ (combinations) ] map-with concat ;

: make-shuffles ( max-out max-in -- shuffles )
[ 1+ dup rot combinations [ 2array ] map-with ] map-with concat ;

(Going counter to the Forth/Factor mantra, I actually wrote this code in a top-down fashion. First, I wrote make-shuffles, and then realized I'd need combinations, and then realized I need (combinations) and translate. This is generally not a good tactic, but it works for small problems that take a lot of reasoning, like this.)

Generating quotations
It surprised me, but (b) turned out to be the easy one. At first, to transform the stack, I stored the top few elements in an array, then accessed the elements used for output and dropped the array. But that isn't optimal, because an array is unnecessarily allocated. It's slightly more efficient if we just pass things around on the stack.

To make this easier, I used Chris Double's extra/shuffle library. It defines words like [npick] and [ndrop] which generate quotations for arbitrary depth pick and drop. Using this, I can take the strategy of reaching down and pulling up the last element of the output array, then dipping beneath it. After iterating backwards through the entire output array, I drop the inputs and come back up. That wasn't too good an explanation, so just have a look at the code:

: shuffle>quot ( shuffle -- quot )
[
first2 2dup [ - ] map-with
reverse [ [npick] % \ >r , ] each
swap [ndrop] % length [ \ r> , ] times
] [ ] make ;

As an example, the code for ab-ba is over >r dup >r 2drop r> r>.

Wrapping it all up

To make this work, I also need a few trivial operators. There are words for turning a stack shuffle datum into a string:
: shuffle>string ( names shuffle -- string )
[ [ swap nth ] map-with ] map-with
first2 "-" swap 3append >string ;

For attaching a stack effect to a shuffle word:

: put-effect ( word -- )
dup word-name "-" split1
[ >array [ 1string ] map ] 2apply
<effect> "declared-effect" set-word-prop ;

And for defining a set of shuffle operators (in-shuffle and out-shuffle make sure that the operators get defined in their own minivocabulary, so they're not exported by default. This follows the example of <PRIVATE and PRIVATE>):

: in-shuffle ( -- ) in get ".shuffle" append set-in ;
: out-shuffle ( -- ) in get ".shuffle" ?tail drop set-in ;

: define-shuffles ( names max-out -- )
in-shuffle over length make-shuffles [
[ shuffle>string create-in ] keep
shuffle>quot dupd define-compound put-effect
] each-with out-shuffle ;

And finally, the dirty macro itself, the greatest sin to hygiene Factor has ever seen:

: SHUFFLE:
scan scan string>number define-shuffles ; parsing

Enjoy! I'm not sure whether it would be ethical for me to release this evil into the Factor distribution, so I haven't pushed it on to my repository.

Monday, June 25, 2007

The joy of unhygienic macros

When working on the Unicode Collation Algorithm (alphabetical order; I'm still nowhere near done) I needed an efficient way to store a bunch of data in one fixnum--a bitfield. I needed this to hold the weights of various Unicode code points without taking an obscene amount of memory. For an introduction to this concept of bitfields in C/C++, see Wikipedia's article on them.

I could have done this the simple way, by manually defining a constructor which does all the left shifts and bitors, and accessors with right shifts and bitands. But I chose to do it a different way: since similar situations come up in other places (UTF-8/16 encoding and decoding, the assemblers of various platforms, and in the FFI), why not make a general library for it? In its current, incomplete state, it's in my repos. As I later found out, one already existed in the core, in the vocabulary math.bitfields. It's very simple and elegant, but somewhat low-level and lacking features I needed, like accessors and bounds checking.

math.bitfields uses a simple array-based signature to specify bitfields. By contrast, what I wrote uses a new definition syntax inspired by tuples. To define a bitfield which has three fields, each with 5 bits of data in them (a number from 0 to 31) and three bits of 0-padding between them, you would use


BITFIELD: foo bar:5 000 baz:5 000 bing:5 ;


This defines several words. First, it defines <foo>, which is the constructor for the bitfield. Second, it defines foo-bar, foo-baz and foo-bing which access the various fields. Third, it defines with-foo-bar, with-foo-baz and with-foo-bing which take a bitfield and a field value and return a new bitfield with the new field value. This is meant to be analogous to set-foo-bar of tuples, though it is somewhat different, as bitfields are immutable.

(An analogous parsing word, SAFE-BITFIELD:, defines the same thing, but includes bounds checks in the constructor and setters. This is useful for debugging, and in less certain situations, at a minimal performance penalty.)

All of these words are simply operations on integers. Yet through the macro, they create an abstraction allowing regular old unboxed fixnums to be viewed as complex objects. In the future, I may extend the bitfield system to create another word, foo, which contains metadata about the bitfield itself. I might also define foo?, to test if an integer can be interpreted as a well-formed foo. But I'm not sure how useful that would be.

To Lispers (and Factorers), this kind of thing elicits a big yawn. Of course you can do that. In Lisp, the invocation might look a little more like this:

(defbitfield foo :bar 5 "000" :baz 5 "000" :bing 5)

but it could do basically the same thing. [Note: the reason the padding is in quotes in Lisp but not in Factor is that, in Factor, it is easy to set things up such that 000 isn't parsed as a number, whereas this is a bit more difficult in Lisp.] I don't even have to mention the fact that this would be impossible in the languages most people use. (Don't you love that construction?) But there are some macro systems, most notably Scheme's (but also Dylan's and many others), which would not allow this. Scheme has what's called a hygienic macro system. The "hygiene" there refers to the fact that the macro doesn't come in contact with any variables which were not given as arguments to it. Basically, it's like putting lexical scope in macros. More formally put, it prevents inadvertent name capture, or name "overlap" between different definitions of the same thing. A better introduction is here.

This prevents a lot of subtle bugs for beginner macro writers*, but it also adds limitations. Most importantly, it is impossible to make new symbols. That means that, to have the same thing in Scheme, an invocation would have to look more like

(define-bitfield foo foo? (foo-bar with-foo-bar 5) "000" (foo-baz with-foo-baz 5) "000" (foo-bing with-foo-bing 5))

Discounting elaborate hacks, runtime penalties or a reduction in functionality, this is as simple as it can get. The Schemer may argue (as I once did) that it is clearer when all variables and function names are explicitly mentioned. However, when the functions defined are known to the programmer though macro conventions, there is no loss of clarity. In this case, I think it's a bit more clear what's going on without the redundant function name listings.



* Factor, due to its lack of lexically scoped local variables and the way it handles symbols, isn't actually subject to most of these bugs.

Tuesday, June 19, 2007

Macrology never gets old

Right now, Factor's predicates like digit?, letter?, printable? etc. work only on ASCII--the first 127 characters of Unicode. But since I'm trying to add Unicode support to Factor, these should instead work for all Unicode code points.

Unicode's character classes make this fairly easy to do. The Unicode standard splits all characters up into 29 character classes. These character classes can be used to implement predicates listed above. All of Factor's existing character predicates can be expressed as the union of one or more Unicode type classes.

At first, I had a lot of pieces of code that looked like this:

: uncased? ( ch -- ? )
category { "Lu" "Ll" "Lt" "Lm" "Mn" "Me" } member? ;

There were ten words defined this way, and four more in extra/xml/char-classes. Additionally, to "optimize" some of the code, I had things like

: Letter? ( ch -- ? )
category first CHAR: L = ;

since Letter? is the union of all character classes beginning with L. But wouldn't it all be clearer if the code were written like

CATEGORY: uncased Lu Ll Lt Lm Mn Me ;
CATEGORY: Letter Lu Ll Lt Lm Lo ;

This makes predicate classes on fixnums that do the same as the above code. So it's basically equivalent to

PREDICATE: fixnum uncased
category { "Lu" "Ll" "Lt" "Lm" "Mn" "Me" } member? ;

You can read more about predicate classes in the Factor documentation, but basically, it creates a new class, uncased which is a subclass of fixnum consisting of fixnums which satisfy the given predicate. A fixnum is a number 0-2^29, which can be represented by a single cell in Factor. In the end result, the word uncased? is defined in basically the same way, but we also have a class to use.

Factor parsing words work differently from Lisp or Scheme macros. Instead of providing a source to source transformation, parsing words are directly part of the parser, and use the same tools that the parser itself does. In fact, the parser consists mostly of parsing words, with just a little glue to hold it together. But this ends up easier than it sounds, because all of the internals are exposed. Here's how CATEGORY: is defined, skipping the word [category] which takes a list of character classes and other strings and generates an appropriate test quotation:

: define-category ( word categories -- )
[category] fixnum -rot define-predicate-class ;

: CATEGORY:
CREATE ";" parse-tokens define-category ; parsing

The word define-predicate-class defines a word as a predicate class, given the test quotation and the superclass. It is the runtime equivalent of PREDICATE:, just as define-category is the runtime equivalent of CATEGORY:. By convention, these runtime equivalents should be defined for the sake of future parsing words which might want to use them.

Now, looking into the word CATEGORY:, the first word used, CREATE scans the next word and creates it, as a new word, in the current vocabulary. Next, ";" parse-tokens lets the parser read until the word ";" is encountered. The blank-separated "tokens" in between are returned as an array of strings. This array of strings is next processed by define-category, which gives the word its definition.

For efficiency, I changed the old test using member? to a bit array mapping category numbers to whether they are included. (In the implementation, the categories are all given numbers, which can then be translated into names.) I also added a feature for XML, that you can include any other characters in this, too, just writing them in as if they were character classes, with all string escapes allowed. So after all that, the automatically generated definition of uncased ended up being

: uncased? ( object -- object )
dup fixnum? [
dup category# ?{
f f f f t f t f t t t t t t t t t t t t t t t t t t
t t t t
} nth [ drop t ] [ "" member? ] if
] [ drop f ] if ;

This code should be really quick to execute. Since category# is implemented as a byte-array nth too, everything should run in a constant number of steps. So Unicode here doesn't give a significant speed penalty at all. On my computer, unicode:Letter? runs about 20% slower than strings:Letter? (the current definition), but given how fast they both run, that difference is negligible.

Update: After more careful consideration of the benchmarks, it actually looks like the new Letter? is five times slower than the old one. Still, it's really fast.