Saturday, June 30, 2007

Concatenative pattern matching

Finally, after more than two years of flopping around in my brain, I've implemented a kind of concatenative pattern matching in Factor. Not sure how useful it is, but it was fun. And this blog is called useless Factor for a reason: because I explain useless stuff. I'm not giving software engineering lectures here.

Anyway, it's not really specific to Factor; it'd be fine to implement in Joy too, or basically any other dynamically typed concatentive language where quotations can be manipulated as lists/arrays (though the outcome probably wouldn't be as flexible in Joh, because of Factor's word properties). The basic concept is simple, just a requiring a rethinking of pattern matching. Maybe I misunderstood it, but I always thought of ML-style pattern matching as doing two things: matching numeric literals for equality and examining instances of algebraic datatypes. Based on these principles, Chris Double made a great pattern matching library, currently residing in extra/match. But I've always had a problem with it, due to its heavy reliance on named variables. That doesn't detract from its utility; it just bothers me.

A more concatenative solution requires a reanalysis of the problem. Viewed more generally, pattern matching is the inverse of computation. What do I mean by that? Well, take a simple example in Haskell:

data List a = Nil | Cons a (List a)

sum :: Num a => List a -> a
sum Nil = 0
sum (Cons hd tl) = hd + sum tl

example :: Int
example = sum (Cons 4 (Cons 2 (Cons 5 Nil)))

(If this doesn't make sense, you should read this terse but good introduction to Haskell.)

So, you can think of sum as "undoing" the Cons function, when was called, and yielding the arguments. But when Nil was used, it "undoes" that (returning no arguments). When you think of pattern matching in this way, it makes it easier to translate things into a useful form for concatenative languages.

Here's a simple, useless Haskell example of the simplest form of pattern matching, just plain old undoing:

Cons 1 x = Cons 1 2

This sets the variable x to 2. It does this in a number of steps: first, it constructs the Cons 1 2, evaluating the right hand side. Then, it moves on to the left hand side. It checks that the value is a Cons, otherwise it fails. Next, it destructures the Cons, giving the values 1 and 2. With these values, it tries to match the 1 to 1 (succeeding) and the 2 to x (setting 2 to x). With slightly altered code, the match would fail, yielding a runtime error:

Cons 5 x = Cons 1 2

Obviously, 1 does not match 5, so this fails.

So it looks like the operation here is an unconditional undo. How could we implement this in a concatenative language? It's simple: just reverse the quotation and undo each word in order. Let's take the simple example above and translate it into Factor:

TUPLE: cons hd tl ;
1 2 <cons> [ 1 swap <cons> ] undo

This constructs a cons cell, and then undoes the construction operation. It matches the cons-hd to 1 and the cons-tl is left on the stack at the end of the undone operation. If, for some reason, we didn't have a cons cell, or if the cons-hd wasn't 1, there would be a runtime error, something along the lines of "Unification failed".

So, the inverse of [ 1 swap <cons> ] is the inverse of <cons> followed by the inverse of swap followed by the inverse of 1. The inverse of <cons> just checks to see if the value is a cons. If yes, then return the values of the slots (except the delegate). If no, fail. The inverse if swap is, simply, swap. And the inverse of 1 is to pop the top of the stack, and fail if it is not 1. So, all put together, the inverse code could be written as

: undo-cons-swap-1 ( cons -- tl )
dup cons? [ fail ] unless [ cons-hd ] keep cons-tl ! inverse of <cons>
swap ! inverse of swap
1 = [ fail ] unless ; ! inverse of 1

(In reality, it's a bit more complicated when you include unbound variables, so you could undo, for example, dup drop, but let's not get into that just now)

A second operation is testing if match succeeds or fails, returning a boolean. Imagine the Haskell code

cons1 :: List -> Bool
cons1 Cons 1 _ = True
cons1 _ = False

Below, I'll discuss another way to implement this, but one thing that's possible is to just have [ 1 swap <cons> ] matches?. This attempts to match the item at the top of the stack to the undone operation. If it succeeds, it returns true, otherwise it fails. Regardless of the success or failure, everything else left on the stack by the undo operation is dropped.

The only real challenge here is what to do with errors. A unification failure should be caught and returned as f, while, say, a stack underflow error should be propagated. In Factor, recover catches all exceptions indiscriminately. To have this behavior, I had to explicitly rethrow errors which weren't unification failures.

Imagine this syntax in Factor for the original Haskell sum function:

TUPLE: nil ;
TUPLE: cons hd tl ;

: sum ( list -- sum )
{ [ <nil> ] [ 0 ] }
{ [ <cons> ] [ sum + ] }
} which ;

: example ( -- num )
4 2 5 <nil> <cons> <cons> <cons> sum ;

which is the name I chose for a pattern-matching case: it chooses which of the options matches the data. It's like a Haskell case statement. For each 2array, it tries to undo the first part. If it succeeds, the associated quotation is executed. Otherwise, it moves on to the next 2array. The code it compiles to looks something like this:

: sum ( list -- sum )
[ [ <nil> ] undo 0 ] [
[ [ <cons> ] undo sum + ]
[ no-match ] recover-fail
] recover-fail ;

where recover-fail is a word which is like recover but propagates all errors which are not unification failures.

This should all generate relatively efficient code. But even if it weren't for that, I'm more glad that I got this idea written out, because it's just conceptually interesting.

The code is at extra/inverse if you want to look at it. I haven't written documentation yet.

Note: If you are a Haskeller, you were probably horribly offended by my explanations of Haskell pattern matching. For a correct description of what's going on, see this. I oversimplified and made it sound a bit more imperative than it is.

PS. The reason I say this is concatenative pattern matching is that patterns can be concatenated (backwards). For example, [ 1 swap <cons> ] undo is the same as [ <cons> ] undo [ 1 swap ] undo, which is the same as [ <cons> ] undo [ swap ] undo [ 1 ] undo. This grouping is associative. So it follows the laws of concatenative languages, as outlined by Manfred von Thun. By comparison, pattern matching along the lines of extra/match is applicative, applying constructor functions to arguments, which are constants or variables. My goal here was really more to show that concatenative pattern matching is possible rather than to build something useful.

Tuesday, June 26, 2007

The incomplete, would-be Unicode implementer's guide

If you're crazy, you might be think, sometime in your life, "Hey, I need Unicode for something. How about I write it myself?" That's what I thought when I was almost finished with an XML parser and realized it would not be conformant unless it allowed UTF-8 input. Bad idea. You should know that nearly everyone (except for Microsoft, Squeak, and Larceny Scheme) uses a library called International Components for Unicode (ICU). ICU, a project started by IBM, stores strings in UTF-16 in memory, supports about a billion encodings for I/O, and implements all of the Unicode algorithms, as far as I can tell. ICU code is written in C and C++ very easy to read and under a liberal BSD-style license.

But Factor's just too special for ICU. We like our strings to be manipulable as random access arrays. This lets them conform to the sequence protocol. Most programming languages using ICU (Python, Java, etc) allow this, but it is erroneous. When you treat a UTF-16 string as an array, code points above 16 bits will inevitably be treated wrong. As the Java documentation for String.charAt() states, "If the char value specified by the index is a surrogate, the surrogate value is returned." Programmers should not treat strings as arrays. Instead, they should iterate through them using special functions which are aware of high-bit characters. In my opinion, this breaks abstraction. Programmers shouldn't have to care whether they're using Unicode or not; that's the Unicode library's concern. In practice, many programmers will only support the only the Basic Multilingual Plane (the first approximately 2^16 characters) properly, as the other planes come up far less frequently and rarely cause bugs.

But we're doing Unicode right. All strings which can't be contained in ASCII will be stored in UTF-32 in memory, in Normalization Form D. This is the approach Squeak takes, and it allows correct random access in strings. Anyway, your Unicode implementation doesn't have to take this path, but mine does. As ICU supports only UTF-16 for internal representation, I had no choice but to write my own implementation. I don't need to do any display, but there are a few things that I do need: UTF-8/16 I/O, normalization, case conversion, grapheme traversal, collation, name to character conversion, character classes, and, maybe in the future, BIDI. I couldn't find any introductory articles about this stuff, so hopefully this will serve as one.

The first thing you have to be willing to do, when implementing these Unicode algorithms, is to be willing to read the standard. For XML, you can get a good deal done without reading the standard, going on your understanding of XML. And the XML standard is about the most boring thing I've ever read, filled with legalistic pronouncements of what an XML parser SHOULD and MUST do, with helpful EBNF line noise. Compared to the XML standard, the Unicode standard is Shakespeare. It has lots of informative introductions, explanations and algorithm descriptions. The most important parts of it, for us implementors, are chapters 2, 4 and 5 and Unicode Standard Annexes #15 and #29, and Unicode Technical Standard #10. There are many, many more documents, and it can be hard to sort through them all to find which ones are relevant. But those are some good starting points.

In order from easy to hard, here are the main Unicode things that I needed/need to implement.

To start out: UTF-8, UTF-16 and UTF-32 are the main ways that Unicode is stored in files and memory and transmitted between computers. There are other techniques, but very few people use them. They are called "encodings" because they encode code points (characters) in ones and zeroes, each with different bit packing mechanisms. UTF-8 stores code points in one of the following ways, where xxxxx... are the binary digits of the code point:

110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

As you can see, this is a variable width encoding. Depending on how large the code point is, it takes a different number of bytes. This is nice for European languages, but the format is somewhat complicated to manipulate directly and inefficient for east Asian languages. In UTF-16, the format is one of the following:

xxxxxxxx xxxxxxxx
110110yy yyyyyyyy 110111yy yyyyyyyy

where yyyy... is xxxx-2^16. The y value always fits in 20 bits, because Unicode code points go from 1 to 17*2^16. UTF-32 is much simpler, with only one format:

00000000 000xxxxx xxxxxxxx xxxxxxxx

But it's not really used for anything except in-memory storage, and even there it's relatively rare. But it is what I'm using in Factor.

Chapter 2 of the Unicode Standard goes into more detail about what's acceptable and correct here, but this is basically all you need to know. If you haven't done anything with bitshifts and masks before (like me), it can be a bit scary, but it's not too bad.

To do anything beyond plain old encoding and decoding UTF-8/16, you need to look at some of the Unicode resource files. The most important of these files is called UnicodeData.txt, whose format is described here. This huge text file (the Unicode folks refer to it as database) contains a line of information about every single Unicode code point. It is extremely simple to parse (in pure ASCII, values separated by semicolons), but processing is another matter. The first field of every line is the code point in hex, the second is the name, the third is the category, and so on. The ICU folks apparently compiled the database into a few huge header files, but my strategy was to parse them at compiletime, building a number of tables from the included data.

Using this data table, the simplest thing to make is a conversion from Unicode name to character. This allows something like CHAR: black-four-pointed-star to be equivalent to HEX: 2726 or "\{limbu letter ra}bc" to be equivalent to "\u1916bc". This is makes it easier to understand certain string or character literals which would otherwise be unreadable numbers. The way to do this is just to look at the UnicodeData.txt file, pick out the first two fields, parse the first field as a hex number, and construct an associative array from the second field (the name) to the first field (the code point). Then, code points can be looked up quickly given their names. For my library, I applied a little transformation to the names (putting them in lower case and replacing spaces with dashes) to give a more Factor-ish feel. More commonplace is to simply replace the spaces with underscores. This table takes up a good deal of memory, but it will not be included in most programs because it's only needed at parsetime.

Just slightly higher on the complexity scale is character classes. I discussed the high-level interface to these in a previous post but I did not not discuss how I stored them. There are only 29 character classes (or 30, if you count Cn, the non-class), so they can be stored in 5 bits. However, it's a lot easier and faster to store them in a byte array, or 8 bits, and not too space-inefficent. A byte array long enough to hold the first three planes (the BMP, Supplementary Ideographic Plane (SIP) and Supplementary Multilingual Plane (SMP), the only ones defined yet) would take about 200kb. On a modern computer, this is hardly significant, and it should be possible to compress it further for more limited settings.

Grapheme traversal is just a little harder, but still well into the easy category. You might be confused by the word "grapheme." (Is that even a word? I'm not sure.) By "grapheme" I mean "a character plus all combining [accent] marks following it." It is easy to confuse characters with graphemes, and the Unicode standard has a lot of clarification and explanation if my definition didn't make enough sense. But sometimes we have to divide text up into these graphemes. As a simple, useless example, reversing a string should reverse the graphemes, not the characters. If you reverse the characters individually, the combing marks will go to the wrong character. For something more useful, imagine moving the cursor around a text input box. The cursor should move by graphemes, not characters. So, to implement this, we need to know about something called the combing class of characters. The combining class is in the fourth field of UnicodeData.txt. If that is non-zero, then the character is a combining mark. To iterate through glyphs, just keep moving through the string until you get to a combining class of zero--the next glyph. To make things easier, I wrote a word to convert a string into an array of strings, each of which is a glyph. This takes more memory, but is easier to process. Grapheme traversal, as well as word separation and other similar things, is discussed in UAX #29 (which I haven't read yet).

[Update: Well, this is what I thought when I wrote it. But, when I went back and actually read UAX 29, I realized that it is actually a bit more involved. So I blogged about it later in much more depth.]

With case conversion, it starts to get a little messy. Upper, lower and title case of all applicable Unicode code points is listed in the UnicodeData.txt file in fields 14-16. But in Unicode, upper case is not as simple as

: >upper ( str -- upper )
[ ch>upper ] map ;

as the Factor strings library currently defines. Some characters, when converted to another case, become two characters. For example, the German es-zed (ß) becomes SS when capitalized. There are actually several cases like this. This means that any ch>upper function must be at least partly wrong, unless the character set is severely limited. That's not all: there are also context-depending casing modifications for Greek and Lithuanian, as well as locale-dependent modifications for Turkish and Azeri. These special cases are listed in another data file, SpecialCasing.txtThis file has a similarly simple format, which is documented in the same place UnicodeData.txt is. Case conversion is specified in the Unicode Standard, chapter 5. A confession: though it shouldn't be hard, I still haven't gotten around to properly implementing case conversion for Lithuanian, Turkish or Azeri. But there have been no complaints so far!

Hopefully, in a follow up, I'll describe where it gets harder: normalization, collation, and BIDI. But I've written enough for today.

Update: I described normalization forms here.

Monday, June 25, 2007

The joy of unhygienic macros

When working on the Unicode Collation Algorithm (alphabetical order; I'm still nowhere near done) I needed an efficient way to store a bunch of data in one fixnum--a bitfield. I needed this to hold the weights of various Unicode code points without taking an obscene amount of memory. For an introduction to this concept of bitfields in C/C++, see Wikipedia's article on them.

I could have done this the simple way, by manually defining a constructor which does all the left shifts and bitors, and accessors with right shifts and bitands. But I chose to do it a different way: since similar situations come up in other places (UTF-8/16 encoding and decoding, the assemblers of various platforms, and in the FFI), why not make a general library for it? In its current, incomplete state, it's in my repos. As I later found out, one already existed in the core, in the vocabulary math.bitfields. It's very simple and elegant, but somewhat low-level and lacking features I needed, like accessors and bounds checking.

math.bitfields uses a simple array-based signature to specify bitfields. By contrast, what I wrote uses a new definition syntax inspired by tuples. To define a bitfield which has three fields, each with 5 bits of data in them (a number from 0 to 31) and three bits of 0-padding between them, you would use

BITFIELD: foo bar:5 000 baz:5 000 bing:5 ;

This defines several words. First, it defines <foo>, which is the constructor for the bitfield. Second, it defines foo-bar, foo-baz and foo-bing which access the various fields. Third, it defines with-foo-bar, with-foo-baz and with-foo-bing which take a bitfield and a field value and return a new bitfield with the new field value. This is meant to be analogous to set-foo-bar of tuples, though it is somewhat different, as bitfields are immutable.

(An analogous parsing word, SAFE-BITFIELD:, defines the same thing, but includes bounds checks in the constructor and setters. This is useful for debugging, and in less certain situations, at a minimal performance penalty.)

All of these words are simply operations on integers. Yet through the macro, they create an abstraction allowing regular old unboxed fixnums to be viewed as complex objects. In the future, I may extend the bitfield system to create another word, foo, which contains metadata about the bitfield itself. I might also define foo?, to test if an integer can be interpreted as a well-formed foo. But I'm not sure how useful that would be.

To Lispers (and Factorers), this kind of thing elicits a big yawn. Of course you can do that. In Lisp, the invocation might look a little more like this:

(defbitfield foo :bar 5 "000" :baz 5 "000" :bing 5)

but it could do basically the same thing. [Note: the reason the padding is in quotes in Lisp but not in Factor is that, in Factor, it is easy to set things up such that 000 isn't parsed as a number, whereas this is a bit more difficult in Lisp.] I don't even have to mention the fact that this would be impossible in the languages most people use. (Don't you love that construction?) But there are some macro systems, most notably Scheme's (but also Dylan's and many others), which would not allow this. Scheme has what's called a hygienic macro system. The "hygiene" there refers to the fact that the macro doesn't come in contact with any variables which were not given as arguments to it. Basically, it's like putting lexical scope in macros. More formally put, it prevents inadvertent name capture, or name "overlap" between different definitions of the same thing. A better introduction is here.

This prevents a lot of subtle bugs for beginner macro writers*, but it also adds limitations. Most importantly, it is impossible to make new symbols. That means that, to have the same thing in Scheme, an invocation would have to look more like

(define-bitfield foo foo? (foo-bar with-foo-bar 5) "000" (foo-baz with-foo-baz 5) "000" (foo-bing with-foo-bing 5))

Discounting elaborate hacks, runtime penalties or a reduction in functionality, this is as simple as it can get. The Schemer may argue (as I once did) that it is clearer when all variables and function names are explicitly mentioned. However, when the functions defined are known to the programmer though macro conventions, there is no loss of clarity. In this case, I think it's a bit more clear what's going on without the redundant function name listings.

* Factor, due to its lack of lexically scoped local variables and the way it handles symbols, isn't actually subject to most of these bugs.

Tuesday, June 19, 2007

A look back

Slava just found my old infix arithmetic code that I started writing two or three years ago. It is forever preserved in the Sourceforge CVS attic here. Overall, it's pretty bad and incoherent, not to mention pointless. But still, it was the first non-trivial program I wrote. At the time I wrote it, I thought "600 lines? Wow, that's huge! I must have written something big and original and important." But that's obviously false on all counts. Anyway, it's a peek into how Factor used to be less than two years ago, and it shows how much things have changed. Note: this is before there was a module system of any kind. Everything had to be loaded explicitly. I'm really not sure at all which one of these files I thought was the main one, though the README.TXT explains the overall intention.

Macrology never gets old

Right now, Factor's predicates like digit?, letter?, printable? etc. work only on ASCII--the first 127 characters of Unicode. But since I'm trying to add Unicode support to Factor, these should instead work for all Unicode code points.

Unicode's character classes make this fairly easy to do. The Unicode standard splits all characters up into 29 character classes. These character classes can be used to implement predicates listed above. All of Factor's existing character predicates can be expressed as the union of one or more Unicode type classes.

At first, I had a lot of pieces of code that looked like this:

: uncased? ( ch -- ? )
category { "Lu" "Ll" "Lt" "Lm" "Mn" "Me" } member? ;

There were ten words defined this way, and four more in extra/xml/char-classes. Additionally, to "optimize" some of the code, I had things like

: Letter? ( ch -- ? )
category first CHAR: L = ;

since Letter? is the union of all character classes beginning with L. But wouldn't it all be clearer if the code were written like

CATEGORY: uncased Lu Ll Lt Lm Mn Me ;
CATEGORY: Letter Lu Ll Lt Lm Lo ;

This makes predicate classes on fixnums that do the same as the above code. So it's basically equivalent to

PREDICATE: fixnum uncased
category { "Lu" "Ll" "Lt" "Lm" "Mn" "Me" } member? ;

You can read more about predicate classes in the Factor documentation, but basically, it creates a new class, uncased which is a subclass of fixnum consisting of fixnums which satisfy the given predicate. A fixnum is a number 0-2^29, which can be represented by a single cell in Factor. In the end result, the word uncased? is defined in basically the same way, but we also have a class to use.

Factor parsing words work differently from Lisp or Scheme macros. Instead of providing a source to source transformation, parsing words are directly part of the parser, and use the same tools that the parser itself does. In fact, the parser consists mostly of parsing words, with just a little glue to hold it together. But this ends up easier than it sounds, because all of the internals are exposed. Here's how CATEGORY: is defined, skipping the word [category] which takes a list of character classes and other strings and generates an appropriate test quotation:

: define-category ( word categories -- )
[category] fixnum -rot define-predicate-class ;

CREATE ";" parse-tokens define-category ; parsing

The word define-predicate-class defines a word as a predicate class, given the test quotation and the superclass. It is the runtime equivalent of PREDICATE:, just as define-category is the runtime equivalent of CATEGORY:. By convention, these runtime equivalents should be defined for the sake of future parsing words which might want to use them.

Now, looking into the word CATEGORY:, the first word used, CREATE scans the next word and creates it, as a new word, in the current vocabulary. Next, ";" parse-tokens lets the parser read until the word ";" is encountered. The blank-separated "tokens" in between are returned as an array of strings. This array of strings is next processed by define-category, which gives the word its definition.

For efficiency, I changed the old test using member? to a bit array mapping category numbers to whether they are included. (In the implementation, the categories are all given numbers, which can then be translated into names.) I also added a feature for XML, that you can include any other characters in this, too, just writing them in as if they were character classes, with all string escapes allowed. So after all that, the automatically generated definition of uncased ended up being

: uncased? ( object -- object )
dup fixnum? [
dup category# ?{
f f f f t f t f t t t t t t t t t t t t t t t t t t
t t t t
} nth [ drop t ] [ "" member? ] if
] [ drop f ] if ;

This code should be really quick to execute. Since category# is implemented as a byte-array nth too, everything should run in a constant number of steps. So Unicode here doesn't give a significant speed penalty at all. On my computer, unicode:Letter? runs about 20% slower than strings:Letter? (the current definition), but given how fast they both run, that difference is negligible.

Update: After more careful consideration of the benchmarks, it actually looks like the new Letter? is five times slower than the old one. Still, it's really fast.

Friday, June 15, 2007

Matrices and packed tuple arrays

In Factor, it's possible to make homogeneous arrays of unboxed tuples pretty easily and efficiently. The advantage of this is space. On the tuple TUPLE: point x y z ; where the delegate is known to always be f, a tuple array yields a 3:1 space savings. With a smaller tuple, the savings could be even more. Additionally, fewer long-lived objects are created, which could help GC performance.

OK, some background. In Factor, when you want your own type/record/"class" (for some of your classes, not all), you make a tuple. A tuple is basically an array with a class tag as the first element, a delegate as the second element, and record fields in the rest of it. You can define a tuple, as previously mentioned, like TUPLE: point x y z ;. That makes a tuple class called point with three fields, point-x, point-y and point-z. Those words can be used to access the fields, and prepending them with set- sets the fields destructively. The word <point> is defined (by default) to create a point object, given an x, y and z. For object orientation purposes, it's possible to define methods on the point class, and the word point? tests if an object is a point. Additionally, all tuples have a delegate, but this is often just f, which does nothing. Delegates are another post, though...

So, in the point tuple, there's a lot of metadata that each instance needs to hold. First, it needs an object header, which says that it's a tuple with 5 slots. The first slot always contains the word point. The second slot, since no delegate is needed, will always contain f. After that, there's the three numbers. If you have an array of 1000 points, there's no real reason why each point should store all this information; it's always the same! So, why not have an array which holds a bunch of x, y, z instead?

The naive approach would be to have an array of arrays. Each array within the array would have length 3 and contain x, y and z. To get the point out of the array, you just index the main array and convert the inner 3-array into a point tuple. This approach would work fine, but it would still hold unnecessary extra data. For one, there is still some metadata remaining--you still have all these arrays, whose length of three must be stored. Why bother? What if, instead, you stored an array of { x y z x y z ... } and the tuples could simply be implied in the position?

There are two aspects of this. The first aspect is creating an implied structure over a flat array--the interpretation of { x y z x y z ... } as { { x y z } { x y z } ... }. This is provided by the groups class, defined in the splitting vocab in the core. The second step is interpreting { x y z } (the individual inner arrays) as a point. [Note: this involves a sort of lazy virtual map, which could be generalized as a virtual sequence (which wouldn't compile yet). This might be something interesting to investigate in the future.] So there will be two virtual sequences: a matrix (like groups, but with using slices for efficiency and set-nth to allow mutation) and a tuple map, which interprets the matrix slices as tuples. The latter will delegate to the former.

The implementation of all of this is fairly straightforward and simple; it's around 60 lines of code. In Factor, there was no special introspection going on with respect to tuples. There were just a few things I needed: in the classes vocabulary, the word class gives the class of any object, including tuples. In the tuples vocabulary, the word >tuple converts any sequence into a tuple, using the first element of the sequence as the class, the second element as the delegate, and the rest as the slots. Note that, because we're using >tuple to create the tuple, the normal constructor is never called. In the same vocab, the word tuple>array converts a tuple into an array, with the structure previously described.

As I mentioned before, if there is no delegate (that is, if the delegate is known to be consistently f) then there's no reason to store it. So when making a packed tuple array, the delegate should be skipped only if there's no delegate. This means there are three things that need to be known about the tuple: the tuple class, length, and whether a delegate needs to be stored. The length is stored in the underlying matrix, but the the other information must be stored somewhere in the packed array itself. I decided to store them together in what I called an example tuple. Basically, the example tuple is a tuple of the sort that the tuple array will store. It is an instance of the matching class, and it has a delegate if the delegate should be stored. For the point tuple, an example could be T{ point }, which is equivalent to f f f <point>.

One important difference to notice the mutation semantics. By mutation I mean changing a tuple in place. If you index the tuple array and then mutate the resulting tuple, the mutation will not be reflected in the array unless it is explicitly written back with set-nth. This means that tuple arrays are not completely equivalent to regular old arrays which happen to all have the same kind of tuple in them. Fortunately, this isn't much of a problem in practice. Factor is an impure functional language, so oftentimes, tuples aren't mutated after they're created. Of course, if you mutate something that's within a tuple slot, that mutation will be propagated, since that object actually exists in the array.

So, to sum it up, it's really cool that Factor can do this so easily. My explanation might have been confusing, but the code itself is pretty simple. Additionally, both the matrix and the homogeneous tuple array follow the sequence protocol, so immediately, all sequence combinators (like map) can be used on them. One major caveat is that this is a little slower than a regular array: on my computer, it takes around 40 microseconds longer to index than a normal array to index, or about three times as long as it takes to allocate the point tuple itself. One reason for this may be redundant bounds checking in the creation of slices in the matrix.

But I hope this library useful. If you've seen anything like this in any other programming language, I'd be really interested to hear about it. It could be possible in Common Lisp with vector-backed structs, but then you can only use those structs. In scheme, Scheme, you could always make your own record system, as in Lisp. And in C/C++, it's sorta builtin; just make an array of structs (rather than pointers to structs). But otherwise, how could you make this in another language?

The code is in my repository in tuple-arrays.factor. It can be loaded with the line USE: tuple-arrays.

Update: I previously wrote extra/2d to have the two dimensional arrays, but Slava later extended groups to have all of this functionality (though it's not perfect).

Tuesday, June 12, 2007

No more high school ever!

Yesterday didn't feel like much of an eventful day until the end. As I was taking the bus home, I realized there was no more high school, ever, for the rest of my life. It's amazing. I just have one final (Calc III) and then I graduate. Now, when you deride my opinions based on my age, you can't say "He's just a little high schooler." You have to say "He's just a little college student." The downside is, if you ever felt like talking about how amazing it is that "a high schooler can do that," (if I ever did anything amazing, which I didn't) it just doesn't work the same with a college student, because college students are generally recognized as full citizens who are capable of doing cool things.

Anyway, I've been inactive for the past week or so because I've had final projects to do. But here's the latest stuff I've been coming up with. Remember all that qualified naming code I posted here before? Well, I have a third version, which is probably much better. It's based on Slava's original idea for qualified naming. Instead of having a word like math or a prefix like << to mark what vocab something is in, what if you prepend it onto the word name itself? That is, what if you used math:+ instead of math: + or << math + >> The advantage of that, besides saving one or four characters, is that it is integrated into the language better. For example, you can do \ math:+ and all will work properly; the word + from the vocab math will be pushed on the stack. The other solutions don't do this right. So, here's the surprisingly short code needed to achieve this. Note that this code only works with the latest CVS because it uses the new module system, which treats vocabs differently.

USING: kernel sequences assocs parser vocabs namespaces ;

: define-qualified ( vocab-name -- )
[ CHAR: : add ] keep vocab vocab-words
[ >r append r> ] assoc-map-with use get push ;

scan define-qualified ; parsing

So the things I'm working on now are (a) Unicode, which should be ready for .90, or if not, .91, and (b) improvements on libs/xml. You probably thought libs/xml was done after so much work. Well, Matthew Willis is writing a Jabber client in Factor. To do that, he has to parse XML without exhausting the stream. Now, because the XML library is stream-based, it can handle streams just fine. The only problem is that, with the DOM-style view (which is reached using the words read-xml or xml-chunk, among others), the stream is exhausted. So, in their current state, it is unusable for this purpose. These structures are strict, not lazy (a path I may explore in the future, along the lines of HaXML, which I still have to read about).

The solution is to extend the pull-xml view. When I implemented it, I never thought it would be that useful, but now it turns out to be very useful for these incremental situations. To extend the DOM view to pull-xml, all I have to do is write a word which will pull the next DOM-view element--a tag, string, or other similar thing. Except, if the next element is a closing tag, I have to return that. The implementation of this is almost done, and when it is, Yuuki should be able to make the Jabber client relatively easily, without having to reinvent the wheel of the tree structure that XML follows.