Tuesday, July 31, 2007

Factor[ial] 101

If you're starting Factor, or any other stack-based language, you'll be a little confused about how the dataflow works. A good first exercise, which probably every successful Factor programmer has done at least once, is a word to calculate the factorial of a number. If you haven't done that yet, I strongly recommend that you do, in any way you feel like, implement it. Afterwards, you can continue reading this blog post. If you've already implemented factorial before, skip down past the second section to the third. In case you forgot, the algorithm for calculating factorial can be described as:

0! = 1
(n+1)! = (n!)(n+1)

It is only defined on nonnegative integers, and it's OK if your implementation has weird behavior, such as infinite recursion, for input outside of this domain. (Mine does too.)


Chances are, you wrote something like this:

: factorial ( n -- n! )
dup zero? [ drop 1 ] [ dup 1- factorial * ] if ;

If you didn't, that's OK, because I'm going to explain this in excruciating detail. Hopefully, this will serve as a good introduction to non-Factorers or beginning Factorers. It should be readable even if you're not running Factor, but you might find it easier if you open up a listener to play around with at the same time.

So, let's start at the beginning. The colon : indicates the start of a word (this is the Factor term for a function), which is called factorial. After that, the stack comment ( n -- n! ) describes the stack effect of the code. Informally, the stack comment here says that factorial takes n off the stack and pushes n! on. For example, if the stack consists of 3 6 3 1 5, then after invoking factorial, the stack will be 3 6 3 1 120. (The stack here is written bottom to top, left to right.)

Now let's look at the actual code. The implementation of the factorial word is everything after the stack comment and before the semicolon ;, which terminates the definition. So the definition is dup zero? [ drop 1 ] [ dup 1- factorial * ] if. I'll go through this word by word. dup is a so-called "stack shuffler" whose only purpose is to manipulate the data stack. This word duplicates the top item on the stack, so there are two copies of it. (Note that it doesn't actually copy the datum, it just copies a reference to it, except for the few objects that are stored on the stack by value.) So if the stack is 1 2 3, then after executing the dup word, the stack is 1 2 3 3. Other stack shufflers include swap, drop, nip, tuck, 2dup and over. These can be found in Factor's built-in documentation system. In general, stack shufflers, like other words, only deal with the top few items on the stack. Underneath what is known to be on the stack, words by convention leave the stack as it is.

The second word in the definition of factorial is zero?. Looking at that word's definition using \ zero? see, we can see that its stack effect is ( x -- ? ). In informal notation, ? in a stack effect means that that item is a boolean, either t or f. The fact that the word zero? returns a boolean is also indicated by the fact that the name ends with a question mark. (Note: this convention only applies when the word returns a single value. When it returns multiple values, one of which is a boolean, the question mark precedes the word, as in ?head.) So, the word zero? takes the object on the top of the stack and returns a boolean indicating whether or not that value is zero.

Let's look at what we have so far: dup zero?. Applied to a few inputs, here's what it yields:

1 2 3 dup zero? ==> 1 2 3 f
0 dup zero? ==> 0 t
dup zero ==> Datastack underflow error

The last error happened because there wasn't enough on the stack to proceed. It was triggered by dup, which expects at least one item on the stack. Most stack underflow errors can be detected by the compiler beforehand.

As you might have guessed, the purpose of the dup before the zero? was so that we could save the value of the number and use it later, while also using it for the purpose of testing its zero-ness. Now, we'll enter a conditional, dependent on whether that value was zero. Following the code we've analyzed are two quotations enclosed in square brackets. These are like code blocks or anonymous functions in other programming languages, but at the same time, they can function as sequences. In their literal form, they are simply pushed onto the stack like other data literals. After these blocks is if, forming a conditional. if works like this. First, the top three items on the stack are popped. If former third-from-the-top item on the stack is f, the top quotation is executed. Otherwise, the second-from-the-top quotation is executed. I've written extensively about conditionals in Factor in a previous blog post.

So we still have to look at the code within the quotations. The first branch, [ drop 1 ] is executed when the condition is true, that is, when the input to factorial is zero. That branch drops the item on the top of the stack (the original input) and pushes 1. So, in total, 0 factorial returns 1.

The second branch, [ dup 1- factorial * ] is a bit more complicated. First, the item on the top of the stack is duplicated, and 1 is subtracted from the result. So, if the stack before that operation was 1, then the stack afterwards should be 1 0. After that, factorial is called recursively. This should only affect the top of the stack. The stack now is 1 1. The word * takes two numbers off the top of the stack and pushes their product. So, the stack result of the first quotation is 1.

I'm reading The Art of Computer Programming right now, so in his style, I'll announce this: Excercise 1 proves this code correct.


Now here's a funny piece of code which does the same thing:

: factorial ( n -- n! )
1 [ 1+ * ] reduce ;

Exercise 2 explains this code and proves it correct.


1. [HM50] Prove the first Factor code listing a correct implementation of factorial.

2. [HM50] (a) Explain the workings of the second code listing. (b) Prove the second code listing a correct implementation of factorial.

(In case you're wondering why these exercises are rated so difficult, you should know that nobody's ever formally described the semantics of Factor, and that is a prerequisite to a working proof of the efficacy of any algorithm implementation. No, you can't look in the back of the book to find the answer.)


OK, OK, I'll give an answer to exercise 2(a), since it may be a bit confusing to those unacquainted with some of the more esoteric features of Factor. In Factor, all nonnegative integers are sequences. The length of integer sequences is equal to the integer itself, and indexing an integer sequence is equal to that index. In other words, an integer n represents the range of integers from 0 to n-1, inclusive. So, to get factorial, all we need to do is add 1 to each of these, and find the product of the whole thing. A direct implementation of that would be

: factorial ( n -- n! )
[ 1+ ] map product ;

map is a combinator (word which takes a quotation as an argument) which, like Common Lisp's mapcar, applies a function to each element of a sequence (or, in Lisp, a list) and returns the resulting sequence (or list). And product simply calculates the product of all the elements of a sequence. This implementation works as expected. However, it's a little bit more efficient if we don't calculate the whole new sequence, with one added to each element, and instead incrementally calculate the product as we range through the numbers 0 to n-1. To do that, we can use another combinator called reduce. reduce takes a sequence (such as an integer) and a start value (such as 1). Then, reduce executes the quotation with each element of the sequence on the top of the stack, and the accumulated value beneath the top. The quotation returns a new accumulated value, which is ultimately the return value of the reduce. So, here, the quotation [ 1+ * ] adds one to the sequence element and then multiplies it into the accumulator.

Now, I'll expect the answers to the other two exercises to be on my desk Monday morning. You are dismissed.

Monday, July 30, 2007

How to make ad-hoc polymorphism less ad hoc in Factor

Note: Originally, I used the words type, class and instance where I now use class, group and implementation, respectively, but I changed this because it was too confusing. This proposal may sound sketchy, which is because I haven't implemented it yet. But I'll have a working draft in a few days.

A few days ago, Slava told me that he had plans to radically change the Factor object system. At first, this alarmed me. I'm afraid we might lose users if we keep changing the language so much, with no end in sight. It's amazing how quickly Factor code undergoes bitrot. One of the changes he suggested included getting rid of class names in slot accessors, which I think is both useless and somewhat obfuscating. But after some thought I realized there was a potential for great changes here.

A proposal: what if Factor had type classes, like Haskell? No, it's impossible to port classes directly from Haskell in their current form. But type classes form a good model to build a dynamically typed object system around. For the rest of this blog posting, I'll use the word "group" to refer to my extended notion of a type class. I call it this because it refers to a group of generic words, as well as a group of classes that implement them. I'll continue referring to Factor classes as classes. This may confuse Haskellers, but it makes more sense, since it fits in with current Factor concepts. A good introduction to type classes is Philip Wadler's original paper, "How to make ad-hoc polymorphism less ad hoc." (Note that I'm not doing any of the dictionary translation part, really just the first part. Also, this does not include multiparameter type classes, but may be extended to that by Factor 2.0 to support homogeneous sequences when Factor is optionally statically typed. Really, the link to Haskell is more metaphorical than anything else. You can also think of groups as abstract classes.)

The sequence protocol, one of the first big uses of Factor's object system, is a good example of what's wrong with what we have right now. Here's some code from core/sequences/sequences.factor:


GENERIC: length ( seq -- n ) flushable
GENERIC: set-length ( n seq -- )
GENERIC: nth ( n seq -- elt ) flushable
GENERIC: set-nth ( elt n seq -- )
GENERIC: new ( len seq -- newseq ) flushable
GENERIC: new-resizable ( len seq -- newseq ) flushable
GENERIC: like ( seq prototype -- newseq ) flushable

<PRIVATE
GENERIC: nth-unsafe ( n seq -- elt ) flushable
GENERIC: set-nth-unsafe ( elt n seq -- )
PRIVATE>


Those are just a bunch of generic words, all detached from one another. The only thing connecting them is documentation, and the understanding that all sequences should implement all of them. (Or at least enough of them so that it can function; some of these methods have implementations on object and some don't.) Some methods have a convenient property that they can defer missing methods to the delegate of a tuple, but this only works if no default implementation on object exists.

The first thing to do is to provide a structure for these generic words. In documentation, they form a protocol, but I'll use them to form a group, called sequence. (Note that this code won't work, if you try to run it. It's just an idea. Hopefully I'll have an implementation in a month or so, but I have a lot to do right now.)


GROUP: sequence
GENERIC: length ( seq -- n ) flushable
GENERIC: nth ( n seq -- elt ) flushable
<PRIVATE
GENERIC: nth-unsafe ( n seq -- elt ) flushable
PRIVATE>
GENERIC: new ( len seq -- newseq ) flushable
GENERIC: new-resizable ( len seq -- newseq ) flushable
GENERIC: like ( seq prototype -- newseq ) flushable
;

GROUP: mutable FROM: sequence
GENERIC: set-nth ( elt n seq -- )
<PRIVATE
GENERIC: set-nth-unsafe ( elt n seq -- )
PRIVATE>
;

GROUP: resizable FROM: mutable
GENERIC: set-length ( n seq -- )
GENERIC: underlying ( seq -- underlying ) ! Note: defined in core/growable
GENERIC: set-underlying ( underlying seq -- ) ! ditto
;


Immediately, you can see that I've added some structure to this. The generic words are each inside a GROUP: declaration. Some of the group declarations have FROM: following the group name, meaning that every implementation of that group must also be an instance of the group following the colon. (Multiple classes could be placed in brackets.) I've separated this out into three type classes, because that is the actual structure of the sequence protocol: all sequences are indexable, and some of those are mutable, and some of those are growable.

In each of the above examples, GENERIC: is used, but it's possible to use other method combinations like GENERIC#, MATH: and HOOK:. This proposal doesn't imply any significant change in dispatch mechanisms, except for the removal of delegates as they are now.

Each group forms a class which is the (initially empty) union of the types implementing the group. At first, this sounds useless, but it allows two things we haven't had in Factor that Haskellers take for granted: the ability to do proper default method implementations and create type class instances of forms like instance Foo a => Bar a where ... (this isn't Haskell98, but it's still useful). Here's how some default methods look:

M: sequence nth bounds-check nth-unsafe ;
M: mutable set-nth bounds-check set-nth-unsafe ;
M: resizable set-nth ensure set-nth-unsafe ;

Currently, each of these definitions is repeated multiple times throughout the Factor code base. Definition on object, the current mechanism for default methods, is avoided to make room for delegation. Other default methods, which are currently implemented on object, could be

M: sequence new-resizable drop ;
M: sequence new drop f ;
M: sequence like drop ;

It is good that these are implemented on sequence rather than object because that is a more precise way of indicating what the code is actually doing. I always found it confusing that sequence methods would be implemented on object when they would clearly fail on many types of objects.

So, we have this group. How do we make an implementation of it, or a class which belongs to it? If we just started implementing methods without any declaration, it'd be hard to decide whether a type belongs to a particular class. So, here's an example of how the syntax could look, showing the instance of sequences on integers. Note that integers belong only to the class sequence, not mutable or resizable.

JOIN: sequence integer
M: integer length ;
M: integer nth-unsafe drop ;

The JOIN: declaration ensures that integers will be included in the sequence group. (They're joining the group, hence the name.) We may or may not want an error to be triggered if someone attempts to implement the sequence methods without that declaration. I'm not sure. It might also be an error-triggering condition if an JOIN: declaration exists in a file without a complete implementation of the group's methods.

Slava suggested that we may want a behavior, or set of implementations of methods, without adding any slots or new generic words. That could be done easily with this system:

GROUP: foo FROM: sequence ;
M: foo nth-unsafe ... ;


Now, here's an innovation where we go beyond Haskell, sorta. It could, potentially, be included back into Haskell; there's nothing stopping them, but it might not be as useful there. (I should note that this so far doesn't incorporate nearly all of the functionality of Haskell type classes, but that's OK.) I'm talking about a generalization a Haskell extension called generalized deriving for newtype. Basically, if you make a newtype in Haskell, you can use the deriving construct to allow a few classes to implement all methods by looking up the object stored "in" the object of the newtype. (In reality, there's nothing in it; the difference is only at the type level.) That wasn't a very good description, so let me explain how it works in Factor.

My first real programming project was writing an XML parser. Recently, I found a way to make some operations on tag objects much simpler: a tag can act like its name, its children, and its attributes all at the same time! That probably sounds confusing, but it is actually quite simple. The main way names are used is querying the name-url, name-tag and name-space, the three tuple fields. The way a tag's children are used is as a sequence. And attributes are used as assocs. So really, the concept we want to express here is that there are three different delegations, each for a different type class. (Let's imagine here that the tuple type TUPLE: name url space tag ; defines a type class name-slots containing the generic words name-url, name-tag and name-space. I'm also not sure if this is a good idea.) Then we can specify these delegations using:

TUPLE: tag name attrs children ;
DELEGATE: name-slots tag tag-name
DELEGATE: growable-seq tag tag-children
DELEGATE: assoc tag tag-attrs

Currently, my code for this is a mess of delegation using the delegate slot (which is set awkwardly in a manual constructor definition to the name) and this:

M: tag at* tag-attrs at* ;
M: tag set-at tag-attrs set-at ;
M: tag new-assoc tag-attrs new-assoc ;
M: tag assoc-find >r tag-attrs r> assoc-find ;
M: tag delete-at tag-attrs delete-at ;
M: tag clear-assoc tag-attrs clear-assoc ;
M: tag assoc-size tag-attrs assoc-size ;
M: tag assoc-like tag-attrs assoc-like ;

M: tag nth tag-children nth ;
M: tag nth-unsafe tag-children nth-unsafe ;
M: tag set-nth tag-children set-nth ;
M: tag set-nth-unsafe tag-children set-nth-unsafe ;
M: tag length tag-children length ;

which is just plain unnecessary duplication. The simpler preceding code would basically generate the code underneath. Assuming tuple slot accessors are kept the way they are, this would allow a tuple to "inherit" from more than one tuple type. So all classes have the capability to inherit from one or more "abstract classes" (groups) and "concrete classes" (normal classes). For those of you worried about multiple inheritance conflicts, let's take a simple approach and just say the last DELEGATE: declaration takes precedence. There's no reason to have complicated graph algorithms here. If your code depends on something more complicated than that, then there's probably something wrong with it. In general, the classes you implement or delegate to shouldn't overlap too much.

If you've noticed, I have a notion of a class which is a bit broader than the Haskell concept. Now let me make it even broader: every generic word forms a group (and therefore also a class, since every group forms a class which is the union of the types that implement it). This way, every generic word is in some sort of type class. So, by itself, the code GENERIC: foo would be equivalent to

GROUP: foo
GENERIC: foo
;

In most Factor code, it wouldn't make sense to put most generic words into classes. They're not part of a protocol, and it's just annoying. When a type implements a method, it is automatically made an instance of the class corresponding to that method. Here's a nice useless example of that:

GENERIC: nothing? ( object -- ? )
M: f nothing? drop t ;
M: sequence nothing? empty? ;
M: zero? nothing zero? ;

The final method works on all objects that have an implementation for zero?. Again, I'm not sure whether this is a good idea or not.

So, there's my long-winded proposal for something like type classes in Factor. The terminology is probably confusing, so feel free to ask for clarification (or correct my errors!). Ed Cavazos made another interesting proposal for a radically changed object system in Factor, which he has already implemented and made use of in extra/mortar. I could implement this system in Factor, and may in the future, but haven't yet. Comments?

Update: I made some changes that Slava suggested. I should've noted better that ad-hoc polymorphism, with or without this change, will still be a lot more ad hoc than in Haskell, if I didn't make that clear.

Update 2: I have a working implementation of this at the wee-url pastebin. There are only two problems: it's slow and it'll hang if you give it a cyclic group structure. (Sorry, that statement sounded too abstract algebra-like but it actually has nothing to do with that, AFAIK.) Oh, and a couple more problems: I can't include this in my repository, because loading it affects the whole system, which is bad. And also, it might be a bad idea to make every generic word and set of tuple slots a class, because that'd add 1199 classes to the core as of Factor .89, most of which will never be used. In .89, there were 296 classes in the core. Another problem is that I haven't fully tested the newly modified class< needed because of the new class structure. I'm not sure if it'll work properly, since this includes multiple inheritance. (But then again, so do overlapping unions, which, in my tests, silently and arbitrarily chose one when there was any ambiguity about which method to run.) When writing this, I realized that it's actually really easy to infer by context when a group is joining a class, so an explicit JOIN: declaration is only needed in certain edge cases, or when a group does not have any directly associated methods.

Update 3: Actually, Slava implemented basically the same thing at the same time as I wrote it, but I didn't have internet at the time, so I didn't know it. Huh.

Friday, July 27, 2007

Prior art

A few weeks ago, I outlined an implementation of pattern matching in Factor. But just recently, I found a paper with a remarkably similar pattern matching system in Haskell. It's entitled "First Class Patterns" by Mark Tullsen, published in 2000 with a grant from the NSF.

On the surface, it looks completely different, but underneath, it's the same. This Haskell pattern matching system defines inverses for constructor words, with the versions Constructor for the normal constructor and Constructor# for the inverse. The inverse is failable, and can be represented as a -> Maybe b. Just x indicates a successful inverse, with the data which it yields, and Nothing indicates failure. This is analogous to what I did with exceptions. You can define your own inverse for functions as well as constructors, much like what I did with define-inverse.

Tullsen did something really cool when he generalized pattern matching to MonadPlus m => a -> m b, allowing more complex things like backtracking using the list monad. You could also give your own semantics for pattern sequencing. I haven't investigated this fully, but it may be possible to do backtracking pattern matching in Factor, also, using call/cc and, from that, amb. Another extension that we both may be able to use is more involved unification semantics along the lines of Prolog or Erlang.

A couple things about it annoyed me, though. First, it didn't really provide a good-looking syntax and wouldn't work as a drop-in replacement for existing Haskell pattern matching. Second, the implementation required a language extension for the syntax to work properly. This is a symptom of a general problem with Haskell, that the language isn't as extensible as people like to think it is. It is much more common seeing a paper which proposes a language extension than a paper which proposes an idea and provides a Template Haskell implementation of it. My last problem with the paper is that it skipped over multiargument (curried) constructors, saying "Curried constructors are allowed, but the pattern law is much more straightforward when constructors are uncurried."

But if you're interested in models of pattern matching, the paper is definitely worth a read. It's rigorous where my blog post made blind implicit assertions that something works correctly. And it provides several combinators for processing and composing patterns in a manner somewhat broader than existing pattern matching techniques. It's also interesting to look at how the type system of Haskell interacts with pattern matching, and how pattern matching based on inverses works out in an applicative language.

Anyway, I just thought that it was necessary for people to know that I wasn't the first person to come up with that version of pattern matching.

Update: I kinda rushed this blog post out initially, so I went back and edited it.

Thursday, July 26, 2007

Who are you people and what are you doing reading my blog?

Ever since I got an invisible hit counter for this blog in mid-March, I've been surprised to see that people actually read this thing. Of course, I'm talking about you. At my last count, there were 10-15 active Factor programmers in the universe, not counting the ones who do it in secret. Yet, for some reason, my blog gets around 25 unique hits a day, with a standard deviation of around 12 and a distribution skewed strongly to the right. (That's ignoring the outliers of the times I get Redditted, getting 400-500 hits in a day; if I include that, the standard deviation is higher than the mean!)

So, I'm just wondering: among those of you who aren't Factor programmers who I sorta know, what motivates you to read this? Where/how did you find it? I know you're out there, so I'd really like it if you commented. (In fact, I always like it when people comment, even stupid comments.) This isn't meant as interrogation. I'm just really curious and confused.

Update: Thanks for all of your comments! But it's never too late to write another.

Sunday, July 22, 2007

ICFP results

A bunch of people have asked me how the ICFP programming contest is going, so I thought I'd just write a blog post about how it all went. Just to give you a spoiler, in case you don't want to read the whole thing: we lost.

But if you're going to keep reading, let me describe the task to you. It was described in a 22 page document, so if you weren't in the contest, chances are you didn't read the whole thing. The premise is, an alien crash-landed on earth, and we have to prefix his DNA with some more DNA in order to help him survive in our conditions. His DNA works differently than ours, consisting of the letters I, C, F and P. It is converted into alien RNA through a slow and potentially non-halting algorithmic process. Then, the RNA (in chunks of 7 bases) is executed, with each chunk interpreted as one of 20 commands, or ignored. These commands are interpreted to produce a 600x600 picture. The specification for the DNA>RNA conversion and RNA interpretation is really complicated; I remember falling asleep at 2:00 AM on Friday (Saturday?) reading it. The default DNA produces this, and the goal of the contest is to come up with a prefix for that which will produce this, or at least a good approximation of it, in as little bases as possible.

So, obviously, Gavin (gmh33) and I thought, the first step is to produce a system which converts DNA into the image. By the end of the second day, we were mostly done. Some teams got this done in the middle of the first day, but (a) we're not perfect programmers and (b) we still had lives going on, so it took us a bit longer. Gavin wrote most of the DNA>RNA converter, and I wrote the RNA interpreter with lots of help from dmpk2k from #concatenative. For me, it was a nice opportunity to learn about basic graphics algorithms like flood fill, alpha blending and line drawing. (I might put these in a coherent, efficient library some time later, but right now, they're pretty incoherent and inefficient.) For Gavin, it was a good chance to learn about Factor programming and all of its oddities and idioms.

I asked Gavin to put the alien's DNA into his code to convert it to RNA, so I could then render it into a picture, hopefully the source picture. Eight hours later, it was still going. Eventually, Gavin figured it out. If I'd been working alone, I'd probably never realize it. "I kind of have a feeling we approached this the wrong way... I think maybe going backwards would have been the right way." Indeed, according to the spec itself, the DNA goes through more than a million iterations before fully transforming into RNA. So, most likely, the DNA>RNA conversion wouldn't've been done by the time the contest was over!

Our new strategy was simple: I'd write the code to convert the target picture into corresponding RNA instructions, and then Gavin would convert that into a DNA prefix. Within a couple hours, Gavin had his code working--it only needs to add three I's before each RNA instruction, and then skip all the following DNA code. But mine was much harder. I wrote code for moving the cursor around, setting the bucket color to any RGB value, and drawing lines. I thought there was just one more step: code to look at the target picture and figure out where lines should be drawn so I can flood fill the solid patches of color. Dmpk2k (a 2D graphics genius with infinite tolerance for people who don't understand anything) and I discussed the possible algorithms for this for over an hour, until we both simultaneously realized, in horror, that there were no solid regions of color!

At this point, Gavin and I decided to give up. In total, we'd written about 500 lines of code, and it all worked, though it didn't solve the problem. So we weren't going to just not submit anything--we were going to lose in style. Gavin came up with the idea of drawing a big green F for Factor, and before long I sent him the RNA, which he converted to DNA and submitted. Surprisingly, we didn't come in last place. Maybe it was the white background, but we apparently got a few pixels right, somehow. At the end of the contest, team Raptors is in position 354, or fourth to last of those who submitted something.

All in all, a horrible failure which I'll be sure to take part in next year. For both Gavin and I, it was a very humbling experience. It'll be a little while until I'm a good enough programmer, but eventually the world will know that Factor is the programming of choice for discriminating hackers!

Update: I should have mentioned this in the original post. After doing everything, I realized that our approach was far too simplistic. A good solution would not have taken the segmented approach we did, separating the image to RNA processing from the RNA to prefix processing from reading the DNA. To have a good solution, we would have had to do all of those things at once. We would have had to analyze the DNA to see how it related to the picture, and then produce DNA which, over several iterations, would make the correct RNA to generate the picture. I have no idea how to do this. To test things, we'd need a DNA interpreter that was fast, and for that, we'd need a highly optimized, well-thought-out implementation in a fast, low-level language, like C. It might be possible in a higher level functional language, but only if many low-level features are taken advantage of.

Update 2: After reading a bunch of other post-mortems, I realize that, in fact, the organizers of the ICFP made a whole library and programming system in DNA, which turns out to be a Turing-complete (though not terribly efficient) programming language. So you get a series of puzzles, which lead to more puzzles, and so on. Oh, and to represent DNA efficiently, you can't use plain old strings, you need an obscure functional data structure developed at Xerox PARC called a rope (like heavyweight strings, get it?). It's a specialized tree of characters that do concatenation in constant time and most other operations (indexing, substring, etc) in logarithmic time. For searching though a string, the Knuth-Morris-Pratt (KMP) algorithm is needed, which detects and takes advantage of patterns to get the complexity down from O(mn) (where m is the length of the pattern and n is the length of the string) to O(m+n). The DNA contains a lot of repetitions, and the conversion to RNA uses a lot of searching, so this is necessary.

Since someone asked, I uploaded Gavin's and my code to onigirihouse (thanks Matt!) for all to see the horrors. The icfp.factor may be a little outdated, though, and it doesn't contain the code used to convert RNA into DNA. We just communicated our code by email, and that wasn't always the most efficient or effective way to do things. The code isn't terrible, but it's really inefficient and there are bugs in it. Also, I had some messy stack handling in a few parts of image.factor because I didn't think things through very well.

Monday, July 16, 2007

ICFP Contest

Is anyone else interested in entering the ICFP contest, from this Friday to Sunday? gmh33 from #concatenative and I are planning on having a team using Factor, and we'd like it if anyone else who's interested joined in. There's no limit to how many people can be on a team, but everyone has to register beforehand. Just contact me if you're interested.

Eww, that's the dirtiest macro I've ever seen!

I've shown a few parsing words on this blog that create words that aren't mentioned directly in the text. An example of this is TUPLE:, which makes a class with a number of slots, a constructor, a predicate, accessors and setters. It's very useful, even if not all words it creates are given literally in the code. Now, what would you say if I told you I have an unhygenic macro that creates O(n^m) words? (I'll tell you what m and n are later.) In reality, it makes more than n^m words. With 7 as m and n, it creates 1419760 words, and with 8 as both of them, Factor crashes because it runs out of memory after expanding the heap a number of times.

You're probably thinking, "Why would I ever want such a stupid unhygenic macro? And what does it do? Your introduction was so boring and pointless, Dan." Well, I'll tell you: it's a new stack shuffling syntax. dmpk2k on #concatenative asked me about alternative shuffle word syntax. For example, why not make dup into x-xx, swap xy-yx, etc? It'd be much easier to remember, and there would be no need for "stack idioms" like swap rot for xyz-zyx. Even complicated stack shuffling would be simple.

Believe it or not, it's not all to hard to do in Factor. It'd be prohibitively difficult to extend the parser to make this work for all possible shuffles, so I took a halfway approach: provide a way to define as many as you want. The syntax I came up with was
SHUFFLE: abcd 6

which defines shuffle words for up to 4 inputs labeled a, b, c and d (in order), and up to 6 outputs, defining a paltry 6686 shuffle operators, everything from abcd- to ab-baabaa. (If you want, you can alias them, as in : tuckd xyz-yxyz ;, but there's generally little point.) Some of the shufflers do the same thing as others, such as ab-a and a-, but this constitutes a very small fraction of the total. Going with the variables at the beginning of the article, n is the length of the alphabet (here, abcd) and m is the maximum output length (here, 6). The total number of transformations is higher than 4^6 (4096) because that count only includes the longest transforms, from four elements on the stack to six. However, it is still an accurate asymptotic approximation.

So, there are two difficult programming problems here, in addition to some easy ones. The hard ones are (a) generating all possible stack transformations and (b) turning a stack transformation into a quotation.

Generating all stack transformations

(a) may look simple, but it's a bit more complex than it seems. It is easy to get the possible inputs: it's just somewhere between the first element of the alphabet and the entire alphabet. But the outputs can be any number of the inputs, in any order, duplicated or not.

This can be reduced to a simpler problem. Let the length of the alphabet be A, and the length of the output be O. Given an alphabet length and array length, generate an array of arrays consisting of all the possible combinations of numbers 0 to A-1. The length of the array generated will be O^A because there are O spaces, each of which can be filled with A different items. So if we can find a bijection between the numbers from 0 to (O^A)-1 (let's call one of these numbers N) and the arrays we need, the problem is easy.

As it happens, the structure of this problem is remarkably similar to converting numbers to different bases. We can treat our number N as a base A number. Then we create an array of length O consisting of the digits of the number. Each digit should be between 0 and A-1. If we convert each number N into this base, we'll get every possible combination for the output array. The code for this is below, where translate converts a number into an array with a given length and alphabet size, and (combinations) creates all possible combinations for the given length and alphabet size:

: translate ( alphabet out-len n -- seq )
-rot [ drop /mod ] map-with nip ;

: (combinations) ( alphabet out-len -- seq[seq] )
2dup ^ [ translate ] map-with2 ;

To make an array of all the possible stack shuffles with input length from 1 to max-in and output length from 1 to max-out, we need a couple more words. To represent a stack shuffle, I used a 2array whose first element is the number of inputs and whose second element is an array of the output indices.

: combinations ( n max-out -- seq[seq] )
1+ [ (combinations) ] map-with concat ;

: make-shuffles ( max-out max-in -- shuffles )
[ 1+ dup rot combinations [ 2array ] map-with ] map-with concat ;

(Going counter to the Forth/Factor mantra, I actually wrote this code in a top-down fashion. First, I wrote make-shuffles, and then realized I'd need combinations, and then realized I need (combinations) and translate. This is generally not a good tactic, but it works for small problems that take a lot of reasoning, like this.)

Generating quotations
It surprised me, but (b) turned out to be the easy one. At first, to transform the stack, I stored the top few elements in an array, then accessed the elements used for output and dropped the array. But that isn't optimal, because an array is unnecessarily allocated. It's slightly more efficient if we just pass things around on the stack.

To make this easier, I used Chris Double's extra/shuffle library. It defines words like [npick] and [ndrop] which generate quotations for arbitrary depth pick and drop. Using this, I can take the strategy of reaching down and pulling up the last element of the output array, then dipping beneath it. After iterating backwards through the entire output array, I drop the inputs and come back up. That wasn't too good an explanation, so just have a look at the code:

: shuffle>quot ( shuffle -- quot )
[
first2 2dup [ - ] map-with
reverse [ [npick] % \ >r , ] each
swap [ndrop] % length [ \ r> , ] times
] [ ] make ;

As an example, the code for ab-ba is over >r dup >r 2drop r> r>.

Wrapping it all up

To make this work, I also need a few trivial operators. There are words for turning a stack shuffle datum into a string:
: shuffle>string ( names shuffle -- string )
[ [ swap nth ] map-with ] map-with
first2 "-" swap 3append >string ;

For attaching a stack effect to a shuffle word:

: put-effect ( word -- )
dup word-name "-" split1
[ >array [ 1string ] map ] 2apply
<effect> "declared-effect" set-word-prop ;

And for defining a set of shuffle operators (in-shuffle and out-shuffle make sure that the operators get defined in their own minivocabulary, so they're not exported by default. This follows the example of <PRIVATE and PRIVATE>):

: in-shuffle ( -- ) in get ".shuffle" append set-in ;
: out-shuffle ( -- ) in get ".shuffle" ?tail drop set-in ;

: define-shuffles ( names max-out -- )
in-shuffle over length make-shuffles [
[ shuffle>string create-in ] keep
shuffle>quot dupd define-compound put-effect
] each-with out-shuffle ;

And finally, the dirty macro itself, the greatest sin to hygiene Factor has ever seen:

: SHUFFLE:
scan scan string>number define-shuffles ; parsing

Enjoy! I'm not sure whether it would be ethical for me to release this evil into the Factor distribution, so I haven't pushed it on to my repository.

Sunday, July 15, 2007

Messing around at the Factor REPL

Read Eval Print Loops, or REPLs, are really useful, I've found. One of my favorite uses, beyond prototyping, is playing around and solving problems. Often these are math problems, but for slightly more complicated things, I need string processing. Simple string processing things like this also make an easy example for explaining Factor to beginners like you. (If you're not a Factor beginner, you may want to stop reading.)

My younger brother is learning how to type right now. My dad found Tux Typing, a GPL typing program for Windows and Linux. So far, it's great; there's just one problem: when you use the word mode (where you have to type words before they fall to the bottom of the screen, creating horrible crashing sounds) there are only about 35 words in the long word setting. My dad asked me to fix this, and since it was a simple little task, I agreed.

I started by figuring out the file format for the dictionaries. It was easy: the first line was the title (I chose "Huge words") and after that, the words were listed in allcaps, separated by Unix line endings. Next, I copied the text of the Wikipedia entry Economy of the United States into Wordpad and saved it to the desktop. Then I downloaded Factor to the computer I was working on and fired up the REPL.

The first thing in making the word list is getting a list of space-separated things. So I made a file reader object and got an array of all the lines. I joined these lines with a space, then split everything separated by a space (separating both lines and words on the same line).
"article.txt" <file-reader> lines " " join " " split

Now an array of words is lying on top of the stack. There are a bunch of operations we need to do to manipulate this, and Factor's sequence combinators help make it easier. So I made sure that each word had at least three letters in it:
[ length 3 >= ] subset

And I put all the words in upper case:
[ >upper ] map

And I made sure that each character of each word was an upper case letter, to filter out apostrophes, numbers, and other similar things:
[ [ LETTER? ] all? ] subset

And finally, I made sure there were no duplicates:
prune

So, to join this together in the correct file format and add a title, I used
"Huge words" add* "\n" join

yielding a string. To write this string back to the original file, all I needed was
"article.txt" <file-writer> [ print ] with-stream

And that's it! All in all, 10-15 minutes work and I got the game working with a good word list. But the REPL made it a lot easier.

Update: I didn't have the <file-reader> and <file-writer> displayed properly before, but it's fixed now. Thanks Sam!

Monday, July 9, 2007

The Unicode implementer's guide, part 2: Normalization

In a reasonable encoding system, it'd make sense if every string had only one encoding, wouldn't it. I'm not talking about UTF-8/16/32; I'm talking about the logical sequence of characters. If you have a capital C with an acute accent on top and a cedilla on the bottom, there should be only one sequence of characters to represent it, and any two strings in the same encoding which represent that glyph should have the same bits.

Well, that's not Unicode. There are actually several ways to represent that. At least four, by my count. You could have C-acute followed by cedilla, a C-cedilla followed by an acute, a C followed by an acute followed by a cedilla, or a C followed by a cedilla followed by an acute. There is a difference between "C-acute" and "C followed by an acute" because Unicode has what's called precomposed characters, or characters which already have accent marks on them. However, they are directly equivalent to a corresponding decomposition, specifically, the base character followed by a combining accent mark.

Equivalence and normalization forms

All of the above sequences represent the same thing, but they're represented differently in the computer. We call them canonically equivalent. There's another type of equivalence, called compatibility equivalence, where, say, the composed fraction ½ is equivalent to 1/2. These aren't strictly equivalent, though, so Unicode maintains a distinction to a stronger degree than the distinction between C-acute and C followed by an acute.

Chapter 3 of the Unicode standard states, right near the beginning,

C6 A process shall not assume that the interpretations of two canonical-equivalent sequences are distinct.

  • ...a process is never required to give different interpretations to two different, but canonically equivalent character sequences...no process can assume that another process will make a distinction between...canonical-equivalent character sequences.

  • Ideally, an implementation would always interpret two canonical-equivalent character sequences identically...




To me, it seems apparent that one single form needs to be chosen. In this form, which is efficient to create, any two canonical-equivalent strings will be represented by the same sequence of characters. There should also be a form for categorical equivalence.

Luckily, the Unicode consortium thought of this first. They came up with not two but four forms, with the imaginative names Normalization Form C (NFC), Normalization Form D (NFD), Normalization Form KD (NFKD) and Normalization Form KC (NFKC). Officially, they stand for nothing, but I know better! They do mean something! C stands for "composed", D stands for "decomposed" and K stands for "kompatibility". (If they had abbreviations like NFCC, it'd be confusing, wouldn't it?)

Decomposition

Let me explain. In NFD, everything that can be decomposed is decomposed. In NFC, everything that can be composed is composed, after first decomposing everything. In NFKD, everything that has a compatibility or canonical decomposition is decomposed that way. And in NFKC, everything is decomposed by NFKD and then recomposed by their canonical composition. These decompositions are listed in UnicodeData.txt, and have to be recursively applied until a fully decomposed/recomposed form is achieved.

Still doesn't make sense? Here are some details of my implementation. For both forms of decomposition (NFD and NFKD), I constructed tables* associating characters with their fully-calculated decomposition, decomposing the decomposition repeatedly until no more steps are possible. For example, the musical symbol eighth note character can be decomposed into a black notehead, a combining stem, and a single combining flag**. From there, all you have to do to construct a decomposed normalization form is map characters to their decomposed forms (or, if there's no decomposition, to the character itself) and concatenate the results.

Composition

To recompose afterwards, there is just one table for both NFC and NFKC, because only the canonical decomposition can be recomposed; the compatibility decomposition is not ever decomposed. To do this, you need some sort of associative mapping from two characters to one character. There is an entry in this table for each canonical decomposition from one character to two characters. (There are a few characters, called singletons, that map in NFD from one character to one other character. This is never undone. They exist purely for round-trip compatibility with other character encodings.) Most people suggested using a custom-built trie for it, but I got pretty good performance with a simpler custom-built hashtable. I didn't want to use the standard library hashtable because the keys consist of two integers. With a standard hashtable, I'd have to construct a 2array or similar object for each key, and then all lookups would require consing. This is unnecessarily inefficient, especially considering how many lookups there are. The hashtable I came up with is in extra/hash2.

The recomposition algorithm is rather complicated once you consider that there may be multiple combining marks following a character, and their order matters only sometimes (see below). I asked about it on the public Unicode mailing list, and Ken Whistler gave me a very helpful response in a private email. The algorithms are also described in the Unicode standard itself.

Canonical ordering

OK, so this handles the separation between the "C followed by acute" and "C-acute" forms. But what about the ordering of the combining marks? How do we make sure that an acute followed by a cedilla is treated the same as a cedilla treated by an acute? It gets even more complicated when you take into account the fact that order does matter for certain accent marks. For example, a C followed by a diaeresis followed by an acute accent is actually different than a C followed by an acute accent followed by a diaeresis. Which one is displayed on the bottom depends on which comes first. So there are a few groups of accent marks where the order matters, and for the rest, it's equivalent.

The folks behind Unicode came up with a great idea called canonical ordering. Each group of combining marks (in the way I just described, where a group is the few accent marks where the order matters) is assigned an arbitrary number. For example, the group containing the acute accent and the diaeresis is numbered 230, and the group containing the combining cedilla is numbered 202. Then, for each grapheme followed by multiple combining marks, the marks are sorted by combining class (the number they are assigned). The sort is a stable one, so that if two marks have the same combining class, their original order is preserved.

But it was a little tricky to find the right abstractions and appropriate sequence combinators to process strings in this way. To implement canonical ordering, I first tried one big iteration though the string elements using each, but that made control flow really difficult. The main loop was one bit cond with 5 clauses, each of which had to deal with 5 items on the stack. I ended up using find* to locate the starts and ends of combining mark clusters, allowing for more flexible control flow and better factoring. For my implementation, I used an in-place insertion sort to order the combining mark clusters. It's simple and efficient on small sequences.

It should be possible to do the sorting in the same pass as the cluster location, but for my implementation, it's a bit premature to optimize that much.

Design considerations

One of the design decisions I've made is that all strings in Factor will always be in NFD at all times. For a number of reasons, this requires that strings are immutable. But the advantage is that the Unicode Conformance Clause 6 will always be followed by all Factor programs. Nothing (unless it engineers a huge workaround, using binary rather than UTF-8 I/O and uses string buffers rather than strings) will violate this. If two canonical-equivalent strings are put in NFD, they will be equal, so there's no risk of any Factor process treating canonical-equivalent strings differently, nor expecting that any subprocess to treat it differently.

A little side note: I've already mentioned this, but many programs and libraries do Unicode wrong. In particular, many programs can't properly handle things that aren't in Normalization Form C. So, for all input and output (except for Mac OS X pathnames, which must be NFD), strings will first have to be converted to NFC. But that's no big deal, once you have everything else figured out.

In a perfect world, we wouldn't have to worry about normalization when communicating with other programs. If every process just treated all canonical-equivalent sequences equally, we wouldn't need to convert strings into normalized form. But we don't live in that world, so normalization must exist. Keeping text in a consistently normalized form within a program is a simple but surprisingly uncommon way to achieve this.



* I constructed all the tables--for normalization and the things I discussed earlier--at parsetime, before compiletime and runtime. I did this by using words of the style

: load-stuff
\ table-word read-resource-file process-data
1quotation define-compound ; parsing
load-stuff

ICU, the Unicode library used by everyone but me, generates C header files containing very large arrays for tables. Some of the arrays are organized in a trie format, so they can be used as associative arrays. I used hashtables for associative arrays instead.

** Wondering how I got that? Here's some code you could enter into the Factor listener:

USE: unicode
canonical-map [ nip length 3 = ] assoc-find
drop >r name-map value-at r> [ name-map value-at ] map
swap "Name: " write print "Components: " print [ print ] each

In total, there are 229 characters with this sort of decomposition. A small variation to that program (changing 3 to 4) shows me that there's a Greek character, named Greek small letter alpha with psili and varia and ypogegrammeni which decomposes to 4 characters. There are actually 36 of these kinds of characters, which decompose to four. They're all Greek. There's nothing that's five characters long, though. Isn't Unicode fun?

Saturday, July 7, 2007

A tangent

Everybody else is doing it, so why not me?

Take this hypothetical piece of code:

int maxSalary = 0;
for (Employee employee : employees) {
if (Role.PROGRAMMER.equals(employee.getRole())) {
int salary = employee.getSalary();
if (salary > maxSalary) {
maxSalary = salary;
}
}
}

Now translate it into Factor (assuming the employees:

[ employee-role programmer = ] subset [ employee-salary ] map supremum

Note: this will fail if there are no programmers, because that's how supremum works. A simple workaround would be to precede supremum with 0 add, which puts a 0 on the end of the sequence returned by map.

Whoa! Functional programming! (By which I mean programming using functions as first class values, Steve Dekorte, which is what the name means.) I've never seen that in my life before! Let's make a big deal out of this, and submit at least three articles to Reddit about it! I need to find something better to do than read Reddit, don't I. I promise, I'll do a substantiative post on Unicode normalization forms soon...

Well, something I did notice here. The Factor version looks remarkably similar to the Io version:

employees select(role == "programmer") max(salary)

In Io, certain methods like select, max, map, and others, allow a number of forms, for convenience. One form looks like list(1, 2, 3) map(elt, elt+1), but for cases where the map is simply a message passed to each object, map(+1) will suffice. There is no currying going on here; +1 is merely a message which can be passed to an object. So something like select(role == "programmer") is equivalent to select(p, p role == programmer), and max(salary) is equivalent to max(p, p salary).

Messages can easily be composed by tacking them on in sequence—in Io, separated only by whitespace. It often works out well to chain a whole bunch of messages on to one object, where each message is sent to the object returned by the last method call. After a while, these chained postfix method calls start to look like like Forth or Factor. You don't actually need as many named variables when you can do this.

I learned Io before I learned Factor. I got very comfortable with the system of composing messages, but felt like there was something missing. Imagine you have a List created by foo bar baz and you want to make that list, but appending 0 on the end. So, you can have foo bar baz append(0). This is just an expression, and doesn't affect the variables in the environment at all. But, say you decide it shouldn't add 0 to the end; instead it should add the last element. To do this, your code would have to look more like (x := foo bar baz) append(last x). You need to introduce a new variable into the environment just because it's referenced more than once. There's a good chance that that variable will never need to be referenced again, yet it needs to be named.

I felt like there should be a way to reference something twice without having to name it. And the only way to do that is if it's possible to store more than one object on the postfix chain. This is how I learned to understand stacks and reverse Polish notation: it's like composing operations the way Io does, but there can be more than one object. You can't do this in Io, but in Factor, you could always do foo bar baz dup peek add (where peek is like Io's last and add is like Io's append).

I don't mean to criticize Io, especially as this property of dataflow applies to all languages with application- or mutation-style dataflow. I have some issues with it, but I have no interest in starting any sort of flame war. Still, the correspondence here is really interesting. But I like being able to program where I don't have to name my values, whether I use them once or more than once, no matter where the data flows. It's not very important, though, and occasionally causes problems, particularly in combinators. But it's still good.

Tuesday, July 3, 2007

Licensing

I don't understand why some people who aren't making a profit by coding/writing want to restrict the ways other people use it. To me, it makes the most sense to let anyone do anything they want with (the very small, useless amount of) things that I've made--code, essays, etc--even not acknowledge me. So, I hereby declare that anything I have written in the past (I can't talk about the future) can be used for any purpose, but without any warranty, without even the warranty of merchantability or fitness for a particular purpose. This, as far as I can tell, is more permissive than the Factor license (a BSD license), which requires that the license and a note of authorship be distributed with it. Just do whatever you want

Realistically, I don't think anyone will sue me because I allowed my software to be redistributed without its license and they found it to be unmerchantable or unfit for a particular purpose. So I'm not worried. If you ask me in the future, I'll probably license any future writing/software the same way.