Useless Factor: April 2008

Tuesday, April 29, 2008

Potential ideas to explore

I haven't written in a while, and it's a little hard to get started back up, so here are just a bunch of random ideas in my head that I'd like to share with you guys. Sorry if it's a little incoherent...

Possible extensions to Inverse

I've been thinking about possible ways to generalize my system for concatenative pattern matching, currently in extra/inverse. There are two ways to go about it: making a more general constraint solving system, and giving access to the old input when inverting something, as in the Harmony project. A third way is to add backtracking (in a different place than constraint solving would put it). To someone familiar with Inverse, these might seem like they're coming from nowhere, but they're actually very closely related. (To someone not familiar with it, see my previous blog post describing Inverse.)

Constraint solving

The idea of resolving constraints is to figure out as much as you can about a situation given certain facts. This is easy in some cases, but impossible in others, even if enough facts are known to, potentially, figure out what everything is. For example, Diophantine equations can be solved by a fully general constraint-solving system, but they're known to be undecidable in general.

So what can constraint solving get you in Inverse? Well, imagine an inverse to bi. It's not difficult to make one within the current framework, but some information is lost: everything must be completely determined. Think about inverting [ first ] [ second ] bi. Inverting this should get the same result as first2 (which has a hard-coded inverse right now, inverting to 2array). But it won't work.

A way for [ first ] [ second ] bi to work would be using the following steps:

Initialize a logic variable X as unbound
Unify X with the information, "the first element is what's second from the top of the stack (at runtime)". Now it's known that X is a sequence of length at least 1.
Unify X with the information, "the second element is what's on the top of the stack (at runtime)". Now it's know that X is a sequence of length at least two.
From the information we have about X, produce a canonical representation, since the inverted quotation is over: an array of the minimum possible length.

This isn't easy to do in general, but it should be possible, in theory. It'd be extremely cool if it worked out.

Formally, you can think of Inverse as already a reasonable constraint solving system, for a limited problem domain. Given [ f ], and the statement about stacks A and B that f(A) = B, and given B, find a possible value for A. The strategy used right now is mathematically sound, and I hope to write it up some day. But, a more general use of logic variables is possible: explicit logic variables in code. This could be used to make a better-integrated logic language in Factor.

The Harmony Project

The Harmony Project, led by Benjamin C. Pierce, is an attempt to solve the "view-update problem" using a new programming language and type system which is largely invertible. The view-update problem is that we want to convert different storage formats into an abstract representation, manipulate that representation and put it back without duplicating code about the representation. Everything operates on edge-labeled trees.

Within the Harmony framework, it's possible to do all your work in bijections (one-to-one onto functions, similar but not identical to the domain of Inverse right now), but there's extra power included: the function to put the abstract representation back into the original form has access to the original. This adds a huge amount of power, giving the possibility of conditionals and recursion, in limited cases. Also, it gives the power to ignore certain things about the surface structure when looking at the abstract form. (Harmony also has ideas about tree merging, and of course a new type system, but I'm not as interested in that right now.)

So far, only relatively trivial things have been made with Harmony, but the idea looks really useful, though there are two problems: (1) I don't really understand it fully (like constraints) and (2) I have no idea how it can fit together with Inverse as it is right now.

Backtracking

In Mark Tullsen's paper on first-class patterns, there was an interesting idea that Inverse could adopt. Tullsen used monads to sequence the patterns. It's the simplest to use the Maybe monad, and that corresponds to how pattern matching systems normally work. But if the List monad is used instead, then you easily get backtracking. This could be ported to Factor either by using monads or, maybe easier, by using continuations. Years ago, Chris Double implemented amb in Factor using continuations, though the code won't work anymore. The sequencing and backtracking I'm talking about is relevant in things like switch statements, rather than undo itself. I'm not sure if it'd actually be useful in practice.

Garbage collection research ideas

Because the summer's coming up, and I'll be participating in Harvey Mudd's Garbage Collection REU, I've been coming up with a few research ideas. The suggested one is to continue with the work of previous years' REUs and think about simplifiers and collecting certain persistent data structures and weak hashtables, but here are a couple more:

Figure out how efficient garbage collection on Non-Uniform Memory Access systems can work. The problem (if it is a problem) is that plain old garbage collection on multiprocessor NUMA systems isn't as fast as it could be, because a chunk of memory allocated for a thread may be far away from where it's used. One way to ensure locality is to give each processor (at least) its own heap, where the heap is guaranteed to be stored in the closest memory. But if data needs to be shared between processors, this can be too limiting. A piece of data can be kept on the RAM closest the processor which made the allocating call, but maybe it'd be beneficial to collect data on which processor is using which data, and dynamically move data around to different places in RAM to put it closest to where it's used. A related issue is maximizing locality when actually performing the tracing in the GC, which I have no ideas about.

Run a real benchmark comparing several GC algorithms. Probably the most annoying thing for programming language implementors trying to pick a good GC algorithm is that there's no comprehensive benchmark to refer to. No one really knows which algorithm is the fastest, so there are two strategies remaining: pick the one that sounds the fastest, or do trial and error among just a few. Each paper about a new algorithm reports speed improvements—over significantly older algorithms. It'd be a big project, but I think it's possible to make a good benchmark suite and test how long it takes for these algorithms to run, in terms of absolute throughput and pause length and frequency, given different allocation strategies. If it's possible, it'd be nice to know what kind of GC performs best given a particular memory use pattern.

Garbage collector implementation in proof-carrying code. There are a couple invariants that garbage collectors have, that must be preserved. For example, the user can't be exposed to any forwarding pointers, and a new garbage collection can't be started when forwarding pointers exist. The idea of proof-carrying code (an explicit proof, which is type-checked to be accurate, is given with the code) isn't new; it's mostly been used to prove memory consistency safety given untrusted code. But maybe it could be used to prove that a GC implementation is correct.

These ideas are really difficult, but I think they're interesting, and with four other smart people working with me, maybe in a summer we can do something really cool, like this or whatever other idea they come up with.

Ragel-style state machines in Factor

In my Automata and Computability class at Carleton, we've been studying (what else) finite automata, and it got me thinking about regular expressions and their utility in Factor. By regular expression, I mean an expression denoting a regular language: a real, academic regexp. A regular language is one that can be written as a deterministic finite automaton (finite state machine). Hopefully, I'll explain more about this in a future blog post.

Anyway, if you've heard of Ragel, it's basically what I want to do. But the form it'd take is basically the same as PEGs (Chris Double's Pacrat parser), with the one restriction that no recursion is allowed. In return for this restriction, there is no linear space overhead. Basically everything else, as far as I know, could stay the same.

I'm thinking I'll redo the XML parser with this. The SAX-like view will be done with this regular languages parser (since all that's needed is a tokenizer), and then that can be formed into a tree using PEGs (since linear space overhead is acceptable there). Linear space overhead, by the way, is unacceptable for the SAX-like view, since it should be usable for extremely large documents that couldn't easily fit in memory all at once.

(By the way, I know Ragel also allows you to explicitly make state charts, but I won't include this until I see a place where I want to use it.)

Sunday, April 6, 2008

Programming in a series of trivial one-liners

Among Perl programmers, a one-line program is considered a useful piece of hackage, something to show off to your friends as a surprisingly simple way to do a particular Unix or text-processing task. Outsiders tend to deride these one-liners as line noise, but there's a certain virtue to it: in just one line, in certain programming languages, it's possible to create meaningful functionality.

APL, lived on by its derivatives like K, Q, J and Dyalog pioneered the concept of writing entire programs in a bunch of one-liners. Because their syntax is so terse and because of the powerful and high-level constructs of array processing, you can pack a lot into just 80 characters. In most K programs I've seen, each one does something non-trivial, though this isn't always the case. It can take some time to decode just a single line. Reading Perl one-liners is the same way.

Factor continues the one-line tradition. In general, it's considered good style to write your words in one, or sometimes two or three, lines each. But this isn't because we like to pack a lot into each line. Rather, each word is rather trivial, using the words defined before it. After enough simple things are combined, something non-trivial can result, but each step is easy to understand.

Because Factor is concatenative (concatenation of programs denotes composition) it's easier to split things into these trivial one-liners. It can be done by copy and paste after the initial code is already written; there are no local variables whose name has to be changed. One liners in Factor aren't exceptional or an eccentric trait of the community. They're the norm and programs written otherwise are considered in bad style.

Enough philosophizing. How does this work in practice? I'm working on encodings right now, so I'll break down how this worked out in implementing 8-bit encodings like ISO-8859 and Windows-1252. These encodings are just a mapping of bytes to characters. Conveniently, a bunch of resource files describing these mappings which are all in exactly the same format is already exists on the Unicode website.

The first thing to do in implementing this is to parse and process the resource file, turning it into two tables for fast lookup in either direction. Instead of putting this in one word, it's defined in five, each one or two lines long. First, tail-if is a utility word which works like tail but leaves the sequence as it is if it's shorter.


: tail-if ( seq n -- newseq )
    2dup swap length <= [ tail ] [ drop ] if ;

Using that, process-contents an array of lines and turns it into an associative mapping (in the form of an array of pairs) from octets to code points.


: process-contents ( lines -- assoc )
    [ "#" split1 drop ] map [ empty? not ] subset
    [ "\t" split 2 head [ 2 tail-if hex> ] map ] map ;

byte>ch takes this assoc, the product of process-contents and produces an array which can be used to get the code point corresponding to a byte.


: byte>ch ( assoc -- array )
    256 replacement-char <array>
    [ [ swapd set-nth ] curry assoc-each ] keep ;

ch>byte is the opposite, taking the original assoc and producing an efficiently indexable mapping from code points to octets.


: ch>byte ( assoc -- newassoc )
    [ swap ] assoc-map >hashtable ;

Finally, parse-file puts these all together and makes both mappings, given a stream for the resource file.


: parse-file ( stream -- byte>ch ch>byte )
    lines process-contents
    [ byte>ch ] [ ch>byte ] bi ;

Next, the structure of the encoding itself is defined. A single tuple named 8-bit is used to represent all 8-bit encodings. It contains the encoding and decoding table, as well as the name of the encoding. The encode-8-bit and decode-8-bit words just take some encoding or decoding information and look the code point or octet up in the given table.


TUPLE: 8-bit name decode encode ;

: encode-8-bit ( char stream assoc -- )
    swapd at* [ encode-error ] unless swap stream-write1 ;

M: 8-bit encode-char
    encode>> encode-8-bit ;

: decode-8-bit ( stream array -- char/f )
    swap stream-read1 dup
    [ swap nth [ replacement-char ] unless* ] [ nip ] if ;

M: 8-bit decode-char
    decode>> decode-8-bit ;

I wanted to design this, like existing Unicode functionality, to read resource files at parsetime rather than to generate Factor source code. Though I don't expect these encodings to change, the result is still more maintainable as it leaves a lower volume of code. If I were implementing this in C or Java or R5RS Scheme or Haskell98, this wouldn't be possible. So make-8-bit defines an encoding given a word and the lookup tables to use:


: make-8-bit ( word byte>ch ch>byte -- )
    [ 8-bit construct-boa ] 2curry dupd curry define ;

define-8-bit-encoding puts everything together. It takes a string for the name of an encoding to be defined and a stream, reads the appropriate resource file and defines an 8-bit encoding.


: define-8-bit-encoding ( name stream -- )
    >r in get create r> parse-file make-8-bit ;

To top it all off, here's what's needed to define all the 8-bit encodings we want:


: mappings {
    { "latin1" "8859-1" }
    { "latin2" "8859-2" }
    ! ...
} ;

: encoding-file ( file-name -- stream )
    "extra/io/encodings/8-bit/" ".TXT"
    swapd 3append resource-path ascii  ;

[
    "io.encodings.8-bit" in [
        mappings [ encoding-file define-8-bit-encoding ] assoc-each
    ] with-variable
] with-compilation-unit

So by combining these trivial one-liners or two-liners, you can make something that's not as trivial. The end product is that hard things are made easy, which is the goal of every practical programming language. The point of this isn't to say that this code is perfect (it's very far from that), but just to demonstrate how clear things become when they're broken down in this way.

When I first started programming Factor, I thought that it only made sense to define things separately when it was conceivable that something else would use them, or that it'd be individually useful for testing, or something like that. But actually, it's useful for more than that: for just making your program clear. In a way, the hardest thing to do when programming in Factor once you have the basics is to name each of these pieces and factor them out properly from your program. The result is far more maintainable and readable than if the factoring process has not been done.