Useless Factor: Potential ideas to explore

I haven't written in a while, and it's a little hard to get started back up, so here are just a bunch of random ideas in my head that I'd like to share with you guys. Sorry if it's a little incoherent...

Possible extensions to Inverse

I've been thinking about possible ways to generalize my system for concatenative pattern matching, currently in extra/inverse. There are two ways to go about it: making a more general constraint solving system, and giving access to the old input when inverting something, as in the Harmony project. A third way is to add backtracking (in a different place than constraint solving would put it). To someone familiar with Inverse, these might seem like they're coming from nowhere, but they're actually very closely related. (To someone not familiar with it, see my previous blog post describing Inverse.)

Constraint solving

The idea of resolving constraints is to figure out as much as you can about a situation given certain facts. This is easy in some cases, but impossible in others, even if enough facts are known to, potentially, figure out what everything is. For example, Diophantine equations can be solved by a fully general constraint-solving system, but they're known to be undecidable in general.

So what can constraint solving get you in Inverse? Well, imagine an inverse to bi. It's not difficult to make one within the current framework, but some information is lost: everything must be completely determined. Think about inverting [ first ] [ second ] bi. Inverting this should get the same result as first2 (which has a hard-coded inverse right now, inverting to 2array). But it won't work.

A way for [ first ] [ second ] bi to work would be using the following steps:

Initialize a logic variable X as unbound
Unify X with the information, "the first element is what's second from the top of the stack (at runtime)". Now it's known that X is a sequence of length at least 1.
Unify X with the information, "the second element is what's on the top of the stack (at runtime)". Now it's know that X is a sequence of length at least two.
From the information we have about X, produce a canonical representation, since the inverted quotation is over: an array of the minimum possible length.

This isn't easy to do in general, but it should be possible, in theory. It'd be extremely cool if it worked out.

Formally, you can think of Inverse as already a reasonable constraint solving system, for a limited problem domain. Given [ f ], and the statement about stacks A and B that f(A) = B, and given B, find a possible value for A. The strategy used right now is mathematically sound, and I hope to write it up some day. But, a more general use of logic variables is possible: explicit logic variables in code. This could be used to make a better-integrated logic language in Factor.

The Harmony Project

The Harmony Project, led by Benjamin C. Pierce, is an attempt to solve the "view-update problem" using a new programming language and type system which is largely invertible. The view-update problem is that we want to convert different storage formats into an abstract representation, manipulate that representation and put it back without duplicating code about the representation. Everything operates on edge-labeled trees.

Within the Harmony framework, it's possible to do all your work in bijections (one-to-one onto functions, similar but not identical to the domain of Inverse right now), but there's extra power included: the function to put the abstract representation back into the original form has access to the original. This adds a huge amount of power, giving the possibility of conditionals and recursion, in limited cases. Also, it gives the power to ignore certain things about the surface structure when looking at the abstract form. (Harmony also has ideas about tree merging, and of course a new type system, but I'm not as interested in that right now.)

So far, only relatively trivial things have been made with Harmony, but the idea looks really useful, though there are two problems: (1) I don't really understand it fully (like constraints) and (2) I have no idea how it can fit together with Inverse as it is right now.

Backtracking

In Mark Tullsen's paper on first-class patterns, there was an interesting idea that Inverse could adopt. Tullsen used monads to sequence the patterns. It's the simplest to use the Maybe monad, and that corresponds to how pattern matching systems normally work. But if the List monad is used instead, then you easily get backtracking. This could be ported to Factor either by using monads or, maybe easier, by using continuations. Years ago, Chris Double implemented amb in Factor using continuations, though the code won't work anymore. The sequencing and backtracking I'm talking about is relevant in things like switch statements, rather than undo itself. I'm not sure if it'd actually be useful in practice.

Garbage collection research ideas

Because the summer's coming up, and I'll be participating in Harvey Mudd's Garbage Collection REU, I've been coming up with a few research ideas. The suggested one is to continue with the work of previous years' REUs and think about simplifiers and collecting certain persistent data structures and weak hashtables, but here are a couple more:

Figure out how efficient garbage collection on Non-Uniform Memory Access systems can work. The problem (if it is a problem) is that plain old garbage collection on multiprocessor NUMA systems isn't as fast as it could be, because a chunk of memory allocated for a thread may be far away from where it's used. One way to ensure locality is to give each processor (at least) its own heap, where the heap is guaranteed to be stored in the closest memory. But if data needs to be shared between processors, this can be too limiting. A piece of data can be kept on the RAM closest the processor which made the allocating call, but maybe it'd be beneficial to collect data on which processor is using which data, and dynamically move data around to different places in RAM to put it closest to where it's used. A related issue is maximizing locality when actually performing the tracing in the GC, which I have no ideas about.

Run a real benchmark comparing several GC algorithms. Probably the most annoying thing for programming language implementors trying to pick a good GC algorithm is that there's no comprehensive benchmark to refer to. No one really knows which algorithm is the fastest, so there are two strategies remaining: pick the one that sounds the fastest, or do trial and error among just a few. Each paper about a new algorithm reports speed improvements—over significantly older algorithms. It'd be a big project, but I think it's possible to make a good benchmark suite and test how long it takes for these algorithms to run, in terms of absolute throughput and pause length and frequency, given different allocation strategies. If it's possible, it'd be nice to know what kind of GC performs best given a particular memory use pattern.

Garbage collector implementation in proof-carrying code. There are a couple invariants that garbage collectors have, that must be preserved. For example, the user can't be exposed to any forwarding pointers, and a new garbage collection can't be started when forwarding pointers exist. The idea of proof-carrying code (an explicit proof, which is type-checked to be accurate, is given with the code) isn't new; it's mostly been used to prove memory consistency safety given untrusted code. But maybe it could be used to prove that a GC implementation is correct.

These ideas are really difficult, but I think they're interesting, and with four other smart people working with me, maybe in a summer we can do something really cool, like this or whatever other idea they come up with.

Ragel-style state machines in Factor

In my Automata and Computability class at Carleton, we've been studying (what else) finite automata, and it got me thinking about regular expressions and their utility in Factor. By regular expression, I mean an expression denoting a regular language: a real, academic regexp. A regular language is one that can be written as a deterministic finite automaton (finite state machine). Hopefully, I'll explain more about this in a future blog post.

Anyway, if you've heard of Ragel, it's basically what I want to do. But the form it'd take is basically the same as PEGs (Chris Double's Pacrat parser), with the one restriction that no recursion is allowed. In return for this restriction, there is no linear space overhead. Basically everything else, as far as I know, could stay the same.

I'm thinking I'll redo the XML parser with this. The SAX-like view will be done with this regular languages parser (since all that's needed is a tokenizer), and then that can be formed into a tree using PEGs (since linear space overhead is acceptable there). Linear space overhead, by the way, is unacceptable for the SAX-like view, since it should be usable for extremely large documents that couldn't easily fit in memory all at once.

(By the way, I know Ragel also allows you to explicitly make state charts, but I won't include this until I see a place where I want to use it.)

Useless Factor

Tuesday, April 29, 2008

Potential ideas to explore