Useless Factor: November 2007

Thursday, November 22, 2007

FactorCon MN 2007 post-mortem

FactorCon 2007 in Minneapolis was awesome, despite low attendance: it was just me, Doug, and Slava. Besides coding, we went to one of Doug's girlfriend's concerts, visited Loring Park with the big sculpture of the cherry on a spoon, made mole twice, out of guacs, and successfully avoided bro rape. I don't have pictures, but Doug or Slava probably will have them soon. Unfortunately, unlike the last FactorCon which I missed, we only stayed up until 3:00 or so coding most nights.

Overall, it was maybe a bit too short, but here's what came out of it:

Slava had a project to get Windows I/O working better, and make deployment work for Windows the way it does for Mac OS X. The deployment tool can spit out a directory with the proper DLLs, a minimal image, and an executable, but it's not finished yet. Slava blogged about this.

Doug's project was to get his regex engine in Factor to work better. Instead of refurbishing the old one, he's writing a new one using parser combinators which generate a parser which uses parser combinators itself. This should be more efficient once Chris Double's pacrat parser in Factor is done.

My project, which I really didn't intend to be my project, was getting Factor to work on Mac OS X x86-64 with Leopard. This wasn't going to be that hard; by the first day, I was done with all the compilation stuff. The only thing that made it difficult was the fact that Apple significantly changed the Objective C API in their new ObjC 2.0, included in Leopard. On 32-bit mode, there's backwards compatibility, but not in 64-bit mode. This is going to be a bit more work.

Looks like most future FactorCons will be in Austin, since that's where most people are, for some reason. I hope to see many of you (my readers) there!

Because of my college's weird trimester system, which conjures up images of pregnancy, we have a six-week break from before Thanksgiving until after the new year. (Next term, I'll be taking Abstract Algebra, Algorithms, Social Dance and Syntax Theory, all of which should be fun.) So, over the next month or so, during the break, you can expect more regular posts, and maybe I'll even be able to get some programming done.

Wednesday, November 21, 2007

When Unicode doesn't just work itself out

So, in my job as a student worker at Carleton College, I'm working on programming the website. Just a week or so ago, I basically completed a classified housing ads module, which I'm fairly happy about, even if it was pretty trivial. They've made their own in-house CMS, now open-source, written in PHP and called Reason. Being a very respectful employee, I won't allude to any possible irony in this name. Not at all.

Anyway, a couple of days ago, one of the non-student workers, Margaret (the names of those involved have been swapped with others' to protect their identity) came in with a question: can anybody, who's not doing something important, come up with a regular expression to detect bad UTF-8? Since I was basically unoccupied, reading a book from the library in preparation for my next super-secret project, and since I know a little about Unicode, I volunteered.

I sat down with Margaret as she explained the problem: somehow, a bunch of bad Unicode had gotten into Carleton's database. This isn't a bug that we can reproduce, but a number of times in the past, there have been these question marks, like �, appearing in the content. Whenever it comes up, we delete it, but we need a regular expression to find all of the bad parts.

When this situation occurs, there is malformed UTF-8, where UTF-8 is the encoding used internally in Reason. Reason uses the Unicode-blind approach I discussed earlier. The Unicode-blind approach failed here, though. Most likely it's because of strange, invalid input from browsers which can't easily be replicated, except by artificially constructing queries. All good web designers know that you can't trust the browser (a fact which I internalized only recently) and that apparently extends down to the validity of UTF-8 itself.

Margaret's idea was to find characters whose most significant bit was on, surrounded by characters whose most significant bit was off. However, there are other places where Unicode could be malformed and checking for this specific case isn't enough. With a bit of cursory Googling, we found phputf8, a pure-PHP library which has functions for validating, cleaning, encoding and decoding UTF-8. We could use this library to clean all input, to ensure that it is in valid UTF-8 and will appear without � all the time. This would be a bit computationally intensive, though.

Another approach is to take advantage of the database itself. Reason works on MySQL, but what character encoding does it use? Since Margaret didn't know, I asked Eric, our omnipotent server admin. Carleton actually has one SQL server instance where there are many different databases. Some of these databases use UTF-8, Eric told me, but the main Reason database is encoded in ISO Latin 1, as far as SQL is concerned. The content is, in truth, encoded in UTF-8; when Reason was started, MySQL was in version 3 and never supported Unicode, and we continue to use the same database. Eric guided me through the migration in a wonderfully hackish way: serialize the whole database, then in that serialization, replace all instances of "latin1" with "UTF8", then deserialize the database.

There are still a couple potential problems: for one, it's possible that there could be invalid UTF-8 in the database that MySQL doesn't check for. The best way to fix this would be to do a one-time cleanup with something like phputf8 before the migration. Otherwise, things will be sort of corrupted from the start. After this, it shouldn't be possible to put invalid UTF-8 in the database. The second thing is that there are still some remaining Unicode issues, most notably normalization. Though we could trust the browser to give us text in NFC, that's generally a bad idea. It'd be better to normalize all text, and MediaWiki has code for that.

Once all strings are considered UTF-8 by the database, things should move much more smoothly. Not only will invalid UTF-8 not be accepted, but collation will work properly, since MySQL supports Unicode collation consistent with the Unicode Collation Algorithm (UCA) and Default Unicode Collation Element Table (DUCET). But there's a minor caveat: MySQL 5.0 and lower only supports three octets of UTF-8. In 2001, Unicode 3.1 introduced the possibility of a fourth octet, for supplementary planes, used for obscure Han characters, math symbols, old scripts, etc. MySQL 5.0 doesn't allow any of these to be used in UTF-8 mode, though the Unicode-blind approach does allow it.

Nevertheless, it's good to be aware of Unicode. The fact that you can sometimes get away without noticing Unicode and its encodings is not a reason to ignore it.

Update: Alexander Barkov, a MySQL developer, pointed out to me that MySQL 6.0 supports three encodings, utf8, utf16 and utf32, including support for code points outside the BMP. Previous versions of MySQL had only three octets of utf8, or the fixed-width ucs2, which only allows the BMP. I'm very happy about this, but of course it takes some time to upgrade versions.

Wednesday, November 14, 2007

Programming Language Consciousness

If Nelson Mandela is South Africa's Martin Luther King Jr., then Steve Biko is its Malcolm X*. Steve Biko, under the pen name Frank Talk, developed the philosophy of Black Consciousness. This is the idea that Black people in South Africa need to work together with each other to liberate themselves from Apartheid. Liberation would not come as a gift from White liberals, but necessarily came exclusively from the efforts of the oppressed themselves. Before integration is necessary—before it is even helpful— Blacks need to be proud of their own culture and conscious of their oppression.

Some language communities (Tcl, for example) have a shared belief that, it's best to write many things in C, and then bind that library to the higher-level language. C has some advantages: it can be very fast, and it can integrate well with external libraries, which are usually also written in C or C++. So, when users of these languages need something like a GUI library, or a complex data structure, or a function which could sensibly be included as a primitive, or just something that they feel should run a little faster, they go and implement that thing in C. But one of the sources of inspiration for these languages was always to have something better to program in than C. Once the programming language exists, not as much repetitive, low-level coding has to be done in that language.

If we programmers are going to be liberated from this oppressive world of C and other imperative, low-level languages with little sense of abstraction, we have to take charge. A new language could be written in C, but then, once the core is there, libraries should be written in that language. This is the only way that liberation can be achieved.** It's fine to write a binding for existing C libraries, but for new libraries Steve Biko, Alan Turing and I are here to tell you that you do have the power to write them in that language itself! Don't fall into the incrementalist trap of using the new language for glue and C for the "heavy-duty" stuff, as this will only prolong and damage the struggle. Code written in the new, better language will be easier to maintain and extend. Paraphrasing Biko speaking about not relying on one's oppressor: Programmer, you are on your own.

* That is, if MLK had founded a guerilla organization and later became president, and if Malcolm X were viciously murdered by the government. I'm referring to the old Malcolm X, of course, not after started being more polite in rhetoric to White people. As it turned out, Mandela's movement (the African National Congress) was a huge amount more successful than the movement that Biko inspired (the Black Consciousness Movement, and later the Azanian People's Organization). Also, in the early '90s, the ANC was basically at war with the Inkhata Freedom Party, another liberation movement, in what's today KwaZulu-Natal***, and they also had significant conflicts with Azapo during the run-up to the elections. But who's counting? It's called an analogy, people. By the way, for those of you who are aware that there are more than just Africans and Europeans in South Africa, Biko extended the definition of Black to include everyone who's oppressed and acknowledges it, basically. (This way, Indians and Colored people could join the movement.)

** OK, in reality, as the above footnote notes, the multiracialist ANC beat out Azapo, but it's the idea that matters...

*** KwaZulu-Natal is my favorite province ever. Who would have thought of a CamelCase name? (Well, must've been those missionaries who created the current writing system for isiZulu.)

If my relatively intense (for me), intro-level History of Southern Africa course is good for anything, it's good for drawing spurious historical analogies for slightly tongue-in-cheek essays, right? This started out being an essay explaining why I'm a little uncomfortable with Io, because of their philosophy of implementing everything in C, but I ended up not mentioning it, except in this footnote. Sadly, this will be my most-read essay on South Africa.

Update: My wonderful history professor, Jamie Monson, read this post and told me that I was a little wrong about the historical fact about the ANC fighting with Azapo, because the mostly fought with the Inkatha Freedom Party. How could I have made such an error? It's fixed now.

Monday, November 5, 2007

The currying theorem

My favorite class in college right now is Introduction to Mathematical Structures, which explains the basics of naive set theory and theorem proving. Today, we were studying cardinal arithmetic when a startling proposition appeared on the board:

Is there a bijection h: {M -> {L -> K}} -> {(M × L) -> K}

The notation in this post is described at the end

Immediately, I shouted out "That's just like currying!" (I blogged about currying a little while ago.) My professor, always mildly annoyed when I shouted out such random things, said "what?" to which I mumbled "never mind."

Some set theory

Since my class uses only naive set theory, a simple definition of a set suffices: a set is something with one well-defined operation: a membership test. If A is a set, and a is something, then we can tell if a is a member A or a is not a member of A.

One thing you can build up using sets is a Cartesian product. I won't go into the details, but it's well-defined to have an ordered pair (a, b). With sets A and B, the Cartesian product of A and B is the set of all (a, b) for any a in A and b in B.

A function f: A -> B can be defined as a subset of this Cartesian product A × B, where f(a) = b if (a, b) is in f. There are additional requirements on the function f, though: it must be total (meaning, for each a in A, there is an (a, b) for some b in B) and it must be a function (meaning, for any a in A, there is only one b in B such that (a, b) is in f).

Two sets A and B are considered equinumerous if there is a one-to-one correspondence between them. This means, for each element a in A, there is exactly one b in B which can be paired up to it, and vice versa. We can put this one-to-one correspondence into a function, called a bijection. Since functions are just subsets of A × B, a bijective function is a set of ordered pairs, one for each element of A, pairing it with a unique element of B and covering all of the elements of B. If the two sets can be paired up like this, we can consider them the same size.

So, what kind of properties does a bijection f: A -> B have? For one, it has to be a function defined on all A, corresponding to an element of B. Another property is that no two elements of A correspond to the same element of B; this is called injectivity. A third property is that every element of B has to have at least one element of A corresponding to it; this is called surjectivity.

This is all a little complicated, so let's take a simple example. Let's show that the natural numbers N (this is the set containing 1, 2, 3, 4, and so on) are equinumerous to the even natural numbers E (2, 4, 6, 8, ...). If this is true, that means there are just as many natural numbers as even numbers, a very counterintuitive proposition. But there's a very simple function f: N -> E which demonstrates a pairing of them:

f(x) = 2x

All this does is pair a natural number to two times its value, which will be an even number. We can visualize f's pairing as the set containing (1, 2), (2, 4), (3, 6), (4, 8) and so on. To prove, mathematically, there are three things we need to show:

f is a total function: for any natural number n, there is obviously only one e such that e = 2n. And, obviously, for any n, it's well-defined to talk about 2n.

f is injective: This means that, for any e, there is no more than one n which corresponds to it. Mathematically, it's easy to show this by demonstrating, if f(n₁) = f(n₂), then n₁ = n₂. In this case, f(n₁) = f(n₂) implies 2⋅n₁ = 2⋅n₂, meaning that n₁ = n₂.

f is surjective: For every even number e, we need to demonstrate that there is a corresponding natural number n such that f(n) = e. This shows that f covers all of E. In this case, it's very easy to show that: all you need to do is divide an even number by two to get a natural number that, when doubled, yields that same even number.

What?

So then what does that have to do with currying? Let's look at that original proposition again:

Is there a bijection h: {M -> {L -> K}} -> {(M × L) -> K}

If this is true, it means that the set of functions {M -> {L -> K}} is equinumerous to the set of functions {(M × L) -> K}, which means, basically, that they are equivalent and represent each other.

Let's think about what these two sets represent. {M -> {L -> K}} is the set of functions from M to the set of functions from L to K. Wait a minute: that's just like a curried function! That's just like f(m)(l) = k: a function which returns a function to take the second argument. {(M × L) -> K} is the set of functions which take an ordered pair (m, l) and return an element of K, basically f(m, l) = k.

So, if this bijection h exists, it could be the uncurry function: it takes a curried function and makes a normal one out of it. Well, by what we know, there are other things it could do, but this is what would make the most sense. Let's define this uncurry function:

h(f)(m, l) = f(m)(l)

Now, is this a bijection? If it is, that means that curried functions are essentially equivalent to uncurried functions, meaning that it's valid to curry things. This is what makes it possible—easy, even—to have some languages be auto-currying (eg OCaml) and some languages not be (eg SML), and despite that use the same basic programming style. To prove that h is a bijection, we can use the same proof outline as the above demonstration that the even numbers are equinumerous to the natural numbers:

h is a total function: for any function f: M -> {L -> K}, there is only one g such that g(m, l) = f(m)(l); if there were another function which satisfied this property, it would equal g. And for any f: M -> {L -> K}, it is valid to call that function on m, where m is in M, and call that result on l, where l is in L.

h is injective: Another way we can show injectivity is by showing that if f ≠ g, then h(f) ≠ h(g). This is equivalent to what was shown before. Remember, f and g: M -> {L -> K}. So, if f ≠ g, then there is some m in M such that f(m) ≠ g(m). This implies that there is some l in L such that f(m)(l) ≠ g(m)(l). We know that f(m)(l) = h(f)(m, l), and that g(m)(l) = h(g)(m, l), by the definition of h. Therefore, h(f)(m, l) ≠ h(g)(m, l), meaning h(f) ≠ h(g). This demonstrates that h is injective: each different curried function is paired up with a different non-curried function.

h is surjective: For h to be surjective, there has to be a curried function corresponding to every uncurried function. Let's call the curried function f and the uncurried function u. If we have an uncurried function, we can easily pull the curried one out with the currying operation c:

c(u)(m)(l) = u(m, l)

So, for every uncurried function u, the corresponding curried function is c(u). If there is always an f such that h(f) = u for any u: (M × L) -> K, then h is surjective. Let's show that c(u) is that f. h(c(u))(m, l) = c(u)(m)(l) = u(m, l), so h(c(u)) = u. Since c(u) is defined for any u of the appropriate function type, h is surjective.

In case you lost track, we just proved that currying makes sense.

Exercise to the reader: Prove that the function h defined above is the only bijection between {M -> {L -> K}} and {(M × L) -> K} which works for any set M, L and K with no additional knowledge. Hint: I don't know how to do this, but I'm pretty sure it's true.

Cardinal arithmetic

When you get into all this complex mathematics, you can do something really cool: arithmetic. Let's redefine numbers completely, ignoring what we know already. A number, like 1, 2, 3, etc, is the set of all sets of that size. (Actually, scratch that—we can't really do that. Let's stipulate that these sets are subsets of some unspecified, but really big, U, and then keep going as if nothing ever happened. This helps us avert a gigantic contradiction.) More specifically, 1 is the set of sets which are equinumerous (remember, bijection) to the set containing just 1; 2 is the set of sets which are equinumerous to the set containing 1 and 2; etc. We use the notation |A| to refer to the cardinality, or this weird number system's value for the size, of A. So, if A contains the numbers 42, π, 1984 and nothing else, |A| = 3 since there is a bijection between A and the set containing 1, 2, 3, as well as a whole bunch of other sets: they all have the same number of elements.

It's easy to build up some basic operations for these cardinal numbers. If sets A and B are disjoint, |A| + |B| = |the set of anything in either A or B, or both|. For any set A and B, |A|⋅|B| = |A × B|. For any set A and B, |A|^|B| = |{B -> A}|. It's useful to imagine these with some small finite sets to verify that they make sense.

Now, we can show a whole bunch of basic identities we already knew about positive integers make sense here over all cardinal numbers. Here's one: for any cardinal numbers κ, λ, and μ: (κ^λ)^μ = κ^λ⋅μ. So, what's the equivalent statement when translated into a fact about sets?

How does this look in programming?

Going back to the practical(ish) side, currying and uncurrying actually comes in handy when programming Haskell. Here's an exerpt from the Haskell Prelude:


-- curry converts an uncurried function to a curried function;
-- uncurry converts a curried function to a function on pairs.

curry            :: ((a, b) -> c) -> a -> b -> c
curry f x y      =  f (x, y)

uncurry          :: (a -> b -> c) -> ((a, b) -> c)
uncurry f p      =  f (fst p) (snd p)

As an example of the use of uncurry, here's an implementation of zipWith (not from the prelude) which doesn't duplicate the logic of map, assuming an existing implementation of zip:


zipWith :: (a->b->c) -> [a]->[b]->[c]
zipWith f a b = map (uncurry f) $ zip a b

Overview of notation

Capital letters (eg K, L, M) are sets.
Lower case letters are functions (eg f, g, h) or members of sets (k, l, m).
Greek lowercase letters (eg κ, λ, μ) are cardinal numbers.
A × B is the Cartesian product of A and B.
f: A -> B means f is a total function from A to B.
{A -> B} is the set of functions from A to B. (I made this notation up)

Maybe I'll never be the Good Math, Bad Math guy, but I still enjoy the chance to explain stuff. Maybe this will make up for the stupidity of my last post.