Friday, September 28, 2007

Metablog entry

Here's how to make a good blog: Just write something. There's no real secret to blog writing; you just have to write something that no one has written before. (And make it sort of interesting and intelligible, but those are more optional.)

I bet all of you reading this entry have a new idea in your head. Why not write about it?

Update: Removed irrelevant rambling

Tuesday, September 25, 2007

My little blog is finally growing up

I'm so happy! I had my first angry anonymous blog comment! Here it is; it was to the article Factor with a dash of curry.

WTF are you on? "More fashionable"? How about "correct"?

I had heard people talking about the shit people put on blog comments, but finally I get part of the action too! Thank you, Anonymous! This implies that I must have some kind of readership of some sort... strange.

If you don't understand that comment at first, I didn't either. I searched my article for 'more fashionable' and found that I wrote "Factor's curry is, by the more fashionable nomenclature of those at Lambda the Ultimate, partial application..." This is an example of what I would call interesting diction (or at least an attempt at that); I wasn't trying to say anything bad about LtU or misrepresent what currying 'technically' is.

Anyway, Haskell Curry was just a guy. His last name doesn't have any inherent meaning. In fact, no word has any inherent meaning, but rather only the meaning given to it by the people who speak a language. Pick up some Wittgenstein, Anonymous!

Sunday, September 23, 2007

Using bigums efficiently

The other day, Doug Coleman got on the #concatenative IRC channel complaining of a horrible bug: when he put a 100,000 digit prime number in a file, and then tried to load the file, Factor hangs. Doug speculated that this was a compiler bug, but I had another idea: the parser wasn't processing bignums efficiently. First, a little background. This article presumes some basic knowledge of computational complexity and big-O notation, which you should read up on, if you don't know about already.

Bignums and performance

A 'bignum' is Factor's term (grabbed from Lisp terminology) for an arbitrary size integer bigger than the standard integer. Integers which do fit in a word (actually, a word minus 3 bits for the header) are called 'fixnums'. On any platform that Factor currently supports, you can count on the fact that a number smaller than 229 will be a fixnum, and a number bigger than 261-1 will be a bignum.

In most situations, this is an irrelevant implementation detail. In Factor, bignums are used with the same functions as fixnums (and all other builtin numeric types). But there is a subtle performance issue. On fixnums, it takes (basically) constant time--O(1)--to do (basically) any simple math operation. This includes addition, multiplication, division, exponentiation, square roots, etc: all of these operations take basically the same amount of time on any fixnum. We can make this claim because all numbers are fairly small, and there's a short maximum bound on the time they take, even if it varies a little bit. In designing algorithms, programmers take advantage of this frequently.

However, with bignums, math operations take O(n) or more time, where n is the number of digits (bits) in the larger number. If you have two integers of arbitrary length, the only thing you can do to add them is, basically, the addition algorithm you learned in first grade, iterating through the string from least significant to most significant bit. The best possible time for this kind of iteration is proportional to the number of bits--O(n). Multiplication, division and other operations take even more time. For purposes of analysis, let's say multiplication is O(n*log n) where n is the number of digits in the bigger number, and exponentiation is O(d log d), where d is the number of digits in the result. (These are slightly faster than the real times, but give us a good enough estimate while leaving the math mostly simple.)

To be efficient in processing bignums, this additional time for processing must be taken into account. It's very easy to write something which works instantly on fixnums but hangs almost indefinitely on large enough bignums, but there is usually a better way.

string>number

So, I suspected that Doug's code was slow because of a naive implementation of string>number, one which is not optimized for bignums. Looking recursively through the code, I can see that the conversion from numbers takes place in the word digit>integer:

: string>number ( str -- n/f ) 10 base> ;

: base> ( str radix -- n/f )
{
{ [ 47 pick member? ] [ string>ratio ] }
{ [ 46 pick member? ] [ drop string>float ] }
{ [ t ] [ string>integer ] }
} cond ;

: string>integer ( str radix -- n/f )
swap "-" ?head
>r string>digits 2dup valid-digits?
[ digits>integer r> [ neg ] when ] [ r> 3drop f ] if ;

: digits>integer ( radix seq -- n )
0 rot [ swapd * + ] curry reduce ;

Basically, what this does is, for each item in the given sequence, an accumulator (starting at 0) is multiplied by the radix, and then the item is added to the accumulator. An example invocation of digits>integer, which returns the number 1234:

10 { 1 2 3 4 } digits>integer

Let's look at the computational complexity of running digits>integer, in a world where only fixnums exist. In this world, * and + run in constant time. Running digits>integer with a d-digit number will do d additions and d multiplications, for a total of d*(O(1)+O(1)) = O(d) time.

O(d) time is optimal, since the string's length is d in the first place, and we need to iterate through all of its characters.

But, if we assume that all arithmetic takes place in bignums, the calculation gets a little more complicated, and the time a bit worse. All together, O((d(d+1))/2) = O(d2) time is spent in addition, and O(((d log(d))(d log(d)+1))/2) = O((d2log(d)2) time is spent in multiplication. The latter dominates the time, so the total complexity is O((d2log(d)2). This is even worse than quadratic! There must be a better way.

Minimizing intermediate bignums

The problem here is that the numbers that the intermediate bignums are too big. In parsing "1234", the accumulator first contains 0, then 1, then 12, then 123 and finally 1234. So the sum of the intermediate number lengths is d(d+1)/2 = O(d2).

But here's another method: split the string in two equal parts, parse each of them individually, then combine the results together. To combine the results together, the first string has to be shifted left by the length of the second string (using the appropriate radix!). You can write base cases for strings of length 0 and 1, which shouldn't be split. (The value of _ { } digit>integer is 0 and _ { n } digit>integer is n.)

An example: to do 10 { 1 2 3 4 } digit>integer splits into 10 { 1 2 } digits>integer and 10 { 3 4 } digits>integer. By induction, let's assume that those intermediate calculations produce 12 and 34. Now, the value 12 must be multiplied by 102=100, since { 3 4 } is two digits long. Now, add 1200 and 34, and you get 34!

The analysis for this is almost identical to that of mergesort or quicksort. For a string holding a 8-digit number, there are four main steps: the step processing the individual numbers (really, 8 steps of constant time), then the step combining two numbers of 1 digit each (4 steps of 2x time), then the step combining two of those numbers, 2 digits each (2 steps of 4x time), and the final step of adding the two four-digit numbers together (1 step of 8x time). If you generalize it, there's a total of log2(d)+1 steps of time O(d), yielding a total of O(d log d).

Actually...

It's a little bit more complicated than that. O(d log d) is something like the complexity for summing a list, resulting in a bignum. But I ignored the other, more expensive operation: the left shift. A base two shift would be O(s+d), where s is the amount shifted over, and d is the number of digits in the thing being shifted. With a base two shift, the complexity O(d log d) would still be valid.

But this shift has an arbitrary radix (usually 10). This is done by calculating the radix raised to the power of the shift, and then multiplying that by the number which needs to be shifted. This takes a bit more time. Counting up the time taken for multiplication and exponentiation in the same manner as addition, we get a total time of O(d*log d*log (d log d)).

Implementation

In our implementation of this function, it'd be the best if we could just go into math.parser, the vocabulary that defines digit>integer, and redefine just that word. This redefinition would be propagated to all the words that use it, all the way up to the Factor parser. Fortunately, Factor explicitly allows this kind of invasion. Just make sure that, after loading the code, everything is recompiled! Otherwise, the change might not be propagated. Here's the code you need:

USING: math kernel math.functions sequences combinators ;
IN: digits

: partition ( seq -- front end )
[ length 2 /i ] keep cut ;

IN: math.parser

DEFER: digits>integer

: split-parse ( radix seq -- n )
partition [
rot [ swap digits>integer ] curry 2apply
] 3keep nip
length ^ swapd * + ;

: digits>integer ( radix seq -- n )
dup length {
{ 0 [ 2drop 0 ] }
{ 1 [ nip first ] }
[ drop split-parse ]
} case ;

Loading this code makes parsing large bignums dramatically faster, though smaller numbers are a little bit slower. The easiest way to load the code is to put it in path extra/digits/digits.factor, and then run the line USE: digits in the listener.

So remember, boys and girls

Whenever doing mathematical calculations that might involve bignums, it's always important to remember the computational complexity of various mathematical operations. If you forget them, a very doable problem can suddenly become intractable.



A technical note about complexity: (for the nit-picky readers among you)

In truth, the best known algorithm for bignum multiplication takes O(n log(n) log(log(n))) time, using fast Fourier transforms, which I don't yet understand. (Well, actually there's one of time O(n log(n) 2^(log*(n))), which is even faster, but no one uses that yet.) Therefore, exponentiation should take O(d log(d) log(log(d))) time, where d is the size of the final result. This is because the algorithm's time is dominated by the final doubling.

I felt that it was appropriate to use O(d log(d)) as an approximation of O(d log(d) log(log(d))), since the double log function grows very slowly, and it clutters up the math with no tangible result. For this analysis, it doesn't hurt anyone to elide it. If I were publishing this in some respectable academic thing (hah! as if that makes sense!), I'd change it at the expense of clarity.

Saturday, September 15, 2007

When Unicode just works itself out

I love college; there are so many smart people here. One smart person I met named Paul designed his own CMS (it started as a forum) using PHP and MySQL called Really Cool Stuff while in high school. It has tons of features, looks pretty well-designed, and he has real, non-programmer users. (He just upgraded and wiped the site, so it's a little empty right now.) I'm jealous.

I've heard many bad things about PHP's Unicode handling, so I asked Paul what he thought about it. He told me that he didn't write any special code for it. In PHP, the strings are composed of normal 8-bit characters. In MySQL, an 8-bit string is also used. Convinced I had found a bug, I opened up a new forum post and entered in a bunch of special characters, convinced I would see a mess of misrendered partial code points from the lack of explicit handling.

Disappointingly, the program worked properly. All of the characters displayed as they were entered.

What happened?

It was too simple for me to realize first: it's UTF-8 all the way through.

UTF-8 is the encoding which is overwhelmingly used for transmission over the internet, due it its backwards compatibility with ASCII and efficiency in representing ASCII. Like nearly all Unicode encodings, UTF-8 is represented as a string of bytes; each character represented as 1, 2, 3 or 4 bytes. In this case, the only thing the website needed to do was to get the string from input, store it in the database, and then serve it back out on a different page. PHP generally works with 8-bit strings, but here, it doesn't really matter. Modern web browsers treat everything as UTF-8 by default. Within PHP and MySQL, it's all just bytes that have no inherent meaning. It really doesn't matter whether the bytes are characters or UTF-8 octets, because the web browser controls the presentation.

So these two basic operations, storage and retrieval, happen to work just fine here. The input is UTF-8, and the output is interpreted as UTF-8, so everybody's happy. You could call this the Unicode-bind approach. This is basically all Paul's program does with strings: no fancy manipulation, just input and output. But there's a third operation going on here, implicit in lookup. The titles, usernames, etc. are similarly stored in 8-bit strings, and looking them up in the database involves, ultimately, testing two strings to see if they are equal.

Wait a minute!

This should raise a red flag here. A simple byte-equality comparison is being used when the strings may be in different, but canonical-equivalent, forms! (I discussed normalization and equivalence in detail earlier. Two strings are considered canonical equivalent if they are the same after being put in NFD (or NFC).) That is, there might be strings which are in different in what code points are used, but represent the same glyphs. A simple example of this is the difference between a lower-case a with an acute accent (one character), and a lower case a followed by a combining acute accent (two characters). Both of these represent the same piece of text, so they should be treated the same. The Unicode standard suggests that they be treated this way.

However, in the real world, this is rarely an issue. Probably it will never come up among Paul's users. From browser input, text is usually in a form that's mostly kinda like Normalization Form C (NFC). In any case, it's nearly always consistent among the people who use non-ASCII characters, so it will rarely create a problem.

It's disappointing, but many popular programs and libraries, including many Unicode fonts, don't treat canonical-equivalent strings the same. The Unicode standard only says that no process shall presume that another process treats any two canonical-equivalent strings differently, and suggests that it'd be nice if canonical-equivalent strings were always treated as the same. The practical manifestation of this is that, in many Unicode fonts, a with an acute and a with a separate combining acute are displayed differently. In the case of Paul's program, a user with an accent mark in his name will probably only be able to log in if he enters his user name in the right form. The username might show up on his screen as the same, but if the user uses, say, combining accent marks instead of precomposed characters, the database might not find him. But, again, this will rarely occur.

So, what about other operations?

If you go beyond the simple operations of storage, lookup, and equality comparison, you'll run into problems in internationalization. Concatenation should work OK too. According to Paul, these four operations are all the website needs to do with strings, so everything should work out OK. But there are four classes of string operations that can cause problems in the know-nothing Unicode approach described above. These may become important as Paul continues to develop his CMS.

  1. Dealing with individual characters. There are two layers of problems here. First, the string is in UTF-8, so it takes a bit of work to extract characters. (I previously wrote about the format of UTF-8, halfway down that page.) Accessing a character in the normal byte offset way just won't work. The default substring and length functions also don't work, and you can't loop through indexes of a string the way you may be accustomed to. Second, in many situations, it's more appropriate to think of graphemes than characters. (Case study with another Carleton programmer.) Individual processing by characters may not always work.

    An immediate consequence of the lack of individual character processing is that regular expressions do not work. Additionally, when using Unicode, regular expressions must not use old patterns like [a-zA-Z] to match alphabetic characters, as this fails when more diverse texts are considered. More subtly, many regular expressions implicitly depend on assumptions like 'words are separated by blanks or punctuation', which are inappropriate in general.

  2. Alphabetical order. I still have yet to write about this in detail, but alphabetical order is much more complicated in than you may think. Simple programs which are only used with ASCII often get away by sorting in code point order, after converting the string to a consistent capitalization. But in general, this doesn't work. In this scheme, a with an accent mark will sort after z. Due to the complicated tie-breaking rules governing accent locale, marks, ligatures, contractions, capitalization and punctuation in proper sorting, the comparison operators included in most programming languages by default (usually code point order) will not be appropriate for international applications. For example, in American English, an a-e ligature sorts like an a followed by an e, and 'café' sorts in between 'cafe' and 'cafes'. The Unicode Collation Algorithm (UCA) sorts this all out.

  3. Word breaking and search. When considering all of Unicode, it is inappropriate to assume that words are separated by blanks. For example, in Japanese, adjacent Kanji should be treated as separate words. This has implications when searching by whole word match. For insensitive search, it often makes sense to use part of the collation key, which I'll describe later.

  4. Uppercase, lowercase, and titlecase. With a totally naive approach, capitalizing a UTF-8 string the way you capitalize an ASCII string, you would only see the letters in the ranges A-Z and a-z transformed properly, with other characters completely ignored. (This is half lucky accident, half by design of UTF-8.) Apparently, PHP has some support for UTF-8 in some of its string functions, including case transformations. But by default, only the ASCII transformation is done. (It's possible to do everything except for Turkish, Azeri and Lithuanian case operations using the same function, but PHP requires an explicit change of locale.)


(Of course, if you do something like write a rich AJAXy text editor, more problems will also show up.) But if you don't use any of these aspects of text manipulation, your program should be fine for the most part, without explicit Unicode support.

This isn't to say that it's OK to ignore Unicode, though. Most complex programs will soon run into one of these issues, but I still think this is an interesting case study. I feel like I've said this a million times, but the best way to avoid running into any of these issues is to have them solved in the standard library correctly. This way, all string manipulations are Unicode-aware, and more complicated things like breaking properties can be easy to access on all systems.


Note about databases: Most modern SQL databases support some Unicode in some way or another, by which I mean they can store strings in particular Unicode encodings. MySQL 4.1 looks like it has reasonably good Unicode support, including several encodings (with some non-Unicode ones, too), and collation based on the UCA, with support for tailoring the order to specific locales. But unfortunately, it looks like only characters in the Basic Multilingual Plane (BMP) are supported. Note that when storing UTF-8 code in a database Unicode-blind, more characters might be taken up in the database than there actually are in the string when displayed, because some characters take multiple octets.

Friday, September 7, 2007

A survey of stack shufflers

Factor is almost Lisp. Aside from the libraries, there's just one tiny difference: Factor has no named variables1. Factor is a concatenative language. This means that programs are assembled by composing other operations. Trial and error has found that the only workable structure for functions to pass data between each other is a stack2 All words in Factor work, in theory, by taking the stack, doing something to it, and returning a new stack, which can be passed on to the next word. In reality, most words only modify the top few items.

For those of you who slept through CS 101 (or, like me, never took it), here's an overview of stacks. We have just three operations: push, pop, and empty. The operation push puts a datum on the end of the stack. The operation pop deletes the top item from the stack, returning it. The operation empty tests whether the stack has any items in it. Initially, we say the stack is empty. Stacks are most commonly implemented as linked lists or growable arrays. Wikipedia has more details.

For our purposes, we'll assume that the empty operation is off-limits3. Generally, we should know whether or not the stack is empty--it has all of our data on it, after all. If a word tries to use more things on the stack then exist, the program crashes with a stack underflow error. This is a good thing, reflecting a real error with the structure of the program which needs to be fixed rather than ignored.

The basics: constants and drop

So, how can we make push and pop work with concatenative languages? We have no place to get the arguments for these, or return the results of these functions except the stack itself. Instead of using push and pop directly, slightly more complicated operations are used. These provide not a way to mess around with the stack, but also the data flow of the language.

First, the easy case: pushing constant values. Say you want to push 1 onto the stack. This is represented in Factor as, simply, 1. Think about it: when you have a literal, and when all data is stored on the stack, what else would you want to do with it besides push it onto that stack? In this way, we can actually think of 1 as a function from stacks to stacks, which takes one stack and returns another which has 1 pushed on to the front of it.

It's so much fun to push these values on the stack! But what if we want to get them off? We want to perform pop on the stack, but ignore the result. We can imagine this implemented as

_ = pop(stack)

You can think of it like dropping the top of the stack. In fact, that's what we'll call this operation, drop.

We can try this all out at the Factor listener4. Open up Factor and push a bunch of numbers to the stack. To see what's there, you can use the command .s. The stack is printed from bottom to top order. To delete the item on the top of the stack, use the command drop.

Building blocks: dup, swap and the retain stack

Unfortunately, there's not much you can do with just those two things, pushing literals and dropping them, even if you add in primitive functions and functions derived from those. One useful action is to duplicate the top of the stack. Since "duplicate" is way too long to type so many times, we'll call it dup instead. dup can be implemented as

item = pop(stack)
push(stack, item)
push(stack, item)

dup is useful if you want to do two things with one piece of data. It's also useful if you want to apply the same piece of data twice to the same function. For example, let's look at the definition of sq, which squares the number on the top of the stack5:

: sq ( x -- x^2 )
dup * ;

This means that the one input to sq is sent as both arguments to *. The part in parentheses is called the stack signature, a listing of inputs and outputs to the word. It is basically a comment; the names included are not significant, except to the reader5. To give you an idea of what's going on here, the equivalent function in Haskell would be

sq x = x*x


Another basic building block is swap. This switches the top two things on the stack. Say you want to negate a number. The easiest way to do this, given a function - ( x y -- x-y ), is something like this:

neg x = 0-x

If we translate 0-x into Factor as 0 -, we get the wrong answer; this is equivalent to x-0, an identity function. So, what we need to do is push 0 onto the stack, swap it with x, and then perform a subtraction:

: neg ( x -- -x )
0 swap - ;


The third basic operation is dipping under the top of the stack to do stuff below. For example, say you want to calculate, given x, y and z, (x+y)*z. The easiest way to do this is to dip under the top of the stack, add x and y, then rise back up and multiply that result by z. In Factor, the dipping down operation is >r and the rising up operation is r>. All together, the Factor code for this function comes out as:

: (x+y)*z ( x y z -- (x+y)*z )
>r + r> * ;

(Yes, in Factor, (x+y)*z is a valid word name, since words are just whitespace-delimited chunks. And I was feeling a little unimaginative.)

Derived operations

Using just those five basic operations--dup, drop, swap, >r and r>, we can implement any other stack shuffling operation6. Here are a few that turn out to be useful. The code here is extremely simple, but resist the temptation to do your own stack shuffling with the above operators and not the below: using derived operations makes your code look far cleaner and clearer to future readers.

swapd swaps the two items under the top of the stack. The suffix "d" indicates that it does this underneath.

: swapd ( x y z -- y x z )
>r swap r> ;


rot rotates the third item on the stack up to the top, pushing the top two down.

: rot ( x y z -- y z x )
swapd swap ;


-rot rotates the top item on the stack down to the third position, pushing the two below up. It is the inverse of rot.

: -rot ( x y z -- z x y )
swap swapd ;


nip is like drop, but it deletes the second item on the stack.

: nip ( x y -- x )
>r drop r> ;


dupd duplicates the item second from the top of the stack, and puts the duplicate underneath. This is useful if you want to do something with the top two things on the stack, but retain what's second.

: dupd ( x y -- x x y )
>r dup r> ;


over takes the second item on the stack and puts a copy of it over the top.

: over ( x y -- x y x )
dupd swap ;


tuck tucks the top item on the stack under the second one

: tuck ( x y -- y x y )
swap over ;


pick picks the third item on the stack and places a copy of it on top

: pick ( x y z -- x y z x )
>r over r> swap ;


In Factor, these are all implemented as primitives, but the only reason for this is performance. Note that, in the above descriptions, when I said "copy" I meant copying a pointer, not copying data (which can be done with clone).

Keep

There's a really cool operator known as keep, which often makes code far more elegant. Strictly speaking, it's not a stack shuffler; it takes a quotation as an argument, making it a combinator, or higher order function. Basically, keep encapsulates a simple, common pattern:

dup >r ... r>

replacing it with

[ ... ] keep

There's also 2keep, which is equivalent to

2dup >r >r ... r> r>

These, with the analogous 3keep, are more useful than they first seem. They are used for any instance where you want to process the top one or more items on the stack, together with something underneath, while retaining those items on the top for further use.

What's this all for?

This list of functions probably looks daunting to memorize, and it is difficult at first. But it's useful, and they soon become second nature. In Factor and other concatenative languages, stack shufflers are the glue used to tie the output of certain functions to the input of other functions. No matter what programming language you're working in, everything ends up being a (perhaps convoluted) stringing together of different functions. Once you get used to it, using the stack and its associated transformations can be even more natural than using tons of named variables.



1 OK, technically, you could use locals for named variables. But it's all stack manipulation underneath. Also, there's dynamically scoped variables, but those aren't used the way lexically scoped locals typically are; they're used for passing data over greater distances.

2 Experimental languages have used a queue rather than, or in addition to, a stack. But this generally leads to more difficult to use--and compile--models. The original point-free language, FP, used function composition without a stack, but the way it uses lists is trivially equivalent to the stack. However, it ends up being somewhat awkward in practice. If you have an idea for how to improve either of these, or of another data structure for concatenative languages, I'd love to hear it. Please comment.

3 Strictly speaking, this isn't the case in Factor. In fact, a function to test whether the stack is empty is trivial to implement:

: stack-empty ( -- ? )
datastack empty? ;

But it's a bad idea to use this, since words should only care about their arguments, and they should take a fixed (or easy to determine, in the case of certain macros) number of arguments.

4 Here's how to get and open Factor. If you're on Windows or Mac, download the appropriate package from the Factor website and unpack it somewhere. On either platform, double click on the F icon, and a Factor listener window will pop up. On Linux or other Unixes, download the source, then make with the appropriate target (run without arguments to see options), and run ./factor. See the readme for more details. No matter how you get it started, once you have the listener open, you can just type in words and press enter to execute them.

5 The structure of stack comments is fairly simple. Before the -- is the inputs, and after is the outputs. Both sides are written from the bottom of the stack to the top of the stack. For example, ( a b -- c d ) indicates a word which takes a second from the top of the stack, and b on the top of the stack, pops those off, and returns c second from the top and
d
on the top. The names should indicate something about the arguments, ideally more informative than their type (though that is often used). If the same name is used on the left and right sides of the --, that implies that the same object exists before and after the function is called. This occurs most often in stack shufflers.

6 Actually, there's a set of two combinators, cake and k, which can do anything, including any stack shuffle, and call, and curry. They're described in a fun paper by Brent Kerby, The Theory of Concatenative Combinators.




Note: Another way to look at the stack shufflers is through a functional implementation of stacks in terms of linked lists with pattern matching. We'll call the front of the list the top of the stack Here's a few stack shufflers in Haskell, a language where : means cons:

drop (x:stack) = stack
swap (x:y:stack) = y:x:stack
dup (x:stack) = x:x:stack
over (x:y:stack) = y:x:y:stack
tuck (x:y:stack) = x:y:x:stack
rot (x:y:z:stack) = z:x:y:stack
dip f (x:stack) = x:f (stack)
keep f stack = dip f (dup stack)

This could be the basis for a dynamically typed concatenative language in Haskell, though it wouldn't work with static typing very well. (You could use nested tuples, with statically known types, are used instead. This requires a whole bunch of polymorphism and type classes. But with this, it is impossible to have a statically unknown stack depth. This makes it impossible to do certain types of recursion.)

Another way to think of it in Haskell is in terms of higher order functions:

drop f x = f -- same as const
swap f x y = f y x -- same as flip
dup f x = f x x

This, in my mind, more closely resembles the real data flow manipulation involved here. However, they are hard to compose in real programs.

Tuesday, September 4, 2007

Strings are now complicated

It's too late to turn back now. Now that the Unicode floodgates have been opened, and we're trying to support every script by default, string processing has gotten a bit more complicated.

In my pre-frosh orientation trip, I met another kid who enjoys programming. He told me about a little PHP script he wrote called Synesthesia. From his description, I imagine it goes something like this. First, generate a CSS file with classes named c0 through c255, each with a random text color. Then, take an input text file, and generate an HTML document from it. Put each character in a span specifying the class as "c" plus the ISO Latin 1 (8859-1) number associated with that character after the character has been put in lower case. The lower case is done so that "t" and "T" have the same color. The effect of this program is really cool, assigning each character a random color, making interesting designs and patterns from a plain text.

Now, how can we make this work with every language? It gets just a little bit more difficult. First of all, before doing anything, the string should be normalized. The Unicode standard explicitly states that any Unicode-conformant process must treat all canonical-equivalent strings equally. The easiest way to do this is to get the string in a consistent form, by normalizing it in some way.

When coloring a letter, we should be thinking about graphemes, not code points. Code points are the 21-bit basic units of Unicode text, whereas a grapheme is a unit of display, for example, an e with an accent mark over it, or a Hangul syllable. So we want to color units of display, not, say, just an accent mark. Iteration through graphemes is a feature of all sufficiently useful Unicode libraries. We can put the span around the whole grapheme, not just one code point.

When all of Unicode is considered, conversion to lower case isn't enough to cancel out all of the unimportant differences between similar graphemes. A better way to do it is to take the initial primary collation weight (of all non-ignorable characters) and use that as the unique identity for coloration. Just one little problem: there are, potentially, 65536 primary collation weights. That's a pretty big CSS file, so, where possible, only the used weights should be generated as CSS classes for text colors.

If you don't follow this (which is likely, as I haven't discussed collation in detail yet, and Unicode is confusing), I don't blame you. Unicode-aware text processing is complicated. While the Unicode consortium has made sure to allow less intelligent programs to be called conformant, they still might not be as good or readily internationalizable (is that a word?) as they could be.

Of course, it's ridiculous to expect programmers to be aware of all of this. That's why it's extremely important to have a good, easy-to-use standard library for Unicode, which abstracts as many things as possible away from the programmer. The optimal case, which I'm working towards, is to have nearly everything be Unicode-aware by default.