Until now, in my blog posts on Unicode, I've been focusing on internal processing which has only an indirect effect on users. But now I'm thinking about something else: how can you write a text editing widget which potentially supports as many locales as possible? This should generally be the goal when writing internationalized applications, and almost all applications should be eventually internationalized, so why not a text editor?
As it turns out, this is an extremely complicated task, combining the difficulties of input methods and rendering. In this article I'll try to brush on as many relevant topics as possible, trying not to ramble too much on any particular one. Really, each section should probably be given its own blog post.
To test the Unicode-bind approach, I'm looking at the Factor .91 editor gadget, which pays no attention at all to internationalization and basically parrots back whatever characters it receives. Different keyboard layouts were tested on Mac OS X 10.5, though that shouldn't matter here.
Typing European scripts
European scripts are pretty easy. You just type letters on the keyboard and they appear, always in left-to-right order, on the screen, one character appearing for each keystroke in many languages. This works for English, German, Russian, Georgian and many other languages, but not for, say Spanish or Vietnamese. In those languages, there are characters like é or ở which require multiple keystrokes to input. With a typical Spanish keyboard, to write é, first the ´ key is pressed and then the e key. For ở, you first press the ơ key and then the ̉ key.
There are more complicated input methods for, say, typing é with a US keyboard, and these are system-dependent. On Mac OS X, I use Alt-e to get that first combining ´ mark, and then an e to constitute the character under it. This can be used in any application. In Microsoft Office applications on Windows, you can use Ctrl-' to make a combining ´ mark. Elsewhere, you can't use that key chord, just some complicated keystrokes to input the right hex code. I believe Gnome and KDE each define their own mechanisms for this kind of input.
From a Unicode-bind text editor, none of these multiple-character inputs work. When in the Spanish keyboard mode in the Factor editor gadget, pressing ´ does nothing, and when in Vietnamese mode, pressing ơ does nothing. This is because the input system expects something more complicated to happen next. The OS-specific keys, of course, don't work properly either unless there is special handing for them. All of these things must be taken into account in an internationalized editor widget.
Typing East Asian scripts
When looking at the possible input methods in the Mac keyboard layout chooser, you're immediately struck by the fact that four East Asian languages--Chinese, Japanese, Korean and Vietnamese, have far more input methods listed than any other individual language. This is probably because (more than anything else) of the difficulty involved in inputing Han ideographs.
Han ideographs (Chinese characters) are used in Chinese and Japanese (as Kanji), occasionally in Korean and historically with Vietnamese. Unicode encodes over 70,000 Han ideographs, but at least 90% of these are extremely rare. Still, thousands come up in everyday situations and need to be typed into computers.
There are a number of ways of inputting Han ideographs. One way is through a Romanization. This requires several dictionaries, because each language pronounces the ideographs differently, and there are many Romanization systems for each language. One Romanization may correspond to more than one ideograph, so users are presented with a list of numbered choices. Another input method works by specifying the radicals, or component pieces, of the Han ideographs.
Japanese and Korean (and Vietnamese, discussed above) also have alphabets. Japanese has the kana—katakana and hiragana—a syllabary which can be entered by a special keyboard or through Romanization. Korean has Hangul, (I blogged about this earlier) which are syllables made of the 24 letters in the jamo alphabet. These can be entered by Romanization or by typing jamo. In either case, things are expected to coalesce into Hangul syllables automatically during typing.
The choice of input methods is usually chosen outside of the current application, and applications are expected to work perfectly with that choice. A good text editing widget must communicate with the OS settings to pick the right one, and it must implement it properly.
Rendering difficult scripts
Hebrew, European scripts and East Asian scripts are relatively simple to render. Well, maybe not that simple, when it comes to placing (and stacking!) combing marks and putting Hangul syllables or Han ideographs. But at least two adjacent letters don't tend to combine with each other spontaneously, forming several types of ligatures depending on what's on the left and what's on the right. At least letters go in their prevailing order, not rearranging themselves visually.
Certain Middle Eastern scripts like Arabic and Syriac, and several Indic scripts like Devanagari, don't follow these rules. Arabic and Syriac are written in a cursive form. Each letter has up to four (or sometimes five) forms in the simplified form that most computer typesetting systems use, and vowels are written above or below the characters as combining marks. In some computer typesetting systems, these shapes can be inferred from context, but others will display things incorrectly unless special code points are used which include information about which form it is.
Indic scripts are also complicated, but differently. The term "Indic script" refers not only to scripts used in India but a whole family of scripts used around South and Southeast Asia. There are 19 of these scripts encoded in Unicode, all with somewhat similar properties. Devanagari, the script used for Hindi, Nepali and Sanskrit, among other things, is one of these Indic scripts, and others include Malayalam, Tibetan, Tamil, Khmer and Thai. Among consonants in most Indic scripts, there are a large number of ligatures, but these are not encoded into Unicode; they must be inferred by a suitable font. Vowels in Indic scripts can be written after a consonant group, above it, below it, or even before the group, and this must also be inferred by the font. (I'm oversimplifying in both of those sentences, of course. Things are much more complicated.) In most Indic scripts' Unicode encodings, vowels which are placed before a consonant group come aftewards in memory, following logical order. Thai and Lao scripts don't follow these rules, and put those vowels before, but this causes complications in collation. All of these properties of Indic scripts create difficulties not only in rendering but also in input methods, but I haven't read much specifically on this.
Some or all of these issues might be solved by choosing a suitable font and display system rather than the editor widget itself.
Line breaking
I wrote about grapheme breaking previously. Unicode line breaking is sort of like grapheme breaking, except much much more complicated. By "line breaking" I mean what you might call word wrap. It's described in UAX #14.
This document describes several classes of characters with different line-breaking properties. Some force a line break before or after their display, others passively allow a line break before or after, some forbid a line break and others have even more complicated properties. All in all, there are 38 categories. Even though the algorithm is complicated, it (like generating a collation key) can be done in linear time with respect to the length of the string.
With respect to line breaking, different scripts and locales have different rules. In Chinese, Japanese and Korean text, where spaces between words are rare or nonexistent, you can do a line break between any two Han ideographs, kana or Hangul syllables. Often, arbitrary line breaks are even allowed within words written in Roman script. In Korean, we have to watch out that we don't break up a Hangul syllable, so conjoining jamo behavior rules are used.
In Khmer, Lao and Thai scripts, there are also no spaces between words, but line breaks must not interrupt words. This requires some linguistic analysis and dictionary lookup, but Unicode provides a code point to indicate line breaking opportunities. This may sound a little weird and impractical, but a similar thing also exists for European scripts which may hyphenate words between lines: the soft hyphen.
Even if things are restricted to American English, Unicode line breaking properties provide for better visual display than separating things at word breaking points, or just where there's whitespace. Still, it provides a significant implementation challenge for the writer of an internationalized text editor.
Bidirectional text
In Arabic, Hebrew and some other scripts, text runs from right to left rather than left to right. But what if we have a document that combines Hebrew and English: how do we organize text running in two directions? In an even more common situation, we could have text containing Hebrew and numbers. (Even in Hebrew and Arabic, decimal numbers are written left to right.) According to Unicode principles, both left-to-right (LTR) and right-to-left (RTL) scripts should be written in logical order, yet their display order is different.
To negotiate these issues, the Unicode Consortium specified the Bidirectional (BIDI) Algorithm (UAX #9). The BIDI Algorithm could easily fill up a whole blog post by itself, and I don't fully understand it, so I won't try to explain it all here. There are a number of different categories that Unicode code points go in for BIDI purposes: characters can be not only LTR or RTL, but also neutral, mirroring, strongly or weakly directioned, or an invisible special-purpose mark. These categories sometimes overlap, and there are several levels of directioning.
The BIDI algorithm is difficult to implement for display, but it gets even more complicated in a text editor. When placing the cursor somewhere, there can be ambiguity as to where it is. When selecting things, there is sometimes a design decision whether to support logical selection (which could be discontiguous) and visual selection (which is hard to implement). The best resource to learn more about this is the UAX document itself.
Conclusion
For Factor, it will be impossible to keep the editor widget written completely in Factor. It would require a huge amount of duplicated effort and system-specific configuration to get it right, so we'll probably end up having to use native (meaning Gtk, Cocoa and Windows UI) editor widgets.
Clearly, the Unicode-blind approach is insufficient here. Because of all this, internationalized editor widgets are implemented pretty rarely. English-speakers can get complacent in simple input methods, but these will not work for others.
Warning: Readers should be aware that I haven't studied any one of these topics in-depth. This is just an introduction and shouldn't be taken very seriously.
Monday, December 31, 2007
Sunday, December 23, 2007
Books that should exist
As you can probably guess by the existence of this blog, I love to write. Part of what I love about it is the actually act of writing, but it's really more that I want things to be written that hadn't before been written. I want to write stuff that I wish I could have read. Right now, there are three books and one journal article that I think should exist that. Hopefully, I'll be able to write some of these some time in my life.
Implementing Unicode: A Practical Guide
One thing that struck me when beginning to write the Unicode library is that there aren't many books about Unicode. The two I found in my local Barnes and Noble were the Unicode 5.0 Standard and Unicode Explained. Looking on Amazon.com, I can't find any other books that address Unicode 3.1 (the version that moved Unicode from 1 to 17 planes) or newer in detail, ignoring more specialized books.
Both of these were both great books, but they aren't optimal for figuring out how to implement a Unicode library. Unicode Explained focuses on understanding Unicode for software development, but shies away from details of implementation. The Unicode Standard explains everything, but it often gets overly technical and can be difficult to read for people not already familiar with the concepts described. A Unicode library implementor needs something in the middle. Unicode Demystified might be the right one, but it describes Unicode 3.0, so it is in many ways outdated.
I wish a book existed which explained Unicode in suitable detail for most library implementors. If this book continues to not exist for many years, I might just have to write it myself. This, however, would be more difficult and less fun than my other book ideas.
[Update: After getting my hands on Unicode Demystified, I realize that I should not have thrown it aside so quickly. It's a great book, and nearly all of it is still relevant and current. It looks like the ebook version I have is updated for Unicode 3.1.]
Programming with Factor
Factor's included documentation is great, but it's not enough, by itself, to learn Factor. I, and probably most people who know Factor, learned through a combination of experimenting, reading blog posts and the mailing list, and talking on #concatenative. Many people will continue to learn Factor this way, but it still seems somehow insufficient. It should be possible to learn a programming language without participating in its community.
Of course, we can't write a book on Factor until we get to see what Factor will look like in version 1.0. But I'm confident that this book will be written, even if it goes unpublished in print, and I'm fairly sure that I'll have at least some part in it. Maybe it'll be a big group effort by the whole Factor community.
What I'm thinking is that we could have a book which teaches programming through Factor, to people who aren't yet programmers. I've talked a lot about this with Doug Coleman, and we agree that most programming books are bad; we should make a new book that does things very differently. But we can't agree or imagine just how...
The Story of Programming
I, like many of you reading this blog, have an unhealthy interest in programming languages. Mine may be a little more unhealthy than yours. Whenever I hear the name of a programming language that I don't know of, I immediately need to read about it, to get some basic knowledge of its history, syntax, semantics and innovations.
Most study of programming languages works by examining the properties of the languages themselves: how functional programming languages are different from imperative object-oriented languages and logic languages and the like. But what about a historical perspective? The history of programming languages is useful for the same reason other kinds of historical inquiry are useful. When we know about the past, we know more about the way things are in the present, and we can better tell what will happen in the future. The history of programming languages could tell us what makes things popular and what makes things ultimately useful.
Unfortunately, not much has been done on this. Knowledge of programming language history is passed on, unsourced, with as much verifiability as folk legend. The ACM has held three conferences called HOPL on this subject over the past 30 years, so all the source material is there. But apart from a book published in 1969, this is all I can find as far as a survey history of programming languages goes.
There is a limit to how much academic research papers can provide. The proceedings of the HOPL conferences aren't bedtime reading, and they don't provide much by way of a strong narrative. A new book could present the whole history of programming from the first writings about algorithms to modern scripting languages and functional programming languages so it's both accessible to non-programmers and interesting to programmers. As far as I know, no one's really tried. But it would be really fun to try.
Inverses in Concatenative Languages
Most of my ideas are either bad or unoriginal. But there's one idea that I came up with that seems to be both original and not horrible, and that's the idea of a particular kind of concatenative pattern matching (which I blogged about, and Slava also wrote about in relation to
The basic idea is that, in a concatenative language, the inverse of
In my implementation of this, I made it so you can invert and call a quotation with the word
Anyway, I think this is a cool idea, but I doubt it could fill a book. I want to write an article about it for an academic journal so that I can explain it to more people and expand the idea. It could be made more rigorous, and it could use a lot more thought. I hope this works.
When will I write these?
So, I hope I'll be able to write these things. Unfortunately, I'm not sure when I could. I need to finish my next 10 years of education, which won't leave tons of free time unless I give up blogging and programming and my job. Also, I'm not sure if I'm capable of writing a book or an article for an academic journal, though maybe I will be in 10 years when I'm done with my education. It wouldn't be so bad if someone stole my thunder and wrote one of these things because what I really want to do is read these books.
Update: Here are a couple more books (or maybe just long surveys) that should exist but don't: something about operational transformation, and something about edit distance calculations and merge operations for things more complicated than strings. Right now, you can only learn about these things from scattered research papers. (I never thought I'd find the references at the end so useful!)
Implementing Unicode: A Practical Guide
One thing that struck me when beginning to write the Unicode library is that there aren't many books about Unicode. The two I found in my local Barnes and Noble were the Unicode 5.0 Standard and Unicode Explained. Looking on Amazon.com, I can't find any other books that address Unicode 3.1 (the version that moved Unicode from 1 to 17 planes) or newer in detail, ignoring more specialized books.
Both of these were both great books, but they aren't optimal for figuring out how to implement a Unicode library. Unicode Explained focuses on understanding Unicode for software development, but shies away from details of implementation. The Unicode Standard explains everything, but it often gets overly technical and can be difficult to read for people not already familiar with the concepts described. A Unicode library implementor needs something in the middle. Unicode Demystified might be the right one, but it describes Unicode 3.0, so it is in many ways outdated.
I wish a book existed which explained Unicode in suitable detail for most library implementors. If this book continues to not exist for many years, I might just have to write it myself. This, however, would be more difficult and less fun than my other book ideas.
[Update: After getting my hands on Unicode Demystified, I realize that I should not have thrown it aside so quickly. It's a great book, and nearly all of it is still relevant and current. It looks like the ebook version I have is updated for Unicode 3.1.]
Programming with Factor
Factor's included documentation is great, but it's not enough, by itself, to learn Factor. I, and probably most people who know Factor, learned through a combination of experimenting, reading blog posts and the mailing list, and talking on #concatenative. Many people will continue to learn Factor this way, but it still seems somehow insufficient. It should be possible to learn a programming language without participating in its community.
Of course, we can't write a book on Factor until we get to see what Factor will look like in version 1.0. But I'm confident that this book will be written, even if it goes unpublished in print, and I'm fairly sure that I'll have at least some part in it. Maybe it'll be a big group effort by the whole Factor community.
What I'm thinking is that we could have a book which teaches programming through Factor, to people who aren't yet programmers. I've talked a lot about this with Doug Coleman, and we agree that most programming books are bad; we should make a new book that does things very differently. But we can't agree or imagine just how...
The Story of Programming
I, like many of you reading this blog, have an unhealthy interest in programming languages. Mine may be a little more unhealthy than yours. Whenever I hear the name of a programming language that I don't know of, I immediately need to read about it, to get some basic knowledge of its history, syntax, semantics and innovations.
Most study of programming languages works by examining the properties of the languages themselves: how functional programming languages are different from imperative object-oriented languages and logic languages and the like. But what about a historical perspective? The history of programming languages is useful for the same reason other kinds of historical inquiry are useful. When we know about the past, we know more about the way things are in the present, and we can better tell what will happen in the future. The history of programming languages could tell us what makes things popular and what makes things ultimately useful.
Unfortunately, not much has been done on this. Knowledge of programming language history is passed on, unsourced, with as much verifiability as folk legend. The ACM has held three conferences called HOPL on this subject over the past 30 years, so all the source material is there. But apart from a book published in 1969, this is all I can find as far as a survey history of programming languages goes.
There is a limit to how much academic research papers can provide. The proceedings of the HOPL conferences aren't bedtime reading, and they don't provide much by way of a strong narrative. A new book could present the whole history of programming from the first writings about algorithms to modern scripting languages and functional programming languages so it's both accessible to non-programmers and interesting to programmers. As far as I know, no one's really tried. But it would be really fun to try.
Inverses in Concatenative Languages
Most of my ideas are either bad or unoriginal. But there's one idea that I came up with that seems to be both original and not horrible, and that's the idea of a particular kind of concatenative pattern matching (which I blogged about, and Slava also wrote about in relation to
units
).The basic idea is that, in a concatenative language, the inverse of
foo bar
is the inverse of bar
followed by the inverse of foo
. Since there are some stacks that we know foo
and bar
will never return (imagine bar
is 2array
and the top of the stack is 1
), this can fail. From this, we get a sort of pattern matching. Put more mathematically, if f and g are functions, then (f o g)-1 = g-1 o f-1. In my implementation of this, I made it so you can invert and call a quotation with the word
undo
. We don't actually need a full inverse; we only need a right inverse. That is, it's necessary that [ foo ] undo foo
be a no-op, but maybe foo [ foo ] undo
returns something different. Also, we're taking all functions to be partial.Anyway, I think this is a cool idea, but I doubt it could fill a book. I want to write an article about it for an academic journal so that I can explain it to more people and expand the idea. It could be made more rigorous, and it could use a lot more thought. I hope this works.
When will I write these?
So, I hope I'll be able to write these things. Unfortunately, I'm not sure when I could. I need to finish my next 10 years of education, which won't leave tons of free time unless I give up blogging and programming and my job. Also, I'm not sure if I'm capable of writing a book or an article for an academic journal, though maybe I will be in 10 years when I'm done with my education. It wouldn't be so bad if someone stole my thunder and wrote one of these things because what I really want to do is read these books.
Update: Here are a couple more books (or maybe just long surveys) that should exist but don't: something about operational transformation, and something about edit distance calculations and merge operations for things more complicated than strings. Right now, you can only learn about these things from scattered research papers. (I never thought I'd find the references at the end so useful!)
Thursday, December 13, 2007
Alphanum sort and Unicode
So, I'll just be the Unicode troll here and say, all of your solutions to the problem of sorting alphanumeric file names are insufficient. To recap, if you haven't been following Reddit: it's annoying how, when looking at a directory, the file named z10.txt is sorted between z1.txt and z2.txt, and we programmers should come up with a solution. The gist of the solution is that you split up the number into a list of alphabetical and numeric parts, parse the numbers, and then sort it all.
What's wrong with that?
Every one of these implementations, including a slightly internationalized version (just for some Scandanavian locale, apparently), does not produce a sort the way humans expect it. Remember, in ASCII, capital Z comes before lower-case a. Jeff Attwood hinted at internationalization problems, and that's just part of the problem. We should use the Unicode Collation Algorithm (which I previously blogged about). The UCA provides a locale-dependent, mostly linguistically accurate collation for most locales (basically all that don't use Han ideographs).
The basic idea that we need here is that the UCA is based on a number of levels of comparison. This is useful not only in "international" locales, but also in United States English. First, we compare two strings with most features removed, then we add back in accent marks, and then capitalization, and then punctuation/spacing. So, even if we want "B" to come before "b", we can have "bench" come before "Bundle"--before comparing for capitalization, we compare for the letters themselves, and "bench" obviously precedes "bundle". For a better description of the UCA, see my past blog post.
So, the most basic way to put the UCA together with the Alphanum algorithm is to just use the same algorithm, except with the UCA to compare the strings rather than an ASCII comparison like most other implementations have done. But this would completely destroy the UCA's subtle level effect: even if "A" is sorted before "a", we would want "a1" to come before "A2". We also want "A1" to come before "a2", and we also don't want to ignore case completely ignore case; it's still significant.
The UCA solves this problem when we're working with "a1" and "A2", but not when we're comparing "a2" and "a10". For this, we need a slightly more complicated solution based on something like the Alphanum algorithm. But breaking the string up by itself won't allow for the layering kind of behavior that the UCA depends on.
What's a solution for this?
One way to fix this is to break the string up into its segments, right-pad the alphabetical strings and left-pad the numbers. We can just break things up into segments of consecutive digits and non-digits. The length that we'll pad to for the nth subsequence is the maximum of the lengths of the nth subsequences for all of the strings. For numbers, we can left-pad with plain old 0, but for strings, we want to
right-pad with something that's lower than any other character. This sounds hard to do, but by tailoring the DUCET (data table for the UCA), we can just define a character in a private use space as a new, special padding character. This padding character will be completely non-ignorable, fixed weight and have a lower primary, secondary and tertiary weight than any other character by definition.
OK, that sounds really complicated. Let's step back a minute and see how this padding could work out to provide functionality equivalent to the existing Alphanum algorithm. For this, we'll assume access to an ASCII-order sorting algorithm, and just figure out how to pad the string in the right way to come up with a suitable sort key. Instead of some new character, we can just use a null byte. So, if we have the strings "apple1b" and "the2345thing1", the fully padded strings should look like "apple0001b\0\0\0\00" and "the\0\02345thing1", where '\0' is the null byte.
I can make a simple Factor implementation by stealing Doug's
This, by far, isn't the most concise implementation, but the advantage is that it can easily be adapted for a better collation algorithm.
But where's the real Unicode implementation of this?
I'm not sure where to implement this in terms of tailoring and hooking it up with the UCA. I'd either have to (a) go into C++ and do this with ICU (b) write my own Unicode library which has tailoring, and do it there. I'm too scared to do the first, and the second isn't done yet. I've looked at Java's tailoring, but I don't see how I could define a character that's lower than all other characters. I feel a little bit lame/trollish putting this blog post out there without a real implementation of the solution, but I hope I was able to expand on people's knowledge of the concerns of internationalization (yes, I'll write that word all out; I don't need to write i18n) in text-based algorithms.
What's wrong with that?
Every one of these implementations, including a slightly internationalized version (just for some Scandanavian locale, apparently), does not produce a sort the way humans expect it. Remember, in ASCII, capital Z comes before lower-case a. Jeff Attwood hinted at internationalization problems, and that's just part of the problem. We should use the Unicode Collation Algorithm (which I previously blogged about). The UCA provides a locale-dependent, mostly linguistically accurate collation for most locales (basically all that don't use Han ideographs).
The basic idea that we need here is that the UCA is based on a number of levels of comparison. This is useful not only in "international" locales, but also in United States English. First, we compare two strings with most features removed, then we add back in accent marks, and then capitalization, and then punctuation/spacing. So, even if we want "B" to come before "b", we can have "bench" come before "Bundle"--before comparing for capitalization, we compare for the letters themselves, and "bench" obviously precedes "bundle". For a better description of the UCA, see my past blog post.
So, the most basic way to put the UCA together with the Alphanum algorithm is to just use the same algorithm, except with the UCA to compare the strings rather than an ASCII comparison like most other implementations have done. But this would completely destroy the UCA's subtle level effect: even if "A" is sorted before "a", we would want "a1" to come before "A2". We also want "A1" to come before "a2", and we also don't want to ignore case completely ignore case; it's still significant.
The UCA solves this problem when we're working with "a1" and "A2", but not when we're comparing "a2" and "a10". For this, we need a slightly more complicated solution based on something like the Alphanum algorithm. But breaking the string up by itself won't allow for the layering kind of behavior that the UCA depends on.
What's a solution for this?
One way to fix this is to break the string up into its segments, right-pad the alphabetical strings and left-pad the numbers. We can just break things up into segments of consecutive digits and non-digits. The length that we'll pad to for the nth subsequence is the maximum of the lengths of the nth subsequences for all of the strings. For numbers, we can left-pad with plain old 0, but for strings, we want to
right-pad with something that's lower than any other character. This sounds hard to do, but by tailoring the DUCET (data table for the UCA), we can just define a character in a private use space as a new, special padding character. This padding character will be completely non-ignorable, fixed weight and have a lower primary, secondary and tertiary weight than any other character by definition.
OK, that sounds really complicated. Let's step back a minute and see how this padding could work out to provide functionality equivalent to the existing Alphanum algorithm. For this, we'll assume access to an ASCII-order sorting algorithm, and just figure out how to pad the string in the right way to come up with a suitable sort key. Instead of some new character, we can just use a null byte. So, if we have the strings "apple1b" and "the2345thing1", the fully padded strings should look like "apple0001b\0\0\0\00" and "the\0\02345thing1", where '\0' is the null byte.
I can make a simple Factor implementation by stealing Doug's
cut-all
code from his description of a solution to this problem.
: pad-seq ( seq quot -- newseq )
>r dup [ length ] map supremum r>
curry map ; inline
: pad-quote ( seq -- newseq )
[ "" pad-right ] pad-seq ;
: pad-number ( str -- newstr )
[ CHAR: 0 pad-left ] pad-seq ;
: pad-string ( str -- newstr )
[ 0 pad-right ] pad-seq ;
: pieces ( strings -- pieces )
[ [ digit? ] [ ] cut-all ] map pad-quote flip ;
: pad ( pieces ? -- ? newpieces )
[ pad-string f ] [ pad-number t ] if swap ;
: pad-strings ( strings -- newstrings )
pieces t swap [ swap pad ] map nip flip [ concat ] map ;
: alphanum-sort ( strings -- sorted )
dup pad-strings [ 2array ] 2map sort-values keys ;
This, by far, isn't the most concise implementation, but the advantage is that it can easily be adapted for a better collation algorithm.
But where's the real Unicode implementation of this?
I'm not sure where to implement this in terms of tailoring and hooking it up with the UCA. I'd either have to (a) go into C++ and do this with ICU (b) write my own Unicode library which has tailoring, and do it there. I'm too scared to do the first, and the second isn't done yet. I've looked at Java's tailoring, but I don't see how I could define a character that's lower than all other characters. I feel a little bit lame/trollish putting this blog post out there without a real implementation of the solution, but I hope I was able to expand on people's knowledge of the concerns of internationalization (yes, I'll write that word all out; I don't need to write i18n) in text-based algorithms.
Monday, December 10, 2007
Multiline string literals in Factor
It's always annoyed me somewhat that Factor strings can only be on one line and that there was no mechanism for anything like "here documents" like Perl has. So I decided to write it myself. At this point, I don't really need it and have forgotten what I wanted it for, but it was still a fun exercise.
I started out thinking I should do something similar to some other languages do it: write a word, maybe called
Putting this together, I came up with the following syntax:
The final line, the semicolon, has to have no whitespace before or after it. This allows for practically any useful string to be written this way. The syntax will compile into something like this:
With a previous parser design, multiline string literals like this were impossible, but now they can be done in 11 lines. I packaged this up and put it in my repository under
Using the internals of the parser, the first word advances the parser state one line and returns the text of the new line.
The next two words do the bulk of the parsing. They advance the parser's current line until reaching a line consisting only of ";", and then advance the line one more time. While moving forward, a string is compiled consisting of all of the lines, interspersed with newlines.
Finally, the word
There are downsides to having an extremely flexible syntax like Factor. Things can be less predictable when libraries can alter the fundamentals of syntax. It'd be impossible to create a fixed BNF description of Factor syntax. Additionally, the particular structure of Factor sometimes encourages syntax extension that's heavily dependent on the details of the current implementation. But the upside is that we can do things like this. I think it's worth it.
I started out thinking I should do something similar to some other languages do it: write a word, maybe called
<<<
, which is followed by some string which is used to delineate a multiline string literal expression. But I realized this wouldn't be the most idiomatic way in Factor. First, if you're making a multiline string literal, why would you ever have it be within a word? For constants like this, it's considered best practice to put them in their own separate words. Second, why do I need to give the option of choosing your own ending? What's wrong with just using a semicolon, like other Factor definitions?Putting this together, I came up with the following syntax:
STRING: something
Hey, I can type all the text I want here
And it can be over multiple lines
But without the backslash-n escape sequence!
;
The final line, the semicolon, has to have no whitespace before or after it. This allows for practically any useful string to be written this way. The syntax will compile into something like this:
: something
"Hey, I can type all the text I want here\nAnd it can be over multiple lines\nBut without the backslash-n escape sequence!" ;
With a previous parser design, multiline string literals like this were impossible, but now they can be done in 11 lines. I packaged this up and put it in my repository under
extra/multiline
so others can use it.Using the internals of the parser, the first word advances the parser state one line and returns the text of the new line.
: next-line-text ( -- str )
lexer get dup next-line line-text ;
The next two words do the bulk of the parsing. They advance the parser's current line until reaching a line consisting only of ";", and then advance the line one more time. While moving forward, a string is compiled consisting of all of the lines, interspersed with newlines.
: (parse-here) ( -- )
next-line-text dup ";" =
[ drop lexer get next-line ] [ % "\n" % (parse-here) ] if ;
: parse-here ( -- str )
[ (parse-here) ] "" make 1 head*
lexer get next-line ;
Finally, the word
STRING:
puts it all together, defining a new word using a string gathered with parse-here
.
: STRING:
CREATE dup reset-generic
parse-here 1quotation define-compound ; parsing
There are downsides to having an extremely flexible syntax like Factor. Things can be less predictable when libraries can alter the fundamentals of syntax. It'd be impossible to create a fixed BNF description of Factor syntax. Additionally, the particular structure of Factor sometimes encourages syntax extension that's heavily dependent on the details of the current implementation. But the upside is that we can do things like this. I think it's worth it.
Thursday, December 6, 2007
Roadmap to Factor 1.0
According to the Factor website, Factor 1.0 is coming out some time in 2008. That's pretty scary. 2008 is coming in less than a month, and we currently have no solid plan for how we're going to go about reaching 1.0. Depending on the amount of available time that contributors have over the next year, it'll take a different amount of time to achieve this goal. Nevertheless, I'm proposing a roadmap of development goals for Factor 1.0.
Figuring out the goals
We first need to figure out the exact goals for Factor 1.0, and how we're going to go about completing them. From the Factor homepage, we have a list of goals for Factor 1.0:
This contains many different things, which will take a lot of work. For each of these tasks (or pieces of them), we need to decide who will be in charge of them and what a target date for completion will be. There are also other goals that aren't on this list, and we need to identify what these are.
One list item is particularly involved: Continue improving existing libraries, such as databases, multimedia, networking, web, etc. Right now, the libraries in
Timetable
By the end of this year (December 31, 2007): We need to decide, in more detail, how the path to Factor 1.0 will go. We need a full list of goals, a person assigned to each one, and a target date for completion. We also need to decide the terms for an audit of the libraries.
One-third through next year (April 31, 2008): We should be done with the library audit. All new libraries should be correctly sorted into
Two-thirds through the year (August 31, 2008): This date is the target for all of the key features to be completed. After this point, we will focus on bug fixes, performance and more complete documentation. If possible, 1.0 alpha/beta 1 should be released, and further alpha/beta releases should be made through the year. Here, Factor's runtime should be frozen with no significant changes in implementation.
At the end of the year (December 31, 2008): By this date, if all goes well, we will release Factor 1.0, with no major additional features added after the alpha stage.
Specifics
Here are guidelines that all vocabs in
Eventually, some vocabs in
This whole roadmap is quite optimistic, but I think it's possible if we focus on that goal.
Note: I wrote this without significantly consulting everyone involved with Factor, so if you have a problem with it, know that this doesn't form any kind of official policy (and please tell me).
Figuring out the goals
We first need to figure out the exact goals for Factor 1.0, and how we're going to go about completing them. From the Factor homepage, we have a list of goals for Factor 1.0:
- New object system with inheritance and multiple dispatch
- Incremental garbage collection
- Drastically reduce compile time
- Continue improving existing libraries, such as databases, multimedia, networking, web, etc
- Full Unicode support, Unicode text display in UI (in progress)
- Better UI development tools
- Get the UI running on Windows CE
- Add support for Windows 64-bit and Mac OS X Intel 64-bit
- Use kqueue (BSD, Mac OS X), epoll (Linux) for high-performance I/O
- Directory change notification
This contains many different things, which will take a lot of work. For each of these tasks (or pieces of them), we need to decide who will be in charge of them and what a target date for completion will be. There are also other goals that aren't on this list, and we need to identify what these are.
One list item is particularly involved: Continue improving existing libraries, such as databases, multimedia, networking, web, etc. Right now, the libraries in
extra/
are of inconsistent quality. We should set a standard for quality for things that are in extra/
and designate more experimental libraries as such, maybe by putting them in a different directory like playground/
. We'll should have an audit of all the libraries in extra/
to improve them as much as possible and decide whether they should go into playground/
.Timetable
By the end of this year (December 31, 2007): We need to decide, in more detail, how the path to Factor 1.0 will go. We need a full list of goals, a person assigned to each one, and a target date for completion. We also need to decide the terms for an audit of the libraries.
One-third through next year (April 31, 2008): We should be done with the library audit. All new libraries should be correctly sorted into
extra/
or playground/
. At this point, around half of the remaining critical features of Factor 1.0 should be done. Language features, like the form of the object system, should be completed and frozen.Two-thirds through the year (August 31, 2008): This date is the target for all of the key features to be completed. After this point, we will focus on bug fixes, performance and more complete documentation. If possible, 1.0 alpha/beta 1 should be released, and further alpha/beta releases should be made through the year. Here, Factor's runtime should be frozen with no significant changes in implementation.
At the end of the year (December 31, 2008): By this date, if all goes well, we will release Factor 1.0, with no major additional features added after the alpha stage.
Specifics
Here are guidelines that all vocabs in
extra/
should eventually follow:- Each vocab foo should consist of six files: foo.factor, foo-tests.factor, foo-docs.factor, summary.txt, authors.txt, tags.txt
- Every vocab's documentation should explain each externally useful word, and it should have a main article telling users where to start.
- Every vocab's unit tests should be as complete as possible.
Eventually, some vocabs in
extra/
and playground/
will be hosted outside of the main Factor distribution. But right now, the total quantity of Factor code is such that everything that's publicly distributed can be included. Of course, when more people start writing code that they refuse to put under a BSD-compatible license, we'll want to distribute that code separately.This whole roadmap is quite optimistic, but I think it's possible if we focus on that goal.
Note: I wrote this without significantly consulting everyone involved with Factor, so if you have a problem with it, know that this doesn't form any kind of official policy (and please tell me).
Tuesday, December 4, 2007
Factor FAQ moved
The Factor FAQ that Doug and I wrote has been moved to a new location at factorcode.org. It looks much better now, formatting-wise, I think. I'm still updating the FAQ, and since it's grown so big now, I'll have to categorize the questions. If you have any questions about Factor that aren't on the FAQ, please ask me (or the folks at #concatenative or the Factor mailing list).
extra/delegation: an annotated vocabulary
In order to allow for better compiler optimizations change-slot operations and other improvements, Slava plans to add tuple inheritance to the Factor object system. The exact form of this isn't completely finalized, but it looks like delegation in its current form isn't going to stick around much longer. I proposed a new kind of delegation mechanism, along with other changes to the Factor object system. In the end, a more conservative change was chosen in the form of mixins.
I described the Factor object system before, but I'll review delegation here. Basically, when a generic word dispatches on the tuple, one of two things can happen: either a method is found, and then called, or no method is found. If no method is found, the generic word checks the delegate slot. If that slot is filled (with something that's not
There are a few problems with delegates. One problem is the slicing problem: when relying on delegation to extend behavior and a method is delegated, the method call knows only about the delegate, not the original object. Sometimes, this is the desired behavior, but other times it can lead to subtle bugs. Another problem is performance. When looking up the value of a direct tuple slot, the required operation is a simple method dispatch and pointer arithmetic. If the compiler knows the tuple class of an object before runtime, it can even eliminate the dispatch step. However, when looking at a delegate slot, there is an additional pointer indirection. Crucially, with the current system, there is no way to avoid the dispatch. Tuple inheritance would require fewer levels of indirection and give more opportunity for optimization. With delegation, method dispatch takes linear time with respect to the length of the delegate chain, and this is bad. A very subtle problem with delegation is that everything is delegated at once, when often, we only need to delegate a couple things.
Yet delegation is still sometimes useful, and there must be a way to fix the slicing problem without breaking everything*. I've written a vocabulary (vocab) called
For a generic word, the group consists just of itself
For a tuple class, the group consists of all of the slot readers and writers, except for the delegate slot. (
A protocol is an explicitly-defined list of generic words which form a group. The syntax
Here's an example of the defintion of the sequence protocol:
The advantage of defining delegation over these groups is that, usually, we don't need to delegate for everything. With this, multiple delegation over disjoint groups is possible. This is in use in the
One mechanism which replaces delegation is called consultation. Billy Tanksley has suggested that this is what Factor's delegation should be termed; in essence, consultation is the same as delegation, except that it's limited to a single group. Say you have some class
Here's the implementation:
So, this fixes one problem with delegation: that it happens for everything. The second problem, performance, can be solved with increased static typing, for example static typing of tuple slots. But this won't be an available feature of Factor for some time. Until then, if we're sure that a tuple slot is a specific class, we can use
There's still one last problem, the slicing problem. A complete solution would involve heavy hackery into the Factor method dispatch system, allowing one object to mimic another. But a simpler solution is to just copy all of the method implementations from one class to another, for the generic words in a specific group. This can be unsafe in some situations, such as when the group is a tuple class, but it can also be useful. I called the system mimicry.
So there you have it: in fewer than 40 lines, the basis for a replacement of the delegation system, at least partially fixing all three main problems. The code is in my git repository. There are still problems, of course: metadata about what mimics/consults what is not stored, and when you
* Actually, it's possible to re-implement delegation in Factor, completely eliminating the slicing problem, but at a slight performance cost for tuple slot lookups, and at the cost of breaking a lot of code. The thing is, if instances of tuple class
I described the Factor object system before, but I'll review delegation here. Basically, when a generic word dispatches on the tuple, one of two things can happen: either a method is found, and then called, or no method is found. If no method is found, the generic word checks the delegate slot. If that slot is filled (with something that's not
f
), the method will be re-called on that delegate in place of the original object. For more information on this, see my previous blog postThere are a few problems with delegates. One problem is the slicing problem: when relying on delegation to extend behavior and a method is delegated, the method call knows only about the delegate, not the original object. Sometimes, this is the desired behavior, but other times it can lead to subtle bugs. Another problem is performance. When looking up the value of a direct tuple slot, the required operation is a simple method dispatch and pointer arithmetic. If the compiler knows the tuple class of an object before runtime, it can even eliminate the dispatch step. However, when looking at a delegate slot, there is an additional pointer indirection. Crucially, with the current system, there is no way to avoid the dispatch. Tuple inheritance would require fewer levels of indirection and give more opportunity for optimization. With delegation, method dispatch takes linear time with respect to the length of the delegate chain, and this is bad. A very subtle problem with delegation is that everything is delegated at once, when often, we only need to delegate a couple things.
Yet delegation is still sometimes useful, and there must be a way to fix the slicing problem without breaking everything*. I've written a vocabulary (vocab) called
delegate
which attempts to fix this. The premise is that this will constitute delegation after it is removed from the core. Central to the vocab is the concept of a group, or a word which signifies a number of methods. There are three builtin types of groups: tuple classes, generic words and protocols. There is a word called group-words
which extracts the constituent words from a group:
GENERIC: group-words ( group -- words )
For a generic word, the group consists just of itself
M: generic group-words
1array ;
For a tuple class, the group consists of all of the slot readers and writers, except for the delegate slot. (
delegate
and set-delegate
aren't generic words, anyway)
M: tuple-class group-words
"slots" word-prop 1 tail ! The first slot is the delegate
! 1 tail should be removed when the delegate slot is removed
dup [ slot-spec-reader ] map
swap [ slot-spec-writer ] map append ;
A protocol is an explicitly-defined list of generic words which form a group. The syntax
PROTOCOL: protocol-name words... ;
is used to define a protocol, and the words it contains are stored in the "protocol-words" word property.
: define-protocol ( wordlist protocol -- )
swap { } like "protocol-words" set-word-prop ;
: PROTOCOL:
CREATE dup reset-generic dup define-symbol
parse-definition swap define-protocol ; parsing
PREDICATE: word protocol "protocol-words" word-prop ;
M: protocol group-words
"protocol-words" word-prop ;
Here's an example of the defintion of the sequence protocol:
PROTOCOL: sequence-protocol
clone clone-like like new new-resizable nth nth-unsafe
set-nth set-nth-unsafe length immutable set-length lengthen ;
The advantage of defining delegation over these groups is that, usually, we don't need to delegate for everything. With this, multiple delegation over disjoint groups is possible. This is in use in the
xml.data
vocab, to let tags act like their names, their attributes assoc and their children sequence at the same time. It could also be used for multiple tuple inheritance.One mechanism which replaces delegation is called consultation. Billy Tanksley has suggested that this is what Factor's delegation should be termed; in essence, consultation is the same as delegation, except that it's limited to a single group. Say you have some class
foo
which implements all generic words in the group bar
. There is another class baz
, which has a word baz>foo
itself into into a foo
, or getting a corresponding foo to which to delegate the methods. Then, we could define CONSULT: bar baz baz>foo ;
in order to define the delegation. If bar
contains the words qux
and quux
, then the following method definitions will be generated:
M: baz qux baz>foo qux ;
M: baz quux baz>foo quux ;
Here's the implementation:
: spin ( x y z -- z y x ) ! This stack shuffler makes things easier to think about
swap rot ;
: define-consult-method ( word class quot -- )
pick add <method> spin define-method ;
: define-consult ( class group quot -- )
>r group-words r>
swapd [ define-consult-method ] 2curry each ;
: CONSULT:
scan-word scan-word parse-definition swapd define-consult ; parsing
So, this fixes one problem with delegation: that it happens for everything. The second problem, performance, can be solved with increased static typing, for example static typing of tuple slots. But this won't be an available feature of Factor for some time. Until then, if we're sure that a tuple slot is a specific class, we can use
declare
to let the compiler know. For example,
CONSULT: bar baz
baz>foo { foo } declare ;
There's still one last problem, the slicing problem. A complete solution would involve heavy hackery into the Factor method dispatch system, allowing one object to mimic another. But a simpler solution is to just copy all of the method implementations from one class to another, for the generic words in a specific group. This can be unsafe in some situations, such as when the group is a tuple class, but it can also be useful. I called the system mimicry.
: define-mimic ( group mimicker mimicked -- )
>r >r group-words r> r> [
pick "methods" word-prop at [ 2drop ]
[ method-def <method> spin define-method ] if*
] 2curry each ;
: MIMIC:
scan-word scan-word scan-word define-mimic ; parsing
So there you have it: in fewer than 40 lines, the basis for a replacement of the delegation system, at least partially fixing all three main problems. The code is in my git repository. There are still problems, of course: metadata about what mimics/consults what is not stored, and when you
see
a generic word or class, you'll see all of the mechanically-defined methods. It's amazing, though, how flexible Factor is to this kind of manipulation and metaprogramming.* Actually, it's possible to re-implement delegation in Factor, completely eliminating the slicing problem, but at a slight performance cost for tuple slot lookups, and at the cost of breaking a lot of code. The thing is, if instances of tuple class
a
all delegate to tuple class b
, then a instances will return f
to b?
. To get around this, semi-hacks like the is?
word are used.
Thursday, November 22, 2007
FactorCon MN 2007 post-mortem
FactorCon 2007 in Minneapolis was awesome, despite low attendance: it was just me, Doug, and Slava. Besides coding, we went to one of Doug's girlfriend's concerts, visited Loring Park with the big sculpture of the cherry on a spoon, made mole twice, out of guacs, and successfully avoided bro rape. I don't have pictures, but Doug or Slava probably will have them soon. Unfortunately, unlike the last FactorCon which I missed, we only stayed up until 3:00 or so coding most nights.
Overall, it was maybe a bit too short, but here's what came out of it:
Looks like most future FactorCons will be in Austin, since that's where most people are, for some reason. I hope to see many of you (my readers) there!
Because of my college's weird trimester system, which conjures up images of pregnancy, we have a six-week break from before Thanksgiving until after the new year. (Next term, I'll be taking Abstract Algebra, Algorithms, Social Dance and Syntax Theory, all of which should be fun.) So, over the next month or so, during the break, you can expect more regular posts, and maybe I'll even be able to get some programming done.
Overall, it was maybe a bit too short, but here's what came out of it:
- Slava had a project to get Windows I/O working better, and make deployment work for Windows the way it does for Mac OS X. The deployment tool can spit out a directory with the proper DLLs, a minimal image, and an executable, but it's not finished yet. Slava blogged about this.
- Doug's project was to get his regex engine in Factor to work better. Instead of refurbishing the old one, he's writing a new one using parser combinators which generate a parser which uses parser combinators itself. This should be more efficient once Chris Double's pacrat parser in Factor is done.
- My project, which I really didn't intend to be my project, was getting Factor to work on Mac OS X x86-64 with Leopard. This wasn't going to be that hard; by the first day, I was done with all the compilation stuff. The only thing that made it difficult was the fact that Apple significantly changed the Objective C API in their new ObjC 2.0, included in Leopard. On 32-bit mode, there's backwards compatibility, but not in 64-bit mode. This is going to be a bit more work.
Looks like most future FactorCons will be in Austin, since that's where most people are, for some reason. I hope to see many of you (my readers) there!
Because of my college's weird trimester system, which conjures up images of pregnancy, we have a six-week break from before Thanksgiving until after the new year. (Next term, I'll be taking Abstract Algebra, Algorithms, Social Dance and Syntax Theory, all of which should be fun.) So, over the next month or so, during the break, you can expect more regular posts, and maybe I'll even be able to get some programming done.
Wednesday, November 21, 2007
When Unicode doesn't just work itself out
So, in my job as a student worker at Carleton College, I'm working on programming the website. Just a week or so ago, I basically completed a classified housing ads module, which I'm fairly happy about, even if it was pretty trivial. They've made their own in-house CMS, now open-source, written in PHP and called Reason. Being a very respectful employee, I won't allude to any possible irony in this name. Not at all.
Anyway, a couple of days ago, one of the non-student workers, Margaret (the names of those involved have been swapped with others' to protect their identity) came in with a question: can anybody, who's not doing something important, come up with a regular expression to detect bad UTF-8? Since I was basically unoccupied, reading a book from the library in preparation for my next super-secret project, and since I know a little about Unicode, I volunteered.
I sat down with Margaret as she explained the problem: somehow, a bunch of bad Unicode had gotten into Carleton's database. This isn't a bug that we can reproduce, but a number of times in the past, there have been these question marks, like �, appearing in the content. Whenever it comes up, we delete it, but we need a regular expression to find all of the bad parts.
When this situation occurs, there is malformed UTF-8, where UTF-8 is the encoding used internally in Reason. Reason uses the Unicode-blind approach I discussed earlier. The Unicode-blind approach failed here, though. Most likely it's because of strange, invalid input from browsers which can't easily be replicated, except by artificially constructing queries. All good web designers know that you can't trust the browser (a fact which I internalized only recently) and that apparently extends down to the validity of UTF-8 itself.
Margaret's idea was to find characters whose most significant bit was on, surrounded by characters whose most significant bit was off. However, there are other places where Unicode could be malformed and checking for this specific case isn't enough. With a bit of cursory Googling, we found phputf8, a pure-PHP library which has functions for validating, cleaning, encoding and decoding UTF-8. We could use this library to clean all input, to ensure that it is in valid UTF-8 and will appear without � all the time. This would be a bit computationally intensive, though.
Another approach is to take advantage of the database itself. Reason works on MySQL, but what character encoding does it use? Since Margaret didn't know, I asked Eric, our omnipotent server admin. Carleton actually has one SQL server instance where there are many different databases. Some of these databases use UTF-8, Eric told me, but the main Reason database is encoded in ISO Latin 1, as far as SQL is concerned. The content is, in truth, encoded in UTF-8; when Reason was started, MySQL was in version 3 and never supported Unicode, and we continue to use the same database. Eric guided me through the migration in a wonderfully hackish way: serialize the whole database, then in that serialization, replace all instances of "latin1" with "UTF8", then deserialize the database.
There are still a couple potential problems: for one, it's possible that there could be invalid UTF-8 in the database that MySQL doesn't check for. The best way to fix this would be to do a one-time cleanup with something like phputf8 before the migration. Otherwise, things will be sort of corrupted from the start. After this, it shouldn't be possible to put invalid UTF-8 in the database. The second thing is that there are still some remaining Unicode issues, most notably normalization. Though we could trust the browser to give us text in NFC, that's generally a bad idea. It'd be better to normalize all text, and MediaWiki has code for that.
Once all strings are considered UTF-8 by the database, things should move much more smoothly. Not only will invalid UTF-8 not be accepted, but collation will work properly, since MySQL supports Unicode collation consistent with the Unicode Collation Algorithm (UCA) and Default Unicode Collation Element Table (DUCET). But there's a minor caveat: MySQL 5.0 and lower only supports three octets of UTF-8. In 2001, Unicode 3.1 introduced the possibility of a fourth octet, for supplementary planes, used for obscure Han characters, math symbols, old scripts, etc. MySQL 5.0 doesn't allow any of these to be used in UTF-8 mode, though the Unicode-blind approach does allow it.
Nevertheless, it's good to be aware of Unicode. The fact that you can sometimes get away without noticing Unicode and its encodings is not a reason to ignore it.
Update: Alexander Barkov, a MySQL developer, pointed out to me that MySQL 6.0 supports three encodings, utf8, utf16 and utf32, including support for code points outside the BMP. Previous versions of MySQL had only three octets of utf8, or the fixed-width ucs2, which only allows the BMP. I'm very happy about this, but of course it takes some time to upgrade versions.
Anyway, a couple of days ago, one of the non-student workers, Margaret (the names of those involved have been swapped with others' to protect their identity) came in with a question: can anybody, who's not doing something important, come up with a regular expression to detect bad UTF-8? Since I was basically unoccupied, reading a book from the library in preparation for my next super-secret project, and since I know a little about Unicode, I volunteered.
I sat down with Margaret as she explained the problem: somehow, a bunch of bad Unicode had gotten into Carleton's database. This isn't a bug that we can reproduce, but a number of times in the past, there have been these question marks, like �, appearing in the content. Whenever it comes up, we delete it, but we need a regular expression to find all of the bad parts.
When this situation occurs, there is malformed UTF-8, where UTF-8 is the encoding used internally in Reason. Reason uses the Unicode-blind approach I discussed earlier. The Unicode-blind approach failed here, though. Most likely it's because of strange, invalid input from browsers which can't easily be replicated, except by artificially constructing queries. All good web designers know that you can't trust the browser (a fact which I internalized only recently) and that apparently extends down to the validity of UTF-8 itself.
Margaret's idea was to find characters whose most significant bit was on, surrounded by characters whose most significant bit was off. However, there are other places where Unicode could be malformed and checking for this specific case isn't enough. With a bit of cursory Googling, we found phputf8, a pure-PHP library which has functions for validating, cleaning, encoding and decoding UTF-8. We could use this library to clean all input, to ensure that it is in valid UTF-8 and will appear without � all the time. This would be a bit computationally intensive, though.
Another approach is to take advantage of the database itself. Reason works on MySQL, but what character encoding does it use? Since Margaret didn't know, I asked Eric, our omnipotent server admin. Carleton actually has one SQL server instance where there are many different databases. Some of these databases use UTF-8, Eric told me, but the main Reason database is encoded in ISO Latin 1, as far as SQL is concerned. The content is, in truth, encoded in UTF-8; when Reason was started, MySQL was in version 3 and never supported Unicode, and we continue to use the same database. Eric guided me through the migration in a wonderfully hackish way: serialize the whole database, then in that serialization, replace all instances of "latin1" with "UTF8", then deserialize the database.
There are still a couple potential problems: for one, it's possible that there could be invalid UTF-8 in the database that MySQL doesn't check for. The best way to fix this would be to do a one-time cleanup with something like phputf8 before the migration. Otherwise, things will be sort of corrupted from the start. After this, it shouldn't be possible to put invalid UTF-8 in the database. The second thing is that there are still some remaining Unicode issues, most notably normalization. Though we could trust the browser to give us text in NFC, that's generally a bad idea. It'd be better to normalize all text, and MediaWiki has code for that.
Once all strings are considered UTF-8 by the database, things should move much more smoothly. Not only will invalid UTF-8 not be accepted, but collation will work properly, since MySQL supports Unicode collation consistent with the Unicode Collation Algorithm (UCA) and Default Unicode Collation Element Table (DUCET). But there's a minor caveat: MySQL 5.0 and lower only supports three octets of UTF-8. In 2001, Unicode 3.1 introduced the possibility of a fourth octet, for supplementary planes, used for obscure Han characters, math symbols, old scripts, etc. MySQL 5.0 doesn't allow any of these to be used in UTF-8 mode, though the Unicode-blind approach does allow it.
Nevertheless, it's good to be aware of Unicode. The fact that you can sometimes get away without noticing Unicode and its encodings is not a reason to ignore it.
Update: Alexander Barkov, a MySQL developer, pointed out to me that MySQL 6.0 supports three encodings, utf8, utf16 and utf32, including support for code points outside the BMP. Previous versions of MySQL had only three octets of utf8, or the fixed-width ucs2, which only allows the BMP. I'm very happy about this, but of course it takes some time to upgrade versions.
Wednesday, November 14, 2007
Programming Language Consciousness
If Nelson Mandela is South Africa's Martin Luther King Jr., then Steve Biko is its Malcolm X*. Steve Biko, under the pen name Frank Talk, developed the philosophy of Black Consciousness. This is the idea that Black people in South Africa need to work together with each other to liberate themselves from Apartheid. Liberation would not come as a gift from White liberals, but necessarily came exclusively from the efforts of the oppressed themselves. Before integration is necessary—before it is even helpful— Blacks need to be proud of their own culture and conscious of their oppression.
Some language communities (Tcl, for example) have a shared belief that, it's best to write many things in C, and then bind that library to the higher-level language. C has some advantages: it can be very fast, and it can integrate well with external libraries, which are usually also written in C or C++. So, when users of these languages need something like a GUI library, or a complex data structure, or a function which could sensibly be included as a primitive, or just something that they feel should run a little faster, they go and implement that thing in C. But one of the sources of inspiration for these languages was always to have something better to program in than C. Once the programming language exists, not as much repetitive, low-level coding has to be done in that language.
If we programmers are going to be liberated from this oppressive world of C and other imperative, low-level languages with little sense of abstraction, we have to take charge. A new language could be written in C, but then, once the core is there, libraries should be written in that language. This is the only way that liberation can be achieved.** It's fine to write a binding for existing C libraries, but for new libraries Steve Biko, Alan Turing and I are here to tell you that you do have the power to write them in that language itself! Don't fall into the incrementalist trap of using the new language for glue and C for the "heavy-duty" stuff, as this will only prolong and damage the struggle. Code written in the new, better language will be easier to maintain and extend. Paraphrasing Biko speaking about not relying on one's oppressor: Programmer, you are on your own.
* That is, if MLK had founded a guerilla organization and later became president, and if Malcolm X were viciously murdered by the government. I'm referring to the old Malcolm X, of course, not after started being more polite in rhetoric to White people. As it turned out, Mandela's movement (the African National Congress) was a huge amount more successful than the movement that Biko inspired (the Black Consciousness Movement, and later the Azanian People's Organization). Also, in the early '90s, the ANC was basically at war with the Inkhata Freedom Party, another liberation movement, in what's today KwaZulu-Natal***, and they also had significant conflicts with Azapo during the run-up to the elections. But who's counting? It's called an analogy, people. By the way, for those of you who are aware that there are more than just Africans and Europeans in South Africa, Biko extended the definition of Black to include everyone who's oppressed and acknowledges it, basically. (This way, Indians and Colored people could join the movement.)
** OK, in reality, as the above footnote notes, the multiracialist ANC beat out Azapo, but it's the idea that matters...
*** KwaZulu-Natal is my favorite province ever. Who would have thought of a CamelCase name? (Well, must've been those missionaries who created the current writing system for isiZulu.)
If my relatively intense (for me), intro-level History of Southern Africa course is good for anything, it's good for drawing spurious historical analogies for slightly tongue-in-cheek essays, right? This started out being an essay explaining why I'm a little uncomfortable with Io, because of their philosophy of implementing everything in C, but I ended up not mentioning it, except in this footnote. Sadly, this will be my most-read essay on South Africa.
Update: My wonderful history professor, Jamie Monson, read this post and told me that I was a little wrong about the historical fact about the ANC fighting with Azapo, because the mostly fought with the Inkatha Freedom Party. How could I have made such an error? It's fixed now.
Monday, November 5, 2007
The currying theorem
My favorite class in college right now is Introduction to Mathematical Structures, which explains the basics of naive set theory and theorem proving. Today, we were studying cardinal arithmetic when a startling proposition appeared on the board:
The notation in this post is described at the end
Immediately, I shouted out "That's just like currying!" (I blogged about currying a little while ago.) My professor, always mildly annoyed when I shouted out such random things, said "what?" to which I mumbled "never mind."
Some set theory
Since my class uses only naive set theory, a simple definition of a set suffices: a set is something with one well-defined operation: a membership test. If A is a set, and a is something, then we can tell if a is a member A or a is not a member of A.
One thing you can build up using sets is a Cartesian product. I won't go into the details, but it's well-defined to have an ordered pair (a, b). With sets A and B, the Cartesian product of A and B is the set of all (a, b) for any a in A and b in B.
A function f: A -> B can be defined as a subset of this Cartesian product A × B, where f(a) = b if (a, b) is in f. There are additional requirements on the function f, though: it must be total (meaning, for each a in A, there is an (a, b) for some b in B) and it must be a function (meaning, for any a in A, there is only one b in B such that (a, b) is in f).
Two sets A and B are considered equinumerous if there is a one-to-one correspondence between them. This means, for each element a in A, there is exactly one b in B which can be paired up to it, and vice versa. We can put this one-to-one correspondence into a function, called a bijection. Since functions are just subsets of A × B, a bijective function is a set of ordered pairs, one for each element of A, pairing it with a unique element of B and covering all of the elements of B. If the two sets can be paired up like this, we can consider them the same size.
So, what kind of properties does a bijection f: A -> B have? For one, it has to be a function defined on all A, corresponding to an element of B. Another property is that no two elements of A correspond to the same element of B; this is called injectivity. A third property is that every element of B has to have at least one element of A corresponding to it; this is called surjectivity.
This is all a little complicated, so let's take a simple example. Let's show that the natural numbers N (this is the set containing 1, 2, 3, 4, and so on) are equinumerous to the even natural numbers E (2, 4, 6, 8, ...). If this is true, that means there are just as many natural numbers as even numbers, a very counterintuitive proposition. But there's a very simple function f: N -> E which demonstrates a pairing of them:
All this does is pair a natural number to two times its value, which will be an even number. We can visualize f's pairing as the set containing (1, 2), (2, 4), (3, 6), (4, 8) and so on. To prove, mathematically, there are three things we need to show:
What?
So then what does that have to do with currying? Let's look at that original proposition again:
If this is true, it means that the set of functions {M -> {L -> K}} is equinumerous to the set of functions {(M × L) -> K}, which means, basically, that they are equivalent and represent each other.
Let's think about what these two sets represent. {M -> {L -> K}} is the set of functions from M to the set of functions from L to K. Wait a minute: that's just like a curried function! That's just like f(m)(l) = k: a function which returns a function to take the second argument. {(M × L) -> K} is the set of functions which take an ordered pair (m, l) and return an element of K, basically f(m, l) = k.
So, if this bijection h exists, it could be the uncurry function: it takes a curried function and makes a normal one out of it. Well, by what we know, there are other things it could do, but this is what would make the most sense. Let's define this uncurry function:
Now, is this a bijection? If it is, that means that curried functions are essentially equivalent to uncurried functions, meaning that it's valid to curry things. This is what makes it possible—easy, even—to have some languages be auto-currying (eg OCaml) and some languages not be (eg SML), and despite that use the same basic programming style. To prove that h is a bijection, we can use the same proof outline as the above demonstration that the even numbers are equinumerous to the natural numbers:
In case you lost track, we just proved that currying makes sense.
Exercise to the reader: Prove that the function h defined above is the only bijection between {M -> {L -> K}} and {(M × L) -> K} which works for any set M, L and K with no additional knowledge. Hint: I don't know how to do this, but I'm pretty sure it's true.
Cardinal arithmetic
When you get into all this complex mathematics, you can do something really cool: arithmetic. Let's redefine numbers completely, ignoring what we know already. A number, like 1, 2, 3, etc, is the set of all sets of that size. (Actually, scratch that—we can't really do that. Let's stipulate that these sets are subsets of some unspecified, but really big, U, and then keep going as if nothing ever happened. This helps us avert a gigantic contradiction.) More specifically, 1 is the set of sets which are equinumerous (remember, bijection) to the set containing just 1; 2 is the set of sets which are equinumerous to the set containing 1 and 2; etc. We use the notation |A| to refer to the cardinality, or this weird number system's value for the size, of A. So, if A contains the numbers 42, π, 1984 and nothing else, |A| = 3 since there is a bijection between A and the set containing 1, 2, 3, as well as a whole bunch of other sets: they all have the same number of elements.
It's easy to build up some basic operations for these cardinal numbers. If sets A and B are disjoint, |A| + |B| = |the set of anything in either A or B, or both|. For any set A and B, |A|⋅|B| = |A × B|. For any set A and B, |A||B| = |{B -> A}|. It's useful to imagine these with some small finite sets to verify that they make sense.
Now, we can show a whole bunch of basic identities we already knew about positive integers make sense here over all cardinal numbers. Here's one: for any cardinal numbers κ, λ, and μ: (κλ)μ = κλ⋅μ. So, what's the equivalent statement when translated into a fact about sets?
How does this look in programming?
Going back to the practical(ish) side, currying and uncurrying actually comes in handy when programming Haskell. Here's an exerpt from the Haskell Prelude:
As an example of the use of uncurry, here's an implementation of
Overview of notation
Maybe I'll never be the Good Math, Bad Math guy, but I still enjoy the chance to explain stuff. Maybe this will make up for the stupidity of my last post.
Is there a bijection h: {M -> {L -> K}} -> {(M × L) -> K}
The notation in this post is described at the end
Immediately, I shouted out "That's just like currying!" (I blogged about currying a little while ago.) My professor, always mildly annoyed when I shouted out such random things, said "what?" to which I mumbled "never mind."
Some set theory
Since my class uses only naive set theory, a simple definition of a set suffices: a set is something with one well-defined operation: a membership test. If A is a set, and a is something, then we can tell if a is a member A or a is not a member of A.
One thing you can build up using sets is a Cartesian product. I won't go into the details, but it's well-defined to have an ordered pair (a, b). With sets A and B, the Cartesian product of A and B is the set of all (a, b) for any a in A and b in B.
A function f: A -> B can be defined as a subset of this Cartesian product A × B, where f(a) = b if (a, b) is in f. There are additional requirements on the function f, though: it must be total (meaning, for each a in A, there is an (a, b) for some b in B) and it must be a function (meaning, for any a in A, there is only one b in B such that (a, b) is in f).
Two sets A and B are considered equinumerous if there is a one-to-one correspondence between them. This means, for each element a in A, there is exactly one b in B which can be paired up to it, and vice versa. We can put this one-to-one correspondence into a function, called a bijection. Since functions are just subsets of A × B, a bijective function is a set of ordered pairs, one for each element of A, pairing it with a unique element of B and covering all of the elements of B. If the two sets can be paired up like this, we can consider them the same size.
So, what kind of properties does a bijection f: A -> B have? For one, it has to be a function defined on all A, corresponding to an element of B. Another property is that no two elements of A correspond to the same element of B; this is called injectivity. A third property is that every element of B has to have at least one element of A corresponding to it; this is called surjectivity.
This is all a little complicated, so let's take a simple example. Let's show that the natural numbers N (this is the set containing 1, 2, 3, 4, and so on) are equinumerous to the even natural numbers E (2, 4, 6, 8, ...). If this is true, that means there are just as many natural numbers as even numbers, a very counterintuitive proposition. But there's a very simple function f: N -> E which demonstrates a pairing of them:
f(x) = 2x
All this does is pair a natural number to two times its value, which will be an even number. We can visualize f's pairing as the set containing (1, 2), (2, 4), (3, 6), (4, 8) and so on. To prove, mathematically, there are three things we need to show:
f is a total function: for any natural number n, there is obviously only one e such that e = 2n. And, obviously, for any n, it's well-defined to talk about 2n.
f is injective: This means that, for any e, there is no more than one n which corresponds to it. Mathematically, it's easy to show this by demonstrating, if f(n1) = f(n2), then n1 = n2. In this case, f(n1) = f(n2) implies 2⋅n1 = 2⋅n2, meaning that n1 = n2.
f is surjective: For every even number e, we need to demonstrate that there is a corresponding natural number n such that f(n) = e. This shows that f covers all of E. In this case, it's very easy to show that: all you need to do is divide an even number by two to get a natural number that, when doubled, yields that same even number.
What?
So then what does that have to do with currying? Let's look at that original proposition again:
Is there a bijection h: {M -> {L -> K}} -> {(M × L) -> K}
If this is true, it means that the set of functions {M -> {L -> K}} is equinumerous to the set of functions {(M × L) -> K}, which means, basically, that they are equivalent and represent each other.
Let's think about what these two sets represent. {M -> {L -> K}} is the set of functions from M to the set of functions from L to K. Wait a minute: that's just like a curried function! That's just like f(m)(l) = k: a function which returns a function to take the second argument. {(M × L) -> K} is the set of functions which take an ordered pair (m, l) and return an element of K, basically f(m, l) = k.
So, if this bijection h exists, it could be the uncurry function: it takes a curried function and makes a normal one out of it. Well, by what we know, there are other things it could do, but this is what would make the most sense. Let's define this uncurry function:
h(f)(m, l) = f(m)(l)
Now, is this a bijection? If it is, that means that curried functions are essentially equivalent to uncurried functions, meaning that it's valid to curry things. This is what makes it possible—easy, even—to have some languages be auto-currying (eg OCaml) and some languages not be (eg SML), and despite that use the same basic programming style. To prove that h is a bijection, we can use the same proof outline as the above demonstration that the even numbers are equinumerous to the natural numbers:
h is a total function: for any function f: M -> {L -> K}, there is only one g such that g(m, l) = f(m)(l); if there were another function which satisfied this property, it would equal g. And for any f: M -> {L -> K}, it is valid to call that function on m, where m is in M, and call that result on l, where l is in L.
h is injective: Another way we can show injectivity is by showing that if f ≠ g, then h(f) ≠ h(g). This is equivalent to what was shown before. Remember, f and g: M -> {L -> K}. So, if f ≠ g, then there is some m in M such that f(m) ≠ g(m). This implies that there is some l in L such that f(m)(l) ≠ g(m)(l). We know that f(m)(l) = h(f)(m, l), and that g(m)(l) = h(g)(m, l), by the definition of h. Therefore, h(f)(m, l) ≠ h(g)(m, l), meaning h(f) ≠ h(g). This demonstrates that h is injective: each different curried function is paired up with a different non-curried function.
h is surjective: For h to be surjective, there has to be a curried function corresponding to every uncurried function. Let's call the curried function f and the uncurried function u. If we have an uncurried function, we can easily pull the curried one out with the currying operation c:
c(u)(m)(l) = u(m, l)
So, for every uncurried function u, the corresponding curried function is c(u). If there is always an f such that h(f) = u for any u: (M × L) -> K, then h is surjective. Let's show that c(u) is that f. h(c(u))(m, l) = c(u)(m)(l) = u(m, l), so h(c(u)) = u. Since c(u) is defined for any u of the appropriate function type, h is surjective.
In case you lost track, we just proved that currying makes sense.
Exercise to the reader: Prove that the function h defined above is the only bijection between {M -> {L -> K}} and {(M × L) -> K} which works for any set M, L and K with no additional knowledge. Hint: I don't know how to do this, but I'm pretty sure it's true.
Cardinal arithmetic
When you get into all this complex mathematics, you can do something really cool: arithmetic. Let's redefine numbers completely, ignoring what we know already. A number, like 1, 2, 3, etc, is the set of all sets of that size. (Actually, scratch that—we can't really do that. Let's stipulate that these sets are subsets of some unspecified, but really big, U, and then keep going as if nothing ever happened. This helps us avert a gigantic contradiction.) More specifically, 1 is the set of sets which are equinumerous (remember, bijection) to the set containing just 1; 2 is the set of sets which are equinumerous to the set containing 1 and 2; etc. We use the notation |A| to refer to the cardinality, or this weird number system's value for the size, of A. So, if A contains the numbers 42, π, 1984 and nothing else, |A| = 3 since there is a bijection between A and the set containing 1, 2, 3, as well as a whole bunch of other sets: they all have the same number of elements.
It's easy to build up some basic operations for these cardinal numbers. If sets A and B are disjoint, |A| + |B| = |the set of anything in either A or B, or both|. For any set A and B, |A|⋅|B| = |A × B|. For any set A and B, |A||B| = |{B -> A}|. It's useful to imagine these with some small finite sets to verify that they make sense.
Now, we can show a whole bunch of basic identities we already knew about positive integers make sense here over all cardinal numbers. Here's one: for any cardinal numbers κ, λ, and μ: (κλ)μ = κλ⋅μ. So, what's the equivalent statement when translated into a fact about sets?
How does this look in programming?
Going back to the practical(ish) side, currying and uncurrying actually comes in handy when programming Haskell. Here's an exerpt from the Haskell Prelude:
-- curry converts an uncurried function to a curried function;
-- uncurry converts a curried function to a function on pairs.
curry :: ((a, b) -> c) -> a -> b -> c
curry f x y = f (x, y)
uncurry :: (a -> b -> c) -> ((a, b) -> c)
uncurry f p = f (fst p) (snd p)
As an example of the use of uncurry, here's an implementation of
zipWith
(not from the prelude) which doesn't duplicate the logic of map
, assuming an existing implementation of zip
:
zipWith :: (a->b->c) -> [a]->[b]->[c]
zipWith f a b = map (uncurry f) $ zip a b
Overview of notation
- Capital letters (eg K, L, M) are sets.
- Lower case letters are functions (eg f, g, h) or members of sets (k, l, m).
- Greek lowercase letters (eg κ, λ, μ) are cardinal numbers.
- A × B is the Cartesian product of A and B.
- f: A -> B means f is a total function from A to B.
- {A -> B} is the set of functions from A to B. (I made this notation up)
Maybe I'll never be the Good Math, Bad Math guy, but I still enjoy the chance to explain stuff. Maybe this will make up for the stupidity of my last post.
Wednesday, October 31, 2007
Post removed
The post that used to be here was wrong and pointless. I've removed it, but if you're really curious, it's just commented-out as HTML, so you could still read it.
Saturday, October 27, 2007
FactorCon MN 2007
You may have already seen this on Slava's blog, but for those of you who didn't: on November 18-21, we're having a Factor-related get-together in Minneapolis, Minnesota. All levels of experience/use of Factor are encouraged to attend. Confirmed guests include Slava Pestov, Doug Coleman and me. If you think you can come, please contact me (or someone else who's already going). I haven't been able to do much Factor hacking lately, due to school constraints, so this'll be a great opportunity for me, personally. For all of you: if you don't know Factor well, this would be a great time to learn more; if you do know Factor well, this would be a good chance to meet your fellow developers. I hope to see you there!
Friday, October 26, 2007
Factor's object system
In my FAQ, I wrote that Factor was object-oriented. That might have confused some of you: how is Factor object-oriented when code isn't organized in terms of classes like in Java? When there is no system for message passing*? When tuples can't inherit from each other?
Rationale
Originally, Factor didn't have an object system. It was said that it's not necessary; we can get by with just words, conditionals and lists. But as the library grew more complicated, it became clear that we needed a cleaner method of dispatch. This took the form of generic words, or words that can be overloaded by the class of one of their arguments. Initially, we only had predicate classes, and generic words were in fact no more than syntactic sugar for
It's important to note that most words are not methods and not all data is stored in tuples. Groups of Factor code are put in vocabularies, not in classes. It's common to write a lot of code without using any object-oriented features (aside from generic words defined in libraries). Nevertheless, object orientation, when used appropriately, makes code much better organized.
The class hierarchy
In Factor, everything is an object. Well, of course everything is an object; everything is something! But I mean everything is an instance of the class
On top of these builtin classes, there are a number of abstractions. For example, the class
Factor also supports extensible unions, called mixins. Whereas a union is closed, later code is allowed to add things to a mixin. The above code can be translated into mixin syntax as:
This is useful for things like sequences and assocs, which form mixin classes. Unions are inappropriate here, as new sequences and assocs can be made in the library.
Going in the other direction, classes can be narrowed down using predicate classes. A predicate class is a subclass of some other class consisting of instances which satisfy a conditional. For example, the natural numbers (defined as integers 0 or greater) could be defined as
Generic words
So what use are all these classes? We can use predicates on them to test membership, but that's not all that useful. The centerpiece of the object system is generic words, or words which dispatch on the class of their arguments. Typically, only single dispatch is needed, so multiple dispatch is not in the core**. A generic word which dispatches on the top of the stack is defined with the parsing word
Here, there is a generic word
But the first version of
There are other ways to dispatch besides
Tuples, delegation and constructors
It's possible to construct your own data types using just arrays and predicate classes, but that can get awkward and ambiguous. To make new data structures, Factor offers a tuples, basically another name for structs or records. Here is an example of a tuple class called
The first line creates the tuple class, and the second line makes the constructor. (Slava described the details and rationale for the constructor system.) It is simple to use this. To make a new
But wouldn't it be useful if we could let tuples inherit from each other? We've already seen that Factor allows for general subtypes and supertypes, but this does not extend to tuples. The way that tuples can be based off each other is through delegation: one tuple delegates to another. Each tuple has a slot, called the delegate (accessed with
Instead of inheriting from a class, tuples delegate to an instance. That is, if I have my tuple
There's a subtle problem here with delegates. Look at the following code:
You might expect that when
Use cases
As I said before, a lot of Factor code has no need for object orientation. Going by my own experience, some complicated libraries, such as Unicode and
In general, the object system should be used when it's useful and not used when it's not useful. That seems simple enough, but it is a radical departure from the philosophy of other languages where all procedures are methods. It's good to use whichever strategy you think will produce the most maintainable, clear code.
The future
For a while, I hesitated to write about the object system because it changes so frequently. Tuple constructors, for example, changed very recently to eliminate the default constructor and replace it with a more flexible system. Mixins are also a recent addition, yet they significantly change the way default methods are implemented. Soon, Slava plans to replace delegation with tuple inheritance.
Since the Factor object system is constructed completely in Factor, rather than being built in to the virtual machine, any changes to it can be done within Factor itself. This is a huge advantage. One change that I proposed could easily be written in Factor and run without any special care. Fundamental changes to the object system can even be interactively tested and debugged!
* Eduardo Cavazos has written a message passing object system in Factor, and he uses it for some of his code. Factor allows other object systems to be created, but with his system, not everything is an 'object' that can receive messages. Still, it's an interesting (and surprisingly terse) example of the flexibility of Factor, and of useful syntax extension. It is not widely used, however.
** I implemented double dispatch as a library called
*** As you can see, if you make an infinite (that is, recursive) delegate chain, method resolution may not halt. But why would you ever want to do that? Anyway, since we have predicate classes, it's already not guaranteed to halt!
Blogging is much more fun than writing essays about the history of South Africa.
Rationale
Originally, Factor didn't have an object system. It was said that it's not necessary; we can get by with just words, conditionals and lists. But as the library grew more complicated, it became clear that we needed a cleaner method of dispatch. This took the form of generic words, or words that can be overloaded by the class of one of their arguments. Initially, we only had predicate classes, and generic words were in fact no more than syntactic sugar for
cond
. But we also needed a good way to define data types, so that they wouldn't be confused with each other. Later, we realized the need for a system of 'inheritance' from tuples to tuples, flexible constructors, extensible unions and other things. Piece by piece, features were added to form a emerging coherent whole which could be used to organize some Factor code.It's important to note that most words are not methods and not all data is stored in tuples. Groups of Factor code are put in vocabularies, not in classes. It's common to write a lot of code without using any object-oriented features (aside from generic words defined in libraries). Nevertheless, object orientation, when used appropriately, makes code much better organized.
The class hierarchy
In Factor, everything is an object. Well, of course everything is an object; everything is something! But I mean everything is an instance of the class
object
. Under that class, there are a number of built-in classes, representing basic things like tuple
, vector
, array
, fixnum
, float
, bignum
, ratio
, quotation
, curry
, etc. You can test if an object is in a particular class with words of the scheme object?
. To test if an object is a tuple, use the word tuple?
.On top of these builtin classes, there are a number of abstractions. For example, the class
number
is the union of all numeric classes. Under this is a numerical tower of subclasses, similar to Scheme. An example of a couple unions of this kind:
UNION: integer fixnum bignum ;
UNION: rational ratio integer;
Factor also supports extensible unions, called mixins. Whereas a union is closed, later code is allowed to add things to a mixin. The above code can be translated into mixin syntax as:
MIXIN: rational
MIXIN: integer
INSTANCE: rational ratio
INSTANCE: rational integer
INSTANCE: integer fixnum
INSTANCE: integer bignum
This is useful for things like sequences and assocs, which form mixin classes. Unions are inappropriate here, as new sequences and assocs can be made in the library.
Going in the other direction, classes can be narrowed down using predicate classes. A predicate class is a subclass of some other class consisting of instances which satisfy a conditional. For example, the natural numbers (defined as integers 0 or greater) could be defined as
PREDICATE: integer natural 0 >= ;
Generic words
So what use are all these classes? We can use predicates on them to test membership, but that's not all that useful. The centerpiece of the object system is generic words, or words which dispatch on the class of their arguments. Typically, only single dispatch is needed, so multiple dispatch is not in the core**. A generic word which dispatches on the top of the stack is defined with the parsing word
GENERIC:
. Methods, or implementations of the generic word, are written with the parsing word M:
.Take the following useless example:
GENERIC: process ( thing -- processed )
M: string process
"foo" append ;
M: integer process
5 + ;
Here, there is a generic word
process
with two methods: one on strings (which appends "foo") and one on integers (which adds 5). This is equivalent to:
: process ( thing -- processed )
{
{ [ dup string? ] [ "foo" append ] }
{ [ dup integer? ] [ 5 + ] }
} cond ;
But the first version of
process
has the advantage that it is extensible later, in other files. Additionally, generic words are sensitive to the class hierarchy and will test things in an appropriate order, and dispatch is more optimized than with a simple cond
expression.There are other ways to dispatch besides
GENERIC:
. Other method combinations include MATH:
, for words which dispatch on builtin math types for two arguments, HOOK:
, to dispatch on the class of the value of a variable, and GENERIC#
, which allows dispatch on items further down on the stack. You can also make your own method combinations.Tuples, delegation and constructors
It's possible to construct your own data types using just arrays and predicate classes, but that can get awkward and ambiguous. To make new data structures, Factor offers a tuples, basically another name for structs or records. Here is an example of a tuple class called
foo
, with slots bar
and baz
, together with a constructor called <foo>
:
TUPLE: foo bar baz ;
C: <foo> foo
The first line creates the tuple class, and the second line makes the constructor. (Slava described the details and rationale for the constructor system.) It is simple to use this. To make a new
foo
tuple, just get the values of bar
and baz
on the stack and call <foo
. With that object, the bar
slot can be read with the word foo-bar
the baz
slot can be read with the word foo-baz
. To write values to those slots, use the words set-foo-bar
and set-foo-baz
. And, of course, to test if an object is a foo tuple, use the word foo?
. All in all, TUPLE:
is a wildly unhygenic macro, and we like it that way!But wouldn't it be useful if we could let tuples inherit from each other? We've already seen that Factor allows for general subtypes and supertypes, but this does not extend to tuples. The way that tuples can be based off each other is through delegation: one tuple delegates to another. Each tuple has a slot, called the delegate (accessed with
delegate
and set-delegate
). When a generic word dispatches on a tuple, but it doesn't have a method implementation for that tuple, it looks in the delegate slot for what to do next. If there is a delegate, then the generic word is again invoked on that delegate, with the hope that generic word itself has an implementation for that class. If not, it's on to the next delegate until there is no delegate anymore--the delegate is f.***Instead of inheriting from a class, tuples delegate to an instance. That is, if I have my tuple
TUPLE: foo bar baz
and I want it to inherit from TUPLE: bing quux ;
, all I have to do is make an instance of foo and an instance of bing, and set the delegate of the foo to the bing. Slot accessors like bing-quux
and set-bing-quux
are generic words implemented, by default, only on that one tuple class that they were made for. So, if you call bing-quux
on a tuple of class foo that delegates to a bing, the foo instance will fail to respond and the generic word will be called again on the delegate, bing. This delegate will know what to do, and it'll return its quux value.There's a subtle problem here with delegates. Look at the following code:
TUPLE: foo bar baz ;
TUPLE: bing quux ;
GENERIC: who
M: bing who how ;
GENERIC: how
M: bing how drop "bing" print ;
M: foo how drop "foo" print ;
You might expect that when
who
is called on a foo instance, "foo" is printed, but in fact, "bing" is printed. This is called the slicing problem: the object is "sliced" when using delegates, so only part of the object is represented in further method calls. On one hand, the behavior of M: bing who
is more predictable, but on the other hand, information is lost. It's not a horrible thing, just something to watch out for.Use cases
As I said before, a lot of Factor code has no need for object orientation. Going by my own experience, some complicated libraries, such as Unicode and
inverse
, don't yet use the object system at all. The XML-RPC vocabulary relies heavily on generic words to perform dispatch in turning Factor data into XML-RPC format, though no new classes are defined. The XML library, however, uses tuples to represent XML at various stages of processing, and dispatches on the class of these tuples to provide complex functionality. Several tuples are defined in the XML library to represent exceptions, though any object can be thrown as an exception.In general, the object system should be used when it's useful and not used when it's not useful. That seems simple enough, but it is a radical departure from the philosophy of other languages where all procedures are methods. It's good to use whichever strategy you think will produce the most maintainable, clear code.
The future
For a while, I hesitated to write about the object system because it changes so frequently. Tuple constructors, for example, changed very recently to eliminate the default constructor and replace it with a more flexible system. Mixins are also a recent addition, yet they significantly change the way default methods are implemented. Soon, Slava plans to replace delegation with tuple inheritance.
Since the Factor object system is constructed completely in Factor, rather than being built in to the virtual machine, any changes to it can be done within Factor itself. This is a huge advantage. One change that I proposed could easily be written in Factor and run without any special care. Fundamental changes to the object system can even be interactively tested and debugged!
* Eduardo Cavazos has written a message passing object system in Factor, and he uses it for some of his code. Factor allows other object systems to be created, but with his system, not everything is an 'object' that can receive messages. Still, it's an interesting (and surprisingly terse) example of the flexibility of Factor, and of useful syntax extension. It is not widely used, however.
** I implemented double dispatch as a library called
visitor
(since it's implemented with the visitor pattern, yielding a simple lexicographic method order). Maybe in the future we'll have double dispatch in the Factor core, to better implement words like like
, which end up approximating double dispatch using conditionals. Though it makes sense in theory, I can't imagine any use for triple dispatch in Factor.*** As you can see, if you make an infinite (that is, recursive) delegate chain, method resolution may not halt. But why would you ever want to do that? Anyway, since we have predicate classes, it's already not guaranteed to halt!
Blogging is much more fun than writing essays about the history of South Africa.
Wednesday, October 17, 2007
Tuesday, October 16, 2007
Unicode implementer's guide part 5: Collation
What could be simpler than alphabetical order1? It's a simple lexicographic comparison of two strings, comparing the first character to see which is greater, then the second as a tie breaker, then the third, and so on. Unicode makes a lot of things harder, but this shouldn't be any different. Unicode doesn't prohibit this kind of alphabetical order; you can have a 'Unicode-conformant' process which uses this order.
Great idea, but...
This kind of order, which we could call code point order, is useless. Well, it's not completely useless; it's fine if you need an arbitrary, consistent order. But for humans, it's useless. It puts Z before a. In Normalization Form D (which I'll assume strings are in for the rest of the article), it puts the "äa" after "az". It puts "aa" after "a-z". It puts "æ" after "af". All of these are backwards for users in the American English locale and most other locales.
It's not just that the letters are in the wrong order. It also won't be sufficient to put all characters in a consistent case, as the R6RS Unicode module authors seem to think, as this will still give unnatural behavior for punctuation, accent marks, ligatures, and other things. Removing all of these features won't do it either; they can't be completely ignored, becase "foot ball" should come consistently after "football", and "Apple" should come consistently either before or after "apple", depending on the needs of the user.
It's crucial that things are sorted correctly; if they are in a different order than the user expects, then a user, looking at a long alphabetized list, may conclude that an item on the list doesn't exist.
A better way for people
To get things right, the way humans expect them, you need another few levels of tiebreakers. When comparing two strings to see which one comes first, there's a number of steps: First, compare two strings with all of the features stripped. If they're equal, compare the accent marks of the strings, as a tie breaker. If those are equal again, compare the capitalization as a second tie breaker. And if those are still equal, compare the punctuation.
I left out one thing--ligatures like æ. This can be expanded to something that looks like "ae" in the first stage, but looks different in the accent mark tiebreaker. We'll call them expansions. In some languages, like Danish, there are also contractions like "aa" with their own special alphabetization; these contract into a single value for the initial sort, if the Danish locale is used. In Thai and Lao scripts, there is an additional problem of certain vowels and consonants reversing order for collation purposes. This can be achieved with a combination of the expansion and contraction patterns. There's also a special problem for Korean collation2.
UTS #10
Believe it or not, I didn't come up with this all by myself. It's part of the Unicode Standard, in Unicode Techinical Standard #10: Collation. UTS #10 isn't something that's required of Unicode implementations, but it is specifically standardized and codified. It does basically what I described in the section above, with a slight change for efficiency. Because the folks at the Unicode Consortium did all this work, we don't have to! They've even given us a convenient table of data that makes it easier to implement the above tiebreakers.
UTS #10 suggests a simple way for implementing these tiebreakers that works for efficiently for sorting large lists of strings: calculate a collation key, a bit array, for each string, and compare these collation keys with a simple bit comparison. This collation key can be calculated in linear time and space with respect to the string, so a human-appropriate comparison is asymptotically no worse than a naive one and suffers only linear overhead.
Collation keys and the DUCET
But how do we make the collation key? Well, that table I mentioned earlier is called DUCET, the Default Unicode Collation Element Table. It describes a neutral locale which is linguistically inaccurate everyone. For most3 characters in NFD, there is a three-tuple4 of the primary, secondary, and tertiary collation weights. In assembling the collation key from a string, go through the string, first appending the primary weights of each character5, then a separator of 0, then all of the secondary weights, followed by another separator, and finally the tertiary weights. If one of these weights for a particular character is 0, leave it out. This is all followed by a representation of the punctuation6.
A simple binary comparison of the collation keys ends up being equivalent to what I described before. How? When the primary weights are appended, they form something equivalent to the characters all put in a consistent case, with accent marks and punctuation removed. In a binary comparison of collation keys, these are compared first. They are followed by a separator of 0 so that, if the strings' primary weights are equivalent up to the end of the shorter one, the longer string will be sorted second. This is because 0 is less than any collation weight. The secondary collation weight represents the accent marks, and the tertiary collation weight represents the capitalization. (See note 5 about punctuation.)
Implementation and tailoring
I wish I could give hints on implementation, but, umm... I haven't written a working one yet. It's quite an involved algorithm, at least compared to things I've done before. Luckily, the UTS gives lots of hits for how to do it. One thing that makes implementation easier is that the collation weights of many characters, mostly Han ideographs, can be derived algorithmically through a formula specified in UTS #10. In fact, this is somewhat unfortunate, as it is a symptom of the fact that the ordering of Han ideographs requires a more involved process, using some sort of dictionary lookup for pronunciation or radical index, depending on the context.
But implementation comes with an additional challenge: tailoring to specific locales. The DUCET isn't linguistically accurate for any locale, and must always be modified somewhat. Different contexts such as language, nation, context and organizational preference can demand slightly different collation orders that users depend on. Modifying the collation table to suit this is known as tailoring. I'm not yet sure what algorithm should be used to reshuffle the collation weights in the right way. But there is a ton of locale data in the Common Locale Data Repository (CLDR), a project run by the Unicode Consortium. This includes information about collation.
Conclusion
Collation isn't easy, but it's necessary to get right. UTS #10 lays out an algorithm for many aspects of collation, but still leaves some things out: there's no specified process for ignoring words like 'the' in the beginning of a book title, or padding numbers with zeros so that 2 comes before 10. Nevertheless, with this algorithm and appropriate locale data, it's possible to give most users a linguistically accurate alphabetical order which makes sense to humans.
1 Actually, we should probably call it collation, since alphabetical order implies we're working with an alphabet. With Unicode, some scripts don't use an alphabet.
2 There are a number of different ways to sort Korean Hangul syllables appropriately. The precomposed Hangul syllables are already in the correct collation order, but a problem arises in the representation of old Hangul texts, where multiple initial, medial or final jamo can be used. (In modern Korean, only one is used per syllable.) There is more than one way to sort this. Because of an apparent bureaucratic argument between the Unicode Consortium and the ISO (who publishes the parallel standard ISO 10646), none of these are encoded in the DUCET. The one that sounds the simplest to implement is to decompose all syllables and place a special terminator weight after the end of each Hangul syllable. More details are in UTS #10.
3 I say 'most' because Han characters and Hangul syllables are left out, as their weights can be derived algorithmically. Maybe I shouldn't say 'most', as most characters are in that group, though.
4 Actually, it's a four-tuple, but the last entry isn't very useful for anything. I think it's the same as the shifted strategy for variable weighting, described in note 5.
5 Actually, to make expansions and contractions work, you shouldn't be iterating over characters; you should iterate over collation elements. For example, in Slovak, 'ch' forms a collation element with a distinct single primary weight.
6 There are many different ways to represent the punctuation, but that is beyond the scope of this article. Generally, we don't refer to it as 'punctuation' but rather 'variable-weighted'. The most useful variant (they all have applications, however) is called 'shifted', adding an algorithmically-derived fourth level of weight to compare variable-weighted characters.
Great idea, but...
This kind of order, which we could call code point order, is useless. Well, it's not completely useless; it's fine if you need an arbitrary, consistent order. But for humans, it's useless. It puts Z before a. In Normalization Form D (which I'll assume strings are in for the rest of the article), it puts the "äa" after "az". It puts "aa" after "a-z". It puts "æ" after "af". All of these are backwards for users in the American English locale and most other locales.
It's not just that the letters are in the wrong order. It also won't be sufficient to put all characters in a consistent case, as the R6RS Unicode module authors seem to think, as this will still give unnatural behavior for punctuation, accent marks, ligatures, and other things. Removing all of these features won't do it either; they can't be completely ignored, becase "foot ball" should come consistently after "football", and "Apple" should come consistently either before or after "apple", depending on the needs of the user.
It's crucial that things are sorted correctly; if they are in a different order than the user expects, then a user, looking at a long alphabetized list, may conclude that an item on the list doesn't exist.
A better way for people
To get things right, the way humans expect them, you need another few levels of tiebreakers. When comparing two strings to see which one comes first, there's a number of steps: First, compare two strings with all of the features stripped. If they're equal, compare the accent marks of the strings, as a tie breaker. If those are equal again, compare the capitalization as a second tie breaker. And if those are still equal, compare the punctuation.
I left out one thing--ligatures like æ. This can be expanded to something that looks like "ae" in the first stage, but looks different in the accent mark tiebreaker. We'll call them expansions. In some languages, like Danish, there are also contractions like "aa" with their own special alphabetization; these contract into a single value for the initial sort, if the Danish locale is used. In Thai and Lao scripts, there is an additional problem of certain vowels and consonants reversing order for collation purposes. This can be achieved with a combination of the expansion and contraction patterns. There's also a special problem for Korean collation2.
UTS #10
Believe it or not, I didn't come up with this all by myself. It's part of the Unicode Standard, in Unicode Techinical Standard #10: Collation. UTS #10 isn't something that's required of Unicode implementations, but it is specifically standardized and codified. It does basically what I described in the section above, with a slight change for efficiency. Because the folks at the Unicode Consortium did all this work, we don't have to! They've even given us a convenient table of data that makes it easier to implement the above tiebreakers.
UTS #10 suggests a simple way for implementing these tiebreakers that works for efficiently for sorting large lists of strings: calculate a collation key, a bit array, for each string, and compare these collation keys with a simple bit comparison. This collation key can be calculated in linear time and space with respect to the string, so a human-appropriate comparison is asymptotically no worse than a naive one and suffers only linear overhead.
Collation keys and the DUCET
But how do we make the collation key? Well, that table I mentioned earlier is called DUCET, the Default Unicode Collation Element Table. It describes a neutral locale which is linguistically inaccurate everyone. For most3 characters in NFD, there is a three-tuple4 of the primary, secondary, and tertiary collation weights. In assembling the collation key from a string, go through the string, first appending the primary weights of each character5, then a separator of 0, then all of the secondary weights, followed by another separator, and finally the tertiary weights. If one of these weights for a particular character is 0, leave it out. This is all followed by a representation of the punctuation6.
A simple binary comparison of the collation keys ends up being equivalent to what I described before. How? When the primary weights are appended, they form something equivalent to the characters all put in a consistent case, with accent marks and punctuation removed. In a binary comparison of collation keys, these are compared first. They are followed by a separator of 0 so that, if the strings' primary weights are equivalent up to the end of the shorter one, the longer string will be sorted second. This is because 0 is less than any collation weight. The secondary collation weight represents the accent marks, and the tertiary collation weight represents the capitalization. (See note 5 about punctuation.)
Implementation and tailoring
I wish I could give hints on implementation, but, umm... I haven't written a working one yet. It's quite an involved algorithm, at least compared to things I've done before. Luckily, the UTS gives lots of hits for how to do it. One thing that makes implementation easier is that the collation weights of many characters, mostly Han ideographs, can be derived algorithmically through a formula specified in UTS #10. In fact, this is somewhat unfortunate, as it is a symptom of the fact that the ordering of Han ideographs requires a more involved process, using some sort of dictionary lookup for pronunciation or radical index, depending on the context.
But implementation comes with an additional challenge: tailoring to specific locales. The DUCET isn't linguistically accurate for any locale, and must always be modified somewhat. Different contexts such as language, nation, context and organizational preference can demand slightly different collation orders that users depend on. Modifying the collation table to suit this is known as tailoring. I'm not yet sure what algorithm should be used to reshuffle the collation weights in the right way. But there is a ton of locale data in the Common Locale Data Repository (CLDR), a project run by the Unicode Consortium. This includes information about collation.
Conclusion
Collation isn't easy, but it's necessary to get right. UTS #10 lays out an algorithm for many aspects of collation, but still leaves some things out: there's no specified process for ignoring words like 'the' in the beginning of a book title, or padding numbers with zeros so that 2 comes before 10. Nevertheless, with this algorithm and appropriate locale data, it's possible to give most users a linguistically accurate alphabetical order which makes sense to humans.
1 Actually, we should probably call it collation, since alphabetical order implies we're working with an alphabet. With Unicode, some scripts don't use an alphabet.
2 There are a number of different ways to sort Korean Hangul syllables appropriately. The precomposed Hangul syllables are already in the correct collation order, but a problem arises in the representation of old Hangul texts, where multiple initial, medial or final jamo can be used. (In modern Korean, only one is used per syllable.) There is more than one way to sort this. Because of an apparent bureaucratic argument between the Unicode Consortium and the ISO (who publishes the parallel standard ISO 10646), none of these are encoded in the DUCET. The one that sounds the simplest to implement is to decompose all syllables and place a special terminator weight after the end of each Hangul syllable. More details are in UTS #10.
3 I say 'most' because Han characters and Hangul syllables are left out, as their weights can be derived algorithmically. Maybe I shouldn't say 'most', as most characters are in that group, though.
4 Actually, it's a four-tuple, but the last entry isn't very useful for anything. I think it's the same as the shifted strategy for variable weighting, described in note 5.
5 Actually, to make expansions and contractions work, you shouldn't be iterating over characters; you should iterate over collation elements. For example, in Slovak, 'ch' forms a collation element with a distinct single primary weight.
6 There are many different ways to represent the punctuation, but that is beyond the scope of this article. Generally, we don't refer to it as 'punctuation' but rather 'variable-weighted'. The most useful variant (they all have applications, however) is called 'shifted', adding an algorithmically-derived fourth level of weight to compare variable-weighted characters.
Friday, September 28, 2007
Metablog entry
Here's how to make a good blog: Just write something. There's no real secret to blog writing; you just have to write something that no one has written before. (And make it sort of interesting and intelligible, but those are more optional.)
I bet all of you reading this entry have a new idea in your head. Why not write about it?
Update: Removed irrelevant rambling
I bet all of you reading this entry have a new idea in your head. Why not write about it?
Update: Removed irrelevant rambling
Subscribe to:
Posts (Atom)