Tuesday, February 20, 2007

Doing Unicode right, part 2

Note: this post won't have anything to do with what I said I'd write in my next post.

Previously, I was a bit vague with what I wanted from my Unicode library. I'm going to invent a new term for what I want, since existing terms don't match this: Factor should be a Unicode-friendly language. "Unicode-perfect" only gets 1 relevant Google hit, so I think I'm safe in choosing this term. Currently, there are no Unicode-perfect programming languages, though there have been many attempts. A programming language is Unicode-perfect if:

  1. There are correct I/O routines for dealing with Unicode (this is the easy part).

  2. All programs written in the language (that don't have obvious bugs or purposely look at things at a low level) are Unicode-conformant, as defined by the Unicode standard. It shouldn't take any effort at all for the programmer to make their programs this way.

  3. All scripts defined by Unicode are correctly processed (though not necessarily displayed) with no special handling by the programmer. This includes scripts outside the Basic Multilingual Plane (BMP) and scripts added in the most recent version of Unicode.

  4. Using Unicode should cause no significant performance penalty


A side note: it's funny that most Unicode libraries and programming languages talk about "Unicode support" rather than "Unicode conformance." This is probably because Unicode support is not defined anywhere. Any program that has UTF-8 or UTF-16 I/O (even if it's buggy) can claim "Unicode support", but few things actually conform to the standard. Conformance is a goal because it lets programs have predictably correct behavior in all supported languages. Conformance does not require that everything is supported. It only requires that everything claimed supported is. For example, I do not plan to support word breaking properties, but as long as I don't claim to do so, my programs can still be Unicode-conformant

It is possible to process Unicode properly in any Turing-complete programming language. But will developers want to do it? Say a programmer is using C to write a program which they initially believe will only be used with ASCII. For efficiency and ease of writing, they use a null-terminated char * to represent the string. If the programmer later decides to use Unicode, they have a number of decisions to make: which Unicode library will they use? What encoding will be used internally? Additionally, all existing operations used for processing strings will have to be switched to the new Unicode version.

Now I can hear all you Pythonistas, Perl Monks and Java and C# programmers saying, "My language is Unicode-perfect!" No it's not. You're wrong, and it's not just about lack of support for obscure scripts. None of those languages really hides the encoding of strings from the programmer. If you index a string, an operation available and commonly used in all of those languages, the result may be a surrogate pair or UTF-8 octet rather than a character. One solution to this problem is to make all strings in UTF-32, so all code points can be represented directly. Another solution is to make some strings 8-bit when characters with code points U+0000..U+00FF are used and 32-bit for everything else, though this becomes complicated when strings are mutable, as they are in Factor.

One prickly issue is normalization. According to the Unicode standard, "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." Basically, this means that if the canonical decomposed normalization form (NKD) of two strings is equal, the strings should be treated as equal. Now, if a language is going to be Unicode-perfect, it has to let programmers not care about this. The easiest way to handle this is to put all strings in NFD internally, when strings are created and read from streams. (On output, strings should be converted to NFC so the greatest number of programs can process them.) But the only way to assure that strings stay NFD while inside the language is to make them immutable. Otherwise a user may, for example, insert a precomposed character into a string, making it no longer in NFD. This might be how it's done in Java or Python or other languages with immutable strings, but I'm not sure. In Factor, if this approach were to be taken, many changes would have to be made, but they would all be internal. All string mutations would take place in sbufs, and >string would normalize. This requires a fast normalization algorithm, but I think that's possible, since in many cases, normalization is a no-op.

The other option, which is very undesirable, is to not have strings under any sort of normalization and make the programmer put it in a normalized form. This is very problematic, though. For one, normalization only changes behavior in a few edge cases, but it requires time operate, so careless programmers may be tempted to skip it. Another problem is that it breaks abstraction: most programmers don't know that NFD exists, and they shouldn't have to. The third problem is where to place the normalization operation. One option is immediately after operations which construct strings. Another option is immediately before operations which rely on equality of strings. It may be tempting to do something like : string-map map normalize ; inline or : string= [ normalize ] 2apply = ; to make these options easier, but there are far too many words which use strings to make this practical.

Though it may be difficult, it is not impossible for a programming language to be Unicode-perfect. It is my goal to achieve this for Factor.

Update: Funny, it turns out some people actually read this blog. Thanks for all your comments, though I'd like it if people wrote their name or a suitable pseudonym when commenting.

An anonymous commenter suggested that C under Plan 9 might be Unicode-perfect, since the OS and all libraries use Unicode to the core. That's definitely a possibility, though there are two possible problems: 1, a stupid programmer could manipulate strings without the new library that allows UTF-8 strings, and 2, programs written this way aren't cross-platform. Still, neither of these kills its utility.

Schlenk suggested that Tcl(/Tk), while not perfect, might be good. A cursory glance at it makes it appear that Unicode can be completely ignored by the programmer. This leads to a different issue: how does the programmer specify input/output encodings? But I'm sure they have this worked somehow. Another thing is that many of the websites that I read about Tcl and Unicode contained common fallacies like that Unicode is a 16-bit encoding (it is 21-bit), or that a UTF-8 can contain up to 6 octets, specifying a 32-bit character (it can only contain 4 octets which stands for a 21-bit character). But this error isn't necessarily reflected in Tcl's implementation.

For Factor .89 (Factor .88 should come out in a few days), Slava has promised to make strings immutable, and conversion from a sbuf to another sequence to normalize. Operations like append and map, when operating on strings to create strings, will first form an sbuf. The sbuf will be converted to a string, normalizing it. This will make all strings normalized (I chose the NFD normalization form) unless you do something deliberately evil like [ code-pushing-a-precomposed-char ] { } make >string. It should be noted that for most output operations, strings will be converted to NFC form (except for Mac OS pathnames).

Saturday, February 10, 2007

Doing Unicode right, part 1

When I found out that Factor didn't really support Unicode, I decided to implement it myself. One goal in this is to do Unicode right, where other programming languages had terrible bugs, exposing programmers to surrogate pairs, casing complications and other quirks of Unicode. My other goal is to spend absolutely no money in doing it. Unicode is a free standard, so it should not cost anything to get the necessary information about it. I started to read Unicode Explained every time I went to Barnes and Noble, so I could figure out how to accomplish it. This shouldn't be too hard, should it? It's only a character encoding...

It wasn't hard to get a finite state machine working to decode UTF-8 and UTF-16, and it wasn't hard to write an encoder either, using make (though I may refactor that out). But then, I decided to do case conversion, opening a can of worms. After a bit of research (it really doesn't say anything on the internet on this outside of the standard itself) I found out that unicode.org hosts a web-accessable text database of Unicode characters in a file called UnicodeData.txt, including their upper and lower forms. Since I was doing Unicode right and everything, I decided to use Unicode 5.0, not 2.0 like most things do. And to make transition easy, I got all data directly from these files without any code generation (as I did for libs/xml/char-classes.factor) and with minimal hard coding.

For most characters, this maps cleanly 1:1, but for just a few (maybe 100) there are multi-letter translations for upper or lower case. These are defined in a separate file called SpecialCasing.txt. For example, in German, "Fuß" >upper ==> "FUSS". Though this is a little difficult, it ultimately worked out fine. This is about as far as I've gotten. For a small subset of these (around 15) there are conditional mappings: a word ending in a sigma, when made into lower case, must end in a final sigma; in Lithuanian, Azeri and Turkish, there are oddities with dots on I (and sometimes J). Some of the mappings are duplicated in the Turkish/Azeri section and the main UnicodeData.txt and I have no idea why. Additionally, the file format in SpecialCasing.txt changed in 5.0 from 4.0, but I can't figure out what the new semantics are. The strangest thing is, actually, the fact that the Lithuanian, Turkish and Azeri mappings are locale-dependent, that is, the language of the text must be known. But the whole point of Unicode to allow multilingual texts. How is this compatable?

I would look at the Unicode standard itself to resolve these issues (there's not exactly a big Unicode developer community I can tap into), but the most recent version available online is 4.0, and the format of SpecialCasing.txt is incompatible. The website lists the table of contents and says that links to PDFs will be made as soon as the standard is put up on the website. But it's been two months since the book was released, and theses are computer people: they should be capable of getting it on the website. Buying the book (or having someone else buy it for me) is out of the question; that would not be in keeping with my second goal.

To see my progress so far, look in a factor distribution in libs/unicode. In the next update, I'll discuss Unicode equivalence forms and why they make no sense at all.

Friday, February 2, 2007

More string parsing?

While I was writing my XML parser, I swore that I would never again write anything that involved string parsing. I quickly broke that promise, and the next two libraries I wrote were a CSV (comma-separated values) parser and a UTF-8/16 encoder/decoder. The good thing about these is that they'll be useful. I'm planning on using the CSV parser, for example, in fixing a complicated spreadsheet for a volunteer association I'm in (the version of Excel I have exports to CSV but not XML). Both the Unicode encodings and CSV are relatively simple, but I implemented the encodings in two different ways. For CSV, I used the imperative parsing model that I described in previous blog entries. To make this cleaner, I split off a new module, libs/state-parser, accompanied by better documentation.

This way, I was able to build a working program, but the problem with this approach is that it is imperative. This makes it so that the order in which words appear matter not only in terms of stack effect but also side effect. You might think in a language like Factor, where flow of control is extremely easy to determine, this is not a problem. However, it is. In this particular case, I needed to look at two characters to determine what to do next, the current character and the previous character. To look at both of these requires a state change, one which I don't always want to happen. It took a while to fully diagnose this issue and I tried about 5 different ways until I found one which fixed it.

I implemented Unicode differently. For this, I used a design pattern known as a sentinel, which helps me cross-cut pointcutting concerns by instantiating objects which encapsulate the state of the parser [Enough OO jargon for you?]. I never mutate these, and the program is purely functional except for the use of make (which could trivially be changed into a less efficient map [ ] subset, sacrificing efficiency and some terseness but making it functional). Writing this code was practically error-free, and the errors that existed could easily be diagnosed and fixed. I always sorta knew in my mind that it's better to write programs in a functional style than an imperative style, but this really proved it to me. Here's the code for decoding UTF-16LE.

TUPLE: new ;
TUPLE: double val ;
TUPLE: quad2 val ;
TUPLE: quad3 val ;

: bad-char CHAR: ? ;

GENERIC: (utf16le) ( char state -- state )
M: new (utf16le)
drop <double> ;
M: double (utf16le)
over -3 shift BIN: 11011 = [
over BIN: 100 bitand 0 =
[ double-val swap BIN: 11 bitand 8 shift bitor <quad2> ]
[ 2drop bad-char , <new> ] if
] [ double-val swap 8 shift bitor , <new> ] if ;
M: quad2 (utf16le)
quad2-val 10 shift bitor <quad3> ;
M: quad3 (utf16le)
over -2 shift BIN: 110111 = [
swap BIN: 11 bitand 8 shift
swap quad3-val bitor HEX: 10000 + , <new>
] [ 2drop bad-char , <new> ] if ;

: utf16le ( state string -- state string )
[ [ swap (utf16le) ] each ] { } make ;

If you're still reading my blog at this point, I'm genuinely impressed that you put up with such a boring series of posts.