Tuesday, February 20, 2007

Doing Unicode right, part 2

Note: this post won't have anything to do with what I said I'd write in my next post.

Previously, I was a bit vague with what I wanted from my Unicode library. I'm going to invent a new term for what I want, since existing terms don't match this: Factor should be a Unicode-friendly language. "Unicode-perfect" only gets 1 relevant Google hit, so I think I'm safe in choosing this term. Currently, there are no Unicode-perfect programming languages, though there have been many attempts. A programming language is Unicode-perfect if:

  1. There are correct I/O routines for dealing with Unicode (this is the easy part).

  2. All programs written in the language (that don't have obvious bugs or purposely look at things at a low level) are Unicode-conformant, as defined by the Unicode standard. It shouldn't take any effort at all for the programmer to make their programs this way.

  3. All scripts defined by Unicode are correctly processed (though not necessarily displayed) with no special handling by the programmer. This includes scripts outside the Basic Multilingual Plane (BMP) and scripts added in the most recent version of Unicode.

  4. Using Unicode should cause no significant performance penalty

A side note: it's funny that most Unicode libraries and programming languages talk about "Unicode support" rather than "Unicode conformance." This is probably because Unicode support is not defined anywhere. Any program that has UTF-8 or UTF-16 I/O (even if it's buggy) can claim "Unicode support", but few things actually conform to the standard. Conformance is a goal because it lets programs have predictably correct behavior in all supported languages. Conformance does not require that everything is supported. It only requires that everything claimed supported is. For example, I do not plan to support word breaking properties, but as long as I don't claim to do so, my programs can still be Unicode-conformant

It is possible to process Unicode properly in any Turing-complete programming language. But will developers want to do it? Say a programmer is using C to write a program which they initially believe will only be used with ASCII. For efficiency and ease of writing, they use a null-terminated char * to represent the string. If the programmer later decides to use Unicode, they have a number of decisions to make: which Unicode library will they use? What encoding will be used internally? Additionally, all existing operations used for processing strings will have to be switched to the new Unicode version.

Now I can hear all you Pythonistas, Perl Monks and Java and C# programmers saying, "My language is Unicode-perfect!" No it's not. You're wrong, and it's not just about lack of support for obscure scripts. None of those languages really hides the encoding of strings from the programmer. If you index a string, an operation available and commonly used in all of those languages, the result may be a surrogate pair or UTF-8 octet rather than a character. One solution to this problem is to make all strings in UTF-32, so all code points can be represented directly. Another solution is to make some strings 8-bit when characters with code points U+0000..U+00FF are used and 32-bit for everything else, though this becomes complicated when strings are mutable, as they are in Factor.

One prickly issue is normalization. According to the Unicode standard, "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." Basically, this means that if the canonical decomposed normalization form (NKD) of two strings is equal, the strings should be treated as equal. Now, if a language is going to be Unicode-perfect, it has to let programmers not care about this. The easiest way to handle this is to put all strings in NFD internally, when strings are created and read from streams. (On output, strings should be converted to NFC so the greatest number of programs can process them.) But the only way to assure that strings stay NFD while inside the language is to make them immutable. Otherwise a user may, for example, insert a precomposed character into a string, making it no longer in NFD. This might be how it's done in Java or Python or other languages with immutable strings, but I'm not sure. In Factor, if this approach were to be taken, many changes would have to be made, but they would all be internal. All string mutations would take place in sbufs, and >string would normalize. This requires a fast normalization algorithm, but I think that's possible, since in many cases, normalization is a no-op.

The other option, which is very undesirable, is to not have strings under any sort of normalization and make the programmer put it in a normalized form. This is very problematic, though. For one, normalization only changes behavior in a few edge cases, but it requires time operate, so careless programmers may be tempted to skip it. Another problem is that it breaks abstraction: most programmers don't know that NFD exists, and they shouldn't have to. The third problem is where to place the normalization operation. One option is immediately after operations which construct strings. Another option is immediately before operations which rely on equality of strings. It may be tempting to do something like : string-map map normalize ; inline or : string= [ normalize ] 2apply = ; to make these options easier, but there are far too many words which use strings to make this practical.

Though it may be difficult, it is not impossible for a programming language to be Unicode-perfect. It is my goal to achieve this for Factor.

Update: Funny, it turns out some people actually read this blog. Thanks for all your comments, though I'd like it if people wrote their name or a suitable pseudonym when commenting.

An anonymous commenter suggested that C under Plan 9 might be Unicode-perfect, since the OS and all libraries use Unicode to the core. That's definitely a possibility, though there are two possible problems: 1, a stupid programmer could manipulate strings without the new library that allows UTF-8 strings, and 2, programs written this way aren't cross-platform. Still, neither of these kills its utility.

Schlenk suggested that Tcl(/Tk), while not perfect, might be good. A cursory glance at it makes it appear that Unicode can be completely ignored by the programmer. This leads to a different issue: how does the programmer specify input/output encodings? But I'm sure they have this worked somehow. Another thing is that many of the websites that I read about Tcl and Unicode contained common fallacies like that Unicode is a 16-bit encoding (it is 21-bit), or that a UTF-8 can contain up to 6 octets, specifying a 32-bit character (it can only contain 4 octets which stands for a 21-bit character). But this error isn't necessarily reflected in Tcl's implementation.

For Factor .89 (Factor .88 should come out in a few days), Slava has promised to make strings immutable, and conversion from a sbuf to another sequence to normalize. Operations like append and map, when operating on strings to create strings, will first form an sbuf. The sbuf will be converted to a string, normalizing it. This will make all strings normalized (I chose the NFD normalization form) unless you do something deliberately evil like [ code-pushing-a-precomposed-char ] { } make >string. It should be noted that for most output operations, strings will be converted to NFC form (except for Mac OS pathnames).


Anonymous said...

Would it be too bold to say that C under Plan 9 is Unicode perfect?

Daniel Ehrenberg said...

Hmm, I didn't know about that; that's cool. So I guess it is. I was probably overstating things when I said there is no Unicode-perfect language. Still, C isn't Unicode-perfect in a cross-platform way (though it wouldn't make any sense to make it that way).

schlenk said...

Did you ever look at Tcl's unicode handling. Its not perfect but very smooth for nearly any operation.

Anonymous said...

Well, C under Plan 9 isn't very cross-platform anyway...

Here's an article describing their work:



All programs in Plan 9 now read and write text as UTF, not ASCII. ..."