Saturday, August 25, 2007

The R5.97RS Unicode library is broken

It's a bit late for me to be writing this, with R6RS all but ratified, but there are serious problems with the string base library and associated Unicode library included in R5.97RS. For all but the most trivial internationalized applications, an external library will still have to be used.

Encoding

The way I read it, only fixed-width encodings for Unicode are allowed. This means that UTF-32, or (with a limited repertoire) the outdated 16-bit UCS-2 should be used. This is an odd contrast with the Scheme tradition of being agnostic towards implementation details. Still, it wouldn't be totally unreasonable if it were stated in the standard. Instead it is merely implied. In section 11.12:

Returns the number of characters in the given string as an exact integer object.

(string-ref string k) procedure

K must be a valid index of string. The string-ref procedure returns character

k of string using zero-origin indexing.

    Note: Implementors should make string-ref run in constant time.

Since this runs in constant time, rather than linear time, only basic array indexing (through pointer arithmetic, ultimately) can be used in string-ref. This immediately eliminates a UTF-8 encoding, a marginally popular choice in programming languages, as it uses anywhere from one to four bytes to represent a character.

A more popular encoding for Unicode strings in memory is UTF-16. This puts a Unicode code point (character) in either two or four bytes. All but a few uncommon characters can be represented in two bytes. For the remaining ones, two chunks of two bytes are used, called surrogates. Alone, they mean nothing. The Unicode standard states specifically that they should not be falsely interpreted as characters. But the present R6RS draft does not mention anything about returning surrogates for the rare instances where high code points are used. So with UTF-16, string-ref either sometimes returns a surrogate, or it takes linear time. Java takes the former route: In Java's documentation for the method String.charAt(), there is an explicit warning, "If the char value specified by the index is a surrogate, the surrogate value is returned." But R6RS is silent on this, so we can only assume that UTF-16 is out.

This is almost certainly not what they meant, though. UTF-16 is used in almost all languages which "support Unicode." Larceny Scheme, which is being used for the reference implementation, can currently be compiled to use one of a number of fixed width encodings. So UTF-16 isn't universal in Scheme Unicode implementations, but it shouldn't be ruled out.

I should clarify that, since R6RS says 'should', there is no absolute requirement of constant time in string-ref, but to follow this recommendation would disallow using most (though maybe not all) fixed-width encodings.

Case conversion

A string is of course a sequence of characters. However, many Unicode string operations (for example, case conversion) cannot be viewed as simply applying the operation to each character. With ASCII, it's OK to define this (assuming string-map exists, which it doesn't in R6RS):

(define (string-upcase string)
(string-map char-upcase string))

When considering all of Unicode, that definition is no longer accurate. For example, the lower case of Σ depends whether it is placed at a word boundary, and the upper case of ß expands to two characters, SS.

Despite these conditions, the R6RS Unicode library defines case conversion functions which operate on characters: char-upcase, char-downcase, etc. Their behavior in these exceptional cases is defined only by example, though it is difficult to extend the example to all cases without peeking at the reference implementation. More importantly, it doesn't seem very useful, since correctly behaving operations on strings (string-upcase etc.) are also provided. There is no good reason for these functions on characters to exist.

[As a side note, I have to admit here that my implementation of titlecase in extra/unicode in the Factor distribution is flawed, because I don't properly detect word boundaries. But this will be fixed eventually.]

Collation and matching

This is the area where R6RS fails in the worst way. The words included in R6RS for comparing strings operate on code point order. Basically, this means they are useless. At least I think it's code point order; the standard only says "These procedures [char< and friends] impose a total ordering on the set of characters according to their Unicode scalar values." The Unicode standard doesn't talk about "scalar values," but I think they mean code point values here.

Unicode Technical Standard #10 defines a complex, customizable and linguistically accurate collation algorithm is defined. I'll write about this in more detail in the near future, but here's the basic outline. (I've simplified a lot here; comment if you have any specific questions.) Each Unicode code point has a three-level collation weight, and the individual weights may be zero. Using the collation weights of the individual characters in the string, a sort key is assembled: first, the first level weights of each character is appended to the (initially empty) sort key (except for the zeros, which are omitted), then a zero, then the second level weights (except zeros) and finally the third level weights. This makes it so that some differences in strings, like accent marks, can be ignored at the first level, and others, like capitalization, can be ignored at both the first and the second levels. In different locales, the table of collation weights is customized, or tailored (see below).

The Unicode Collation Algorithm is not trivial to implement, but it's far more useful than code point order, which R6RS uses. The Unicode code points may sometimes happen to be in the correct order, but in general, there is no way to order the code points to create a linguistically correct collation order. This is because a series of tie breakers are needed. To steal an example from the UTS, "cab" < "Cab" < "cáb" < "dab". Any non-toy application needs to know this.

An unexpectedly related problem is weak comparison. What do you do if you want to compare two strings for equality, but you don't care about the case? Or the accent marks? Or the spaces and punctuation? The answer is, just compare the first level, or the first two levels, of the collation key as described above.

But R6RS has another idea, which is much less useful and general, an idea which occurred to me, too, before I realized how primitive it was: perform a case fold on both strings being compared. In total, there are ten functions which perform a case fold on characters or strings before performing a comparison (by code point order!) on them. These could be used for a case-insensitive comparison. But there are many other things that need to be ignored, too, and case folding can only be used in very limited situations.

Locales

The good folks at the Unicode Consortium did their best to define algorithms which would work equally well across all locales. But for some things, it's just impossible, and different applications need different customizations of the algorithms. For example, in case conversion, the Turkish, Azeri and Lithuanian locales require special handling of the dot over i and j. Another place where locales are needed are in collation, where a specific tailoring is nearly always needed to get a useful order. If elements of a sorted list are not in the order users expect, users may fail to find the items they are looking for. And this order differs between countries and contexts. Locales are also needed to find word boundaries in South East Asian languages, where spaces are not used.

Locales are best implemented as a dynamically scoped variable, but, failing that language feature, a global variable would suffice. This can hold a symbol which other functions access to determine their precise behavior. Because the case conversion functions do not reference a locale, they will have to be reimplemented in another library to work for all users.


How did this happen?

I have no idea. Apparently, an SRFI written by Matthew Flatt and Marc Feeley was available in July 2005. The mailing list archives are here. But the people on the R6RS committee aren't stupid, and there was extensive discussion on the R6RS mailing list about Unicode. People talked a lot about different encodings, but apparently none of that went into the final draft, and we are left with the encodings problem I mentioned above.

So is all hope lost? Is R6RS going to destroy the Unicode hopes of thousands of innocent Schemers, never to be revived? Of course not. While there may be some useless core functions, and there may be a confused standard library, the R6RS module system makes it possible to define a new, completely portable library for Unicode that's actually useful. But, broader than this, the best approach would be to create an abstract, high level API for dealing with Unicode. This API could be agnostic as to the encoding of the implementation and other low-level design decisions, but it could include the many different Unicode processing operations complex programs need. There could be several implementations, in the form of R6RS modules, of this API.

In general, it seems pretty inappropriate for Schemers to depend on RnRS to provide their Unicode library. Why not include, in R7RS, a standard GUI library or standard web server? These sorts of excursions will always lead to issues. Unicode is a specific problem for which there are many approaches.

Update: Fixed minor errors, and a major error where I accused the SRFI mailing list of being inactive and short lived, when it was actually active and open for some time. Also, removed the assertion that R5RS suffices for trivial international applications.

Reading the SRFI 75 mailing list archives, it looks like nearly all of my points were brought up. Yet, strangely, they did not lead to many helpful changes in the specification, at least not enough to avoid the above issues. It can be argued that collation by the Unicode scalar value is useful for a few things, such as implementing a binary tree of strings, which needs an arbitrary comparison; I should have mentioned that in the above article.

I wrote a letter to the r6rs-discuss mailing list about these problems, and one more (the possibility of limited repertoires). William D Clinger responded with a very reasonable analysis of the issues. He says that it's possible for a non-fixed-width encoding of Unicode to have O(1) string-ref (amortized time), which might be something to look into for Factor in the future. In the end, I guess it's not completely unreasonable that the R6RS folks went for a conservative standardization, and included some questionable functions (like char-upcase) for backwards compatibility. But the case-folded comparison functions still make no sense to me.

I'm glad Factor is a young, single-implementation language right now, because we don't really have to worry about anything like backwards compatibility, or providing a minimal standard which is easy to implement. Of course, we want a simple language, but we have the freedom to make whatever kind of Unicode library we want.

7 comments:

Jens Axel Søgaard said...

About the encoding. Your interpretation hinges on the meaning of "should". Here is what RFC 2119 has to say:

3. SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

Unknown said...

Thanks Jens, I forgot about the official, formal meanings of words like should and must. So I should change my statement about UTF-32 to, "Only UTF-32 should be used", rather than, "Only UTF-32 can be used," as it amounted to before. But the problem still remains.

Paul Prescod said...

Can you make another update? The spec implies that only UTF-32 should be used for strings that contain non-BMP characters.

Paul Prescod said...

With respect to collation, the Unicode specification you point to says the following:

"There are many different ways to compare strings, and the Unicode Standard does not restrict the ways in which implementations can do this. However, any Unicode-conformant implementation that purports to implement the Unicode Collation Algorithm must do so as described in this document."

There are a variety of reasons that a language might choose to implement a more technical, less lingustically-oriented sort order.

Here are some other quotes from the spec:

"Implementers should be aware that using different versions of the UCA, as well as different versions of the Unicode Standard, could result in different collation results of their data. "

"Collation order is not preserved under concatenation or substring operations, in general."

"Collation is not a property of strings. Consider a list of cities, with each city correctly tagged with its language. Despite this, a German user will expect to see the cities all sorted according to German order, and not expect to see a word with ö appear after z, simply because the city has a Swedish name."

"In practice, there are additional features of collation that users need control over, which are expressed in user-interfaces and eventually in APIs."

I really do not think that it is appropriate that the most basic string comparison function in a language should need to take all of this into account.

Such a complex API should be expressed in a third party library -- at least at first.

Ultimately, your whole post depends on the premise of the second sentence that "serious internationalized applications should not require a third-party library." That premise is arguable. R6RS provides a basis for basic Unicode handling and a platform for more sophisticated lingustically-correct handling.

Despite your incorrect assertion, R5RS was a poor platform because one was not guaranteed to be able to even represent Unicode characters as characters.

I would encourage you to update this post significantly and change the title to "R5.97RS Unicode library is not complete." That's a much more honest appraisal.

Anonymous said...

May I ask a simple question: what languages can you list with Unicode libraries that are not "broken" according to your definition here. Languages that get the default sort right, that neither mandate nor suggest performance characteristics for string implementations, etc.

Unknown said...

Anonymous,
The Java Unicode library doesn't abstract everything away from the programmer, but I think it's pretty good (and I'm no Java advocate in general). Java processes locales properly, which is the main place where I think the R6RS library is objectively broken, not just poorly planned. And it supports the Unicode Collation Algorithm (though not with <, and it also supports binary comparison). It uses UTF-16 and requires its library users to be aware of it somewhat, but I don't think that makes it extremely "broken", just not perfect.

It's appropriate for Java (with a single, standard implementation and a few of clones based on that) to choose a specific encoding, but that's inappropriate for the Scheme standard. Scheme is an abstract language, and implementors should have a wide range of implementation choices, not just UTF-32. There's no reason for the Scheme standard to mandate or strongly suggest performance characteristics of something this specific when the order of evaluation of arguments is still unspecified.

I haven't looked into the C#/.NET Unicode support, but I assume it's similar to Java's. We users of functional programming languages don't like thinking that Java or C# might do some things better than Scheme, but in fact they do.

Unknown said...

Reviewer dude,
I changed my assertion about R5RS. But your argument about collation seems to imply that it's appropriate for a language with "Unicode support" to rely on third-party libraries for basic operations, even if they are complex. Unicode is complex, I'm aware, but if you want to support it, there's no point in going the R6RS route. It doesn't provide a very good basis for operations since any correct implementation of it hides details which would be useful to implementations of more complex operations. To accurately support locales and to do appropriate selective comparisons, much of the library will just be ignored. There are many ways to make a Unicode library simple, but the goal here was to create something useful that you could depend on across implementations.