The way I read it, only fixed-width encodings for Unicode are allowed. This means that UTF-32, or (with a limited repertoire) the outdated 16-bit UCS-2 should be used. This is an odd contrast with the Scheme tradition of being agnostic towards implementation details. Still, it wouldn't be totally unreasonable if it were stated in the standard. Instead it is merely implied. In section 11.12:
Returns the number of characters in the given string as an exact integer object.
(string-ref string k)procedure
K must be a valid index of string. The string-ref procedure returns character
k of string using zero-origin indexing.
Note: Implementors should make string-ref run in constant time.
Since this runs in constant time, rather than linear time, only basic array indexing (through pointer arithmetic, ultimately) can be used in
string-ref. This immediately eliminates a UTF-8 encoding, a marginally popular choice in programming languages, as it uses anywhere from one to four bytes to represent a character.
A more popular encoding for Unicode strings in memory is UTF-16. This puts a Unicode code point (character) in either two or four bytes. All but a few uncommon characters can be represented in two bytes. For the remaining ones, two chunks of two bytes are used, called surrogates. Alone, they mean nothing. The Unicode standard states specifically that they should not be falsely interpreted as characters. But the present R6RS draft does not mention anything about returning surrogates for the rare instances where high code points are used. So with UTF-16,
string-ref either sometimes returns a surrogate, or it takes linear time. Java takes the former route: In Java's documentation for the method
String.charAt(), there is an explicit warning, "If the char value specified by the index is a surrogate, the surrogate value is returned." But R6RS is silent on this, so we can only assume that UTF-16 is out.
This is almost certainly not what they meant, though. UTF-16 is used in almost all languages which "support Unicode." Larceny Scheme, which is being used for the reference implementation, can currently be compiled to use one of a number of fixed width encodings. So UTF-16 isn't universal in Scheme Unicode implementations, but it shouldn't be ruled out.
I should clarify that, since R6RS says 'should', there is no absolute requirement of constant time in
string-ref, but to follow this recommendation would disallow using most (though maybe not all) fixed-width encodings.
A string is of course a sequence of characters. However, many Unicode string operations (for example, case conversion) cannot be viewed as simply applying the operation to each character. With ASCII, it's OK to define this (assuming string-map exists, which it doesn't in R6RS):
(define (string-upcase string)
(string-map char-upcase string))
When considering all of Unicode, that definition is no longer accurate. For example, the lower case of Σ depends whether it is placed at a word boundary, and the upper case of ß expands to two characters, SS.
Despite these conditions, the R6RS Unicode library defines case conversion functions which operate on characters:
char-downcase, etc. Their behavior in these exceptional cases is defined only by example, though it is difficult to extend the example to all cases without peeking at the reference implementation. More importantly, it doesn't seem very useful, since correctly behaving operations on strings (
string-upcase etc.) are also provided. There is no good reason for these functions on characters to exist.
[As a side note, I have to admit here that my implementation of titlecase in
extra/unicode in the Factor distribution is flawed, because I don't properly detect word boundaries. But this will be fixed eventually.]
Collation and matching
This is the area where R6RS fails in the worst way. The words included in R6RS for comparing strings operate on code point order. Basically, this means they are useless. At least I think it's code point order; the standard only says "These procedures [
char< and friends] impose a total ordering on the set of characters according to their Unicode scalar values." The Unicode standard doesn't talk about "scalar values," but I think they mean code point values here.
Unicode Technical Standard #10 defines a complex, customizable and linguistically accurate collation algorithm is defined. I'll write about this in more detail in the near future, but here's the basic outline. (I've simplified a lot here; comment if you have any specific questions.) Each Unicode code point has a three-level collation weight, and the individual weights may be zero. Using the collation weights of the individual characters in the string, a sort key is assembled: first, the first level weights of each character is appended to the (initially empty) sort key (except for the zeros, which are omitted), then a zero, then the second level weights (except zeros) and finally the third level weights. This makes it so that some differences in strings, like accent marks, can be ignored at the first level, and others, like capitalization, can be ignored at both the first and the second levels. In different locales, the table of collation weights is customized, or tailored (see below).
The Unicode Collation Algorithm is not trivial to implement, but it's far more useful than code point order, which R6RS uses. The Unicode code points may sometimes happen to be in the correct order, but in general, there is no way to order the code points to create a linguistically correct collation order. This is because a series of tie breakers are needed. To steal an example from the UTS, "cab" < "Cab" < "cáb" < "dab". Any non-toy application needs to know this.
An unexpectedly related problem is weak comparison. What do you do if you want to compare two strings for equality, but you don't care about the case? Or the accent marks? Or the spaces and punctuation? The answer is, just compare the first level, or the first two levels, of the collation key as described above.
But R6RS has another idea, which is much less useful and general, an idea which occurred to me, too, before I realized how primitive it was: perform a case fold on both strings being compared. In total, there are ten functions which perform a case fold on characters or strings before performing a comparison (by code point order!) on them. These could be used for a case-insensitive comparison. But there are many other things that need to be ignored, too, and case folding can only be used in very limited situations.
The good folks at the Unicode Consortium did their best to define algorithms which would work equally well across all locales. But for some things, it's just impossible, and different applications need different customizations of the algorithms. For example, in case conversion, the Turkish, Azeri and Lithuanian locales require special handling of the dot over i and j. Another place where locales are needed are in collation, where a specific tailoring is nearly always needed to get a useful order. If elements of a sorted list are not in the order users expect, users may fail to find the items they are looking for. And this order differs between countries and contexts. Locales are also needed to find word boundaries in South East Asian languages, where spaces are not used.
Locales are best implemented as a dynamically scoped variable, but, failing that language feature, a global variable would suffice. This can hold a symbol which other functions access to determine their precise behavior. Because the case conversion functions do not reference a locale, they will have to be reimplemented in another library to work for all users.
How did this happen?
I have no idea. Apparently, an SRFI written by Matthew Flatt and Marc Feeley was available in July 2005. The mailing list archives are here. But the people on the R6RS committee aren't stupid, and there was extensive discussion on the R6RS mailing list about Unicode. People talked a lot about different encodings, but apparently none of that went into the final draft, and we are left with the encodings problem I mentioned above.
So is all hope lost? Is R6RS going to destroy the Unicode hopes of thousands of innocent Schemers, never to be revived? Of course not. While there may be some useless core functions, and there may be a confused standard library, the R6RS module system makes it possible to define a new, completely portable library for Unicode that's actually useful. But, broader than this, the best approach would be to create an abstract, high level API for dealing with Unicode. This API could be agnostic as to the encoding of the implementation and other low-level design decisions, but it could include the many different Unicode processing operations complex programs need. There could be several implementations, in the form of R6RS modules, of this API.
In general, it seems pretty inappropriate for Schemers to depend on RnRS to provide their Unicode library. Why not include, in R7RS, a standard GUI library or standard web server? These sorts of excursions will always lead to issues. Unicode is a specific problem for which there are many approaches.
Update: Fixed minor errors, and a major error where I accused the SRFI mailing list of being inactive and short lived, when it was actually active and open for some time. Also, removed the assertion that R5RS suffices for trivial international applications.
Reading the SRFI 75 mailing list archives, it looks like nearly all of my points were brought up. Yet, strangely, they did not lead to many helpful changes in the specification, at least not enough to avoid the above issues. It can be argued that collation by the Unicode scalar value is useful for a few things, such as implementing a binary tree of strings, which needs an arbitrary comparison; I should have mentioned that in the above article.
I wrote a letter to the r6rs-discuss mailing list about these problems, and one more (the possibility of limited repertoires). William D Clinger responded with a very reasonable analysis of the issues. He says that it's possible for a non-fixed-width encoding of Unicode to have O(1)
string-ref (amortized time), which might be something to look into for Factor in the future. In the end, I guess it's not completely unreasonable that the R6RS folks went for a conservative standardization, and included some questionable functions (like
char-upcase) for backwards compatibility. But the case-folded comparison functions still make no sense to me.
I'm glad Factor is a young, single-implementation language right now, because we don't really have to worry about anything like backwards compatibility, or providing a minimal standard which is easy to implement. Of course, we want a simple language, but we have the freedom to make whatever kind of Unicode library we want.