So the time has come for me to leave for Carleton College. I'm taking a one-month break from blogging and programming so I can meet people rather than be inadvertently isolated. But I'll be back next month; don't leave forever! I'll be continuing my Unicode series, describing word, sentence and line breaking properties, as well as collation. I'll also try to write more basic Factor introductions, since I've received a few requests.
So, in the time that I'm gone, why not look at some of my older Unicode-related posts and introductory posts, if you haven't yet? The earlier ones aren't as good, though, I should warn you. You'll often see a warning message at the top of the posts, too. But I hope you enjoy them. If you have any questions about these, I'll still be responding to comments and emails.
So, over the next month, why not start your own blog? And if you already have one, and you haven't been posting for a while (you know who you are), why not maintain it? I never thought I could write a reasonable blog, or that anyone would read it if it were reasonable. But I just wrote about what I knew, and it turned out pretty well so far. In computer science, there are so many ideas out there which no one has written about, new ideas and old ones. If you're working on any sort of programming project, you probably have an interesting one in your head. I'd really enjoy reading about that idea, and I'm sure others would too.
So, I'll see you in October!
Update: I just got an on-campus job programming PHP and/or EMCAScript. This should be interesting...
Tuesday, August 28, 2007
Monday, August 27, 2007
Unicode implementer's guide part 4: grapheme breaking
I have to be honest: I lied. I oversimplified the semantics of grapheme cluster breaking, and I hope no one messed up any software over it. But luckily, as far as I know, no one is following this and implementing Unicode, so I didn't cause too much damage. Anyway, it seems that an outright, objective error doesn't inspire nearly as much rebuke as a mere controversial essay. (I like rebuke. Please rebuke me if I've made a false claim in this post.)
Unicode grapheme breaking, like everything else in Unicode, is unexpectedly complex. But, as always, there exists an efficient algorithm to process it.
What's a grapheme cluster?
Abstract Unicode strings, like the ones you're used to, are a series of characters. In Unicode parlance, these are often called code points, but I'll usually call them characters here. But these characters aren't what users think of when they hear "character." Some characters, such as accent marks, go on top of, or underneath, other characters in a horizontal text. Some characters, like \r and \n (CR and LF) often come one after another, signifying together just one thing: a line break.
Because of these and other complications, an additional concept of a "grapheme" is useful. A grapheme is a series of one or more code points that, together, forms what a user thinks of as a character. When you use your left and right arrow keys to scroll through text, you're iterating through graphemes, not characters. If you want to reverse a string, you reverse the graphemes, not the characters. These are just a couple of the many contexts where graphemes show up. Any self-respecting Unicode library needs to be able to detect grapheme cluster boundaries. (I'll use 'grapheme' interchangably with 'grapheme cluster' for simplicity.)
The rules
The primary source for the relevant Unicode algorithm here is Unicode Standard Annex (UAX) #29: Text Boundaries. This describes not only grapheme boundaries but also word and sentence boundaries. Another related document is UAX 14: Line Breaking Properties. I will discuss all of these in a later post, but for now, I haven't even started implementing them. Grapheme cluster boundaries are described in terms of the junction or disjunction of adjacent characters. The core of the rules is described in section 3.1. I've simplified them down to ignore characters that won't be present in our NFD-normalized strings. In the below rules, × signifies that the two adjacent code points belong to one grapheme cluster, and ÷ indicates that there is a grapheme cluster break between those characters.
The character groups L, V, T, CR, LF, Extend, Control and Any (which I'll call grapheme classes, a name I invented) are defined in the UAX. Their intuitive definitions are simple, though. The Extend grapheme class encompasses all characters which are marked as extenders, and the Control grapheme class can be thought of as containing all control and line-breaking characters, except CR, LF, ZWNJ and ZWJ. (I'll discuss the last two later.) CR and LF are just that--the characters \r and \n, respectively. These need special handling. L is initial jamo characters, V is jamo medial, and T is jamo final. Any is anything else. sot is the start of the text, and eot is the end. Note that, in the above rules, earlier rules override later rules.
A first shot
My first attempt at implementation didn't go so well. At first, I implemented grapheme cluster breaks without even glancing at the UAX. This was incredibly stupid. I acted like there were two rules, GB9 and GB10. And I misunderstood Extend to be the same as the group of non-starters, that is, characters with a non-zero combining class. Using this incorrect definition, which I repeated in several blog posts, I made the words
This simple protocol gives you everything you need to iterate forwards and backwards through graphemes in strings. Using it, it is possible to define a word which generates an array of all of the graphemes in the string:
And with this, it's easy to reverse a string, preserving graphemes rather than breaking them up. If we didn't preserve graphemes when reversing strings, the resulting reversal would be incoherent and represent something different, with, for example, accent marks on the wrong character.
Note that the
This interface, along with
Fixing some bugs
So, my first version worked for my simple unit tests, but when I expanded testing, I soon found that it failed. This led to a much more complicated version of
Most of the grapheme class detection was simple, involving only character classes and simple ranges. But Extend is slightly more complicated. The Extend grapheme class is defined as code points with the property Grapheme_Extend, but there's a more direct definition. Looking in the description page for the Unicode Character Database (UCD), we see that Grapheme_Extend is defined as the characters in the categories Me and Mn, plus a few extras in Other_Grapheme_Extend. Other_Grapheme_Extend is defined in PropList.txt. It's only 21 characters, but rather than enter them in manually, I decided to write a parser for PropList.txt so that the update to the next version of Unicode is easier. (Actually, I cut out only the relevant part of the file, because it will be included in the Factor distribution, and there's no sense in including thousands of lines of text that will not be used.)
There were a few problems: first, every character was tested for its grapheme breaking properties multiple times, where one time would have sufficed. Second, the code was structured in a way that required me to rewrite the whole thing for reverse traversal!
Looking at UAX 29, I saw several references to a 'table-based implementation', but there was no explanation of this. Confused, I wrote to the Unicode list (unicode@unicode.org, an active, public mailing list) asking for an explanation of this. The Unicode list is useful but a bit technical, and my only response was a pointer to the ICU implementation of breaking properties, the BreakIterator. Naively, I looked up the BreakIterator class in the ICU source code, imagining that the code governing it would be there, or in one of the classes that that file directly references. But after looking at maybe 10 files of C++, I gave up. I'm not smart enough for C++, or at least for these big projects.
Through the ICU documentation, I saw that a closely related problem was line breaking properties. Looking at the Unicode spec for that (UAX 14), I got an actual description of table-based implementations. The feeling was something between an epiphany and a realization of incredible stupidity.
Table-based implementation
Here's the basic idea of how to implement grapheme cluster breaks, the table-based way:
Step 1: Before runtime, create a table describing which grapheme classes connect and which grapheme classes disconnect.
Step 2: Between two grapheme classes, you can query the table to see if there's a grapheme cluster break
Step 3: To detect grapheme clusters, iterate through a string and determine the grapheme class of each character, until the classes have a break between them.
The table is an 8x8 bitarray, where the the rows represent the grapheme class of the character before the potential break and the columns represent the character after the potential break. Simply indexing it tells us whether there is a grapheme break. Once you have this table, and the facility to tell which grapheme class each character is in, the task is easy to write. In essence we're making a finite state machine, where the state is completely determined by the previous character. For more details, you can check out my implementation.
Generating the table
But let's look at Step 1 in more detail, which is decidedly not as easy. How do we get from the rules to the table? Sure, you could generate the table manually, since it's not very large, but that process is annoying, error-prone and creates difficult-to-maintain code. A better way to go about it is to generate the table directly from the rules.
No, this doesn't require any fancy DSLs, just a simple algorithm with some utility functions. For this description, assume false does not equal zero. First, initialize an GxG array to false, where G is the number of grapheme classes. Then, define the operation
On top of this backing, I implemented a few functions.
So...
The solution ended up being more complex than I thought, but at least it wasn't unintelligible. I'll go back now and fix all of my incorrect descriptions of grapheme breaking in previous blog entries. Along these same lines, word breaking, sentence breaking line breaking are implemented. But they're more complex, involving optional breaks in the case of line breaking, and non-adjacent characters in the case of word/sentence boundaries. But this is all for a later post!
Unicode grapheme breaking, like everything else in Unicode, is unexpectedly complex. But, as always, there exists an efficient algorithm to process it.
What's a grapheme cluster?
Abstract Unicode strings, like the ones you're used to, are a series of characters. In Unicode parlance, these are often called code points, but I'll usually call them characters here. But these characters aren't what users think of when they hear "character." Some characters, such as accent marks, go on top of, or underneath, other characters in a horizontal text. Some characters, like \r and \n (CR and LF) often come one after another, signifying together just one thing: a line break.
Because of these and other complications, an additional concept of a "grapheme" is useful. A grapheme is a series of one or more code points that, together, forms what a user thinks of as a character. When you use your left and right arrow keys to scroll through text, you're iterating through graphemes, not characters. If you want to reverse a string, you reverse the graphemes, not the characters. These are just a couple of the many contexts where graphemes show up. Any self-respecting Unicode library needs to be able to detect grapheme cluster boundaries. (I'll use 'grapheme' interchangably with 'grapheme cluster' for simplicity.)
The rules
The primary source for the relevant Unicode algorithm here is Unicode Standard Annex (UAX) #29: Text Boundaries. This describes not only grapheme boundaries but also word and sentence boundaries. Another related document is UAX 14: Line Breaking Properties. I will discuss all of these in a later post, but for now, I haven't even started implementing them. Grapheme cluster boundaries are described in terms of the junction or disjunction of adjacent characters. The core of the rules is described in section 3.1. I've simplified them down to ignore characters that won't be present in our NFD-normalized strings. In the below rules, × signifies that the two adjacent code points belong to one grapheme cluster, and ÷ indicates that there is a grapheme cluster break between those characters.
Break at the start and end of text. | |||
GB1. | sot | ÷ | |
GB2. | ÷ | eot | |
Do not break between a CR and LF. Otherwise, break before and after | |||
GB3. | CR | × | LF |
GB4. | ( Control | CR | LF ) | ÷ | |
GB5. | ÷ | ( Control | CR | LF ) | |
Do not break Hangul syllable sequences. | |||
GB6. | L | × | ( L | V ) |
GB7. | V | × | ( V | T ) |
GB8. | T | × | T |
Do not break before extending characters. | |||
GB9. | × | Extend | |
Otherwise, break everywhere. | |||
GB10. | Any | ÷ | Any |
The character groups L, V, T, CR, LF, Extend, Control and Any (which I'll call grapheme classes, a name I invented) are defined in the UAX. Their intuitive definitions are simple, though. The Extend grapheme class encompasses all characters which are marked as extenders, and the Control grapheme class can be thought of as containing all control and line-breaking characters, except CR, LF, ZWNJ and ZWJ. (I'll discuss the last two later.) CR and LF are just that--the characters \r and \n, respectively. These need special handling. L is initial jamo characters, V is jamo medial, and T is jamo final. Any is anything else. sot is the start of the text, and eot is the end. Note that, in the above rules, earlier rules override later rules.
A first shot
My first attempt at implementation didn't go so well. At first, I implemented grapheme cluster breaks without even glancing at the UAX. This was incredibly stupid. I acted like there were two rules, GB9 and GB10. And I misunderstood Extend to be the same as the group of non-starters, that is, characters with a non-zero combining class. Using this incorrect definition, which I repeated in several blog posts, I made the words
next-grapheme
and prev-grapheme
. These words take an index and a string and locate the next or previous grapheme break. (In reality, prev-grapheme
never worked, and I had insufficient unit tests so I didn't catch this.)This simple protocol gives you everything you need to iterate forwards and backwards through graphemes in strings. Using it, it is possible to define a word which generates an array of all of the graphemes in the string:
: (>graphemes) ( i str -- )
2dup bounds-check? [
dupd [ next-grapheme ] keep
[ subseq , ] 2keep (>graphemes)
] [ 2drop ] if ;
: >graphemes ( str -- graphemes )
[ 0 swap (>graphemes) ] { } make* ;
And with this, it's easy to reverse a string, preserving graphemes rather than breaking them up. If we didn't preserve graphemes when reversing strings, the resulting reversal would be incoherent and represent something different, with, for example, accent marks on the wrong character.
: string-reverse ( str -- rts )
>graphemes reverse concat ;
Note that the
concat
here does not need to be normalizing, whereas normally, concat
and append
need to do a little bit of reordering to be properly normalized in NFD.This interface, along with
>graphemes
and string-reverse
, works independently of the bad implementation of next-grapheme
. So, when I fixed that, in the two following versions, I was able to retain the same code for >graphemes
and string-reverse
.Fixing some bugs
So, my first version worked for my simple unit tests, but when I expanded testing, I soon found that it failed. This led to a much more complicated version of
next-grapheme
, which used mountains of conditionals to determine if two graphemes are conjoining.Most of the grapheme class detection was simple, involving only character classes and simple ranges. But Extend is slightly more complicated. The Extend grapheme class is defined as code points with the property Grapheme_Extend, but there's a more direct definition. Looking in the description page for the Unicode Character Database (UCD), we see that Grapheme_Extend is defined as the characters in the categories Me and Mn, plus a few extras in Other_Grapheme_Extend. Other_Grapheme_Extend is defined in PropList.txt. It's only 21 characters, but rather than enter them in manually, I decided to write a parser for PropList.txt so that the update to the next version of Unicode is easier. (Actually, I cut out only the relevant part of the file, because it will be included in the Factor distribution, and there's no sense in including thousands of lines of text that will not be used.)
There were a few problems: first, every character was tested for its grapheme breaking properties multiple times, where one time would have sufficed. Second, the code was structured in a way that required me to rewrite the whole thing for reverse traversal!
Looking at UAX 29, I saw several references to a 'table-based implementation', but there was no explanation of this. Confused, I wrote to the Unicode list (unicode@unicode.org, an active, public mailing list) asking for an explanation of this. The Unicode list is useful but a bit technical, and my only response was a pointer to the ICU implementation of breaking properties, the BreakIterator. Naively, I looked up the BreakIterator class in the ICU source code, imagining that the code governing it would be there, or in one of the classes that that file directly references. But after looking at maybe 10 files of C++, I gave up. I'm not smart enough for C++, or at least for these big projects.
Through the ICU documentation, I saw that a closely related problem was line breaking properties. Looking at the Unicode spec for that (UAX 14), I got an actual description of table-based implementations. The feeling was something between an epiphany and a realization of incredible stupidity.
Table-based implementation
Here's the basic idea of how to implement grapheme cluster breaks, the table-based way:
Step 1: Before runtime, create a table describing which grapheme classes connect and which grapheme classes disconnect.
Step 2: Between two grapheme classes, you can query the table to see if there's a grapheme cluster break
Step 3: To detect grapheme clusters, iterate through a string and determine the grapheme class of each character, until the classes have a break between them.
The table is an 8x8 bitarray, where the the rows represent the grapheme class of the character before the potential break and the columns represent the character after the potential break. Simply indexing it tells us whether there is a grapheme break. Once you have this table, and the facility to tell which grapheme class each character is in, the task is easy to write. In essence we're making a finite state machine, where the state is completely determined by the previous character. For more details, you can check out my implementation.
Generating the table
But let's look at Step 1 in more detail, which is decidedly not as easy. How do we get from the rules to the table? Sure, you could generate the table manually, since it's not very large, but that process is annoying, error-prone and creates difficult-to-maintain code. A better way to go about it is to generate the table directly from the rules.
No, this doesn't require any fancy DSLs, just a simple algorithm with some utility functions. For this description, assume false does not equal zero. First, initialize an GxG array to false, where G is the number of grapheme classes. Then, define the operation
connect
to set the array at a point to 1 if it is currently false, else do nothing. Define the operation disconnect
to set the array at a point to 0 if that point is currently false. These allow the table to give precedence to earlier operations and ignore overlapping later ones. Once we're done with all of the connects and disconnects, generate the final table by making all points which are 1 true, and all points which are 0 or false, false.On top of this backing, I implemented a few functions.
connect-before
connects one class to a list of classes, to implement rules like "L × ( L | V )". connect-after
connects a list of classes to a class, to implement rules like " × Extend" (whose 'list of classes' is all of the classes). break-around
disconnects all junctions between the first list and the second list, to implement " ÷ ( Control | CR | LF )" (using, again, all classes as one of the arguments). Once I I had these functions, generating the table from the rules was a simple transliteration:
CR LF connect
{ Control CR LF } graphemes break-around
L { L V } connect-before
V { V T } connect-before
T T connect
graphemes Extend connect-after
So...
The solution ended up being more complex than I thought, but at least it wasn't unintelligible. I'll go back now and fix all of my incorrect descriptions of grapheme breaking in previous blog entries. Along these same lines, word breaking, sentence breaking line breaking are implemented. But they're more complex, involving optional breaks in the case of line breaking, and non-adjacent characters in the case of word/sentence boundaries. But this is all for a later post!
Symbolic constants and enumeration
For grapheme classes in Unicode, I needed a bunch of symbolic constants 0..n to describe a number of things. (More graphemes in the next post.) At first, I used regular old words corresponding to regular old numbers, in a typical Factor idiom:
But later, in constructing the table (discussed in more detail below), I found the need to write arrays like
How can we get that quasiquotation in Factor? The most obvious way is using
But who wants to write this eight times? I don't! We can abstract it into a parsing word
But in this particular code, we use a particular pattern, similar to C's
Going back to the original problem, the
to specify the grapheme classes, without having to care about which number they correspond to.
This solution isn't perfect. The problem here is that this completely ruins homoiconic syntax for all uses of the constants. By "homoiconic syntax," I mean that
Note: At the time I wrote this, this strategy made sense. But now, I'm thinking it'd be better to just go with a regular
: Any 0 ;
: Extend 1 ;
: L 2 ;
! ...
But later, in constructing the table (discussed in more detail below), I found the need to write arrays like
{ Control CR LF }
. But in Factor, this doesn't work the way you might think it does; it's basically equivalent to, in Lisp, '(Control CR LF)
(a list of the symbols) when we actually want `(,Control ,CR ,LF)
(a list of the values that the symbols refer to). How can we get that quasiquotation in Factor? The most obvious way is using
make
: [ Control , CR , LF , ] { } make
. But that solution is pretty wordy, and not nearly fun enough. Here's another idea: what if you make the grapheme class words expand directly, at parsetime, to their value? This can be done if you use
: Any 0 parsed ; parsing
: Extend 1 parsed ; parsing
! ...
But who wants to write this eight times? I don't! We can abstract it into a parsing word
CONST:
to do this for us:
: define-const ( word value -- )
[ parsed ] curry dupd define-compound
t "parsing" set-word-prop ;
: CONST:
CREATE scan-word dup parsing?
[ execute dup pop ] when define-const ; parsing
But in this particular code, we use a particular pattern, similar to C's
enum
. Why not abstract this into our own ENUM:
?
: define-enum ( words -- )
dup length [ define-const ] 2each ;
: ENUM:
";" parse-tokens [ create-in ] map define-enum ; parsing
Going back to the original problem, the
ENUM:
parsing word allows me to write
ENUM: Any L V T Extend Control CR LF graphemes ;
to specify the grapheme classes, without having to care about which number they correspond to.
This solution isn't perfect. The problem here is that this completely ruins homoiconic syntax for all uses of the constants. By "homoiconic syntax," I mean that
see
can print out the original source code, with the only real difference being whitespace. A macro using compiler transformations, which would preserve homoiconic syntax by using a different strategy, might be preferable. But I still wanted to share this little thing with you all.Note: At the time I wrote this, this strategy made sense. But now, I'm thinking it'd be better to just go with a regular
C-ENUM:
, which should be renamed ENUM:
. This is partly because of changes in the parser which make it more difficult to use parsing words defined in the same source file.
Saturday, August 25, 2007
The R5.97RS Unicode library is broken
It's a bit late for me to be writing this, with R6RS all but ratified, but there are serious problems with the string base library and associated Unicode library included in R5.97RS. For all but the most trivial internationalized applications, an external library will still have to be used.
Encoding
The way I read it, only fixed-width encodings for Unicode are allowed. This means that UTF-32, or (with a limited repertoire) the outdated 16-bit UCS-2 should be used. This is an odd contrast with the Scheme tradition of being agnostic towards implementation details. Still, it wouldn't be totally unreasonable if it were stated in the standard. Instead it is merely implied. In section 11.12:
Since this runs in constant time, rather than linear time, only basic array indexing (through pointer arithmetic, ultimately) can be used in
A more popular encoding for Unicode strings in memory is UTF-16. This puts a Unicode code point (character) in either two or four bytes. All but a few uncommon characters can be represented in two bytes. For the remaining ones, two chunks of two bytes are used, called surrogates. Alone, they mean nothing. The Unicode standard states specifically that they should not be falsely interpreted as characters. But the present R6RS draft does not mention anything about returning surrogates for the rare instances where high code points are used. So with UTF-16,
This is almost certainly not what they meant, though. UTF-16 is used in almost all languages which "support Unicode." Larceny Scheme, which is being used for the reference implementation, can currently be compiled to use one of a number of fixed width encodings. So UTF-16 isn't universal in Scheme Unicode implementations, but it shouldn't be ruled out.
I should clarify that, since R6RS says 'should', there is no absolute requirement of constant time in
Case conversion
A string is of course a sequence of characters. However, many Unicode string operations (for example, case conversion) cannot be viewed as simply applying the operation to each character. With ASCII, it's OK to define this (assuming string-map exists, which it doesn't in R6RS):
When considering all of Unicode, that definition is no longer accurate. For example, the lower case of Σ depends whether it is placed at a word boundary, and the upper case of ß expands to two characters, SS.
Despite these conditions, the R6RS Unicode library defines case conversion functions which operate on characters:
[As a side note, I have to admit here that my implementation of titlecase in
Collation and matching
This is the area where R6RS fails in the worst way. The words included in R6RS for comparing strings operate on code point order. Basically, this means they are useless. At least I think it's code point order; the standard only says "These procedures [
Unicode Technical Standard #10 defines a complex, customizable and linguistically accurate collation algorithm is defined. I'll write about this in more detail in the near future, but here's the basic outline. (I've simplified a lot here; comment if you have any specific questions.) Each Unicode code point has a three-level collation weight, and the individual weights may be zero. Using the collation weights of the individual characters in the string, a sort key is assembled: first, the first level weights of each character is appended to the (initially empty) sort key (except for the zeros, which are omitted), then a zero, then the second level weights (except zeros) and finally the third level weights. This makes it so that some differences in strings, like accent marks, can be ignored at the first level, and others, like capitalization, can be ignored at both the first and the second levels. In different locales, the table of collation weights is customized, or tailored (see below).
The Unicode Collation Algorithm is not trivial to implement, but it's far more useful than code point order, which R6RS uses. The Unicode code points may sometimes happen to be in the correct order, but in general, there is no way to order the code points to create a linguistically correct collation order. This is because a series of tie breakers are needed. To steal an example from the UTS, "cab" < "Cab" < "cáb" < "dab". Any non-toy application needs to know this.
An unexpectedly related problem is weak comparison. What do you do if you want to compare two strings for equality, but you don't care about the case? Or the accent marks? Or the spaces and punctuation? The answer is, just compare the first level, or the first two levels, of the collation key as described above.
But R6RS has another idea, which is much less useful and general, an idea which occurred to me, too, before I realized how primitive it was: perform a case fold on both strings being compared. In total, there are ten functions which perform a case fold on characters or strings before performing a comparison (by code point order!) on them. These could be used for a case-insensitive comparison. But there are many other things that need to be ignored, too, and case folding can only be used in very limited situations.
Locales
The good folks at the Unicode Consortium did their best to define algorithms which would work equally well across all locales. But for some things, it's just impossible, and different applications need different customizations of the algorithms. For example, in case conversion, the Turkish, Azeri and Lithuanian locales require special handling of the dot over i and j. Another place where locales are needed are in collation, where a specific tailoring is nearly always needed to get a useful order. If elements of a sorted list are not in the order users expect, users may fail to find the items they are looking for. And this order differs between countries and contexts. Locales are also needed to find word boundaries in South East Asian languages, where spaces are not used.
Locales are best implemented as a dynamically scoped variable, but, failing that language feature, a global variable would suffice. This can hold a symbol which other functions access to determine their precise behavior. Because the case conversion functions do not reference a locale, they will have to be reimplemented in another library to work for all users.
How did this happen?
I have no idea. Apparently, an SRFI written by Matthew Flatt and Marc Feeley was available in July 2005. The mailing list archives are here. But the people on the R6RS committee aren't stupid, and there was extensive discussion on the R6RS mailing list about Unicode. People talked a lot about different encodings, but apparently none of that went into the final draft, and we are left with the encodings problem I mentioned above.
So is all hope lost? Is R6RS going to destroy the Unicode hopes of thousands of innocent Schemers, never to be revived? Of course not. While there may be some useless core functions, and there may be a confused standard library, the R6RS module system makes it possible to define a new, completely portable library for Unicode that's actually useful. But, broader than this, the best approach would be to create an abstract, high level API for dealing with Unicode. This API could be agnostic as to the encoding of the implementation and other low-level design decisions, but it could include the many different Unicode processing operations complex programs need. There could be several implementations, in the form of R6RS modules, of this API.
In general, it seems pretty inappropriate for Schemers to depend on RnRS to provide their Unicode library. Why not include, in R7RS, a standard GUI library or standard web server? These sorts of excursions will always lead to issues. Unicode is a specific problem for which there are many approaches.
Update: Fixed minor errors, and a major error where I accused the SRFI mailing list of being inactive and short lived, when it was actually active and open for some time. Also, removed the assertion that R5RS suffices for trivial international applications.
Reading the SRFI 75 mailing list archives, it looks like nearly all of my points were brought up. Yet, strangely, they did not lead to many helpful changes in the specification, at least not enough to avoid the above issues. It can be argued that collation by the Unicode scalar value is useful for a few things, such as implementing a binary tree of strings, which needs an arbitrary comparison; I should have mentioned that in the above article.
I wrote a letter to the r6rs-discuss mailing list about these problems, and one more (the possibility of limited repertoires). William D Clinger responded with a very reasonable analysis of the issues. He says that it's possible for a non-fixed-width encoding of Unicode to have O(1)
I'm glad Factor is a young, single-implementation language right now, because we don't really have to worry about anything like backwards compatibility, or providing a minimal standard which is easy to implement. Of course, we want a simple language, but we have the freedom to make whatever kind of Unicode library we want.
Encoding
The way I read it, only fixed-width encodings for Unicode are allowed. This means that UTF-32, or (with a limited repertoire) the outdated 16-bit UCS-2 should be used. This is an odd contrast with the Scheme tradition of being agnostic towards implementation details. Still, it wouldn't be totally unreasonable if it were stated in the standard. Instead it is merely implied. In section 11.12:
Returns the number of characters in the given string as an exact integer object.(string-ref string k)
procedure
K must be a valid index of string. The string-ref procedure returns character
k of string using zero-origin indexing.
Note: Implementors should make string-ref run in constant time.
Since this runs in constant time, rather than linear time, only basic array indexing (through pointer arithmetic, ultimately) can be used in
string-ref
. This immediately eliminates a UTF-8 encoding, a marginally popular choice in programming languages, as it uses anywhere from one to four bytes to represent a character.A more popular encoding for Unicode strings in memory is UTF-16. This puts a Unicode code point (character) in either two or four bytes. All but a few uncommon characters can be represented in two bytes. For the remaining ones, two chunks of two bytes are used, called surrogates. Alone, they mean nothing. The Unicode standard states specifically that they should not be falsely interpreted as characters. But the present R6RS draft does not mention anything about returning surrogates for the rare instances where high code points are used. So with UTF-16,
string-ref
either sometimes returns a surrogate, or it takes linear time. Java takes the former route: In Java's documentation for the method String.charAt()
, there is an explicit warning, "If the char value specified by the index is a surrogate, the surrogate value is returned." But R6RS is silent on this, so we can only assume that UTF-16 is out.This is almost certainly not what they meant, though. UTF-16 is used in almost all languages which "support Unicode." Larceny Scheme, which is being used for the reference implementation, can currently be compiled to use one of a number of fixed width encodings. So UTF-16 isn't universal in Scheme Unicode implementations, but it shouldn't be ruled out.
I should clarify that, since R6RS says 'should', there is no absolute requirement of constant time in
string-ref
, but to follow this recommendation would disallow using most (though maybe not all) fixed-width encodings.Case conversion
A string is of course a sequence of characters. However, many Unicode string operations (for example, case conversion) cannot be viewed as simply applying the operation to each character. With ASCII, it's OK to define this (assuming string-map exists, which it doesn't in R6RS):
(define (string-upcase string)
(string-map char-upcase string))
When considering all of Unicode, that definition is no longer accurate. For example, the lower case of Σ depends whether it is placed at a word boundary, and the upper case of ß expands to two characters, SS.
Despite these conditions, the R6RS Unicode library defines case conversion functions which operate on characters:
char-upcase
, char-downcase
, etc. Their behavior in these exceptional cases is defined only by example, though it is difficult to extend the example to all cases without peeking at the reference implementation. More importantly, it doesn't seem very useful, since correctly behaving operations on strings (string-upcase
etc.) are also provided. There is no good reason for these functions on characters to exist.[As a side note, I have to admit here that my implementation of titlecase in
extra/unicode
in the Factor distribution is flawed, because I don't properly detect word boundaries. But this will be fixed eventually.]Collation and matching
This is the area where R6RS fails in the worst way. The words included in R6RS for comparing strings operate on code point order. Basically, this means they are useless. At least I think it's code point order; the standard only says "These procedures [
char<
and friends] impose a total ordering on the set of characters according to their Unicode scalar values." The Unicode standard doesn't talk about "scalar values," but I think they mean code point values here.Unicode Technical Standard #10 defines a complex, customizable and linguistically accurate collation algorithm is defined. I'll write about this in more detail in the near future, but here's the basic outline. (I've simplified a lot here; comment if you have any specific questions.) Each Unicode code point has a three-level collation weight, and the individual weights may be zero. Using the collation weights of the individual characters in the string, a sort key is assembled: first, the first level weights of each character is appended to the (initially empty) sort key (except for the zeros, which are omitted), then a zero, then the second level weights (except zeros) and finally the third level weights. This makes it so that some differences in strings, like accent marks, can be ignored at the first level, and others, like capitalization, can be ignored at both the first and the second levels. In different locales, the table of collation weights is customized, or tailored (see below).
The Unicode Collation Algorithm is not trivial to implement, but it's far more useful than code point order, which R6RS uses. The Unicode code points may sometimes happen to be in the correct order, but in general, there is no way to order the code points to create a linguistically correct collation order. This is because a series of tie breakers are needed. To steal an example from the UTS, "cab" < "Cab" < "cáb" < "dab". Any non-toy application needs to know this.
An unexpectedly related problem is weak comparison. What do you do if you want to compare two strings for equality, but you don't care about the case? Or the accent marks? Or the spaces and punctuation? The answer is, just compare the first level, or the first two levels, of the collation key as described above.
But R6RS has another idea, which is much less useful and general, an idea which occurred to me, too, before I realized how primitive it was: perform a case fold on both strings being compared. In total, there are ten functions which perform a case fold on characters or strings before performing a comparison (by code point order!) on them. These could be used for a case-insensitive comparison. But there are many other things that need to be ignored, too, and case folding can only be used in very limited situations.
Locales
The good folks at the Unicode Consortium did their best to define algorithms which would work equally well across all locales. But for some things, it's just impossible, and different applications need different customizations of the algorithms. For example, in case conversion, the Turkish, Azeri and Lithuanian locales require special handling of the dot over i and j. Another place where locales are needed are in collation, where a specific tailoring is nearly always needed to get a useful order. If elements of a sorted list are not in the order users expect, users may fail to find the items they are looking for. And this order differs between countries and contexts. Locales are also needed to find word boundaries in South East Asian languages, where spaces are not used.
Locales are best implemented as a dynamically scoped variable, but, failing that language feature, a global variable would suffice. This can hold a symbol which other functions access to determine their precise behavior. Because the case conversion functions do not reference a locale, they will have to be reimplemented in another library to work for all users.
How did this happen?
I have no idea. Apparently, an SRFI written by Matthew Flatt and Marc Feeley was available in July 2005. The mailing list archives are here. But the people on the R6RS committee aren't stupid, and there was extensive discussion on the R6RS mailing list about Unicode. People talked a lot about different encodings, but apparently none of that went into the final draft, and we are left with the encodings problem I mentioned above.
So is all hope lost? Is R6RS going to destroy the Unicode hopes of thousands of innocent Schemers, never to be revived? Of course not. While there may be some useless core functions, and there may be a confused standard library, the R6RS module system makes it possible to define a new, completely portable library for Unicode that's actually useful. But, broader than this, the best approach would be to create an abstract, high level API for dealing with Unicode. This API could be agnostic as to the encoding of the implementation and other low-level design decisions, but it could include the many different Unicode processing operations complex programs need. There could be several implementations, in the form of R6RS modules, of this API.
In general, it seems pretty inappropriate for Schemers to depend on RnRS to provide their Unicode library. Why not include, in R7RS, a standard GUI library or standard web server? These sorts of excursions will always lead to issues. Unicode is a specific problem for which there are many approaches.
Update: Fixed minor errors, and a major error where I accused the SRFI mailing list of being inactive and short lived, when it was actually active and open for some time. Also, removed the assertion that R5RS suffices for trivial international applications.
Reading the SRFI 75 mailing list archives, it looks like nearly all of my points were brought up. Yet, strangely, they did not lead to many helpful changes in the specification, at least not enough to avoid the above issues. It can be argued that collation by the Unicode scalar value is useful for a few things, such as implementing a binary tree of strings, which needs an arbitrary comparison; I should have mentioned that in the above article.
I wrote a letter to the r6rs-discuss mailing list about these problems, and one more (the possibility of limited repertoires). William D Clinger responded with a very reasonable analysis of the issues. He says that it's possible for a non-fixed-width encoding of Unicode to have O(1)
string-ref
(amortized time), which might be something to look into for Factor in the future. In the end, I guess it's not completely unreasonable that the R6RS folks went for a conservative standardization, and included some questionable functions (like char-upcase
) for backwards compatibility. But the case-folded comparison functions still make no sense to me.I'm glad Factor is a young, single-implementation language right now, because we don't really have to worry about anything like backwards compatibility, or providing a minimal standard which is easy to implement. Of course, we want a simple language, but we have the freedom to make whatever kind of Unicode library we want.
Monday, August 20, 2007
Factor with a dash of curry
Finally, Factor has compiling curry! This is a great development, and it makes designing combinators in Factor much easier. I previously discussed designing combinators, but that post was very confusing and probably too advanced. With curry in full availability, this should be much easier, and the explanation much cleaner.
Unfortunately, Factor's
do with this delicious Indian meal.
What's curry?
I hear you asking, "What the hell are you talking about? What's curry?" Curry is a simple function, named after mathematician Haskell Curry who developed combinatory logic, though not currying. The man usually credited with that discovery, a bit earlier on, and far before modern computers were developed, was Moses Schönfinkel, so the process may also be known as Schönfinkelisation. Currying is the technique of implementing a multiargument function by chaining together several single-argument functions. To use Haskell as an example, say you have the function
Not all of you are Haskellers, so let me explain how that works. The first line is the type signature (that fact is indicated by the
How Factor does it
Currying is a great construct, but it has almost nothing to do with Factor's
At first, it may sound like partial application only suits applicative, not concatenative languages, as we do not usually think about applying a function to its arguments, but rather composing words which operate on a stack. But in this context, all we mean by partial application, or "currying", is getting something onto the stack before the quotation is called.
With this realization, this sort of currying is easy to do on quotations. In Joy, and in Factor as it was a few years ago, quotations are lists. To prepend a constant to that list is the same as to push it on the stack before executing the quotation. The operation is simply
But now,
Derived operations
You might be wondering if this single operation can take the place of all of the uses of Scheme's
The
Another useful operation is
Practical applications
Aside from eliminating the need for all
Using this, code to count up the number of ones could be
But there's a downside to this: for no good reason, you build a new sequence, which is immediately consumed and discarded. What if we just counted up the length instead? The
There's a little bit of stack acrobatics around calling that quotation so that the quot has access to lower things on the stack.
If you think about it, there are actually two pieces here, which we can factor out. The first piece, which I'll call
Both of these are faster than their counterparts (
On my computer,
Unfortunately, Factor's
curry
has nothing todo with this delicious Indian meal.
What's curry?
I hear you asking, "What the hell are you talking about? What's curry?" Curry is a simple function, named after mathematician Haskell Curry who developed combinatory logic, though not currying. The man usually credited with that discovery, a bit earlier on, and far before modern computers were developed, was Moses Schönfinkel, so the process may also be known as Schönfinkelisation. Currying is the technique of implementing a multiargument function by chaining together several single-argument functions. To use Haskell as an example, say you have the function
add
, which adds two integers. The function could be defined as
add :: Integer -> (Integer -> Integer)
add = \x -> (\y -> x+y)
Not all of you are Haskellers, so let me explain how that works. The first line is the type signature (that fact is indicated by the
::
) which tells us that the type of add
is a function (indicated by ->
which takes an Integer
and returns a function (Integer -> Integer
) from Integer
to Integer
. The second line is a series of lambdas which defines the actual function. (We know it's a definition because of =
.) The syntax \a -> b
specifies an anonymous function which takes a value named a and returns the expression b. So add is a function which takes a value x, and returns a function y which takes a value y and returns the sum of x and y. As it happens, because Haskell is a so-called "auto-currying" language, and because ->
is right-associative due to the frequency of this construct, the exact same thing can be written as
add :: Integer -> Integer -> Integer
add x y = x+y
How Factor does it
Currying is a great construct, but it has almost nothing to do with Factor's
curry
. Factor's curry
is, by the more fashionable nomenclature of those at Lambda the Ultimate, partial application: given a function, apply it with a partial argument list, yielding a new function with just those arguments filled in. (I think this operation would more appropriately be called with
in Factor, but for now, it's curry
.) In Scheme, this is accomplished with the nice macro cut
(SRFI 26).At first, it may sound like partial application only suits applicative, not concatenative languages, as we do not usually think about applying a function to its arguments, but rather composing words which operate on a stack. But in this context, all we mean by partial application, or "currying", is getting something onto the stack before the quotation is called.
With this realization, this sort of currying is easy to do on quotations. In Joy, and in Factor as it was a few years ago, quotations are lists. To prepend a constant to that list is the same as to push it on the stack before executing the quotation. The operation is simply
cons
. Now that Factor's quotations are specially tagged arrays, the code for prepending an element is swap add*
, but the meaning remains the same. If you're a careful reader, you saw that I made a mistake there: if you want to partially apply a quotation with a word as your data, this word has to be "literalized" first, or wrapped so that it is not executed but rather pushed. So, until recently, the code for curry
amounted to swap literalize add*
. (literalize
does nothing if its argument is not a word.)But now,
curry
has been added as a primitive for better interpreter performance, and a compiler intrinsic so it can be safely used in compiling Factor code.Derived operations
You might be wondering if this single operation can take the place of all of the uses of Scheme's
cut
macro. It can. Using only curry
, call
and basic stack shuffling words, it is easy to define several more complex operations. For something simple, here's how to partially apply a quotation to two arguments, rather than one:
: 2curry ( a b quot -- quot' )
curry curry ; inline
The
inline
declaration is there to allow things using 2curry
to be compiled. Composing two quotations is slightly more complicated. We want this to be fast, so we'll build a quotation, using curry
, which calls the two quotations in order. This is more efficient than composition using append
.
: compose ( quot1 quot2 -- q1q2 )
[ >r call r> call ] 2curry ; inline
Another useful operation is
curry*
, which partially applies a quotation based on the second item on the stack. As an example, curry* map
is a complete replacement for map-with
. Because of this, all of the -with
combinators have been removed in favor of direct use of curry
.
: curry* ( param object quot -- object curry )
>r swap r> [ >r swap r> call ] 2curry ; inline
Practical applications
Aside from eliminating the need for all
-with
combinators (I'm very glad I'll never have to write another one myself!), curry
makes writing all sorts of combinators much simpler. For an involved example, see the new definitions of the sequence combinators, which are not so scary anymore. For something simpler, take the problem of counting up the number of elements in a sequence which satisfy a certain property. Say you have the sequence { 1 2 4 1 2 1 }
and you want to count tne number of ones. A simple way to do this is to simply take the subset satisfying the property, and then count how long it is:
: count ( seq quot -- num )
subset length ; inline
Using this, code to count up the number of ones could be
[ 1 = ] count
. This pattern is used a few times in the Factor core, though there is no word named count
. Using it, we can write a count=
word which counts the number of times an element occurs in a sequence:
: count= ( seq elt -- num )
[ = ] curry count ;
But there's a downside to this: for no good reason, you build a new sequence, which is immediately consumed and discarded. What if we just counted up the length instead? The
count
combinator could be rewritten as:
: count ( seq quot -- num )
0 -rot [ rot >r call r> swap [ 1+ ] when ] curry each ;
There's a little bit of stack acrobatics around calling that quotation so that the quot has access to lower things on the stack.
If you think about it, there are actually two pieces here, which we can factor out. The first piece, which I'll call
sigma
, incrementally maps all values in a sequence to a new value, and sums the results. The second piece calls a quotation and converts the boolean result to either one or zero, which can be summed up. When these are separated, the implementation is even cleaner:
: sigma ( seq quot -- num )
[ rot >r call r> + ] curry 0 -rot each ; inline
: count ( seq quot -- num )
[ [ 1 ] [ 0 ] if ] compose sigma ; inline
Both of these are faster than their counterparts (
map sum
and subset length
) by only a constant factor, but, just for fun, let's measure this. Right now, Factor doesn't have very advanced profiling tools, but I haven't felt the need for them. For many things, a simple timing function can work well. After entering in the definitions for sigma
, the following lines can be entered at the listener to time the word against map sum
:
: slow 1000000 [ ] map sum ;
: fast 1000000 [ ] sigma ;
{ slow fast } [ compile ] each
[ slow ] time ! prints time for map sum
[ fast ] time ! prints time for sigma
On my computer,
slow
takes 195ms, and fast
takes 126ms, for around a 35% speedup!
Thursday, August 9, 2007
Unicode implementer's guide part 3: Conjoining jamo behavior
Ugh, Korean. In my last two posts on Unicode, I left out something very important: the Korean writing system. Korean, like Japanese, has a mostly-phonetic alphabet to accompany their Chinese-derived Hanja characters. This phonetic system is called Hangul, and it consists of an alphabet of 24 jamo. Unicode represents these jamo in the block U+1100 to U+11FF, and it includes several obsolete jamo in addition to those used today. Jamo are displayed in blocks of two (consonant+vowel) or three (consonant+vowel+consonant) in modern Korean, though older texts sometimes use more. In order to reduce ambiguity, Unicode uses different characters to represent initial and final jamo, though most of them overlap, looking the same when presented alone.
Supporting that doesn't sound too hard, does it? It gets more complicated when you try to correctly process the block structure formed by the jamo. The algorithms needed to do this are collectively called "conjoining jamo behavior," because the jamo 'conjoin' to form a syllable. I haven't told you about the fun part yet: Unicode actually has two ways to store Korean*! In addition to the component jamo of a syllable, you can store Korean as pre-composed glyphs of two or three jamo. This takes an amazingly large area of Unicode, from U+AC00 to U+D7AF, second in size only the CKJV (Han idiograph) block.
Why bother with conjoining jamo behavior? I don't need Korean for what I'm doing! And the Unicode standard, chapter 3, explicitly says that not all Unicode implementations have to support all Unicode code points in order to be compliant. Implementations only have to explicitly say which ranges they support when claiming conformance. But it's still good to support Korean. I'm Doing Unicode RightTM. So, as long as it's feasible, I should support Korean to its full extent. Also, there have been some people who wanted to use Korean with Factor, so proper support would be helpful.
Things get complicated when determining grapheme breaks, which, in this case, are the breaks between blocks (syllables). To do this, you need to detect where the initial, medial and final jamo connect to form one syllable. This is the basic operation of conjoining jamo behavior that all the others are based on. (In a later post, I discussed grapheme breaks further.) Following this, there are the issues of display, collation and normalization. I'm ignoring display and collation for now, though the latter will be the subject of a future post. So for the rest of this post, I'll discuss normalization as it relates to Korean text, a topic which is surprisingly non-trivial.
In normalization forms D and KD, precomposed Hangul syllables are broken up into their constituent jamo. In C and KC, individual jamo are put, whenever possible, into pre-composed characters. Because of the way the precomposed Hangul block is organized, this can be done algorithmically.
The underlying formulas for composition and decomposition are simple, and Wikipedia gives a great description of them. Basically, given an initial, medial and final jamo (just use 0x11a7 if there's no final), the formula is, "((initial-0x1100)*588)+((medial-0x1161)*28)+(final-0x11a7))+0xac00" (0x is followed by a number in hexadecimal).
For decomposition, it's trivial to invert this, given one Hangul character and some division with remainders (in Factor,
In normal Korean in NFD or NFKD, the jamo are generally organized in an initial-medial-final or initial-medial pattern, with no out-of-order code points. However, we need to take into account all of the possible edge cases. Here, the edge case is old Korean, which sometimes contains more than one initial, medial or final jamo. Additionally, sometimes a jamo stands alone, and this too must be recognized. So detecting syllables for recomposition is a little bit more involved than it first seems.
Since I was having trouble writing clear code for this, I checked in at ICU for a nice, clean implementation. Never mind that. Unlike most of the other code, the ICU implementation of conjoining jamo behavior as it relates to NFC/NFKC is horribly convoluted. For efficiency, ICU tries to always do just one pass through the string for any speed-critical operation. Surprisingly, I was able to replicate cleanly using a different tactic than ICU did.
An implementation of NFC with conjoining jamo behavior can take advantage of the convenient property that no jamo has a combining mark to worry about, besides other jamo**. So, once a jamo is reached in the normalization process, it is possible to jump to another routine which handles exclusively conjoining Jamo behavior. The basic idea of this auxiliary routine is to detect a pattern of initial-medial-final or initial-medial, which can then be applied to the above formula for deriving a pre-composed Hangul. For me, it was easiest to write this by looking ahead in the string, rather than looking back at previously scanned characters as the rest of the composition routine does. All other jamo not in one of these patterns can be safely added to the results string without modification.
If you're a careful reader***, you may have noticed an uncomfortable implication of my previous decision to use NFD and UTF-32 for all strings in Factor: Korean takes four to six times as many bytes per glyph as in the more typical unnormalized UTF-16 representation. This is an unfortunate consequence of otherwise-hopefully-mostly-sound design decisions, but in practice, I don't think it should be much of an issue. These days, computers have a large amount of memory, so getting everything down to the optimally efficient bit sequence is not always necessary. But for the circumstances where it does matter that everything is optimal, maybe a program could maintain a short int (16 bit) array which is normalized to NFC rather than NFD. But reopens the problems of the dangling surrogate and the mistreatment of different normalizations, which I vowed to end. So I'm not sure if there's any optimal solution here.
All told, in my implementation, it's only around 40 additional lines of code, so maybe I shouldn't be making such a big deal out of this.
For further reference, see the Unicode 5.0 standard, 3.12 Conjoining Jamo Behavior (PDF).
* Really, there are actually three ways. The third way is with the Hangul Compatibility Jamo block (U+3130 - U+318F) which corresponds to an old standard for representing Korean jamo, KS C 5601. But these have no "cojoining" semantics, and they don't distinguish between initial and final jamo. Generally, we don't have to worry about these.
** To verify this for yourself, just run the code
to get the set of jamo/combining mark pairs with their compositions, which have a non-algorithmic composition. When I ran it, the result was
*** Or if you have been following my whining at
Supporting that doesn't sound too hard, does it? It gets more complicated when you try to correctly process the block structure formed by the jamo. The algorithms needed to do this are collectively called "conjoining jamo behavior," because the jamo 'conjoin' to form a syllable. I haven't told you about the fun part yet: Unicode actually has two ways to store Korean*! In addition to the component jamo of a syllable, you can store Korean as pre-composed glyphs of two or three jamo. This takes an amazingly large area of Unicode, from U+AC00 to U+D7AF, second in size only the CKJV (Han idiograph) block.
Why bother with conjoining jamo behavior? I don't need Korean for what I'm doing! And the Unicode standard, chapter 3, explicitly says that not all Unicode implementations have to support all Unicode code points in order to be compliant. Implementations only have to explicitly say which ranges they support when claiming conformance. But it's still good to support Korean. I'm Doing Unicode RightTM. So, as long as it's feasible, I should support Korean to its full extent. Also, there have been some people who wanted to use Korean with Factor, so proper support would be helpful.
Things get complicated when determining grapheme breaks, which, in this case, are the breaks between blocks (syllables). To do this, you need to detect where the initial, medial and final jamo connect to form one syllable. This is the basic operation of conjoining jamo behavior that all the others are based on. (In a later post, I discussed grapheme breaks further.) Following this, there are the issues of display, collation and normalization. I'm ignoring display and collation for now, though the latter will be the subject of a future post. So for the rest of this post, I'll discuss normalization as it relates to Korean text, a topic which is surprisingly non-trivial.
In normalization forms D and KD, precomposed Hangul syllables are broken up into their constituent jamo. In C and KC, individual jamo are put, whenever possible, into pre-composed characters. Because of the way the precomposed Hangul block is organized, this can be done algorithmically.
The underlying formulas for composition and decomposition are simple, and Wikipedia gives a great description of them. Basically, given an initial, medial and final jamo (just use 0x11a7 if there's no final), the formula is, "((initial-0x1100)*588)+((medial-0x1161)*28)+(final-0x11a7))+0xac00" (0x is followed by a number in hexadecimal).
For decomposition, it's trivial to invert this, given one Hangul character and some division with remainders (in Factor,
/mod
divides two numbers, returning the quotient and remainder). For a complete listing of the formula, see chapter 3 section 12 of the Unicode standard, linked below. But strangely, composition is a whole lot harder. The trick is recognizing where there's a Hangul syllable.In normal Korean in NFD or NFKD, the jamo are generally organized in an initial-medial-final or initial-medial pattern, with no out-of-order code points. However, we need to take into account all of the possible edge cases. Here, the edge case is old Korean, which sometimes contains more than one initial, medial or final jamo. Additionally, sometimes a jamo stands alone, and this too must be recognized. So detecting syllables for recomposition is a little bit more involved than it first seems.
Since I was having trouble writing clear code for this, I checked in at ICU for a nice, clean implementation. Never mind that. Unlike most of the other code, the ICU implementation of conjoining jamo behavior as it relates to NFC/NFKC is horribly convoluted. For efficiency, ICU tries to always do just one pass through the string for any speed-critical operation. Surprisingly, I was able to replicate cleanly using a different tactic than ICU did.
An implementation of NFC with conjoining jamo behavior can take advantage of the convenient property that no jamo has a combining mark to worry about, besides other jamo**. So, once a jamo is reached in the normalization process, it is possible to jump to another routine which handles exclusively conjoining Jamo behavior. The basic idea of this auxiliary routine is to detect a pattern of initial-medial-final or initial-medial, which can then be applied to the above formula for deriving a pre-composed Hangul. For me, it was easiest to write this by looking ahead in the string, rather than looking back at previously scanned characters as the rest of the composition routine does. All other jamo not in one of these patterns can be safely added to the results string without modification.
If you're a careful reader***, you may have noticed an uncomfortable implication of my previous decision to use NFD and UTF-32 for all strings in Factor: Korean takes four to six times as many bytes per glyph as in the more typical unnormalized UTF-16 representation. This is an unfortunate consequence of otherwise-hopefully-mostly-sound design decisions, but in practice, I don't think it should be much of an issue. These days, computers have a large amount of memory, so getting everything down to the optimally efficient bit sequence is not always necessary. But for the circumstances where it does matter that everything is optimal, maybe a program could maintain a short int (16 bit) array which is normalized to NFC rather than NFD. But reopens the problems of the dangling surrogate and the mistreatment of different normalizations, which I vowed to end. So I'm not sure if there's any optimal solution here.
All told, in my implementation, it's only around 40 additional lines of code, so maybe I shouldn't be making such a big deal out of this.
For further reference, see the Unicode 5.0 standard, 3.12 Conjoining Jamo Behavior (PDF).
* Really, there are actually three ways. The third way is with the Hangul Compatibility Jamo block (U+3130 - U+318F) which corresponds to an old standard for representing Korean jamo, KS C 5601. But these have no "cojoining" semantics, and they don't distinguish between initial and final jamo. Generally, we don't have to worry about these.
** To verify this for yourself, just run the code
USE: unicode
combine-map concat [ first jamo? ] subset
to get the set of jamo/combining mark pairs with their compositions, which have a non-algorithmic composition. When I ran it, the result was
V{ }
.*** Or if you have been following my whining at
#concatenative
Friday, August 3, 2007
COLLEGE
On August 29th, I'm leaving for the 16-hour car drive to Carleton College in Northfield, Minnesota, US. It's a small liberal arts college, but it has reasonable departments in math and CS. I'm entering it as a Freshman this fall, and I'm really, really excited. I'm not sure whether to major in math, computer science or something else like economics or linguistics because I want to study everything. But I don't want to spend all of my time on the computer, so I'm banning myself from programming Factor and blogging during the month of September. I need to build a social life there, and this has the potential to ruin it. But don't assume that this blog is dead after the end of August; just come back October 1 expecting a nice posting on something practical, like XML (ugh), Unicode (2x ugh) or a little useful script I made.
Before I go, there are a few tasks that I need to accomplish. The most important one is implementing conjoining jamo behavior so that Korean will work properly in Unicode. (Korean will take a huge amount of room in memory, which I'll describe soon in another post.) Additionally, I have to finish Unicode collation and fix some problems with
If I have any readers (or if there are any other Factorers) in the Northfield area, or just the Minnesota area (besides Doug), please contact me.
Before I go, there are a few tasks that I need to accomplish. The most important one is implementing conjoining jamo behavior so that Korean will work properly in Unicode. (Korean will take a huge amount of room in memory, which I'll describe soon in another post.) Additionally, I have to finish Unicode collation and fix some problems with
pull-elem
, which will probably require huge changes in extra/xml/xml.factor
. Those last two things might have to wait until October, but the first one is essential because there are people waiting for Unicode support, and strings of UTF-8 which Factor thinks are ISO 8859-1 won't cut it. I hope proper Unicode handling, including 8- and 32-bit strings, are made a part of Factor .90, or at the latest .91.If I have any readers (or if there are any other Factorers) in the Northfield area, or just the Minnesota area (besides Doug), please contact me.
Subscribe to:
Posts (Atom)