Thursday, August 9, 2007

Unicode implementer's guide part 3: Conjoining jamo behavior

Ugh, Korean. In my last two posts on Unicode, I left out something very important: the Korean writing system. Korean, like Japanese, has a mostly-phonetic alphabet to accompany their Chinese-derived Hanja characters. This phonetic system is called Hangul, and it consists of an alphabet of 24 jamo. Unicode represents these jamo in the block U+1100 to U+11FF, and it includes several obsolete jamo in addition to those used today. Jamo are displayed in blocks of two (consonant+vowel) or three (consonant+vowel+consonant) in modern Korean, though older texts sometimes use more. In order to reduce ambiguity, Unicode uses different characters to represent initial and final jamo, though most of them overlap, looking the same when presented alone.

Supporting that doesn't sound too hard, does it? It gets more complicated when you try to correctly process the block structure formed by the jamo. The algorithms needed to do this are collectively called "conjoining jamo behavior," because the jamo 'conjoin' to form a syllable. I haven't told you about the fun part yet: Unicode actually has two ways to store Korean*! In addition to the component jamo of a syllable, you can store Korean as pre-composed glyphs of two or three jamo. This takes an amazingly large area of Unicode, from U+AC00 to U+D7AF, second in size only the CKJV (Han idiograph) block.

Why bother with conjoining jamo behavior? I don't need Korean for what I'm doing! And the Unicode standard, chapter 3, explicitly says that not all Unicode implementations have to support all Unicode code points in order to be compliant. Implementations only have to explicitly say which ranges they support when claiming conformance. But it's still good to support Korean. I'm Doing Unicode RightTM. So, as long as it's feasible, I should support Korean to its full extent. Also, there have been some people who wanted to use Korean with Factor, so proper support would be helpful.

Things get complicated when determining grapheme breaks, which, in this case, are the breaks between blocks (syllables). To do this, you need to detect where the initial, medial and final jamo connect to form one syllable. This is the basic operation of conjoining jamo behavior that all the others are based on. (In a later post, I discussed grapheme breaks further.) Following this, there are the issues of display, collation and normalization. I'm ignoring display and collation for now, though the latter will be the subject of a future post. So for the rest of this post, I'll discuss normalization as it relates to Korean text, a topic which is surprisingly non-trivial.

In normalization forms D and KD, precomposed Hangul syllables are broken up into their constituent jamo. In C and KC, individual jamo are put, whenever possible, into pre-composed characters. Because of the way the precomposed Hangul block is organized, this can be done algorithmically.

The underlying formulas for composition and decomposition are simple, and Wikipedia gives a great description of them. Basically, given an initial, medial and final jamo (just use 0x11a7 if there's no final), the formula is, "((initial-0x1100)*588)+((medial-0x1161)*28)+(final-0x11a7))+0xac00" (0x is followed by a number in hexadecimal).

For decomposition, it's trivial to invert this, given one Hangul character and some division with remainders (in Factor, /mod divides two numbers, returning the quotient and remainder). For a complete listing of the formula, see chapter 3 section 12 of the Unicode standard, linked below. But strangely, composition is a whole lot harder. The trick is recognizing where there's a Hangul syllable.

In normal Korean in NFD or NFKD, the jamo are generally organized in an initial-medial-final or initial-medial pattern, with no out-of-order code points. However, we need to take into account all of the possible edge cases. Here, the edge case is old Korean, which sometimes contains more than one initial, medial or final jamo. Additionally, sometimes a jamo stands alone, and this too must be recognized. So detecting syllables for recomposition is a little bit more involved than it first seems.

Since I was having trouble writing clear code for this, I checked in at ICU for a nice, clean implementation. Never mind that. Unlike most of the other code, the ICU implementation of conjoining jamo behavior as it relates to NFC/NFKC is horribly convoluted. For efficiency, ICU tries to always do just one pass through the string for any speed-critical operation. Surprisingly, I was able to replicate cleanly using a different tactic than ICU did.

An implementation of NFC with conjoining jamo behavior can take advantage of the convenient property that no jamo has a combining mark to worry about, besides other jamo**. So, once a jamo is reached in the normalization process, it is possible to jump to another routine which handles exclusively conjoining Jamo behavior. The basic idea of this auxiliary routine is to detect a pattern of initial-medial-final or initial-medial, which can then be applied to the above formula for deriving a pre-composed Hangul. For me, it was easiest to write this by looking ahead in the string, rather than looking back at previously scanned characters as the rest of the composition routine does. All other jamo not in one of these patterns can be safely added to the results string without modification.

If you're a careful reader***, you may have noticed an uncomfortable implication of my previous decision to use NFD and UTF-32 for all strings in Factor: Korean takes four to six times as many bytes per glyph as in the more typical unnormalized UTF-16 representation. This is an unfortunate consequence of otherwise-hopefully-mostly-sound design decisions, but in practice, I don't think it should be much of an issue. These days, computers have a large amount of memory, so getting everything down to the optimally efficient bit sequence is not always necessary. But for the circumstances where it does matter that everything is optimal, maybe a program could maintain a short int (16 bit) array which is normalized to NFC rather than NFD. But reopens the problems of the dangling surrogate and the mistreatment of different normalizations, which I vowed to end. So I'm not sure if there's any optimal solution here.

All told, in my implementation, it's only around 40 additional lines of code, so maybe I shouldn't be making such a big deal out of this.

For further reference, see the Unicode 5.0 standard, 3.12 Conjoining Jamo Behavior (PDF).

* Really, there are actually three ways. The third way is with the Hangul Compatibility Jamo block (U+3130 - U+318F) which corresponds to an old standard for representing Korean jamo, KS C 5601. But these have no "cojoining" semantics, and they don't distinguish between initial and final jamo. Generally, we don't have to worry about these.

** To verify this for yourself, just run the code

USE: unicode
combine-map concat [ first jamo? ] subset

to get the set of jamo/combining mark pairs with their compositions, which have a non-algorithmic composition. When I ran it, the result was V{ }.

*** Or if you have been following my whining at #concatenative


Slava Pestov said...

Good job!

Anonymous said...

I am also impressed that you implemented both precomposed and non-precomposed Korean. I was fairly surprised to find the existence of both in the unicode spec. I might actually end up using the Korean unicode support in Factor after I learn Korean ;).

Daniel Ehrenberg said...

Dear Anonymous,
Any real implementation has to support both precomposed and non-precomposed Korean because, without it, certain things break. I can't just declare, "All input to Factor programmers should be in individual jamo form" (or in Hangul syllables) because that might not conform to the conditions of the real world. (In reality, individual jamo are used relatively rarely, but still sometimes, like when explaining phonetics, or (in theory) when encoding old Korean. But it's still necessary.)