Monday, July 9, 2007

The Unicode implementer's guide, part 2: Normalization

In a reasonable encoding system, it'd make sense if every string had only one encoding, wouldn't it. I'm not talking about UTF-8/16/32; I'm talking about the logical sequence of characters. If you have a capital C with an acute accent on top and a cedilla on the bottom, there should be only one sequence of characters to represent it, and any two strings in the same encoding which represent that glyph should have the same bits.

Well, that's not Unicode. There are actually several ways to represent that. At least four, by my count. You could have C-acute followed by cedilla, a C-cedilla followed by an acute, a C followed by an acute followed by a cedilla, or a C followed by a cedilla followed by an acute. There is a difference between "C-acute" and "C followed by an acute" because Unicode has what's called precomposed characters, or characters which already have accent marks on them. However, they are directly equivalent to a corresponding decomposition, specifically, the base character followed by a combining accent mark.

Equivalence and normalization forms

All of the above sequences represent the same thing, but they're represented differently in the computer. We call them canonically equivalent. There's another type of equivalence, called compatibility equivalence, where, say, the composed fraction ½ is equivalent to 1/2. These aren't strictly equivalent, though, so Unicode maintains a distinction to a stronger degree than the distinction between C-acute and C followed by an acute.

Chapter 3 of the Unicode standard states, right near the beginning,

C6 A process shall not assume that the interpretations of two canonical-equivalent sequences are distinct.

  • ...a process is never required to give different interpretations to two different, but canonically equivalent character sequences...no process can assume that another process will make a distinction between...canonical-equivalent character sequences.

  • Ideally, an implementation would always interpret two canonical-equivalent character sequences identically...




To me, it seems apparent that one single form needs to be chosen. In this form, which is efficient to create, any two canonical-equivalent strings will be represented by the same sequence of characters. There should also be a form for categorical equivalence.

Luckily, the Unicode consortium thought of this first. They came up with not two but four forms, with the imaginative names Normalization Form C (NFC), Normalization Form D (NFD), Normalization Form KD (NFKD) and Normalization Form KC (NFKC). Officially, they stand for nothing, but I know better! They do mean something! C stands for "composed", D stands for "decomposed" and K stands for "kompatibility". (If they had abbreviations like NFCC, it'd be confusing, wouldn't it?)

Decomposition

Let me explain. In NFD, everything that can be decomposed is decomposed. In NFC, everything that can be composed is composed, after first decomposing everything. In NFKD, everything that has a compatibility or canonical decomposition is decomposed that way. And in NFKC, everything is decomposed by NFKD and then recomposed by their canonical composition. These decompositions are listed in UnicodeData.txt, and have to be recursively applied until a fully decomposed/recomposed form is achieved.

Still doesn't make sense? Here are some details of my implementation. For both forms of decomposition (NFD and NFKD), I constructed tables* associating characters with their fully-calculated decomposition, decomposing the decomposition repeatedly until no more steps are possible. For example, the musical symbol eighth note character can be decomposed into a black notehead, a combining stem, and a single combining flag**. From there, all you have to do to construct a decomposed normalization form is map characters to their decomposed forms (or, if there's no decomposition, to the character itself) and concatenate the results.

Composition

To recompose afterwards, there is just one table for both NFC and NFKC, because only the canonical decomposition can be recomposed; the compatibility decomposition is not ever decomposed. To do this, you need some sort of associative mapping from two characters to one character. There is an entry in this table for each canonical decomposition from one character to two characters. (There are a few characters, called singletons, that map in NFD from one character to one other character. This is never undone. They exist purely for round-trip compatibility with other character encodings.) Most people suggested using a custom-built trie for it, but I got pretty good performance with a simpler custom-built hashtable. I didn't want to use the standard library hashtable because the keys consist of two integers. With a standard hashtable, I'd have to construct a 2array or similar object for each key, and then all lookups would require consing. This is unnecessarily inefficient, especially considering how many lookups there are. The hashtable I came up with is in extra/hash2.

The recomposition algorithm is rather complicated once you consider that there may be multiple combining marks following a character, and their order matters only sometimes (see below). I asked about it on the public Unicode mailing list, and Ken Whistler gave me a very helpful response in a private email. The algorithms are also described in the Unicode standard itself.

Canonical ordering

OK, so this handles the separation between the "C followed by acute" and "C-acute" forms. But what about the ordering of the combining marks? How do we make sure that an acute followed by a cedilla is treated the same as a cedilla treated by an acute? It gets even more complicated when you take into account the fact that order does matter for certain accent marks. For example, a C followed by a diaeresis followed by an acute accent is actually different than a C followed by an acute accent followed by a diaeresis. Which one is displayed on the bottom depends on which comes first. So there are a few groups of accent marks where the order matters, and for the rest, it's equivalent.

The folks behind Unicode came up with a great idea called canonical ordering. Each group of combining marks (in the way I just described, where a group is the few accent marks where the order matters) is assigned an arbitrary number. For example, the group containing the acute accent and the diaeresis is numbered 230, and the group containing the combining cedilla is numbered 202. Then, for each grapheme followed by multiple combining marks, the marks are sorted by combining class (the number they are assigned). The sort is a stable one, so that if two marks have the same combining class, their original order is preserved.

But it was a little tricky to find the right abstractions and appropriate sequence combinators to process strings in this way. To implement canonical ordering, I first tried one big iteration though the string elements using each, but that made control flow really difficult. The main loop was one bit cond with 5 clauses, each of which had to deal with 5 items on the stack. I ended up using find* to locate the starts and ends of combining mark clusters, allowing for more flexible control flow and better factoring. For my implementation, I used an in-place insertion sort to order the combining mark clusters. It's simple and efficient on small sequences.

It should be possible to do the sorting in the same pass as the cluster location, but for my implementation, it's a bit premature to optimize that much.

Design considerations

One of the design decisions I've made is that all strings in Factor will always be in NFD at all times. For a number of reasons, this requires that strings are immutable. But the advantage is that the Unicode Conformance Clause 6 will always be followed by all Factor programs. Nothing (unless it engineers a huge workaround, using binary rather than UTF-8 I/O and uses string buffers rather than strings) will violate this. If two canonical-equivalent strings are put in NFD, they will be equal, so there's no risk of any Factor process treating canonical-equivalent strings differently, nor expecting that any subprocess to treat it differently.

A little side note: I've already mentioned this, but many programs and libraries do Unicode wrong. In particular, many programs can't properly handle things that aren't in Normalization Form C. So, for all input and output (except for Mac OS X pathnames, which must be NFD), strings will first have to be converted to NFC. But that's no big deal, once you have everything else figured out.

In a perfect world, we wouldn't have to worry about normalization when communicating with other programs. If every process just treated all canonical-equivalent sequences equally, we wouldn't need to convert strings into normalized form. But we don't live in that world, so normalization must exist. Keeping text in a consistently normalized form within a program is a simple but surprisingly uncommon way to achieve this.



* I constructed all the tables--for normalization and the things I discussed earlier--at parsetime, before compiletime and runtime. I did this by using words of the style

: load-stuff
\ table-word read-resource-file process-data
1quotation define-compound ; parsing
load-stuff

ICU, the Unicode library used by everyone but me, generates C header files containing very large arrays for tables. Some of the arrays are organized in a trie format, so they can be used as associative arrays. I used hashtables for associative arrays instead.

** Wondering how I got that? Here's some code you could enter into the Factor listener:

USE: unicode
canonical-map [ nip length 3 = ] assoc-find
drop >r name-map value-at r> [ name-map value-at ] map
swap "Name: " write print "Components: " print [ print ] each

In total, there are 229 characters with this sort of decomposition. A small variation to that program (changing 3 to 4) shows me that there's a Greek character, named Greek small letter alpha with psili and varia and ypogegrammeni which decomposes to 4 characters. There are actually 36 of these kinds of characters, which decompose to four. They're all Greek. There's nothing that's five characters long, though. Isn't Unicode fun?

2 comments:

k said...

I can't believe I read this much :)
can you show us how you implemented for example the algorithm that was described in the email?
I mean, how you translated this into factor. have you started by grouping into words and then tested a few cases to determine the order?
It would be instructional to get a tree description (like yes-> go there, no-> go that way) and explain the translation to a stack language.
I like the tone of your writings, and thanks for the major efforts

Daniel Ehrenberg said...

Thanks. Umm, heh, I still don't have an implementation of the algorithm discussed in the email that I'm comfortable with. I used a bunch of hacks and four (!) named variables, which is a big no-no. I need to refactor it all and write some unit tests. I guess I could go into more of the low-level details of some of the other parts of it, and how that translates into Factor (though my goal in this was to be more language-neutral). But that's another post; this one's long enough already!

Keep commenting, everyone!