Useless Factor: Unicode implementer's guide part 4: grapheme breaking

I have to be honest: I lied. I oversimplified the semantics of grapheme cluster breaking, and I hope no one messed up any software over it. But luckily, as far as I know, no one is following this and implementing Unicode, so I didn't cause too much damage. Anyway, it seems that an outright, objective error doesn't inspire nearly as much rebuke as a mere controversial essay. (I like rebuke. Please rebuke me if I've made a false claim in this post.)

Unicode grapheme breaking, like everything else in Unicode, is unexpectedly complex. But, as always, there exists an efficient algorithm to process it.

What's a grapheme cluster?

Abstract Unicode strings, like the ones you're used to, are a series of characters. In Unicode parlance, these are often called code points, but I'll usually call them characters here. But these characters aren't what users think of when they hear "character." Some characters, such as accent marks, go on top of, or underneath, other characters in a horizontal text. Some characters, like \r and \n (CR and LF) often come one after another, signifying together just one thing: a line break.

Because of these and other complications, an additional concept of a "grapheme" is useful. A grapheme is a series of one or more code points that, together, forms what a user thinks of as a character. When you use your left and right arrow keys to scroll through text, you're iterating through graphemes, not characters. If you want to reverse a string, you reverse the graphemes, not the characters. These are just a couple of the many contexts where graphemes show up. Any self-respecting Unicode library needs to be able to detect grapheme cluster boundaries. (I'll use 'grapheme' interchangably with 'grapheme cluster' for simplicity.)

The rules

The primary source for the relevant Unicode algorithm here is Unicode Standard Annex (UAX) #29: Text Boundaries. This describes not only grapheme boundaries but also word and sentence boundaries. Another related document is UAX 14: Line Breaking Properties. I will discuss all of these in a later post, but for now, I haven't even started implementing them. Grapheme cluster boundaries are described in terms of the junction or disjunction of adjacent characters. The core of the rules is described in section 3.1. I've simplified them down to ignore characters that won't be present in our NFD-normalized strings. In the below rules, × signifies that the two adjacent code points belong to one grapheme cluster, and ÷ indicates that there is a grapheme cluster break between those characters.

Break at the start and end of text.
GB1.	sot	÷
GB2.		÷	eot
Do not break between a CR and LF. Otherwise, break before and after controls.
GB3.	CR	×	LF
GB4.	( Control \| CR \| LF )	÷
GB5.		÷	( Control \| CR \| LF )
Do not break Hangul syllable sequences.
GB6.	L	×	( L \| V )
GB7.	V	×	( V \| T )
GB8.	T	×	T
Do not break before extending characters.
GB9.		×	Extend
Otherwise, break everywhere.
GB10.	Any	÷	Any

The character groups L, V, T, CR, LF, Extend, Control and Any (which I'll call grapheme classes, a name I invented) are defined in the UAX. Their intuitive definitions are simple, though. The Extend grapheme class encompasses all characters which are marked as extenders, and the Control grapheme class can be thought of as containing all control and line-breaking characters, except CR, LF, ZWNJ and ZWJ. (I'll discuss the last two later.) CR and LF are just that--the characters \r and \n, respectively. These need special handling. L is initial jamo characters, V is jamo medial, and T is jamo final. Any is anything else. sot is the start of the text, and eot is the end. Note that, in the above rules, earlier rules override later rules.

A first shot

My first attempt at implementation didn't go so well. At first, I implemented grapheme cluster breaks without even glancing at the UAX. This was incredibly stupid. I acted like there were two rules, GB9 and GB10. And I misunderstood Extend to be the same as the group of non-starters, that is, characters with a non-zero combining class. Using this incorrect definition, which I repeated in several blog posts, I made the words next-grapheme and prev-grapheme. These words take an index and a string and locate the next or previous grapheme break. (In reality, prev-grapheme never worked, and I had insufficient unit tests so I didn't catch this.)

This simple protocol gives you everything you need to iterate forwards and backwards through graphemes in strings. Using it, it is possible to define a word which generates an array of all of the graphemes in the string:


: (>graphemes) ( i str -- )
    2dup bounds-check? [
        dupd [ next-grapheme ] keep
        [ subseq , ] 2keep (>graphemes)
    ] [ 2drop ] if ;

: >graphemes ( str -- graphemes )
    [ 0 swap (>graphemes) ] { } make* ;

And with this, it's easy to reverse a string, preserving graphemes rather than breaking them up. If we didn't preserve graphemes when reversing strings, the resulting reversal would be incoherent and represent something different, with, for example, accent marks on the wrong character.


: string-reverse ( str -- rts )
    >graphemes reverse concat ;

Note that the concat here does not need to be normalizing, whereas normally, concat and append need to do a little bit of reordering to be properly normalized in NFD.

This interface, along with >graphemes and string-reverse, works independently of the bad implementation of next-grapheme. So, when I fixed that, in the two following versions, I was able to retain the same code for >graphemes and string-reverse.

Fixing some bugs

So, my first version worked for my simple unit tests, but when I expanded testing, I soon found that it failed. This led to a much more complicated version of next-grapheme, which used mountains of conditionals to determine if two graphemes are conjoining.

Most of the grapheme class detection was simple, involving only character classes and simple ranges. But Extend is slightly more complicated. The Extend grapheme class is defined as code points with the property Grapheme_Extend, but there's a more direct definition. Looking in the description page for the Unicode Character Database (UCD), we see that Grapheme_Extend is defined as the characters in the categories Me and Mn, plus a few extras in Other_Grapheme_Extend. Other_Grapheme_Extend is defined in PropList.txt. It's only 21 characters, but rather than enter them in manually, I decided to write a parser for PropList.txt so that the update to the next version of Unicode is easier. (Actually, I cut out only the relevant part of the file, because it will be included in the Factor distribution, and there's no sense in including thousands of lines of text that will not be used.)

There were a few problems: first, every character was tested for its grapheme breaking properties multiple times, where one time would have sufficed. Second, the code was structured in a way that required me to rewrite the whole thing for reverse traversal!

Looking at UAX 29, I saw several references to a 'table-based implementation', but there was no explanation of this. Confused, I wrote to the Unicode list (unicode@unicode.org, an active, public mailing list) asking for an explanation of this. The Unicode list is useful but a bit technical, and my only response was a pointer to the ICU implementation of breaking properties, the BreakIterator. Naively, I looked up the BreakIterator class in the ICU source code, imagining that the code governing it would be there, or in one of the classes that that file directly references. But after looking at maybe 10 files of C++, I gave up. I'm not smart enough for C++, or at least for these big projects.

Through the ICU documentation, I saw that a closely related problem was line breaking properties. Looking at the Unicode spec for that (UAX 14), I got an actual description of table-based implementations. The feeling was something between an epiphany and a realization of incredible stupidity.

Table-based implementation

Here's the basic idea of how to implement grapheme cluster breaks, the table-based way:

Step 1: Before runtime, create a table describing which grapheme classes connect and which grapheme classes disconnect.

Step 2: Between two grapheme classes, you can query the table to see if there's a grapheme cluster break

Step 3: To detect grapheme clusters, iterate through a string and determine the grapheme class of each character, until the classes have a break between them.

The table is an 8x8 bitarray, where the the rows represent the grapheme class of the character before the potential break and the columns represent the character after the potential break. Simply indexing it tells us whether there is a grapheme break. Once you have this table, and the facility to tell which grapheme class each character is in, the task is easy to write. In essence we're making a finite state machine, where the state is completely determined by the previous character. For more details, you can check out my implementation.

Generating the table

But let's look at Step 1 in more detail, which is decidedly not as easy. How do we get from the rules to the table? Sure, you could generate the table manually, since it's not very large, but that process is annoying, error-prone and creates difficult-to-maintain code. A better way to go about it is to generate the table directly from the rules.

No, this doesn't require any fancy DSLs, just a simple algorithm with some utility functions. For this description, assume false does not equal zero. First, initialize an GxG array to false, where G is the number of grapheme classes. Then, define the operation connect to set the array at a point to 1 if it is currently false, else do nothing. Define the operation disconnect to set the array at a point to 0 if that point is currently false. These allow the table to give precedence to earlier operations and ignore overlapping later ones. Once we're done with all of the connects and disconnects, generate the final table by making all points which are 1 true, and all points which are 0 or false, false.

On top of this backing, I implemented a few functions. connect-before connects one class to a list of classes, to implement rules like "L × ( L | V )". connect-after connects a list of classes to a class, to implement rules like " × Extend" (whose 'list of classes' is all of the classes). break-around disconnects all junctions between the first list and the second list, to implement " ÷ ( Control | CR | LF )" (using, again, all classes as one of the arguments). Once I I had these functions, generating the table from the rules was a simple transliteration:


    CR LF connect
    { Control CR LF } graphemes break-around
    L { L V } connect-before
    V { V T } connect-before
    T T connect
    graphemes Extend connect-after

So...

The solution ended up being more complex than I thought, but at least it wasn't unintelligible. I'll go back now and fix all of my incorrect descriptions of grapheme breaking in previous blog entries. Along these same lines, word breaking, sentence breaking line breaking are implemented. But they're more complex, involving optional breaks in the case of line breaking, and non-adjacent characters in the case of word/sentence boundaries. But this is all for a later post!

Useless Factor

Monday, August 27, 2007

Unicode implementer's guide part 4: grapheme breaking

No comments:

Links