Monday, December 31, 2007

Unicode implementer's guide part 6: Input and output

Until now, in my blog posts on Unicode, I've been focusing on internal processing which has only an indirect effect on users. But now I'm thinking about something else: how can you write a text editing widget which potentially supports as many locales as possible? This should generally be the goal when writing internationalized applications, and almost all applications should be eventually internationalized, so why not a text editor?

As it turns out, this is an extremely complicated task, combining the difficulties of input methods and rendering. In this article I'll try to brush on as many relevant topics as possible, trying not to ramble too much on any particular one. Really, each section should probably be given its own blog post.

To test the Unicode-bind approach, I'm looking at the Factor .91 editor gadget, which pays no attention at all to internationalization and basically parrots back whatever characters it receives. Different keyboard layouts were tested on Mac OS X 10.5, though that shouldn't matter here.

Typing European scripts

European scripts are pretty easy. You just type letters on the keyboard and they appear, always in left-to-right order, on the screen, one character appearing for each keystroke in many languages. This works for English, German, Russian, Georgian and many other languages, but not for, say Spanish or Vietnamese. In those languages, there are characters like é or ở which require multiple keystrokes to input. With a typical Spanish keyboard, to write é, first the ´ key is pressed and then the e key. For ở, you first press the ơ key and then the ̉ key.

There are more complicated input methods for, say, typing é with a US keyboard, and these are system-dependent. On Mac OS X, I use Alt-e to get that first combining ´ mark, and then an e to constitute the character under it. This can be used in any application. In Microsoft Office applications on Windows, you can use Ctrl-' to make a combining ´ mark. Elsewhere, you can't use that key chord, just some complicated keystrokes to input the right hex code. I believe Gnome and KDE each define their own mechanisms for this kind of input.

From a Unicode-bind text editor, none of these multiple-character inputs work. When in the Spanish keyboard mode in the Factor editor gadget, pressing ´ does nothing, and when in Vietnamese mode, pressing ơ does nothing. This is because the input system expects something more complicated to happen next. The OS-specific keys, of course, don't work properly either unless there is special handing for them. All of these things must be taken into account in an internationalized editor widget.

Typing East Asian scripts

When looking at the possible input methods in the Mac keyboard layout chooser, you're immediately struck by the fact that four East Asian languages--Chinese, Japanese, Korean and Vietnamese, have far more input methods listed than any other individual language. This is probably because (more than anything else) of the difficulty involved in inputing Han ideographs.

Han ideographs (Chinese characters) are used in Chinese and Japanese (as Kanji), occasionally in Korean and historically with Vietnamese. Unicode encodes over 70,000 Han ideographs, but at least 90% of these are extremely rare. Still, thousands come up in everyday situations and need to be typed into computers.

There are a number of ways of inputting Han ideographs. One way is through a Romanization. This requires several dictionaries, because each language pronounces the ideographs differently, and there are many Romanization systems for each language. One Romanization may correspond to more than one ideograph, so users are presented with a list of numbered choices. Another input method works by specifying the radicals, or component pieces, of the Han ideographs.

Japanese and Korean (and Vietnamese, discussed above) also have alphabets. Japanese has the kana—katakana and hiragana—a syllabary which can be entered by a special keyboard or through Romanization. Korean has Hangul, (I blogged about this earlier) which are syllables made of the 24 letters in the jamo alphabet. These can be entered by Romanization or by typing jamo. In either case, things are expected to coalesce into Hangul syllables automatically during typing.

The choice of input methods is usually chosen outside of the current application, and applications are expected to work perfectly with that choice. A good text editing widget must communicate with the OS settings to pick the right one, and it must implement it properly.

Rendering difficult scripts

Hebrew, European scripts and East Asian scripts are relatively simple to render. Well, maybe not that simple, when it comes to placing (and stacking!) combing marks and putting Hangul syllables or Han ideographs. But at least two adjacent letters don't tend to combine with each other spontaneously, forming several types of ligatures depending on what's on the left and what's on the right. At least letters go in their prevailing order, not rearranging themselves visually.

Certain Middle Eastern scripts like Arabic and Syriac, and several Indic scripts like Devanagari, don't follow these rules. Arabic and Syriac are written in a cursive form. Each letter has up to four (or sometimes five) forms in the simplified form that most computer typesetting systems use, and vowels are written above or below the characters as combining marks. In some computer typesetting systems, these shapes can be inferred from context, but others will display things incorrectly unless special code points are used which include information about which form it is.

Indic scripts are also complicated, but differently. The term "Indic script" refers not only to scripts used in India but a whole family of scripts used around South and Southeast Asia. There are 19 of these scripts encoded in Unicode, all with somewhat similar properties. Devanagari, the script used for Hindi, Nepali and Sanskrit, among other things, is one of these Indic scripts, and others include Malayalam, Tibetan, Tamil, Khmer and Thai. Among consonants in most Indic scripts, there are a large number of ligatures, but these are not encoded into Unicode; they must be inferred by a suitable font. Vowels in Indic scripts can be written after a consonant group, above it, below it, or even before the group, and this must also be inferred by the font. (I'm oversimplifying in both of those sentences, of course. Things are much more complicated.) In most Indic scripts' Unicode encodings, vowels which are placed before a consonant group come aftewards in memory, following logical order. Thai and Lao scripts don't follow these rules, and put those vowels before, but this causes complications in collation. All of these properties of Indic scripts create difficulties not only in rendering but also in input methods, but I haven't read much specifically on this.

Some or all of these issues might be solved by choosing a suitable font and display system rather than the editor widget itself.

Line breaking

I wrote about grapheme breaking previously. Unicode line breaking is sort of like grapheme breaking, except much much more complicated. By "line breaking" I mean what you might call word wrap. It's described in UAX #14.

This document describes several classes of characters with different line-breaking properties. Some force a line break before or after their display, others passively allow a line break before or after, some forbid a line break and others have even more complicated properties. All in all, there are 38 categories. Even though the algorithm is complicated, it (like generating a collation key) can be done in linear time with respect to the length of the string.

With respect to line breaking, different scripts and locales have different rules. In Chinese, Japanese and Korean text, where spaces between words are rare or nonexistent, you can do a line break between any two Han ideographs, kana or Hangul syllables. Often, arbitrary line breaks are even allowed within words written in Roman script. In Korean, we have to watch out that we don't break up a Hangul syllable, so conjoining jamo behavior rules are used.

In Khmer, Lao and Thai scripts, there are also no spaces between words, but line breaks must not interrupt words. This requires some linguistic analysis and dictionary lookup, but Unicode provides a code point to indicate line breaking opportunities. This may sound a little weird and impractical, but a similar thing also exists for European scripts which may hyphenate words between lines: the soft hyphen.

Even if things are restricted to American English, Unicode line breaking properties provide for better visual display than separating things at word breaking points, or just where there's whitespace. Still, it provides a significant implementation challenge for the writer of an internationalized text editor.

Bidirectional text

In Arabic, Hebrew and some other scripts, text runs from right to left rather than left to right. But what if we have a document that combines Hebrew and English: how do we organize text running in two directions? In an even more common situation, we could have text containing Hebrew and numbers. (Even in Hebrew and Arabic, decimal numbers are written left to right.) According to Unicode principles, both left-to-right (LTR) and right-to-left (RTL) scripts should be written in logical order, yet their display order is different.

To negotiate these issues, the Unicode Consortium specified the Bidirectional (BIDI) Algorithm (UAX #9). The BIDI Algorithm could easily fill up a whole blog post by itself, and I don't fully understand it, so I won't try to explain it all here. There are a number of different categories that Unicode code points go in for BIDI purposes: characters can be not only LTR or RTL, but also neutral, mirroring, strongly or weakly directioned, or an invisible special-purpose mark. These categories sometimes overlap, and there are several levels of directioning.

The BIDI algorithm is difficult to implement for display, but it gets even more complicated in a text editor. When placing the cursor somewhere, there can be ambiguity as to where it is. When selecting things, there is sometimes a design decision whether to support logical selection (which could be discontiguous) and visual selection (which is hard to implement). The best resource to learn more about this is the UAX document itself.

Conclusion

For Factor, it will be impossible to keep the editor widget written completely in Factor. It would require a huge amount of duplicated effort and system-specific configuration to get it right, so we'll probably end up having to use native (meaning Gtk, Cocoa and Windows UI) editor widgets.

Clearly, the Unicode-blind approach is insufficient here. Because of all this, internationalized editor widgets are implemented pretty rarely. English-speakers can get complacent in simple input methods, but these will not work for others.


Warning: Readers should be aware that I haven't studied any one of these topics in-depth. This is just an introduction and shouldn't be taken very seriously.

No comments: