Saturday, June 7, 2008

A second introduction to Unicode

If you're a normal programmer, you probably really don't want to have to think about Unicode. In what you're doing, text processing probably isn't a very important aspect, and most of your users will be using English. Nevertheless, text has a tendency to creep its way into almost everything, as the basic computer/human interface. So it might be a little beneficial to know about the basics of text processing and the Unicode character set.

A lot of people have written blog posts which are introductions to Unicode, and I didn't want to write another one with no new information in it. A popular one is Joel (on Software)'s one, which describes what Unicode is and why it's important. You've likely already read an introduction to Unicode, so I'll just summarize the most important points:
  • You can't assume text is in ASCII anymore This isn't just about being nice to non-English speakers. Even Americans enjoy their “curly quotes”, their cafés—and their em dashes. User input might come with these non-ASCII characters, and it must be handed properly by robust applications.
  • Unicode is the character set to use internally A bunch of character sets have been developed over the years for different purposes, but Unicode can represent more scripts than any one other character set. Unicode was designed to be able to include the characters from basically all other character sets in use. If you're using C or C++, wchar_t rather than char for strings works for most cases. If you're using a higher level language, then strings should already be stored in some representation that allows for Unicode uniformly.
  • There are several text encodings Not all text is in ASCII, and very little text is in the most common 8-bit extension, Latin 1 (ISO-8859-1). Lots of input is in UTF-8, which can represent all of Unicode, but there are other Unicode encodings, as well as specific East Asian encodings like GB 2312 and Shift JIS, in active use. Generally, UTF-8 should be used for output, and it's on the rise in terms of usage. Depending on the programming language or library used, you might have to account for the encoding when doing text processing internally. UTF-16 and UTF-8 are the most common, and careless programming can get meaningless results in non-ASCII or non-BMP cases if the encoding is ignored.
  • Unicode encodes characters, not glyphs Unicode can be seen as a mapping between numbers and code points, where a code point is the basic unit of Unicode stuff. It's been decided that this basic unit is for characters, like letters and spaces, rather than specific presentation forms, which are referred to as glyphs. Glyphs are something that only font designers and people who work on text rendering have to care about.

But there's a little bit more that programmers have to know about. Unicode is part of a bigger program of internationalization within a single framework of encodings and algorithms. The Unicode standard includes several important algorithms that programmers should be aware of. They don't have to be able to implement them, just to figure out where in the library they are.
  • Normalization Because of complications in the design, some Unicode strings have more than one possible form that are actually equivalent. There are a number of normalization forms that have been defined to get rid of these differences, and the one you should use is probably NFC. Usually, you should normalize before doing something like comparing for equality. This is independent of locale.
  • Grapheme, word, sentence and line breaks It's not true, anymore, that a single character forms a single unit for screen display. If you have a q with an umlaut over it, this needs to be represented as two characters, yet it is one grapheme. If you're dealing with screen units (imagine an internationalized Hangman), a library should be used for grapheme breaks. Similarly, you can't identify words as things separated by spaces or punctuation, or line break opportunities by looking for whitespace, or sentence breaks by looking just at punctuation marks. It's easy to write a regular expression which tries to do one of these things but does it wrong for English, and it's even easier to do it wrong for other languages, which use other conventions. So use a Unicode library for this. The locale affects how these breaks happen.
  • Bidirectional text When displaying text on a screen, it doesn't always go left to right as in most languages. Some scripts, like Hebrew and Arabic, go right to left. To account for this, use the Unicode Bidirectional Text Algorithm (BIDI), which should be implemented in your Unicode library. Locale doesn't matter here.
  • Case conversion Putting a string in lowercase is more complicated than replacing [A-Z] with [a-z]. Accent marks and other scripts should be taken into account, as well as a few weird cases like the character ß going to SS in upper case. The locale is also relevant in case conversion, to handle certain dots in Turkish, Azeri and Lithuanian.
  • Collation There's an algorithm for Unicode collation that works much better than sorting by ASCII value, and works reasonably for most languages. Depending on the locale, it should be modified. Even in English, the Unicode Collation Algorithm produces much more natural results. Parts of the collation key can be used for insensitive comparisons, eg. ignoring case.

For further reading, you can look at this much more in-depth article from SIL, or the Unicode 5.1 standard itself, which isn't that bad. Most programmers can't be expected to know all of this stuff, and they shouldn't. But it'd be nice if everyone used the appropriate library for text processing when needed, so that applications could be more easily internationalized.

5 comments:

Anonymous said...

"as well as a few weird cases like the character ß going to SS in upper case"

It might be worth mentioning that that is true even in Unicode 5.1, which introduced captai sz.

R Samuel said...

I'm curious to what other developers are using for their in-memory encoding. Are you using UCS-4 and taking the 4 byte per-codepoint hit? Are you using UTF-8/-16 and writing your code to deal with variable-width codepoints?

Has anyone written their code to use 3 bytes per-codepoint to get a fixed with encoding with a minimal loss?

Tom said...

"If you're using C or C++, wchar_t rather than char for strings works for most cases."

I'm curious as to what these "most cases" are.

First, wchar_t isn't well-defined; on many machines it's 16-bit but certainly not all.

That aside, wchar_t is completely incompatible with the standard C library and your standard C/C++ string constants.

It's absolutely not a matter of dropping in new code; contemplate that in an ASCII constant, every other byte is 0 so if your code ever relies on "byte 0 means end of string" you're lost.

Finally, wchar_t doesn't actually let you represent every Unicode "character" (or codepoint), though this is pretty academic because the missing ones are highly obscure.

Why not use UTF-8?

Daniel Ehrenberg said...

rsamuel,

Factor's representation uses 3 bytes per code point, put in two arrays, with one that's 8 of the bits and one that's 16 of the bits.

Daniel Ehrenberg said...

Tom,

I completely disagree with using wchar_t, since it's usually 16 bits but at least 21 bits are needed. People end up treating UTF-16 strings (made of wchar_t) as an array of code points, but then there are those surrogate pairs that come up only rarely (but should still be supported). So Joel's post was outdated even when he wrote it, but it's still a good explanation of the history, justification and aims of Unicode.

On the other hand, UTF-8 is awkward to read characters from directly compared to indexing an array. I don't see any big need to compress stuff in memory using UTF-8 (which isn't all that good a compression algorithm).

I think a fixed-width encoding is easiest to use internally, so it's what I'll use for my own coding (in Factor).