A lot of people have written blog posts which are introductions to Unicode, and I didn't want to write another one with no new information in it. A popular one is Joel (on Software)'s one, which describes what Unicode is and why it's important. You've likely already read an introduction to Unicode, so I'll just summarize the most important points:
- You can't assume text is in ASCII anymore This isn't just about being nice to non-English speakers. Even Americans enjoy their “curly quotes”, their cafés—and their em dashes. User input might come with these non-ASCII characters, and it must be handed properly by robust applications.
- Unicode is the character set to use internally A bunch of character sets have been developed over the years for different purposes, but Unicode can represent more scripts than any one other character set. Unicode was designed to be able to include the characters from basically all other character sets in use. If you're using C or C++, wchar_t rather than char for strings works for most cases. If you're using a higher level language, then strings should already be stored in some representation that allows for Unicode uniformly.
- There are several text encodings Not all text is in ASCII, and very little text is in the most common 8-bit extension, Latin 1 (ISO-8859-1). Lots of input is in UTF-8, which can represent all of Unicode, but there are other Unicode encodings, as well as specific East Asian encodings like GB 2312 and Shift JIS, in active use. Generally, UTF-8 should be used for output, and it's on the rise in terms of usage. Depending on the programming language or library used, you might have to account for the encoding when doing text processing internally. UTF-16 and UTF-8 are the most common, and careless programming can get meaningless results in non-ASCII or non-BMP cases if the encoding is ignored.
- Unicode encodes characters, not glyphs Unicode can be seen as a mapping between numbers and code points, where a code point is the basic unit of Unicode stuff. It's been decided that this basic unit is for characters, like letters and spaces, rather than specific presentation forms, which are referred to as glyphs. Glyphs are something that only font designers and people who work on text rendering have to care about.
But there's a little bit more that programmers have to know about. Unicode is part of a bigger program of internationalization within a single framework of encodings and algorithms. The Unicode standard includes several important algorithms that programmers should be aware of. They don't have to be able to implement them, just to figure out where in the library they are.
- Normalization Because of complications in the design, some Unicode strings have more than one possible form that are actually equivalent. There are a number of normalization forms that have been defined to get rid of these differences, and the one you should use is probably NFC. Usually, you should normalize before doing something like comparing for equality. This is independent of locale.
- Grapheme, word, sentence and line breaks It's not true, anymore, that a single character forms a single unit for screen display. If you have a q with an umlaut over it, this needs to be represented as two characters, yet it is one grapheme. If you're dealing with screen units (imagine an internationalized Hangman), a library should be used for grapheme breaks. Similarly, you can't identify words as things separated by spaces or punctuation, or line break opportunities by looking for whitespace, or sentence breaks by looking just at punctuation marks. It's easy to write a regular expression which tries to do one of these things but does it wrong for English, and it's even easier to do it wrong for other languages, which use other conventions. So use a Unicode library for this. The locale affects how these breaks happen.
- Bidirectional text When displaying text on a screen, it doesn't always go left to right as in most languages. Some scripts, like Hebrew and Arabic, go right to left. To account for this, use the Unicode Bidirectional Text Algorithm (BIDI), which should be implemented in your Unicode library. Locale doesn't matter here.
- Case conversion Putting a string in lowercase is more complicated than replacing [A-Z] with [a-z]. Accent marks and other scripts should be taken into account, as well as a few weird cases like the character ß going to SS in upper case. The locale is also relevant in case conversion, to handle certain dots in Turkish, Azeri and Lithuanian.
- Collation There's an algorithm for Unicode collation that works much better than sorting by ASCII value, and works reasonably for most languages. Depending on the locale, it should be modified. Even in English, the Unicode Collation Algorithm produces much more natural results. Parts of the collation key can be used for insensitive comparisons, eg. ignoring case.
For further reading, you can look at this much more in-depth article from SIL, or the Unicode 5.1 standard itself, which isn't that bad. Most programmers can't be expected to know all of this stuff, and they shouldn't. But it'd be nice if everyone used the appropriate library for text processing when needed, so that applications could be more easily internationalized.