Sunday, March 8, 2009

Text encodings and API design

This blog post is about a problem that I haven't figured out the answer to: How should vendor extensions to encodings be exposed to programmers?

It seems like basically all pre-Unicode text encodings have proprietary Microsoft extensions that become, in practice, the next version of the standard. Sometimes these are basically supersets, but sometimes the extensions are backwards-incompatible in particular tiny places. One example of this that should be familiar to Westerners is the distinction between Latin 1 (ISO 8859-1) and Windows 1252. The only difference is in the range 0x80-0x9F, where Latin 1 has a set of basically useless control characters.

A similar situation exists with many East Asian character sets (eg. KS X 1001/1003 for Korean, JIS X 208 for Japanese). In these cases, the backslash (\ 0x5C) is mapped to a national currency symbol in the official standard. But in the Microsoft extension, 0x5C is mapped to backslash to maintain compatibility with ASCII, and another character represents the national currency symbol.

Some websites mark themselves as ISO 8859-1, but are in fact encoded in Windows 1252, and many web browsers take this interpretation into account. Similarly, the Microsoft versions of East Asian text encodings are often used in contexts where the standard versions are declared. In some cases, the Microsoft versions are registered with IANA separately for use in internet protocols (eg Shift-JIS/Windows_31J, and Latin1/Windows-1252), but in other cases there is only one registered encoding (eg EUC-KR).

So, there are two questions.
  1. When interpreting an HTTP response, or receiving an email, where the encoding is declared in the header, should the text be interpreted with the standard encoding or the Microsoft extension?
  2. In the encodings API for a programming language, if a particular encoding is used to read or write a file, should the standard be used or the Microsoft extension?

When I started reading about encodings, I assumed that everything could be done reasonably by following the standards precisely. Now I'm not sure what to do.

5 comments:

Barry Kelly said...

Welcome to the real world of software development :)

Anonymous said...

Re 2: Use UTF-8

Unknown said...

troelskn,

Of course *I* use UTF-8 for everything that I write, but the issue is there are text files out there and legacy systems that people need to communicate with which don't use UTF-8.

Unknown said...

For number 2: my approach would be to have two separate encodings. One uses the standard name and implements the standard behavior. The other has the standard name with "-microsoft" tacked on. Document both. The advantages of this are:

1. you pass on your knowledge about this quirk to users of your API.
2. you empower users of your API to make their own decisions about what to do.
3. the "-microsoft" encoding looks and feels like a wart, which is appropriate because it is.

If you want to go to even extra effort, you could offer a separate API call that attempts to auto-detect which variant a document actually uses. Just make sure you don't offer this unless your heuristic has a high success rate. And make sure it's a separate call, so your API's users aren't *forced* to accept the cost or the potential false positives of such a scheme.

Re: #1. When interpreting an HTTP response, it's unlikely that you actually need to know or care whether a character is a backslash or a currency symbol. At that layer, you're probably just passing most of your strings through to be processed by other layers of software. As for reading email, if you can't write a heuristic with a very high success rate, this is probably an issue you have to expose to your users, because there's no way you can hide it from them and do the right thing all the time. I could imagine some combination of configuration/preferences and (if a person gets a lot of email of both variants) a button to toggle between the two.

Out of curiosity, have you researched what mainstream email clients do about this issue?

Unknown said...

Josh,

I was thinking that I should have two versions and document them, but if the non-Microsoft version is truly worthless (as might be the case in the Japanese and Korean cases) and never used, though the non-Microsoft names are almost always used, then this distinction would only cause confusion for developers. I won't be able to influence the behavior of people who write documents by my API, and if I could, I'd make everyone use UTF-8.

It's very difficult to have a heuristic which decides between the national currency symbol and backslash; that'd involve interpreting the actual text. Also, when interpreting an HTTP response it's absolutely critical that this is interpreted correctly, since the encoded text will be converted to Unicode internally and then the decision will be made. If any of this text is presented to the user or passed on to something else, it will probably be in UTF-8. A wrong interpretation would then be seen by anyone who looks at it, and this would cause confusion and frustration among users.

I think most mainstream email clients and web browsers use the Microsoft extensions, but I know that basically by rumor and haven't checked directly.