It seems like basically all pre-Unicode text encodings have proprietary Microsoft extensions that become, in practice, the next version of the standard. Sometimes these are basically supersets, but sometimes the extensions are backwards-incompatible in particular tiny places. One example of this that should be familiar to Westerners is the distinction between Latin 1 (ISO 8859-1) and Windows 1252. The only difference is in the range 0x80-0x9F, where Latin 1 has a set of basically useless control characters.
A similar situation exists with many East Asian character sets (eg. KS X 1001/1003 for Korean, JIS X 208 for Japanese). In these cases, the backslash (\ 0x5C) is mapped to a national currency symbol in the official standard. But in the Microsoft extension, 0x5C is mapped to backslash to maintain compatibility with ASCII, and another character represents the national currency symbol.
Some websites mark themselves as ISO 8859-1, but are in fact encoded in Windows 1252, and many web browsers take this interpretation into account. Similarly, the Microsoft versions of East Asian text encodings are often used in contexts where the standard versions are declared. In some cases, the Microsoft versions are registered with IANA separately for use in internet protocols (eg Shift-JIS/Windows_31J, and Latin1/Windows-1252), but in other cases there is only one registered encoding (eg EUC-KR).
So, there are two questions.
- When interpreting an HTTP response, or receiving an email, where the encoding is declared in the header, should the text be interpreted with the standard encoding or the Microsoft extension?
- In the encodings API for a programming language, if a particular encoding is used to read or write a file, should the standard be used or the Microsoft extension?
When I started reading about encodings, I assumed that everything could be done reasonably by following the standards precisely. Now I'm not sure what to do.