tag:blogger.com,1999:blog-273593670040001243.post7313167227624788108..comments2022-03-28T05:51:26.366-07:00Comments on Useless Factor: Text encodings and API designAnonymoushttp://www.blogger.com/profile/00902922561603041049noreply@blogger.comBlogger5125tag:blogger.com,1999:blog-273593670040001243.post-58614583229729153142009-03-09T10:41:00.000-07:002009-03-09T10:41:00.000-07:00Josh,I was thinking that I should have two version...Josh,<BR/><BR/>I was thinking that I should have two versions and document them, but if the non-Microsoft version is truly worthless (as might be the case in the Japanese and Korean cases) and never used, though the non-Microsoft names are almost always used, then this distinction would only cause confusion for developers. I won't be able to influence the behavior of people who write documents by my API, and if I could, I'd make everyone use UTF-8.<BR/><BR/>It's very difficult to have a heuristic which decides between the national currency symbol and backslash; that'd involve interpreting the actual text. Also, when interpreting an HTTP response it's absolutely critical that this is interpreted correctly, since the encoded text will be converted to Unicode internally and then the decision will be made. If any of this text is presented to the user or passed on to something else, it will probably be in UTF-8. A wrong interpretation would then be seen by anyone who looks at it, and this would cause confusion and frustration among users.<BR/><BR/>I think most mainstream email clients and web browsers use the Microsoft extensions, but I know that basically by rumor and haven't checked directly.Anonymoushttps://www.blogger.com/profile/00902922561603041049noreply@blogger.comtag:blogger.com,1999:blog-273593670040001243.post-15894677971815502462009-03-09T10:26:00.000-07:002009-03-09T10:26:00.000-07:00For number 2: my approach would be to have two sep...For number 2: my approach would be to have two separate encodings. One uses the standard name and implements the standard behavior. The other has the standard name with "-microsoft" tacked on. Document both. The advantages of this are:<BR/><BR/>1. you pass on your knowledge about this quirk to users of your API.<BR/>2. you empower users of your API to make their own decisions about what to do.<BR/>3. the "-microsoft" encoding looks and feels like a wart, which is appropriate because it is.<BR/><BR/>If you want to go to even extra effort, you could offer a separate API call that attempts to auto-detect which variant a document actually uses. Just make sure you don't offer this unless your heuristic has a high success rate. And make sure it's a separate call, so your API's users aren't *forced* to accept the cost or the potential false positives of such a scheme.<BR/><BR/>Re: #1. When interpreting an HTTP response, it's unlikely that you actually need to know or care whether a character is a backslash or a currency symbol. At that layer, you're probably just passing most of your strings through to be processed by other layers of software. As for reading email, if you can't write a heuristic with a very high success rate, this is probably an issue you have to expose to your users, because there's no way you can hide it from them and do the right thing all the time. I could imagine some combination of configuration/preferences and (if a person gets a lot of email of both variants) a button to toggle between the two.<BR/><BR/>Out of curiosity, have you researched what mainstream email clients do about this issue?Anonymoushttps://www.blogger.com/profile/06784200909094825965noreply@blogger.comtag:blogger.com,1999:blog-273593670040001243.post-64276627062739402242009-03-09T08:28:00.000-07:002009-03-09T08:28:00.000-07:00troelskn,Of course *I* use UTF-8 for everything th...troelskn,<BR/><BR/>Of course *I* use UTF-8 for everything that I write, but the issue is there are text files out there and legacy systems that people need to communicate with which don't use UTF-8.Anonymoushttps://www.blogger.com/profile/00902922561603041049noreply@blogger.comtag:blogger.com,1999:blog-273593670040001243.post-50472531871538187382009-03-09T05:27:00.000-07:002009-03-09T05:27:00.000-07:00Re 2: Use UTF-8Re 2: Use UTF-8Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-273593670040001243.post-88752612375555961912009-03-08T20:25:00.000-07:002009-03-08T20:25:00.000-07:00Welcome to the real world of software development ...Welcome to the real world of software development :)Barry Kellyhttps://www.blogger.com/profile/10559947643606684495noreply@blogger.com