Comments on Useless Factor: A second introduction to Unicode

Tom,I completely disagree with using wchar_t, sinc...

2008-06-08T12:37:00.000-07:00

Tom,

I completely disagree with using wchar_t, since it's usually 16 bits but at least 21 bits are needed. People end up treating UTF-16 strings (made of wchar_t) as an array of code points, but then there are those surrogate pairs that come up only rarely (but should still be supported). So Joel's post was outdated even when he wrote it, but it's still a good explanation of the history, justification and aims of Unicode.

On the other hand, UTF-8 is awkward to read characters from directly compared to indexing an array. I don't see any big need to compress stuff in memory using UTF-8 (which isn't all that good a compression algorithm).

I think a fixed-width encoding is easiest to use internally, so it's what I'll use for my own coding (in Factor).

rsamuel,Factor's representation uses 3 bytes per c...

2008-06-08T12:30:00.000-07:00

rsamuel,

Factor's representation uses 3 bytes per code point, put in two arrays, with one that's 8 of the bits and one that's 16 of the bits.

"If you're using C or C++, wchar_t rather than cha...

2008-06-08T10:46:00.000-07:00

"If you're using C or C++, wchar_t rather than char for strings works for most cases."

I'm curious as to what these "most cases" are.

First, wchar_t isn't well-defined; on many machines it's 16-bit but certainly not all.

That aside, wchar_t is completely incompatible with the standard C library and your standard C/C++ string constants.

It's absolutely not a matter of dropping in new code; contemplate that in an ASCII constant, every other byte is 0 so if your code ever relies on "byte 0 means end of string" you're lost.

Finally, wchar_t doesn't actually let you represent every Unicode "character" (or codepoint), though this is pretty academic because the missing ones are highly obscure.

Why not use UTF-8?

I'm curious to what other developers are using for...

2008-06-08T09:12:00.000-07:00

I'm curious to what other developers are using for their in-memory encoding. Are you using UCS-4 and taking the 4 byte per-codepoint hit? Are you using UTF-8/-16 and writing your code to deal with variable-width codepoints?

Has anyone written their code to use 3 bytes per-codepoint to get a fixed with encoding with a minimal loss?

"as well as a few weird cases like the character ß...

2008-06-08T01:19:00.000-07:00

"as well as a few weird cases like the character ß going to SS in upper case"

It might be worth mentioning that that is true even in Unicode 5.1, which introduced captai sz.