Thursday, January 15, 2009

XML encoding auto-detection

The Factor XML parser now auto-detects the encodings of XML documents. This is implemented for all of the encodings that are implemented in Factor. To see how it's implemented, look at the XML standard, because it explains it much better than my blog post, which was below.

I was mystified myself when I first read that XML documents can specify what encoding they are, in the document itself. The encoding, if it's not UTF-8, must specified in the prolog like this:
<?xml version="1.0" encoding="ISO-8859-1"?>

The idea of the algorithm is simple. It just goes by cases. First, check for a byte order mark (BOM), which would indicate UTF-16 and a particular endianness, or UTF-8. If there's no BOM, then the first character must be < or whitespace. If it's <, we can differentiate between UTF-16BE, UTF-16LE (without BOMs) and an 8-bit encoding. If it's one of the first two, we can tell by the fact that there's a null byte before or after the <. If it's an 8-bit encoding, we can be sure that there won't be any non-ASCII in the prolog, so just read the prolog as if it's UTF-8, and if an encoding is declared, use that.

To implement it, I just read byte by byte and have a case statement for each level. After two just octets, it's possible to differentiate between UTF-8, UTF-16 (with a BOM, for both endiannesses), UTF-16BE and UTF-16LE. A similar process could also identify UTF-32 and friends after 4 octets. In my implementation, I had to do a little bit of hacking inside the XML code itself to get this integrated properly. All together, it's about 40 or 50 lines of code. It's available now in the Factor git repository.

[Update: Thanks for pointing out my error, Subbu Allamaraju. Fixed a typo, see comments.]

6 comments:

Anonymous said...

How do you parse XML (tree) when your only data structure is a stack?

Anonymous said...

I suppose you meant to say "I just read byte by byte..." in stead of "I just read character by character...".

Anonymous said...

Technically, UTF-8 is not supposed to have a BOM, and the character should be treated as a zero-width no-break space, but that would make the document invalid in XML 1.0 anyway.

What do you mean by this? As far as I know, U+FEFF in the very beginning of an UTF-8 encoded XML file is acceptable although not necessary or recommendable.

Unknown said...

Oops. After looking at the Unicode standard again, I guess the BOM is allowed for UTF-8. I should update a bunch of code now.

Zeev said...

RE: "The XML specification won't tell you how to do it, so I'll explain."

The spec tells you exactly how to do it:
http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

Unknown said...

Zeev,

You're right. This blog post was terrible, and full of errors, and the piece of the XML spec that you referenced was much more clear.