Sunday, February 17, 2008

Designing an API for encoded streams

When I started looking at Unicode to design a good library for Factor, I wanted to make an API such that the programmer never needed to think about Unicode at all. I now see that that's impossible, for a number of reasons. One thing that the programmer needs to explicitly think about is the encoding of files. For this reason, I'm in the middle of changing lots of words which deal with streams to take an extra mandatory parameter specifying the encoding. The encodings supported so far are binary, ascii, latin1, utf8, utf16 and some more. In the library, we'll eventually put Shift JIS, more 8-bit encodings like MacRoman, Windows 1252 and the other ISO-8859s, UTF-32, etc. Internally, all strings are already in Unicode; this is only for external communication.


Some people objected to this. Why should there be a new mandatory parameter when everything worked already? This makes code longer, rather than shorter! The rationale is that this expands the functionality of streams. With the old functionality, everything is treated as if it's encoded in Latin 1. But in 99% of cases, this is just wrong. When a text file isn't in plain old ASCII, it's almost always in UTF-8 (though occasionally it's in UTF-16 or Shift JIS). Things are rarely in an 8-bit encoding because of its ambiguity; 8-bit non-ASCII encodings can safely be labeled "legacy" except on specialized low-resource applications. Right now things work like Latin 1 is the encoding for all streams, but if we want to do much actual text processing, things will come out wrong.

Even if UTF-8 is used most of the time, could we use a heuristic to determine what encoding things are in? If we know the file is in either ASCII, UTF-8, UTF-16 or UTF-32, it's not too hard to come up with some kind of heuristic that works in almost all cases. But once things get generalized to Shift JIS and 8-bit encodings, it's basically impossible to determine generally how things are encoded. And it's completely impossible if there are binary streams, or for output streams all together.

So let's make UTF-8 the default encoding. Any stream which doesn't want to use UTF-8 should have its instantiation followed by some set-encoding word. But what about binary streams? These aren't uncommon and are needed for things like audio, video, compressed data and Microsoft Word documents. If UTF-8 is the default encoding, it'd be easy to open a file for reading or writing, forgetting that it's in UTF-8, and then writing stuff to it as if it's a binary stream. But if we make the encoding a mandatory explicit parameter, then nobody will forget: if you want to open a stream reading it as UTF-8, you can do utf8 <file-reader>, and if you want to open it as binary, you can do binary <file-reader>. Writing utf8 or binary isn't just boilerplate: it actually indicates some information about how things should work. And for those situations where you want some other encoding, that can be specified just as easily.

Now, do we really want to prefix each stream constructor with an encoding, or can it be determined, explicitly, in the context somehow? There are two ways to scope this, lexical and dynamic, and they both fail. With dynamic scoping, composability is broken: if one piece of code makes some assumption about the encoding—say, that the encoding is UTF-8, which could be the default global encoding—but then the caller sets it to something else. So the encoding must be set lexically. But when I looked at actual code samples, I saw it'd be more trouble than it's worth to have a lexically scoped encoding: nearly all words which open streams which need an encoding only need one or two. You're at most writing the same encoding twice, but it ends up being fewer words than a whole scope declaration (which needs, at a minimum, brackets, the encoding name and something declaring that this is a scope for encoding purposes). What about vocab-level scoping? It could work, but it'd have to be overridden in too many cases to be useful, since it's not unrealistic to have a vocab which uses UTF-8 for half of its streams and binary for the other half.

One other thing that's useful and not particularly common in these sorts of libraries is the fact that the encoding can be changed after the stream is initialized. This is useful for things like XML, where the encoding can be declared in the prolog, and certain network protocols like HTTP and SMTP which allow the encoding to be specified in the header, so the encoding needs to change on the fly. I can only assume that previous implementations of this took everything as binary and used string processing routines to get things in and out of the right encoding.

You might think of this as a standard library cruely forcing everyone to specify every little detail, but I think of it a little differently: the file I/O API encourages programmers to think about the encodings of their files. We could go the other way, still, and use UTF-8 as the default, but it'd create some strange and unreadable bugs. Any default is bad. All other stream APIs I've looked at make this optional, but no matter which way you go this makes misleading assumptions for programmers.


<file-reader>, <file-writer>, <file-appender>, <client> and <server> will now take an extra argument of an encoding descriptor, making them have the stack effect ( path/addrspec encoding -- stream ). file-contents and file-lines also take an encoding from the top of the stack. process-stream's encodings are in the descriptor, as a possible value for stdin, stdout or stderr, indicating that those values will be sent to/from Factor as a stream of the given encoding. If you're dealing with files, the process you call should handle all encoding issues. Some streams, like HTML streams and pane streams, don't need changes, since their encoding is unambiguous. You also don't need to specify the encodings of file and process names, since those are OS-specific and handled by the Factor library.

In addition to <string-reader>s and <string-writer>s that already exist and remain unchanged (they don't need an encoding since everything that goes on there is in Factor's internal Unicode encoding), there are now also <byte-reader>s and <byte-writer>s which do have an encoding as a parameter. Byte readers and writers work on an underlying byte vector, and provide the same encodable interface that files do, because an array of bytes, unlike a string, can take multiple interpretations as to the code points it contains.

I renamed with-file-out to with-file-writer, with-file-in to with-file-reader, string-in to with-string-reader and string-out to with-string-writer for consistency. Additionally, there are now also words with-byte-reader and with-byte-writer. Since byte and file readers and writers need an encoding, in these combinators I've put the encoding before the quotation. It could be the other way around, and really this was an arbitrary choice. Conceptually, you can think of it like the file name or byte array and the encoding form a sort of unit, so they're consistently adjacent in the words which use them.

I've made all the updates to everyone's software in my local branch, so you don't have to worry about implementing these changes. You might want to go back and look at your code to make sure the encoding I chose was sane. 90% of the time it's binary or UTF-8, occasionally ASCII. It's usually clear-cut. Also, I never had to make more than 3 or 4 updates in a single file.

It'd be nice if things were simpler, and nobody had to consider encodings at all except for Unicode library writers. Theoretically, this could be solved by a standard way to denote, inside the file, what encoding the rest of the file is in. But if we did that, then multiple competing encoding encodings might emerge, and we'd have to explicitly choose among them! It'd be even better if the filesystem had metadata on this, but it doesn't. Maybe, on the Factor end, there's a place for having an abstraction over the locations of resources grouped with a description of their type (either encoding or filetype). But either way, encodings just aren't simple enough to allow programmers not to think about them.

Update: Added more info about specifics. It's been taking me a little longer than I initially thought to get this whole thing working with Factor, so this stuff still isn't in the main branch, though you can see the progress in the unicode branch of my repository. Bootstrapping will take a little work, though. The changes have been integrated into Factor! Thanks, Slava, for making it all work.

1 comment:

Adam said...

This is beautiful; and being it's stack based, consuming the encoding type feels so much better than a parameter to a function:

file_writer(utf8,myFile) vs "myfile.txt" utf8 <file-writer>

...reads a lot nicer. Maybe I'm getting biased.