Friday, March 14, 2008

A protocol for creating encodings

I previously wrote about the API that I designed for creating streams with encodings in Factor. I'm not sure if that's going to stick around permanently in this form, due to concerns about easily changing stream encodings and grouping encodings with a pathname as one object on the stack.

Either way, I wanted to describe the protocol I'm developing for actually defining new encodings in Factor. This code isn't completely debugged but it should be done and in the main Factor development repository very soon. There are four words in the encoding protocol:

GENERIC: <encoder> ( stream encoding -- encoder-stream )
GENERIC: <decoder> ( stream decoding -- decoder-stream )
GENERIC: encode-char ( char stream encoding -- )
GENERIC: decode-char ( stream decoding -- char/f )

Let's go through these. First, the constructors <encoder> and <decoder>. These are very rarely called directly by the library user, more often by stream constructors. For example, when you do "filename" utf8 <file-reader>, what's going on underneath is "filename" (file-reader) utf8 <decoder>. (file-reader) is a low-level constructor that gives you a binary stream, and <decoder> wraps it an a decoded stream using the specified encoding descriptor, utf8.

I have some slightly funny methods on <encoder> and <decoder>. See, right now, all encodings are tuples, and their abstract descriptors are tuple classes. All tuple class symbols are in the class tuple-class, and all tuples are in the class tuple. So we can define methods on the two constructor words, for tuple classes one which makes an empty instance of the encoding tuple class and calls the constructor again, and for encoding tuples one which actually puts together the instance of the physical encoder or decoder tuple. Here's how it looks:

M: tuple-class <decoder> construct-empty <decoder> ;
M: tuple <decoder> f decoder construct-boa ;

M: tuple-class <encoder> construct-empty <encoder> ;
M: tuple <encoder> encoder construct-boa ;

One reason these need to be generic is for things like binary streams, where methods on these generic words are implemented as dummies: a binary encoding is just the lack of encoding

TUPLE: binary ;
M: binary <encoder> drop ;
M: binary <decoder> drop ;

Another reason is that certain encodings require processing at the beginning. For example, UTF-16 should write a byte order mark (BOM) immediately when it's initialized for writing, and read a BOM immediately when it's initialized for reading.

M: utf16 <decoder> ( stream utf16 -- decoder )
2 rot stream-read bom>le/be <decoder> ;

M: utf16 <encoder> ( stream utf16 -- encoder )
drop bom-le over stream-write utf16le <encoder> ;

Now, let's look at the other words. The idea of encode-char and decode-char is that it's simpler for encodings to encode or decode one character than implement all the relevant functions of the stream protocol. encode-char takes an encoding, an underlying stream and a character to write to that underlying stream.

The inverse, decode-char, takes an underlying stream and an encoding and uses the encoding to pull a character from the stream. For everything I've implemented so far, the encoding is dropped after method dispatch, but when things like Shift JIS, which require state in decoding, are implemented, the state will be stored in the tuple.

This is all much simpler than my previous design, which required looping to decode a single character and forced encodings to adopt a complicated state-machine-based model. This is something like the third iteration of the encoding protocol I've made, and the code is finally starting to look good.

In Factor, it takes a little bit of work to make certain things, like encodings, have clean code. The appropriate abstractions don't fall out as immediately obvious, but eventually they're found. The result is far more maintainable and clean. I'm not sure what this would imply on big projects with bad programmers. (But I plan to never work in an environment like that; better to be an academic if good work in a small company can't be found.)

Anyway, future pie-in-the-sky plans for encodings include treating cryptographic protocols and compression as encodings (under different protocols, of course). This is really cool: there are five orthogonal layers: stream, cryptography, compression, text encoding and usage. It'll be possible to compose them and factor out their compositions in any way you want! But this doesn't exist, so I probably shouldn't even be talking about it.

Update: Fixed stupid typos. Thanks Slava!

No comments: