Friday, February 2, 2007

More string parsing?

While I was writing my XML parser, I swore that I would never again write anything that involved string parsing. I quickly broke that promise, and the next two libraries I wrote were a CSV (comma-separated values) parser and a UTF-8/16 encoder/decoder. The good thing about these is that they'll be useful. I'm planning on using the CSV parser, for example, in fixing a complicated spreadsheet for a volunteer association I'm in (the version of Excel I have exports to CSV but not XML). Both the Unicode encodings and CSV are relatively simple, but I implemented the encodings in two different ways. For CSV, I used the imperative parsing model that I described in previous blog entries. To make this cleaner, I split off a new module, libs/state-parser, accompanied by better documentation.

This way, I was able to build a working program, but the problem with this approach is that it is imperative. This makes it so that the order in which words appear matter not only in terms of stack effect but also side effect. You might think in a language like Factor, where flow of control is extremely easy to determine, this is not a problem. However, it is. In this particular case, I needed to look at two characters to determine what to do next, the current character and the previous character. To look at both of these requires a state change, one which I don't always want to happen. It took a while to fully diagnose this issue and I tried about 5 different ways until I found one which fixed it.

I implemented Unicode differently. For this, I used a design pattern known as a sentinel, which helps me cross-cut pointcutting concerns by instantiating objects which encapsulate the state of the parser [Enough OO jargon for you?]. I never mutate these, and the program is purely functional except for the use of make (which could trivially be changed into a less efficient map [ ] subset, sacrificing efficiency and some terseness but making it functional). Writing this code was practically error-free, and the errors that existed could easily be diagnosed and fixed. I always sorta knew in my mind that it's better to write programs in a functional style than an imperative style, but this really proved it to me. Here's the code for decoding UTF-16LE.

TUPLE: new ;
TUPLE: double val ;
TUPLE: quad2 val ;
TUPLE: quad3 val ;

: bad-char CHAR: ? ;

GENERIC: (utf16le) ( char state -- state )
M: new (utf16le)
drop <double> ;
M: double (utf16le)
over -3 shift BIN: 11011 = [
over BIN: 100 bitand 0 =
[ double-val swap BIN: 11 bitand 8 shift bitor <quad2> ]
[ 2drop bad-char , <new> ] if
] [ double-val swap 8 shift bitor , <new> ] if ;
M: quad2 (utf16le)
quad2-val 10 shift bitor <quad3> ;
M: quad3 (utf16le)
over -2 shift BIN: 110111 = [
swap BIN: 11 bitand 8 shift
swap quad3-val bitor HEX: 10000 + , <new>
] [ 2drop bad-char , <new> ] if ;

: utf16le ( state string -- state string )
[ [ swap (utf16le) ] each ] { } make ;

If you're still reading my blog at this point, I'm genuinely impressed that you put up with such a boring series of posts.