Sunday, April 6, 2008

Programming in a series of trivial one-liners

Among Perl programmers, a one-line program is considered a useful piece of hackage, something to show off to your friends as a surprisingly simple way to do a particular Unix or text-processing task. Outsiders tend to deride these one-liners as line noise, but there's a certain virtue to it: in just one line, in certain programming languages, it's possible to create meaningful functionality.

APL, lived on by its derivatives like K, Q, J and Dyalog pioneered the concept of writing entire programs in a bunch of one-liners. Because their syntax is so terse and because of the powerful and high-level constructs of array processing, you can pack a lot into just 80 characters. In most K programs I've seen, each one does something non-trivial, though this isn't always the case. It can take some time to decode just a single line. Reading Perl one-liners is the same way.

Factor continues the one-line tradition. In general, it's considered good style to write your words in one, or sometimes two or three, lines each. But this isn't because we like to pack a lot into each line. Rather, each word is rather trivial, using the words defined before it. After enough simple things are combined, something non-trivial can result, but each step is easy to understand.

Because Factor is concatenative (concatenation of programs denotes composition) it's easier to split things into these trivial one-liners. It can be done by copy and paste after the initial code is already written; there are no local variables whose name has to be changed. One liners in Factor aren't exceptional or an eccentric trait of the community. They're the norm and programs written otherwise are considered in bad style.

Enough philosophizing. How does this work in practice? I'm working on encodings right now, so I'll break down how this worked out in implementing 8-bit encodings like ISO-8859 and Windows-1252. These encodings are just a mapping of bytes to characters. Conveniently, a bunch of resource files describing these mappings which are all in exactly the same format is already exists on the Unicode website.

The first thing to do in implementing this is to parse and process the resource file, turning it into two tables for fast lookup in either direction. Instead of putting this in one word, it's defined in five, each one or two lines long. First, tail-if is a utility word which works like tail but leaves the sequence as it is if it's shorter.

: tail-if ( seq n -- newseq )
2dup swap length <= [ tail ] [ drop ] if ;

Using that, process-contents an array of lines and turns it into an associative mapping (in the form of an array of pairs) from octets to code points.

: process-contents ( lines -- assoc )
[ "#" split1 drop ] map [ empty? not ] subset
[ "\t" split 2 head [ 2 tail-if hex> ] map ] map ;

byte>ch takes this assoc, the product of process-contents and produces an array which can be used to get the code point corresponding to a byte.

: byte>ch ( assoc -- array )
256 replacement-char <array>
[ [ swapd set-nth ] curry assoc-each ] keep ;

ch>byte is the opposite, taking the original assoc and producing an efficiently indexable mapping from code points to octets.

: ch>byte ( assoc -- newassoc )
[ swap ] assoc-map >hashtable ;

Finally, parse-file puts these all together and makes both mappings, given a stream for the resource file.

: parse-file ( stream -- byte>ch ch>byte )
lines process-contents
[ byte>ch ] [ ch>byte ] bi ;

Next, the structure of the encoding itself is defined. A single tuple named 8-bit is used to represent all 8-bit encodings. It contains the encoding and decoding table, as well as the name of the encoding. The encode-8-bit and decode-8-bit words just take some encoding or decoding information and look the code point or octet up in the given table.

TUPLE: 8-bit name decode encode ;

: encode-8-bit ( char stream assoc -- )
swapd at* [ encode-error ] unless swap stream-write1 ;

M: 8-bit encode-char
encode>> encode-8-bit ;

: decode-8-bit ( stream array -- char/f )
swap stream-read1 dup
[ swap nth [ replacement-char ] unless* ] [ nip ] if ;

M: 8-bit decode-char
decode>> decode-8-bit ;

I wanted to design this, like existing Unicode functionality, to read resource files at parsetime rather than to generate Factor source code. Though I don't expect these encodings to change, the result is still more maintainable as it leaves a lower volume of code. If I were implementing this in C or Java or R5RS Scheme or Haskell98, this wouldn't be possible. So make-8-bit defines an encoding given a word and the lookup tables to use:

: make-8-bit ( word byte>ch ch>byte -- )
[ 8-bit construct-boa ] 2curry dupd curry define ;

define-8-bit-encoding puts everything together. It takes a string for the name of an encoding to be defined and a stream, reads the appropriate resource file and defines an 8-bit encoding.

: define-8-bit-encoding ( name stream -- )
>r in get create r> parse-file make-8-bit ;

To top it all off, here's what's needed to define all the 8-bit encodings we want:

: mappings {
{ "latin1" "8859-1" }
{ "latin2" "8859-2" }
! ...
} ;

: encoding-file ( file-name -- stream )
"extra/io/encodings/8-bit/" ".TXT"
swapd 3append resource-path ascii ;

"io.encodings.8-bit" in [
mappings [ encoding-file define-8-bit-encoding ] assoc-each
] with-variable
] with-compilation-unit

So by combining these trivial one-liners or two-liners, you can make something that's not as trivial. The end product is that hard things are made easy, which is the goal of every practical programming language. The point of this isn't to say that this code is perfect (it's very far from that), but just to demonstrate how clear things become when they're broken down in this way.

When I first started programming Factor, I thought that it only made sense to define things separately when it was conceivable that something else would use them, or that it'd be individually useful for testing, or something like that. But actually, it's useful for more than that: for just making your program clear. In a way, the hardest thing to do when programming in Factor once you have the basics is to name each of these pieces and factor them out properly from your program. The result is far more maintainable and readable than if the factoring process has not been done.

1 comment:

Slava Pestov said...

There's a typo in byte>ch; the array constructor is not escaped.

Also you can write it as

: byte>ch
256 replacement-char <array> [ <enum> swap update ] keep ;