Monday, December 31, 2007

Unicode implementer's guide part 6: Input and output

Until now, in my blog posts on Unicode, I've been focusing on internal processing which has only an indirect effect on users. But now I'm thinking about something else: how can you write a text editing widget which potentially supports as many locales as possible? This should generally be the goal when writing internationalized applications, and almost all applications should be eventually internationalized, so why not a text editor?

As it turns out, this is an extremely complicated task, combining the difficulties of input methods and rendering. In this article I'll try to brush on as many relevant topics as possible, trying not to ramble too much on any particular one. Really, each section should probably be given its own blog post.

To test the Unicode-bind approach, I'm looking at the Factor .91 editor gadget, which pays no attention at all to internationalization and basically parrots back whatever characters it receives. Different keyboard layouts were tested on Mac OS X 10.5, though that shouldn't matter here.

Typing European scripts

European scripts are pretty easy. You just type letters on the keyboard and they appear, always in left-to-right order, on the screen, one character appearing for each keystroke in many languages. This works for English, German, Russian, Georgian and many other languages, but not for, say Spanish or Vietnamese. In those languages, there are characters like é or ở which require multiple keystrokes to input. With a typical Spanish keyboard, to write é, first the ´ key is pressed and then the e key. For ở, you first press the ơ key and then the ̉ key.

There are more complicated input methods for, say, typing é with a US keyboard, and these are system-dependent. On Mac OS X, I use Alt-e to get that first combining ´ mark, and then an e to constitute the character under it. This can be used in any application. In Microsoft Office applications on Windows, you can use Ctrl-' to make a combining ´ mark. Elsewhere, you can't use that key chord, just some complicated keystrokes to input the right hex code. I believe Gnome and KDE each define their own mechanisms for this kind of input.

From a Unicode-bind text editor, none of these multiple-character inputs work. When in the Spanish keyboard mode in the Factor editor gadget, pressing ´ does nothing, and when in Vietnamese mode, pressing ơ does nothing. This is because the input system expects something more complicated to happen next. The OS-specific keys, of course, don't work properly either unless there is special handing for them. All of these things must be taken into account in an internationalized editor widget.

Typing East Asian scripts

When looking at the possible input methods in the Mac keyboard layout chooser, you're immediately struck by the fact that four East Asian languages--Chinese, Japanese, Korean and Vietnamese, have far more input methods listed than any other individual language. This is probably because (more than anything else) of the difficulty involved in inputing Han ideographs.

Han ideographs (Chinese characters) are used in Chinese and Japanese (as Kanji), occasionally in Korean and historically with Vietnamese. Unicode encodes over 70,000 Han ideographs, but at least 90% of these are extremely rare. Still, thousands come up in everyday situations and need to be typed into computers.

There are a number of ways of inputting Han ideographs. One way is through a Romanization. This requires several dictionaries, because each language pronounces the ideographs differently, and there are many Romanization systems for each language. One Romanization may correspond to more than one ideograph, so users are presented with a list of numbered choices. Another input method works by specifying the radicals, or component pieces, of the Han ideographs.

Japanese and Korean (and Vietnamese, discussed above) also have alphabets. Japanese has the kana—katakana and hiragana—a syllabary which can be entered by a special keyboard or through Romanization. Korean has Hangul, (I blogged about this earlier) which are syllables made of the 24 letters in the jamo alphabet. These can be entered by Romanization or by typing jamo. In either case, things are expected to coalesce into Hangul syllables automatically during typing.

The choice of input methods is usually chosen outside of the current application, and applications are expected to work perfectly with that choice. A good text editing widget must communicate with the OS settings to pick the right one, and it must implement it properly.

Rendering difficult scripts

Hebrew, European scripts and East Asian scripts are relatively simple to render. Well, maybe not that simple, when it comes to placing (and stacking!) combing marks and putting Hangul syllables or Han ideographs. But at least two adjacent letters don't tend to combine with each other spontaneously, forming several types of ligatures depending on what's on the left and what's on the right. At least letters go in their prevailing order, not rearranging themselves visually.

Certain Middle Eastern scripts like Arabic and Syriac, and several Indic scripts like Devanagari, don't follow these rules. Arabic and Syriac are written in a cursive form. Each letter has up to four (or sometimes five) forms in the simplified form that most computer typesetting systems use, and vowels are written above or below the characters as combining marks. In some computer typesetting systems, these shapes can be inferred from context, but others will display things incorrectly unless special code points are used which include information about which form it is.

Indic scripts are also complicated, but differently. The term "Indic script" refers not only to scripts used in India but a whole family of scripts used around South and Southeast Asia. There are 19 of these scripts encoded in Unicode, all with somewhat similar properties. Devanagari, the script used for Hindi, Nepali and Sanskrit, among other things, is one of these Indic scripts, and others include Malayalam, Tibetan, Tamil, Khmer and Thai. Among consonants in most Indic scripts, there are a large number of ligatures, but these are not encoded into Unicode; they must be inferred by a suitable font. Vowels in Indic scripts can be written after a consonant group, above it, below it, or even before the group, and this must also be inferred by the font. (I'm oversimplifying in both of those sentences, of course. Things are much more complicated.) In most Indic scripts' Unicode encodings, vowels which are placed before a consonant group come aftewards in memory, following logical order. Thai and Lao scripts don't follow these rules, and put those vowels before, but this causes complications in collation. All of these properties of Indic scripts create difficulties not only in rendering but also in input methods, but I haven't read much specifically on this.

Some or all of these issues might be solved by choosing a suitable font and display system rather than the editor widget itself.

Line breaking

I wrote about grapheme breaking previously. Unicode line breaking is sort of like grapheme breaking, except much much more complicated. By "line breaking" I mean what you might call word wrap. It's described in UAX #14.

This document describes several classes of characters with different line-breaking properties. Some force a line break before or after their display, others passively allow a line break before or after, some forbid a line break and others have even more complicated properties. All in all, there are 38 categories. Even though the algorithm is complicated, it (like generating a collation key) can be done in linear time with respect to the length of the string.

With respect to line breaking, different scripts and locales have different rules. In Chinese, Japanese and Korean text, where spaces between words are rare or nonexistent, you can do a line break between any two Han ideographs, kana or Hangul syllables. Often, arbitrary line breaks are even allowed within words written in Roman script. In Korean, we have to watch out that we don't break up a Hangul syllable, so conjoining jamo behavior rules are used.

In Khmer, Lao and Thai scripts, there are also no spaces between words, but line breaks must not interrupt words. This requires some linguistic analysis and dictionary lookup, but Unicode provides a code point to indicate line breaking opportunities. This may sound a little weird and impractical, but a similar thing also exists for European scripts which may hyphenate words between lines: the soft hyphen.

Even if things are restricted to American English, Unicode line breaking properties provide for better visual display than separating things at word breaking points, or just where there's whitespace. Still, it provides a significant implementation challenge for the writer of an internationalized text editor.

Bidirectional text

In Arabic, Hebrew and some other scripts, text runs from right to left rather than left to right. But what if we have a document that combines Hebrew and English: how do we organize text running in two directions? In an even more common situation, we could have text containing Hebrew and numbers. (Even in Hebrew and Arabic, decimal numbers are written left to right.) According to Unicode principles, both left-to-right (LTR) and right-to-left (RTL) scripts should be written in logical order, yet their display order is different.

To negotiate these issues, the Unicode Consortium specified the Bidirectional (BIDI) Algorithm (UAX #9). The BIDI Algorithm could easily fill up a whole blog post by itself, and I don't fully understand it, so I won't try to explain it all here. There are a number of different categories that Unicode code points go in for BIDI purposes: characters can be not only LTR or RTL, but also neutral, mirroring, strongly or weakly directioned, or an invisible special-purpose mark. These categories sometimes overlap, and there are several levels of directioning.

The BIDI algorithm is difficult to implement for display, but it gets even more complicated in a text editor. When placing the cursor somewhere, there can be ambiguity as to where it is. When selecting things, there is sometimes a design decision whether to support logical selection (which could be discontiguous) and visual selection (which is hard to implement). The best resource to learn more about this is the UAX document itself.

Conclusion

For Factor, it will be impossible to keep the editor widget written completely in Factor. It would require a huge amount of duplicated effort and system-specific configuration to get it right, so we'll probably end up having to use native (meaning Gtk, Cocoa and Windows UI) editor widgets.

Clearly, the Unicode-blind approach is insufficient here. Because of all this, internationalized editor widgets are implemented pretty rarely. English-speakers can get complacent in simple input methods, but these will not work for others.


Warning: Readers should be aware that I haven't studied any one of these topics in-depth. This is just an introduction and shouldn't be taken very seriously.

Sunday, December 23, 2007

Books that should exist

As you can probably guess by the existence of this blog, I love to write. Part of what I love about it is the actually act of writing, but it's really more that I want things to be written that hadn't before been written. I want to write stuff that I wish I could have read. Right now, there are three books and one journal article that I think should exist that. Hopefully, I'll be able to write some of these some time in my life.

Implementing Unicode: A Practical Guide

One thing that struck me when beginning to write the Unicode library is that there aren't many books about Unicode. The two I found in my local Barnes and Noble were the Unicode 5.0 Standard and Unicode Explained. Looking on Amazon.com, I can't find any other books that address Unicode 3.1 (the version that moved Unicode from 1 to 17 planes) or newer in detail, ignoring more specialized books.

Both of these were both great books, but they aren't optimal for figuring out how to implement a Unicode library. Unicode Explained focuses on understanding Unicode for software development, but shies away from details of implementation. The Unicode Standard explains everything, but it often gets overly technical and can be difficult to read for people not already familiar with the concepts described. A Unicode library implementor needs something in the middle. Unicode Demystified might be the right one, but it describes Unicode 3.0, so it is in many ways outdated.

I wish a book existed which explained Unicode in suitable detail for most library implementors. If this book continues to not exist for many years, I might just have to write it myself. This, however, would be more difficult and less fun than my other book ideas.

[Update: After getting my hands on Unicode Demystified, I realize that I should not have thrown it aside so quickly. It's a great book, and nearly all of it is still relevant and current. It looks like the ebook version I have is updated for Unicode 3.1.]

Programming with Factor

Factor's included documentation is great, but it's not enough, by itself, to learn Factor. I, and probably most people who know Factor, learned through a combination of experimenting, reading blog posts and the mailing list, and talking on #concatenative. Many people will continue to learn Factor this way, but it still seems somehow insufficient. It should be possible to learn a programming language without participating in its community.

Of course, we can't write a book on Factor until we get to see what Factor will look like in version 1.0. But I'm confident that this book will be written, even if it goes unpublished in print, and I'm fairly sure that I'll have at least some part in it. Maybe it'll be a big group effort by the whole Factor community.

What I'm thinking is that we could have a book which teaches programming through Factor, to people who aren't yet programmers. I've talked a lot about this with Doug Coleman, and we agree that most programming books are bad; we should make a new book that does things very differently. But we can't agree or imagine just how...

The Story of Programming

I, like many of you reading this blog, have an unhealthy interest in programming languages. Mine may be a little more unhealthy than yours. Whenever I hear the name of a programming language that I don't know of, I immediately need to read about it, to get some basic knowledge of its history, syntax, semantics and innovations.

Most study of programming languages works by examining the properties of the languages themselves: how functional programming languages are different from imperative object-oriented languages and logic languages and the like. But what about a historical perspective? The history of programming languages is useful for the same reason other kinds of historical inquiry are useful. When we know about the past, we know more about the way things are in the present, and we can better tell what will happen in the future. The history of programming languages could tell us what makes things popular and what makes things ultimately useful.

Unfortunately, not much has been done on this. Knowledge of programming language history is passed on, unsourced, with as much verifiability as folk legend. The ACM has held three conferences called HOPL on this subject over the past 30 years, so all the source material is there. But apart from a book published in 1969, this is all I can find as far as a survey history of programming languages goes.

There is a limit to how much academic research papers can provide. The proceedings of the HOPL conferences aren't bedtime reading, and they don't provide much by way of a strong narrative. A new book could present the whole history of programming from the first writings about algorithms to modern scripting languages and functional programming languages so it's both accessible to non-programmers and interesting to programmers. As far as I know, no one's really tried. But it would be really fun to try.

Inverses in Concatenative Languages

Most of my ideas are either bad or unoriginal. But there's one idea that I came up with that seems to be both original and not horrible, and that's the idea of a particular kind of concatenative pattern matching (which I blogged about, and Slava also wrote about in relation to units).

The basic idea is that, in a concatenative language, the inverse of foo bar is the inverse of bar followed by the inverse of foo. Since there are some stacks that we know foo and bar will never return (imagine bar is 2array and the top of the stack is 1), this can fail. From this, we get a sort of pattern matching. Put more mathematically, if f and g are functions, then (f o g)-1 = g-1 o f-1.

In my implementation of this, I made it so you can invert and call a quotation with the word undo. We don't actually need a full inverse; we only need a right inverse. That is, it's necessary that [ foo ] undo foo be a no-op, but maybe foo [ foo ] undo returns something different. Also, we're taking all functions to be partial.

Anyway, I think this is a cool idea, but I doubt it could fill a book. I want to write an article about it for an academic journal so that I can explain it to more people and expand the idea. It could be made more rigorous, and it could use a lot more thought. I hope this works.

When will I write these?

So, I hope I'll be able to write these things. Unfortunately, I'm not sure when I could. I need to finish my next 10 years of education, which won't leave tons of free time unless I give up blogging and programming and my job. Also, I'm not sure if I'm capable of writing a book or an article for an academic journal, though maybe I will be in 10 years when I'm done with my education. It wouldn't be so bad if someone stole my thunder and wrote one of these things because what I really want to do is read these books.

Update: Here are a couple more books (or maybe just long surveys) that should exist but don't: something about operational transformation, and something about edit distance calculations and merge operations for things more complicated than strings. Right now, you can only learn about these things from scattered research papers. (I never thought I'd find the references at the end so useful!)

Thursday, December 13, 2007

Alphanum sort and Unicode

So, I'll just be the Unicode troll here and say, all of your solutions to the problem of sorting alphanumeric file names are insufficient. To recap, if you haven't been following Reddit: it's annoying how, when looking at a directory, the file named z10.txt is sorted between z1.txt and z2.txt, and we programmers should come up with a solution. The gist of the solution is that you split up the number into a list of alphabetical and numeric parts, parse the numbers, and then sort it all.

What's wrong with that?

Every one of these implementations, including a slightly internationalized version (just for some Scandanavian locale, apparently), does not produce a sort the way humans expect it. Remember, in ASCII, capital Z comes before lower-case a. Jeff Attwood hinted at internationalization problems, and that's just part of the problem. We should use the Unicode Collation Algorithm (which I previously blogged about). The UCA provides a locale-dependent, mostly linguistically accurate collation for most locales (basically all that don't use Han ideographs).

The basic idea that we need here is that the UCA is based on a number of levels of comparison. This is useful not only in "international" locales, but also in United States English. First, we compare two strings with most features removed, then we add back in accent marks, and then capitalization, and then punctuation/spacing. So, even if we want "B" to come before "b", we can have "bench" come before "Bundle"--before comparing for capitalization, we compare for the letters themselves, and "bench" obviously precedes "bundle". For a better description of the UCA, see my past blog post.

So, the most basic way to put the UCA together with the Alphanum algorithm is to just use the same algorithm, except with the UCA to compare the strings rather than an ASCII comparison like most other implementations have done. But this would completely destroy the UCA's subtle level effect: even if "A" is sorted before "a", we would want "a1" to come before "A2". We also want "A1" to come before "a2", and we also don't want to ignore case completely ignore case; it's still significant.

The UCA solves this problem when we're working with "a1" and "A2", but not when we're comparing "a2" and "a10". For this, we need a slightly more complicated solution based on something like the Alphanum algorithm. But breaking the string up by itself won't allow for the layering kind of behavior that the UCA depends on.

What's a solution for this?

One way to fix this is to break the string up into its segments, right-pad the alphabetical strings and left-pad the numbers. We can just break things up into segments of consecutive digits and non-digits. The length that we'll pad to for the nth subsequence is the maximum of the lengths of the nth subsequences for all of the strings. For numbers, we can left-pad with plain old 0, but for strings, we want to
right-pad with something that's lower than any other character. This sounds hard to do, but by tailoring the DUCET (data table for the UCA), we can just define a character in a private use space as a new, special padding character. This padding character will be completely non-ignorable, fixed weight and have a lower primary, secondary and tertiary weight than any other character by definition.

OK, that sounds really complicated. Let's step back a minute and see how this padding could work out to provide functionality equivalent to the existing Alphanum algorithm. For this, we'll assume access to an ASCII-order sorting algorithm, and just figure out how to pad the string in the right way to come up with a suitable sort key. Instead of some new character, we can just use a null byte. So, if we have the strings "apple1b" and "the2345thing1", the fully padded strings should look like "apple0001b\0\0\0\00" and "the\0\02345thing1", where '\0' is the null byte.

I can make a simple Factor implementation by stealing Doug's cut-all code from his description of a solution to this problem.

: pad-seq ( seq quot -- newseq )
>r dup [ length ] map supremum r>
curry map ; inline

: pad-quote ( seq -- newseq )
[ "" pad-right ] pad-seq ;

: pad-number ( str -- newstr )
[ CHAR: 0 pad-left ] pad-seq ;

: pad-string ( str -- newstr )
[ 0 pad-right ] pad-seq ;

: pieces ( strings -- pieces )
[ [ digit? ] [ ] cut-all ] map pad-quote flip ;

: pad ( pieces ? -- ? newpieces )
[ pad-string f ] [ pad-number t ] if swap ;

: pad-strings ( strings -- newstrings )
pieces t swap [ swap pad ] map nip flip [ concat ] map ;

: alphanum-sort ( strings -- sorted )
dup pad-strings [ 2array ] 2map sort-values keys ;

This, by far, isn't the most concise implementation, but the advantage is that it can easily be adapted for a better collation algorithm.

But where's the real Unicode implementation of this?

I'm not sure where to implement this in terms of tailoring and hooking it up with the UCA. I'd either have to (a) go into C++ and do this with ICU (b) write my own Unicode library which has tailoring, and do it there. I'm too scared to do the first, and the second isn't done yet. I've looked at Java's tailoring, but I don't see how I could define a character that's lower than all other characters. I feel a little bit lame/trollish putting this blog post out there without a real implementation of the solution, but I hope I was able to expand on people's knowledge of the concerns of internationalization (yes, I'll write that word all out; I don't need to write i18n) in text-based algorithms.

Monday, December 10, 2007

Multiline string literals in Factor

It's always annoyed me somewhat that Factor strings can only be on one line and that there was no mechanism for anything like "here documents" like Perl has. So I decided to write it myself. At this point, I don't really need it and have forgotten what I wanted it for, but it was still a fun exercise.

I started out thinking I should do something similar to some other languages do it: write a word, maybe called <<<, which is followed by some string which is used to delineate a multiline string literal expression. But I realized this wouldn't be the most idiomatic way in Factor. First, if you're making a multiline string literal, why would you ever have it be within a word? For constants like this, it's considered best practice to put them in their own separate words. Second, why do I need to give the option of choosing your own ending? What's wrong with just using a semicolon, like other Factor definitions?

Putting this together, I came up with the following syntax:

STRING: something
Hey, I can type all the text I want here
And it can be over multiple lines
But without the backslash-n escape sequence!
;

The final line, the semicolon, has to have no whitespace before or after it. This allows for practically any useful string to be written this way. The syntax will compile into something like this:

: something
"Hey, I can type all the text I want here\nAnd it can be over multiple lines\nBut without the backslash-n escape sequence!" ;

With a previous parser design, multiline string literals like this were impossible, but now they can be done in 11 lines. I packaged this up and put it in my repository under extra/multiline so others can use it.


Using the internals of the parser, the first word advances the parser state one line and returns the text of the new line.

: next-line-text ( -- str )
lexer get dup next-line line-text ;

The next two words do the bulk of the parsing. They advance the parser's current line until reaching a line consisting only of ";", and then advance the line one more time. While moving forward, a string is compiled consisting of all of the lines, interspersed with newlines.

: (parse-here) ( -- )
next-line-text dup ";" =
[ drop lexer get next-line ] [ % "\n" % (parse-here) ] if ;

: parse-here ( -- str )
[ (parse-here) ] "" make 1 head*
lexer get next-line ;

Finally, the word STRING: puts it all together, defining a new word using a string gathered with parse-here.

: STRING:
CREATE dup reset-generic
parse-here 1quotation define-compound ; parsing

There are downsides to having an extremely flexible syntax like Factor. Things can be less predictable when libraries can alter the fundamentals of syntax. It'd be impossible to create a fixed BNF description of Factor syntax. Additionally, the particular structure of Factor sometimes encourages syntax extension that's heavily dependent on the details of the current implementation. But the upside is that we can do things like this. I think it's worth it.

Thursday, December 6, 2007

Roadmap to Factor 1.0

According to the Factor website, Factor 1.0 is coming out some time in 2008. That's pretty scary. 2008 is coming in less than a month, and we currently have no solid plan for how we're going to go about reaching 1.0. Depending on the amount of available time that contributors have over the next year, it'll take a different amount of time to achieve this goal. Nevertheless, I'm proposing a roadmap of development goals for Factor 1.0.

Figuring out the goals

We first need to figure out the exact goals for Factor 1.0, and how we're going to go about completing them. From the Factor homepage, we have a list of goals for Factor 1.0:
  • New object system with inheritance and multiple dispatch
  • Incremental garbage collection
  • Drastically reduce compile time
  • Continue improving existing libraries, such as databases, multimedia, networking, web, etc
  • Full Unicode support, Unicode text display in UI (in progress)
  • Better UI development tools
  • Get the UI running on Windows CE
  • Add support for Windows 64-bit and Mac OS X Intel 64-bit
  • Use kqueue (BSD, Mac OS X), epoll (Linux) for high-performance I/O
  • Directory change notification

This contains many different things, which will take a lot of work. For each of these tasks (or pieces of them), we need to decide who will be in charge of them and what a target date for completion will be. There are also other goals that aren't on this list, and we need to identify what these are.

One list item is particularly involved: Continue improving existing libraries, such as databases, multimedia, networking, web, etc. Right now, the libraries in extra/ are of inconsistent quality. We should set a standard for quality for things that are in extra/ and designate more experimental libraries as such, maybe by putting them in a different directory like playground/. We'll should have an audit of all the libraries in extra/ to improve them as much as possible and decide whether they should go into playground/.

Timetable

By the end of this year (December 31, 2007): We need to decide, in more detail, how the path to Factor 1.0 will go. We need a full list of goals, a person assigned to each one, and a target date for completion. We also need to decide the terms for an audit of the libraries.

One-third through next year (April 31, 2008): We should be done with the library audit. All new libraries should be correctly sorted into extra/ or playground/. At this point, around half of the remaining critical features of Factor 1.0 should be done. Language features, like the form of the object system, should be completed and frozen.

Two-thirds through the year (August 31, 2008): This date is the target for all of the key features to be completed. After this point, we will focus on bug fixes, performance and more complete documentation. If possible, 1.0 alpha/beta 1 should be released, and further alpha/beta releases should be made through the year. Here, Factor's runtime should be frozen with no significant changes in implementation.

At the end of the year (December 31, 2008): By this date, if all goes well, we will release Factor 1.0, with no major additional features added after the alpha stage.

Specifics

Here are guidelines that all vocabs in extra/ should eventually follow:
  • Each vocab foo should consist of six files: foo.factor, foo-tests.factor, foo-docs.factor, summary.txt, authors.txt, tags.txt
  • Every vocab's documentation should explain each externally useful word, and it should have a main article telling users where to start.
  • Every vocab's unit tests should be as complete as possible.

Eventually, some vocabs in extra/ and playground/ will be hosted outside of the main Factor distribution. But right now, the total quantity of Factor code is such that everything that's publicly distributed can be included. Of course, when more people start writing code that they refuse to put under a BSD-compatible license, we'll want to distribute that code separately.

This whole roadmap is quite optimistic, but I think it's possible if we focus on that goal.

Note: I wrote this without significantly consulting everyone involved with Factor, so if you have a problem with it, know that this doesn't form any kind of official policy (and please tell me).

Tuesday, December 4, 2007

Factor FAQ moved

The Factor FAQ that Doug and I wrote has been moved to a new location at factorcode.org. It looks much better now, formatting-wise, I think. I'm still updating the FAQ, and since it's grown so big now, I'll have to categorize the questions. If you have any questions about Factor that aren't on the FAQ, please ask me (or the folks at #concatenative or the Factor mailing list).

extra/delegation: an annotated vocabulary

In order to allow for better compiler optimizations change-slot operations and other improvements, Slava plans to add tuple inheritance to the Factor object system. The exact form of this isn't completely finalized, but it looks like delegation in its current form isn't going to stick around much longer. I proposed a new kind of delegation mechanism, along with other changes to the Factor object system. In the end, a more conservative change was chosen in the form of mixins.

I described the Factor object system before, but I'll review delegation here. Basically, when a generic word dispatches on the tuple, one of two things can happen: either a method is found, and then called, or no method is found. If no method is found, the generic word checks the delegate slot. If that slot is filled (with something that's not f), the method will be re-called on that delegate in place of the original object. For more information on this, see my previous blog post

There are a few problems with delegates. One problem is the slicing problem: when relying on delegation to extend behavior and a method is delegated, the method call knows only about the delegate, not the original object. Sometimes, this is the desired behavior, but other times it can lead to subtle bugs. Another problem is performance. When looking up the value of a direct tuple slot, the required operation is a simple method dispatch and pointer arithmetic. If the compiler knows the tuple class of an object before runtime, it can even eliminate the dispatch step. However, when looking at a delegate slot, there is an additional pointer indirection. Crucially, with the current system, there is no way to avoid the dispatch. Tuple inheritance would require fewer levels of indirection and give more opportunity for optimization. With delegation, method dispatch takes linear time with respect to the length of the delegate chain, and this is bad. A very subtle problem with delegation is that everything is delegated at once, when often, we only need to delegate a couple things.

Yet delegation is still sometimes useful, and there must be a way to fix the slicing problem without breaking everything*. I've written a vocabulary (vocab) called delegate which attempts to fix this. The premise is that this will constitute delegation after it is removed from the core. Central to the vocab is the concept of a group, or a word which signifies a number of methods. There are three builtin types of groups: tuple classes, generic words and protocols. There is a word called group-words which extracts the constituent words from a group:

GENERIC: group-words ( group -- words )

For a generic word, the group consists just of itself

M: generic group-words
1array ;

For a tuple class, the group consists of all of the slot readers and writers, except for the delegate slot. (delegate and set-delegate aren't generic words, anyway)

M: tuple-class group-words
"slots" word-prop 1 tail ! The first slot is the delegate
! 1 tail should be removed when the delegate slot is removed
dup [ slot-spec-reader ] map
swap [ slot-spec-writer ] map append ;

A protocol is an explicitly-defined list of generic words which form a group. The syntax PROTOCOL: protocol-name words... ; is used to define a protocol, and the words it contains are stored in the "protocol-words" word property.

: define-protocol ( wordlist protocol -- )
swap { } like "protocol-words" set-word-prop ;

: PROTOCOL:
CREATE dup reset-generic dup define-symbol
parse-definition swap define-protocol ; parsing

PREDICATE: word protocol "protocol-words" word-prop ;

M: protocol group-words
"protocol-words" word-prop ;

Here's an example of the defintion of the sequence protocol:

PROTOCOL: sequence-protocol
clone clone-like like new new-resizable nth nth-unsafe
set-nth set-nth-unsafe length immutable set-length lengthen ;

The advantage of defining delegation over these groups is that, usually, we don't need to delegate for everything. With this, multiple delegation over disjoint groups is possible. This is in use in the xml.data vocab, to let tags act like their names, their attributes assoc and their children sequence at the same time. It could also be used for multiple tuple inheritance.

One mechanism which replaces delegation is called consultation. Billy Tanksley has suggested that this is what Factor's delegation should be termed; in essence, consultation is the same as delegation, except that it's limited to a single group. Say you have some class foo which implements all generic words in the group bar. There is another class baz, which has a word baz>foo itself into into a foo, or getting a corresponding foo to which to delegate the methods. Then, we could define CONSULT: bar baz baz>foo ; in order to define the delegation. If bar contains the words qux and quux, then the following method definitions will be generated:

M: baz qux baz>foo qux ;
M: baz quux baz>foo quux ;

Here's the implementation:

: spin ( x y z -- z y x ) ! This stack shuffler makes things easier to think about
swap rot ;

: define-consult-method ( word class quot -- )
pick add <method> spin define-method ;

: define-consult ( class group quot -- )
>r group-words r>
swapd [ define-consult-method ] 2curry each ;

: CONSULT:
scan-word scan-word parse-definition swapd define-consult ; parsing

So, this fixes one problem with delegation: that it happens for everything. The second problem, performance, can be solved with increased static typing, for example static typing of tuple slots. But this won't be an available feature of Factor for some time. Until then, if we're sure that a tuple slot is a specific class, we can use declare to let the compiler know. For example,

CONSULT: bar baz
baz>foo { foo } declare ;

There's still one last problem, the slicing problem. A complete solution would involve heavy hackery into the Factor method dispatch system, allowing one object to mimic another. But a simpler solution is to just copy all of the method implementations from one class to another, for the generic words in a specific group. This can be unsafe in some situations, such as when the group is a tuple class, but it can also be useful. I called the system mimicry.

: define-mimic ( group mimicker mimicked -- )
>r >r group-words r> r> [
pick "methods" word-prop at [ 2drop ]
[ method-def <method> spin define-method ] if*
] 2curry each ;

: MIMIC:
scan-word scan-word scan-word define-mimic ; parsing

So there you have it: in fewer than 40 lines, the basis for a replacement of the delegation system, at least partially fixing all three main problems. The code is in my git repository. There are still problems, of course: metadata about what mimics/consults what is not stored, and when you see a generic word or class, you'll see all of the mechanically-defined methods. It's amazing, though, how flexible Factor is to this kind of manipulation and metaprogramming.



* Actually, it's possible to re-implement delegation in Factor, completely eliminating the slicing problem, but at a slight performance cost for tuple slot lookups, and at the cost of breaking a lot of code. The thing is, if instances of tuple class a all delegate to tuple class b, then a instances will return f to b?. To get around this, semi-hacks like the is? word are used.