The most common Unicode-processing bug is as pervasive as it is trivial: UTF-8 is confused for some 8-bit repertoire and encoding, or the other way around. Most commonly it's something like Windows-1252, a poorly specified superset of ISO 8859-1 (Latin 1). This, I'm fairly sure, is the source of all of those "funny Unicode characters".
For example, on the Unicode mailing list digest as read in GMail, I received a letter a few days ago which opened, "I don’t understand..." I'm not sure where in the transmission line this happened, but I'm sure that nobody typed those characters. More likely, they used a single right quotation mark U+2019 (or rather their email program did), which would be encoded in UTF-8 as 0010 0000 0001 1001 ==> 11100010 1000000 10011001 = 0xE2 0x80 0x99 = ’ in Windows-1252.
(Note: the giveaway that this is Windows-1252 and not Latin 1 is the Euro sign, which is a relatively recent invention and not encoded into any official ISO 8859 that people actually use*. In all ISO 8859s, 80 is reserved as a kind of fake control character.)
Here's how it might have happened: the email program declared that everything was in Windows-1252, though it was not, and the mailing list server correctly decoded that encoding into the corresponding Unicode code points. Alternatively, maybe no encoding was specified, and since Windows-1252 is a superset of Latin 1, which in turn is a superset of ASCII, it was used as a "maximally tolerant" assumed encoding where ASCII would be the normal default. Alternatively, maybe the mailing list itself failed to specify what encoding it was using, and GMail made a similar mistake. This is more likely, as things consistently appear for me as this way when reading the Unicode mailing list.
This bug is at once easy to fix and impossible. Because compatibility with legacy applications needs to be maintained, it's difficult to change the default encoding of things. So, everywhere it's possible, things need to say explicitly what their encoding is, and applications need to process this information properly.
Still, do we care more about maintaining compatibility with legacy applications or getting things working today? In practice, almost everything is done in UTF-8. So it should be fine to just assume that encodings are always UTF-8, legacy programs be damned, to get maximum utility out of new things.
Well, as it turns out, that's not always the best thing to do. Someone recently told me that he suspects a Unicode bug in Amarok: it wasn't correctly loading his songs that had German in the title, though it worked correctly on a previous installation. Instead, I think the bug was in incompatible default settings for GNU tar or ISO format. The songs used to have accented letters, but the files were transferred onto a different computer. Now, those letters were single question marks when viewed in Konquerer, and Amarok refused to open them, giving a cryptic error message.
UTF-8 is fault-tolerant, and a good decoder will replace malformed octets with a question mark and move on. This is probably exactly what happened: the title of a song contained a character, say ö, which was encoded in Latin 1 as 0xF6, followed by something in ASCII. The song title was encoded in Latin 1 when the file system expected UTF-8. The UTF-8 decoder in Konquerer replaced the 0xF6 with a � (replacement character, U+FFFD), but Amarok threw an error for some reason.
So, for all of this, there's no good solution but to be more careful and mindful of different encodings. In most cases, you can use a heuristic to determine whether something is in Windows-1252 or UTF-8, but this can never be completely accurate, especially if other encodings are considered at the same time.
* ISO 8859-15 and -16 actually have the Euro sign, but I really doubt many people use them, as they were released around the turn of the millennium, when Unicode was already in broad use, and come with the difficulties of telling them apart from other ISO 8859s.
Update: An anonymous commenter pointed out that it's not too hard to use a heuristic to differentiate between Windows-1252 and UTF-8.
Thursday, January 31, 2008
Tuesday, January 22, 2008
Matching, diffing and merging XML
Update: A newer, more complete version is here.
I've said bad things about my job working on Carleton College's website, but fundamentally it's a really sound work environment we have. Just before winter break, one of the full-time employees came to me and asked if I could make a diff between two XHTML documents for use in Carleton's CMS, Reason. This would be useful for (a) comparing versions of a document in the CMS (b) merging documents, in case two people edit the same document at the same time, so as to avoid locks and the need for manual merges. They came to me because I told them I'd written an XML parser.
I may know about XML, but I know (or rather knew) nothing about the algorithms required for such a diff and merge. I looked on Wikipedia, but there was no article about this kind of stuff. So for the past three weeks, I've been paid to read academic papers, mostly from Citeseer, my favorite website, about matching, diffing, and merging trees. Here's some of what I've found.
A naive algorithm
Here's something that won't work: a line-by-line diff and merge, along the lines of the Unix utilities
Most obviously, this can easily break the tree structure. Here's one example of a failing three-way merge. (By a three-way merge, I mean a merge of two documents where you have the original document which the other two are derived from.
This XML is not well-formed! This is unacceptable. I'm not saying
Another problem is that a standard Unix diff only takes two things into account: insertions and deletions. In document diffing and merging, it'd be helpful if we supported another operation: moves. Say we take two paragraphs and swap them. In a model with only deletions and insertions, the paragraph would have to be deleted from one place and inserted in another. This can lead to sub-optimal diffs as presented to the user. It can also lead to bad merges: say in one branch, a paragraph was edited, and in another branch, that paragraph was moved to a different location. An optimal merge would put the edited paragraph in the new location, which requires tracking these moves directly.
Background to a better solution
XML can be viewed as ordered trees, where each node has an unbounded number of children and each internal node has a label. (In some situations, it acts more like an unordered tree, but in XHTML it's definitely ordered. Also, finding the optimal matching for unordered trees is known to be NP-hard, so we don't want to go there.) So we can solve these problems of diffing and merging XML by solving a more general problem on ordered trees. There are some specific aspects of XML which deserve mention (attributes, which are guaranteed to have unique names within a node; IDs, which are guaranteed to be unique within a document), but these are minor aspects which we can ignore for most of the time.
To avoid confusing, I'll define some terms I've been using or will soon start using. When I talk about "diffing", what I mean, formally, is generating an "edit script", or list of changes between two documents that can be used to get the modified document from the original. Sometimes, these edit scripts are invertible, but not always. When I talk about a "merge", I mean a way to reconcile the changes between documents to incorporate both of these changes. A merge can be an operation on edit scripts or it can be done directly on a tree matching. A "matching" is a set of correspondences between nodes in different trees; it is the basis for doing either a diff or a merge, and it's difficult to do efficiently.
(It might help to have a basic understanding of Levenshtein distance and the implied string diff algorithm. The longest common subsequence problem will also crop up here from time to time.)
The Zhang-Shasha algorithm and extensions
The Zhang-Shasha algorithm is the basic starting point when thinking about tree matching, diffing and merging. Except it isn't that basic. Dennis Shasha and Kaizhong Zhang created an algorithm to solve the approximate tree matching problem, which they described in the book Pattern Matching Algorithms. Their chapter on trees is available from Citeseer. Here's the basic idea: we want to see how similar two trees are, by a weighted edit distance metric. The edit script has three operations, similar to Levenshtein distance: add a node (optionally including a contiguous subsequence of the parent node), delete a node (putting children in the parent node), and relabel a node.
With this, they were able to come up with an algorithm of complexity (basically) O((n log n)^2), where n is the number of nodes in the tree. So this can get you a matching and an edit script, or diff between two trees. This isn't great, but it's much better than previous algorithms. The two also worked on the more difficult problem of supporting "Variable-length don't-cares", or the equivalent of * in Google/Unix file globbing, in tree matching, but we don't care about that here.
But this doesn't describe all of the changes that might take place in a tree structure that we might want to record. For example, in Zhang-Shasha proper, moving a node from one place to another is recorded as deleting the node from one place and inserting it into another place. Another issue is that inserting or deleting a subtree is recorded as inserting the node, then inserting each of its children, or deleting the leaf nodes recursively up until you delete their parent. This all leads to counterintuitive diffs, as far as human readability goes, as well as inflated edit distances.
So David Barnard, Gwen Clarke and Nicholas Duncan got together to create a modified algorithm that accommodated this modified definition of edit distance in this paper. It adds three additional operations: insertTree, deleteTree, and swap. Unfortunately, this doesn't account for copying nodes, or for moves that aren't within the same parent.
Some tree matching heuristics
So, it's not very good that the Zhang-Shasha algorithm is quadratic in most cases. In fact, in many cases, it's unacceptable. For example, in my case, where I might sometimes have to compare XHTML documents which are very long, it's unacceptable. But there are some algorithms which run in a time which is dominated by the number of nodes multiplied by the edit distance, or O(ne).
One algorithm called FastMatch, which has insert leaf, delete leaf, update and general move operations is presented by a bunch of people from Stanford in this paper. They work on getting an edit script and matching at the same time, but the algorithm starts by matching as much as possible, top-down, before proceeding to calculate the differences. This yields a complexity of O(ne+e^2). A related algorithm, described in Chapter 7 of Tancred Lindholm's master's thesis [PDF] incorporates tree insertion and deletion operations for a complexity of O(ne log n).
It's important to note that both of these will be O(n^2) in the complete worst case. A different XML matching algorithm was described by Grégory Cobéna in his master's thesis (and also in this paper, which is basically the relevant segment). Cobéna calls his algorithm BULD, which stands for bottom-up lazy-down. The key to the algorithm is that, for each node, there is a hash value and a weight, both calculated bottom-up. Exact equivalence between nodes can be approximated by equal hash values, and you search for the equal hash values of nodes that have been inserted on a maxheap by weight. In Cobéna's full thesis [PDF], he goes into more depth about his invertible edit script format. This algorithm doesn't necessarily generate the optimal diff, but in experiments it generates a very good one, and with a worst-case time of O(n log n).
Three-document merge
Creating an edit script is all well and good, but it's only half of the problem: the merge. Remember that a three-document merge is one where we have the original document and two modified versions, and we want to create a fourth version with both modifications together. Here was my idea: create an edit script for both modified versions with respect to the original, then do one followed by another, with repeated modifications done only once. We know there's a conflict if the order matters, in terms of which comes first in applying to the original document.
But this will come up with more conflicts than actually exist. For example, say some node A has four children, B C D and E. In one change, we insert a new node X after B as a child of A, and in another change, we insert a node Y after D as a child of A. So a sensible merge would have A's new children be B X C D Y E, in that order. But with the model described above, there would be an edit conflict!
One solution to this is the more general strategy of operational transformation. The basic idea for this technique as applied here is that, if we insert Y after inserting X, we have to add 1 to the index that Y is being inserted. If, on the other hand, Y is inserted first, we don't have to add one to the index that X is inserted on. This technique leads to fewer conflicting merges, or in OT lingo, it converges in more cases. There are a few formal properties of an operational transformation that have only recently been proven correct in the best-known algorithms. Pascal Molli used operational transformation, together with Cobéna's diff algorithm and format, in his So6 synchronization framework, which he described in a deceptively short paper [PDF].
Tancred Lindholm went a different route altogether in creating a three-way merge, throwing out the edit script and basing it on a tree matching. His algorithm is described in a paper that I don't understand well enough to faithfully summarize here.
Conclusion
At this point, you might be thinking, "I read all that and he didn't even tell me how the algorithms work?!" I would tell you all about that, but there are two problems. First, that'd take thousands more words, and this post already exceeds 2000 words. This is just an introduction to the problem and the literature. But more importantly, I don't fully understand these algorithms. They're very complicated to follow and even harder to implement.
So, overall, it looks like this is a much harder problem than I initially thought. Each time I read a paper on this, I think, "Oh, so this is what I needed all along. And I wasted so much time looking at the other papers before it!" But there is always another more complicated layer. Before I get to each new layer, I think, "No one's ever done this before!" and then I suddenly find a paper about it, published increasingly recently.
The problem I'm looking at right now is the last step, the interface. In particular, I'm not sure how to explain a diff to non-technical users and ask questions about how to resolve a merge conflict. Though I haven't found information about these things yet, I'm sure I will in the near future.
(There's also the problem of doing a three-way merge on the text mentioned earlier, which should come out bolding one part and italicizing another. But this problem, in the general case, is significantly harder to solve because it requires knowledge of XHTML. Imagine how the desired behavior would differ if one of the tags were an anchor. It also makes the matching task more complex as text nodes can be expected to be matched when they are all together in one version, and split up in subnodes in another version. I'm not sure if anyone has researched this yet. Another difficult problem is to merge edits within a text node, which can be done with a standard LCS procedure and operational transformation but requires a fuzzy matching of text nodes.)
Maybe, continuing in the trend of increasing dates, the answer to these questions hasn't come yet. In that case, I might have to solve this problem myself (and publish the results) so that later people don't have to solve it again. I've already started meeting with a Carleton computer science professor to talk about this problem in general.
Academic computer science is amazing.
Update: I've been getting lots of comments suggesting certain tools for XML diffing or merging. I should have made it more clear that this is a solved problem, or rather, that all current XML diff/merge tools do roughly the same thing. What isn't a solved problem is a schema-sensitive diff for XHTML which doesn't treat text nodes as atomic, and no comment given so far has anything like that. Most of the research going on in XML diffing is about improving heuristics to give faster algorithms, but my main concern is diff quality: a diff over XHTML that's so good even an untrained person could understand it.
If you're trying to solve the first problem of a general XML diff or merge, you shouldn't be using a closed-source implementation, because these don't generally tell you what algorithm they're using. Three good open-source implementations are XyDiff (and its Java clone jXyDiff) implementing Cobena's algorithm in C++, Logilab's xmldiff implementing the FastMatch algorithm in Python, and the 3DM project implementing Lindholm's algorithms. Unfortunately, none of these seems to have been updated since 2006.
Originally, my description of edit distances in the Zhang-Shasha algorithm was incorrect (I described Selkow's operations), but it's now fixed.
I've said bad things about my job working on Carleton College's website, but fundamentally it's a really sound work environment we have. Just before winter break, one of the full-time employees came to me and asked if I could make a diff between two XHTML documents for use in Carleton's CMS, Reason. This would be useful for (a) comparing versions of a document in the CMS (b) merging documents, in case two people edit the same document at the same time, so as to avoid locks and the need for manual merges. They came to me because I told them I'd written an XML parser.
I may know about XML, but I know (or rather knew) nothing about the algorithms required for such a diff and merge. I looked on Wikipedia, but there was no article about this kind of stuff. So for the past three weeks, I've been paid to read academic papers, mostly from Citeseer, my favorite website, about matching, diffing, and merging trees. Here's some of what I've found.
A naive algorithm
Here's something that won't work: a line-by-line diff and merge, along the lines of the Unix utilities
diff
and patch
. You can split things up so that each tag is on a different line, as many have suggested, but it still won't work. Everyone who I've mentioned this problem to, including the boss who gave the assignment, thought of doing something like this, but it's unworkable.Most obviously, this can easily break the tree structure. Here's one example of a failing three-way merge. (By a three-way merge, I mean a merge of two documents where you have the original document which the other two are derived from.
diff3
does a line-by-line three-way merge, which I'm using for this example.)
|
|
|
|
Original | Part bolded | Part italicized | Line-by-line merge |
---|
This XML is not well-formed! This is unacceptable. I'm not saying
diff3
is a bad tool, it's just not what we need.Another problem is that a standard Unix diff only takes two things into account: insertions and deletions. In document diffing and merging, it'd be helpful if we supported another operation: moves. Say we take two paragraphs and swap them. In a model with only deletions and insertions, the paragraph would have to be deleted from one place and inserted in another. This can lead to sub-optimal diffs as presented to the user. It can also lead to bad merges: say in one branch, a paragraph was edited, and in another branch, that paragraph was moved to a different location. An optimal merge would put the edited paragraph in the new location, which requires tracking these moves directly.
Background to a better solution
XML can be viewed as ordered trees, where each node has an unbounded number of children and each internal node has a label. (In some situations, it acts more like an unordered tree, but in XHTML it's definitely ordered. Also, finding the optimal matching for unordered trees is known to be NP-hard, so we don't want to go there.) So we can solve these problems of diffing and merging XML by solving a more general problem on ordered trees. There are some specific aspects of XML which deserve mention (attributes, which are guaranteed to have unique names within a node; IDs, which are guaranteed to be unique within a document), but these are minor aspects which we can ignore for most of the time.
To avoid confusing, I'll define some terms I've been using or will soon start using. When I talk about "diffing", what I mean, formally, is generating an "edit script", or list of changes between two documents that can be used to get the modified document from the original. Sometimes, these edit scripts are invertible, but not always. When I talk about a "merge", I mean a way to reconcile the changes between documents to incorporate both of these changes. A merge can be an operation on edit scripts or it can be done directly on a tree matching. A "matching" is a set of correspondences between nodes in different trees; it is the basis for doing either a diff or a merge, and it's difficult to do efficiently.
(It might help to have a basic understanding of Levenshtein distance and the implied string diff algorithm. The longest common subsequence problem will also crop up here from time to time.)
The Zhang-Shasha algorithm and extensions
The Zhang-Shasha algorithm is the basic starting point when thinking about tree matching, diffing and merging. Except it isn't that basic. Dennis Shasha and Kaizhong Zhang created an algorithm to solve the approximate tree matching problem, which they described in the book Pattern Matching Algorithms. Their chapter on trees is available from Citeseer. Here's the basic idea: we want to see how similar two trees are, by a weighted edit distance metric. The edit script has three operations, similar to Levenshtein distance: add a node (optionally including a contiguous subsequence of the parent node), delete a node (putting children in the parent node), and relabel a node.
With this, they were able to come up with an algorithm of complexity (basically) O((n log n)^2), where n is the number of nodes in the tree. So this can get you a matching and an edit script, or diff between two trees. This isn't great, but it's much better than previous algorithms. The two also worked on the more difficult problem of supporting "Variable-length don't-cares", or the equivalent of * in Google/Unix file globbing, in tree matching, but we don't care about that here.
But this doesn't describe all of the changes that might take place in a tree structure that we might want to record. For example, in Zhang-Shasha proper, moving a node from one place to another is recorded as deleting the node from one place and inserting it into another place. Another issue is that inserting or deleting a subtree is recorded as inserting the node, then inserting each of its children, or deleting the leaf nodes recursively up until you delete their parent. This all leads to counterintuitive diffs, as far as human readability goes, as well as inflated edit distances.
So David Barnard, Gwen Clarke and Nicholas Duncan got together to create a modified algorithm that accommodated this modified definition of edit distance in this paper. It adds three additional operations: insertTree, deleteTree, and swap. Unfortunately, this doesn't account for copying nodes, or for moves that aren't within the same parent.
Some tree matching heuristics
So, it's not very good that the Zhang-Shasha algorithm is quadratic in most cases. In fact, in many cases, it's unacceptable. For example, in my case, where I might sometimes have to compare XHTML documents which are very long, it's unacceptable. But there are some algorithms which run in a time which is dominated by the number of nodes multiplied by the edit distance, or O(ne).
One algorithm called FastMatch, which has insert leaf, delete leaf, update and general move operations is presented by a bunch of people from Stanford in this paper. They work on getting an edit script and matching at the same time, but the algorithm starts by matching as much as possible, top-down, before proceeding to calculate the differences. This yields a complexity of O(ne+e^2). A related algorithm, described in Chapter 7 of Tancred Lindholm's master's thesis [PDF] incorporates tree insertion and deletion operations for a complexity of O(ne log n).
It's important to note that both of these will be O(n^2) in the complete worst case. A different XML matching algorithm was described by Grégory Cobéna in his master's thesis (and also in this paper, which is basically the relevant segment). Cobéna calls his algorithm BULD, which stands for bottom-up lazy-down. The key to the algorithm is that, for each node, there is a hash value and a weight, both calculated bottom-up. Exact equivalence between nodes can be approximated by equal hash values, and you search for the equal hash values of nodes that have been inserted on a maxheap by weight. In Cobéna's full thesis [PDF], he goes into more depth about his invertible edit script format. This algorithm doesn't necessarily generate the optimal diff, but in experiments it generates a very good one, and with a worst-case time of O(n log n).
Three-document merge
Creating an edit script is all well and good, but it's only half of the problem: the merge. Remember that a three-document merge is one where we have the original document and two modified versions, and we want to create a fourth version with both modifications together. Here was my idea: create an edit script for both modified versions with respect to the original, then do one followed by another, with repeated modifications done only once. We know there's a conflict if the order matters, in terms of which comes first in applying to the original document.
But this will come up with more conflicts than actually exist. For example, say some node A has four children, B C D and E. In one change, we insert a new node X after B as a child of A, and in another change, we insert a node Y after D as a child of A. So a sensible merge would have A's new children be B X C D Y E, in that order. But with the model described above, there would be an edit conflict!
One solution to this is the more general strategy of operational transformation. The basic idea for this technique as applied here is that, if we insert Y after inserting X, we have to add 1 to the index that Y is being inserted. If, on the other hand, Y is inserted first, we don't have to add one to the index that X is inserted on. This technique leads to fewer conflicting merges, or in OT lingo, it converges in more cases. There are a few formal properties of an operational transformation that have only recently been proven correct in the best-known algorithms. Pascal Molli used operational transformation, together with Cobéna's diff algorithm and format, in his So6 synchronization framework, which he described in a deceptively short paper [PDF].
Tancred Lindholm went a different route altogether in creating a three-way merge, throwing out the edit script and basing it on a tree matching. His algorithm is described in a paper that I don't understand well enough to faithfully summarize here.
Conclusion
At this point, you might be thinking, "I read all that and he didn't even tell me how the algorithms work?!" I would tell you all about that, but there are two problems. First, that'd take thousands more words, and this post already exceeds 2000 words. This is just an introduction to the problem and the literature. But more importantly, I don't fully understand these algorithms. They're very complicated to follow and even harder to implement.
So, overall, it looks like this is a much harder problem than I initially thought. Each time I read a paper on this, I think, "Oh, so this is what I needed all along. And I wasted so much time looking at the other papers before it!" But there is always another more complicated layer. Before I get to each new layer, I think, "No one's ever done this before!" and then I suddenly find a paper about it, published increasingly recently.
The problem I'm looking at right now is the last step, the interface. In particular, I'm not sure how to explain a diff to non-technical users and ask questions about how to resolve a merge conflict. Though I haven't found information about these things yet, I'm sure I will in the near future.
(There's also the problem of doing a three-way merge on the text mentioned earlier, which should come out bolding one part and italicizing another. But this problem, in the general case, is significantly harder to solve because it requires knowledge of XHTML. Imagine how the desired behavior would differ if one of the tags were an anchor. It also makes the matching task more complex as text nodes can be expected to be matched when they are all together in one version, and split up in subnodes in another version. I'm not sure if anyone has researched this yet. Another difficult problem is to merge edits within a text node, which can be done with a standard LCS procedure and operational transformation but requires a fuzzy matching of text nodes.)
Maybe, continuing in the trend of increasing dates, the answer to these questions hasn't come yet. In that case, I might have to solve this problem myself (and publish the results) so that later people don't have to solve it again. I've already started meeting with a Carleton computer science professor to talk about this problem in general.
Academic computer science is amazing.
Update: I've been getting lots of comments suggesting certain tools for XML diffing or merging. I should have made it more clear that this is a solved problem, or rather, that all current XML diff/merge tools do roughly the same thing. What isn't a solved problem is a schema-sensitive diff for XHTML which doesn't treat text nodes as atomic, and no comment given so far has anything like that. Most of the research going on in XML diffing is about improving heuristics to give faster algorithms, but my main concern is diff quality: a diff over XHTML that's so good even an untrained person could understand it.
If you're trying to solve the first problem of a general XML diff or merge, you shouldn't be using a closed-source implementation, because these don't generally tell you what algorithm they're using. Three good open-source implementations are XyDiff (and its Java clone jXyDiff) implementing Cobena's algorithm in C++, Logilab's xmldiff implementing the FastMatch algorithm in Python, and the 3DM project implementing Lindholm's algorithms. Unfortunately, none of these seems to have been updated since 2006.
Originally, my description of edit distances in the Zhang-Shasha algorithm was incorrect (I described Selkow's operations), but it's now fixed.
Sunday, January 13, 2008
until
: The Ultimate Enumerator
Recently, via programming.reddit, I found this paper by Oleg Kiselyov about the best way to build a generic collection protocol to go through a sequence. We want to do this so that operations on collections can be implemented once generically to work on any kind of collection.
Oleg's idea
There are two main ways to go about this: one is a cursor, which is called an iterator in Java and C#: basically, you have an object which represents a particular point in a collection, and there are operations to get the current value and the next cursor position, if one exists. The alternative is to have a higher order function called an enumerator which represents, basically, a left fold. Oleg's conclusion is that the left fold operator is the best way to go about things, and he proceeds to demonstrate how this works out in languages with and without call/cc. He shows how cursors and enumerators are isomorphic, defining each in terms of the other.
Oleg's function is something like this: you have a coll-fold-left function which takes a collection, a function, and an unbounded number of seeds. This will return those seeds after this left fold processing is done. The function takes an element and the values of the seed inputs and returns an indicator with new values for the seeds. The indicator determines whether to keep going or to stop now and return the current values from coll-fold-left.
An enumeration function like this, which is fully general, is much better than a cursor, because cursors depend on mutable state and they are much more difficult to implement, say, when traversing a tree with no links to parent nodes. Also, enumerators handle termination of the collection much better than iterators, where it must either be manually checked whether the collection has terminated, or an exception thrown when reading past the end. There are also efficiency problems, and some interesting research into doing compiler optimizations like deforestation based on enumerators.
Concatenativity and accumulators
I'm going to talk about how this can all work in Factor. Factor does have call/cc, so we can use the simpler iterator which depends on call/cc for full control. (I won't go into how to stop and start an iteration in the middle in this article; Oleg's example can be translated directly into Factor.) But we also have something else to take advantage of, which neither Scheme nor Haskell have at their disposal: a stack.
Remember, Factor is a concatenative language, and that means that whenever we make a function call, it operates on a huge built-in accumulator. You could think of the stack as one accumulator ("all words are functions from stack to stack") or as an unbounded number of accumulators (which makes more sense thinking about things from a dataflow perspective, that the stack is used to glue together the inputs and outputs of various words).
Here's an example. You know the regular old left fold? As in, taking a list like [1, 2, 3, 4], an initial value (called the identity) 0, and a function like (+), and transforming this into ((((0+1)2)+3)+4)? Well, in Factor we call that
But how does this work?
So, that's the great simplification that you get from concatenative languages: there is no need to maintain the accumulators. Here's all we need to define the perfect enumeration protocol: a word which I call
Of course, we'd probably want to define it as a generic word, if we're doing this in Factor and will use it for more than one thing.
(
Derived operations
Using
Evaluation
You might notice that
Nevertheless, something like
If you don't understand the title allusion, you should read some of the original Scheme papers.
Oleg's idea
There are two main ways to go about this: one is a cursor, which is called an iterator in Java and C#: basically, you have an object which represents a particular point in a collection, and there are operations to get the current value and the next cursor position, if one exists. The alternative is to have a higher order function called an enumerator which represents, basically, a left fold. Oleg's conclusion is that the left fold operator is the best way to go about things, and he proceeds to demonstrate how this works out in languages with and without call/cc. He shows how cursors and enumerators are isomorphic, defining each in terms of the other.
Oleg's function is something like this: you have a coll-fold-left function which takes a collection, a function, and an unbounded number of seeds. This will return those seeds after this left fold processing is done. The function takes an element and the values of the seed inputs and returns an indicator with new values for the seeds. The indicator determines whether to keep going or to stop now and return the current values from coll-fold-left.
An enumeration function like this, which is fully general, is much better than a cursor, because cursors depend on mutable state and they are much more difficult to implement, say, when traversing a tree with no links to parent nodes. Also, enumerators handle termination of the collection much better than iterators, where it must either be manually checked whether the collection has terminated, or an exception thrown when reading past the end. There are also efficiency problems, and some interesting research into doing compiler optimizations like deforestation based on enumerators.
Concatenativity and accumulators
I'm going to talk about how this can all work in Factor. Factor does have call/cc, so we can use the simpler iterator which depends on call/cc for full control. (I won't go into how to stop and start an iteration in the middle in this article; Oleg's example can be translated directly into Factor.) But we also have something else to take advantage of, which neither Scheme nor Haskell have at their disposal: a stack.
Remember, Factor is a concatenative language, and that means that whenever we make a function call, it operates on a huge built-in accumulator. You could think of the stack as one accumulator ("all words are functions from stack to stack") or as an unbounded number of accumulators (which makes more sense thinking about things from a dataflow perspective, that the stack is used to glue together the inputs and outputs of various words).
Here's an example. You know the regular old left fold? As in, taking a list like [1, 2, 3, 4], an initial value (called the identity) 0, and a function like (+), and transforming this into ((((0+1)2)+3)+4)? Well, in Factor we call that
reduce
, and here's how it's defined:
: reduce ( sequence identity quotation -- value )
! quotation: accumulator element -- new-value
swapd each ; inline
swapd
is a stack shuffler with the effect ( x y z -- y x z )
and each
is a word which takes a sequence and a quotation, and calls the quotation on each element of the sequence in order. each
normally takes a quotation with stack effect ( element -- )
.But how does this work?
each
, like nearly every combinator, is defined to expose its quotation to values further down on the stack, so we can mess around with that stuff all we want as long as the result is balanced. Here, the quotation that we give the whole thing is messing with the top thing on the stack, right under the element, each time it's called. So, by the end, the top of the stack has our answer. Keep in mind that all of this is happening completely without mutable state.until
So, that's the great simplification that you get from concatenative languages: there is no need to maintain the accumulators. Here's all we need to define the perfect enumeration protocol: a word which I call
until ( collection quotation -- )
. It takes a quotation and a collection from the stack and leaves nothing. The quotation is of the effect ( element -- ? )
. If the quotation returns true, the iteration through the list stops there.until
is significantly easier to define than coll-fold-left
because we don't have to worry about the accumulators. Here's a simple definition with linked lists, if f
is used as the empty list:
TUPLE: cons car cdr ;
: until ( list quot -- )
over [ 2drop ]
[ [ >r cons-car r> call ] 2keep >r cons-cdr r> until ] if ; inline
Of course, we'd probably want to define it as a generic word, if we're doing this in Factor and will use it for more than one thing.
(
until
is kinda like the generic iteration protocol I've defined for assocs, using assoc-find
, except until
should be easier to implement.)Derived operations
Using
until
as a base, we can easily define other basic operations like each
, find
and assoc-each
as they currently exist in Factor.
: each ( seq quot -- )
[ f ] compose until ; inline
: find ( seq quot -- i elt ) ! This is hideous, and should be factored
>r >r -1 f r> r> [ ! i f elt quot
2swap >r >r keep r> 1+ r> drop ! ? elt i
rot [ rot and ] keep ! i f/elt ?
] curry until dup [ nip f swap ] unless ; inline
: assoc-each ( assoc quot -- )
! This assumes that assoc implement until, giving 2arrays for each element
[ first2 ] swap compose each ; inline
Evaluation
You might notice that
find
will involve a little bit of redundant arithmetic on integer-indexed sequences. I'm not sure how to avoid this without complicating until
and leaking implementation details. Also, you might notice that there's no obvious way to implement find*
(without doing a bunch of redundant iteration), which takes an index to start as an argument. This is a result of the fact that find*
can only really be implemented efficiently on collections with random access indexing. There's also no find-last
.until
isn't everything. As I pointed out just now, some operations, like find, may have to do redundant arithmetic when operating on certain kinds of collections to get an index, when that index must already be calculated for the traversal itself. Another limitation is that there's no obvious way to implement map
. (I would still have to suggest going with Factor's current scheme of the words new
, new-resizable
and like
. But this scheme doesn't work with, say, linked lists.) This could be implemented in place of the assoc iteration protocol, but that might result in some unnecessary allocation of 2arrays when iterating through things like hashtables.Nevertheless, something like
until
could definitely help us make our sequence protocol more generic, allowing for (lazy) lists, assocs, and what we currently call sequences to be used all through the same protocol.If you don't understand the title allusion, you should read some of the original Scheme papers.
Friday, January 11, 2008
Factor hack in NYC
This is kinda short notice, but if anyone's in the New York area today, Zed Shaw is organizing another one of his Factor hackfests in Manhattan, which I'll be going to. The details are at his blog (unfortunately I can't find a link to the particular article, just the blog index). So it'll be at Earth Matters (177 Ludlow St) at 7 PM on Friday, January 11th.
Update: Sorry to those of you who couldn't come on such short notice, or on this continent. The whole meeting was really fun, though. In all, nine people came, four of whom had barely seen Factor at all before then. So I gave a little tutorial introduction along the lines of this blog post using factorial as an example. I also explained inverse, demoed shufflers, and talked about stuff about bignums and complexity, but I don't think I explained those things in enough depth to be clear. Luckily it wasn't a complete monologue on my part, as Zed and a couple others made helpful comments to help my descriptions and stuff. Then we devolved into random discussions about various things, some Factor-related and others not, which occupied the majority of the evening and was really fun. There's a good chance I'll be back in New York next April for the first few days of Passover, and I hope to see many of you there at a later Factor meetup!
Update: Sorry to those of you who couldn't come on such short notice, or on this continent. The whole meeting was really fun, though. In all, nine people came, four of whom had barely seen Factor at all before then. So I gave a little tutorial introduction along the lines of this blog post using factorial as an example. I also explained inverse, demoed shufflers, and talked about stuff about bignums and complexity, but I don't think I explained those things in enough depth to be clear. Luckily it wasn't a complete monologue on my part, as Zed and a couple others made helpful comments to help my descriptions and stuff. Then we devolved into random discussions about various things, some Factor-related and others not, which occupied the majority of the evening and was really fun. There's a good chance I'll be back in New York next April for the first few days of Passover, and I hope to see many of you there at a later Factor meetup!
Tuesday, January 8, 2008
Useless Factor is one year old!
Actually, it's a little more than a year old; I started this blog on December 19, 2006 with a still-relevant post on Factor's forms of if. I meant to put up a "happy birthday me!" post on the right date, but I forgot, so now: Happy Birthday Me!
Many of you brave, intrepid blog readers have endured more than a year of dry, technical posts approaching 2000 words a pop just to learn what I have in my head, giving me a refreshing boost of self-importance. Be proud of yourselves!
(As a little side-comment about Steve Yegge's recent blog post about long blog posts, I write for a long time for a somewhat different reason. I don't try to make readers endure one huge sitting, but rather try to explain everything as fully as is needed to give a good understanding of things. Because of the topics I usually write about, this sometimes takes a long time. I do this completely without noticing; I only think about it when someone points it out to me. I will never limit myself to 1000-word blog posts, because there's just too much important stuff to say.)
Many of you brave, intrepid blog readers have endured more than a year of dry, technical posts approaching 2000 words a pop just to learn what I have in my head, giving me a refreshing boost of self-importance. Be proud of yourselves!
(As a little side-comment about Steve Yegge's recent blog post about long blog posts, I write for a long time for a somewhat different reason. I don't try to make readers endure one huge sitting, but rather try to explain everything as fully as is needed to give a good understanding of things. Because of the topics I usually write about, this sometimes takes a long time. I do this completely without noticing; I only think about it when someone points it out to me. I will never limit myself to 1000-word blog posts, because there's just too much important stuff to say.)
Friday, January 4, 2008
Core Factor developers and their projects
Here's a quick listing of the main current Factor developers and their projects. Hopefully, it'll help beginners navigate the community better.
I wanted to keep this list short, but I hope I'm not offending anyone by leaving them off, so please tell me if there's some glaring omission here. There's definitely no official "core" group Factor developers; these are just the people who have contributed a lot in the past. I put the names in order of seniority (in terms of time using Factor).
Slava Pestov
IRC nick:
Slava is Factor's creator. He wrote almost all of the core, the compiler and the Factor virtual machine, the frontend of the Factor UI and the backend on X11 and Mac OS X, a detailed math library and the Factor HTTP server. Slava's definitely the leader of the Factor project, but he doesn't tend to operate in a dictatorial way, instead taking input from many people on language design decisions.
Chris Double
IRC nick:
Chris was Factor's first adopter. He wrote a bunch of interesting libraries, including one for lazy lists, another for parsing expression grammars (PEGs), a compiler from a subset of Factor to Javascript, and an 8086 emulator suitable for playing old console games. He also wrote a bunch of useful bindings to various multimedia libraries and a bunch of other things too numerous to list here. Unfortunately for us, Chris has a full-time job, limiting his time to hack on Factor.
Eduardo Cavazos
IRC nick:
Ed is a non-conformist and proud of it. He devised Factor's current module system, where vocabs in a USE: clause that haven't already been loaded are loaded automatically. He also wrote a window manager in Factor, a really cool artificial life demo called "boids", a chat server/client, an art programming language implementation and an alternative object system for Factor.
Daniel Ehrenberg (me)
IRC nick:
I'm working on Factor's XML library, Unicode support, and a concatenative pattern matching library. I also have this blog, where I try to write useful (or useless) things about Factor.
Doug Coleman
IRC nick:
Doug made a bunch of important libraries including the calendar library, SQL bindings (he's currently working on an abstraction layer), a Mersenne Twister random number generator, some other math libraries, a cryptography library and integration for several text editors. Doug took over maintaining Windows I/O after another contributer, Mackenzie Straight, stopped maintaining it. Doug's really friendly to the beginners in Factor, even the ones who ask stupid questions.
Mackenzie Straight, Elie Chaftari and Alex Chapman, among others, also contributed a lot. You can see what everyone's done by looking in the vocabulary brower included in Factor.
At the risk of repeating myself, when you have questions about anything related to Factor, including any of these libraries discussed, you should pose the question to the general Factor forums: the IRC channel #concatenative on Freenode and the mailing list. We'd be happy to answer them.
Update: Fixed attributions for Doug and Elie, and changed last paragraph.
I wanted to keep this list short, but I hope I'm not offending anyone by leaving them off, so please tell me if there's some glaring omission here. There's definitely no official "core" group Factor developers; these are just the people who have contributed a lot in the past. I put the names in order of seniority (in terms of time using Factor).
Slava Pestov
IRC nick:
slava
Slava is Factor's creator. He wrote almost all of the core, the compiler and the Factor virtual machine, the frontend of the Factor UI and the backend on X11 and Mac OS X, a detailed math library and the Factor HTTP server. Slava's definitely the leader of the Factor project, but he doesn't tend to operate in a dictatorial way, instead taking input from many people on language design decisions.
Chris Double
IRC nick:
doublec
Chris was Factor's first adopter. He wrote a bunch of interesting libraries, including one for lazy lists, another for parsing expression grammars (PEGs), a compiler from a subset of Factor to Javascript, and an 8086 emulator suitable for playing old console games. He also wrote a bunch of useful bindings to various multimedia libraries and a bunch of other things too numerous to list here. Unfortunately for us, Chris has a full-time job, limiting his time to hack on Factor.
Eduardo Cavazos
IRC nick:
dharmatech
Ed is a non-conformist and proud of it. He devised Factor's current module system, where vocabs in a USE: clause that haven't already been loaded are loaded automatically. He also wrote a window manager in Factor, a really cool artificial life demo called "boids", a chat server/client, an art programming language implementation and an alternative object system for Factor.
Daniel Ehrenberg (me)
IRC nick:
littledan
I'm working on Factor's XML library, Unicode support, and a concatenative pattern matching library. I also have this blog, where I try to write useful (or useless) things about Factor.
Doug Coleman
IRC nick:
erg
Doug made a bunch of important libraries including the calendar library, SQL bindings (he's currently working on an abstraction layer), a Mersenne Twister random number generator, some other math libraries, a cryptography library and integration for several text editors. Doug took over maintaining Windows I/O after another contributer, Mackenzie Straight, stopped maintaining it. Doug's really friendly to the beginners in Factor, even the ones who ask stupid questions.
Mackenzie Straight, Elie Chaftari and Alex Chapman, among others, also contributed a lot. You can see what everyone's done by looking in the vocabulary brower included in Factor.
At the risk of repeating myself, when you have questions about anything related to Factor, including any of these libraries discussed, you should pose the question to the general Factor forums: the IRC channel #concatenative on Freenode and the mailing list. We'd be happy to answer them.
Update: Fixed attributions for Doug and Elie, and changed last paragraph.
Thursday, January 3, 2008
Learning Factor
Learning Factor is tough. One reason for this is that Factor is very different from other programming languages. Programmers today are used to imperative programming languages where data is stored and passed around in named variables (or function calls, which name their variables). Factor is the opposite of this. A lot of code tends to be written in a functional style, and even more jarringly, variables are rare, only referenced in a small fraction of words. Nobody intends to change any of this; it's a feature, not a bug!
What we do need to change is the other reason for Factor's difficulty to learn: lack of documentation and organization thereof. It's hard for me to say this when a lot of effort has been put into documentation, especially creating a comprehensive reference for the Factor core and libraries. When I got started on Factor, there was basically no documentation at all; I just asked questions on the IRC channel. Now there's tons of it, but it's in many places.
Two starting places (not necessarily in this order) are the FAQ and the Factor cookbook. The Factor cookbook is included in the Factor distribution, accessed by running Factor, selecting the "Browser" tab and clicking on Factor cookbook, or online. It was written by Slava Pestov, and illustrates really clearly the basics of Factor.
But once you get done with that, where should you go? Here are a bunch of options, depending on what exactly you want to:
The best way to learn Factor, after you have a grasp of the basic syntax, is to play around with it and try to build something useful. Here are some project ideas:
These problems may sound a bit difficult, but most of them have simple part-way solutions so you can begin to get your feet wet. And with any of these, if you have useful working code, we'd be happy to take it into the Factor library, if you feel like releasing it under a BSD-style license.
If you get stuck, don't despair! The friendly people on the Factor mailing list and irc.freenode.net's #concatenative channel are here to answer all of your questions. You could also just email me. Eventually, we'll have a Factor book, but it hasn't been started yet, in part because Factor's not at version 1.0 and in part because none of us have enough free time to write it.
Update: Factor documentation isn't as bad as I originally put it, so I changed this post a little.
What we do need to change is the other reason for Factor's difficulty to learn: lack of documentation and organization thereof. It's hard for me to say this when a lot of effort has been put into documentation, especially creating a comprehensive reference for the Factor core and libraries. When I got started on Factor, there was basically no documentation at all; I just asked questions on the IRC channel. Now there's tons of it, but it's in many places.
Two starting places (not necessarily in this order) are the FAQ and the Factor cookbook. The Factor cookbook is included in the Factor distribution, accessed by running Factor, selecting the "Browser" tab and clicking on Factor cookbook, or online. It was written by Slava Pestov, and illustrates really clearly the basics of Factor.
But once you get done with that, where should you go? Here are a bunch of options, depending on what exactly you want to:
- Factor's included reference documentation, written mostly by Slava, is sound and basically complete but not very tutorial-like.
- Slava's blog is the best place to learn about new language features as they're developed.
- Aaron Schaefer wrote a series of blog posts that introduce Factor, starting here.
- I wrote a bunch of introductory blog posts.
- Elie Chaftari wrote an introduction to programming Factor, and he wrote about the FFI.
- You can look at many other blogs about Factor through the Planet Factor aggregator. Scattered around here are many more introductions to Factor.
- The Joy papers by Manfred von Thun provide a theoretical basis for concatenative languages.
- You could try reading some source code in the libraries. Simple demos have the tag
demo
, which you can query in the vocab browser. The core sequence library is a good place to look for clean, idiomatic code.
The best way to learn Factor, after you have a grasp of the basic syntax, is to play around with it and try to build something useful. Here are some project ideas:
- Write a date parsing library, accepting ISO 8601 format or a natural English (and maybe internationalized) format. A calendar backend already exists.
- Create an embedded, lexically scoped, infix/prefix DSL for complex math calculations
- Make an analog clock with the Factor UI library, or a GUI for Factor server administration
- Solve some Project Euler problems in Factor
- Whenever you encounter some part of Factor that you don't like, or some missing language feature, implement it. It's very likely that you'll be able to do whatever you want in the library
These problems may sound a bit difficult, but most of them have simple part-way solutions so you can begin to get your feet wet. And with any of these, if you have useful working code, we'd be happy to take it into the Factor library, if you feel like releasing it under a BSD-style license.
If you get stuck, don't despair! The friendly people on the Factor mailing list and irc.freenode.net's #concatenative channel are here to answer all of your questions. You could also just email me. Eventually, we'll have a Factor book, but it hasn't been started yet, in part because Factor's not at version 1.0 and in part because none of us have enough free time to write it.
Update: Factor documentation isn't as bad as I originally put it, so I changed this post a little.
Subscribe to:
Posts (Atom)