Post removed
The post that used to be here was wrong and pointless. I've removed it, but if you're really curious, it's just commented-out as HTML, so you could still read it.
Adventures in computing and the Factor programming language
The post that used to be here was wrong and pointless. I've removed it, but if you're really curious, it's just commented-out as HTML, so you could still read it.
Posted by
Daniel Ehrenberg
at
9:10 PM
2
comments
You may have already seen this on Slava's blog, but for those of you who didn't: on November 18-21, we're having a Factor-related get-together in Minneapolis, Minnesota. All levels of experience/use of Factor are encouraged to attend. Confirmed guests include Slava Pestov, Doug Coleman and me. If you think you can come, please contact me (or someone else who's already going). I haven't been able to do much Factor hacking lately, due to school constraints, so this'll be a great opportunity for me, personally. For all of you: if you don't know Factor well, this would be a great time to learn more; if you do know Factor well, this would be a good chance to meet your fellow developers. I hope to see you there!
Posted by
Daniel Ehrenberg
at
1:54 PM
0
comments
Labels: factor
In my FAQ, I wrote that Factor was object-oriented. That might have confused some of you: how is Factor object-oriented when code isn't organized in terms of classes like in Java? When there is no system for message passing*? When tuples can't inherit from each other?
Rationale
Originally, Factor didn't have an object system. It was said that it's not necessary; we can get by with just words, conditionals and lists. But as the library grew more complicated, it became clear that we needed a cleaner method of dispatch. This took the form of generic words, or words that can be overloaded by the class of one of their arguments. Initially, we only had predicate classes, and generic words were in fact no more than syntactic sugar for cond. But we also needed a good way to define data types, so that they wouldn't be confused with each other. Later, we realized the need for a system of 'inheritance' from tuples to tuples, flexible constructors, extensible unions and other things. Piece by piece, features were added to form a emerging coherent whole which could be used to organize some Factor code.
It's important to note that most words are not methods and not all data is stored in tuples. Groups of Factor code are put in vocabularies, not in classes. It's common to write a lot of code without using any object-oriented features (aside from generic words defined in libraries). Nevertheless, object orientation, when used appropriately, makes code much better organized.
The class hierarchy
In Factor, everything is an object. Well, of course everything is an object; everything is something! But I mean everything is an instance of the class object. Under that class, there are a number of built-in classes, representing basic things like tuple, vector, array, fixnum, float, bignum, ratio, quotation, curry, etc. You can test if an object is in a particular class with words of the scheme object?. To test if an object is a tuple, use the word tuple?.
On top of these builtin classes, there are a number of abstractions. For example, the class number is the union of all numeric classes. Under this is a numerical tower of subclasses, similar to Scheme. An example of a couple unions of this kind:
UNION: integer fixnum bignum ;
UNION: rational ratio integer;
MIXIN: rational
MIXIN: integer
INSTANCE: rational ratio
INSTANCE: rational integer
INSTANCE: integer fixnum
INSTANCE: integer bignum
PREDICATE: integer natural 0 >= ;
GENERIC:. Methods, or implementations of the generic word, are written with the parsing word M:.Take the following useless example:
GENERIC: process ( thing -- processed )
M: string process
"foo" append ;
M: integer process
5 + ;
process with two methods: one on strings (which appends "foo") and one on integers (which adds 5). This is equivalent to:
: process ( thing -- processed )
{
{ [ dup string? ] [ "foo" append ] }
{ [ dup integer? ] [ 5 + ] }
} cond ;
process has the advantage that it is extensible later, in other files. Additionally, generic words are sensitive to the class hierarchy and will test things in an appropriate order, and dispatch is more optimized than with a simple cond expression.GENERIC:. Other method combinations include MATH:, for words which dispatch on builtin math types for two arguments, HOOK:, to dispatch on the class of the value of a variable, and GENERIC#, which allows dispatch on items further down on the stack. You can also make your own method combinations.foo, with slots bar and baz, together with a constructor called <foo>:
TUPLE: foo bar baz ;
C: <foo> foo
foo tuple, just get the values of bar and baz on the stack and call <foo. With that object, the bar slot can be read with the word foo-bar the baz slot can be read with the word foo-baz. To write values to those slots, use the words set-foo-bar and set-foo-baz. And, of course, to test if an object is a foo tuple, use the word foo?. All in all, TUPLE: is a wildly unhygenic macro, and we like it that way!delegate and set-delegate). When a generic word dispatches on a tuple, but it doesn't have a method implementation for that tuple, it looks in the delegate slot for what to do next. If there is a delegate, then the generic word is again invoked on that delegate, with the hope that generic word itself has an implementation for that class. If not, it's on to the next delegate until there is no delegate anymore--the delegate is f.***TUPLE: foo bar baz and I want it to inherit from TUPLE: bing quux ;, all I have to do is make an instance of foo and an instance of bing, and set the delegate of the foo to the bing. Slot accessors like bing-quux and set-bing-quux are generic words implemented, by default, only on that one tuple class that they were made for. So, if you call bing-quux on a tuple of class foo that delegates to a bing, the foo instance will fail to respond and the generic word will be called again on the delegate, bing. This delegate will know what to do, and it'll return its quux value.
TUPLE: foo bar baz ;
TUPLE: bing quux ;
GENERIC: who
M: bing who how ;
GENERIC: how
M: bing how drop "bing" print ;
M: foo how drop "foo" print ;
who is called on a foo instance, "foo" is printed, but in fact, "bing" is printed. This is called the slicing problem: the object is "sliced" when using delegates, so only part of the object is represented in further method calls. On one hand, the behavior of M: bing who is more predictable, but on the other hand, information is lost. It's not a horrible thing, just something to watch out for.inverse, don't yet use the object system at all. The XML-RPC vocabulary relies heavily on generic words to perform dispatch in turning Factor data into XML-RPC format, though no new classes are defined. The XML library, however, uses tuples to represent XML at various stages of processing, and dispatches on the class of these tuples to provide complex functionality. Several tuples are defined in the XML library to represent exceptions, though any object can be thrown as an exception.visitor (since it's implemented with the visitor pattern, yielding a simple lexicographic method order). Maybe in the future we'll have double dispatch in the Factor core, to better implement words like like, which end up approximating double dispatch using conditionals. Though it makes sense in theory, I can't imagine any use for triple dispatch in Factor.
Posted by
Daniel Ehrenberg
at
10:00 AM
3
comments
Labels: factor, introduction, oop
The Factor FAQ has moved to concatenative.org.
Posted by
Daniel Ehrenberg
at
1:33 PM
8
comments
Labels: factor, introduction
What could be simpler than alphabetical order1? It's a simple lexicographic comparison of two strings, comparing the first character to see which is greater, then the second as a tie breaker, then the third, and so on. Unicode makes a lot of things harder, but this shouldn't be any different. Unicode doesn't prohibit this kind of alphabetical order; you can have a 'Unicode-conformant' process which uses this order.
Great idea, but...
This kind of order, which we could call code point order, is useless. Well, it's not completely useless; it's fine if you need an arbitrary, consistent order. But for humans, it's useless. It puts Z before a. In Normalization Form D (which I'll assume strings are in for the rest of the article), it puts the "äa" after "az". It puts "aa" after "a-z". It puts "æ" after "af". All of these are backwards for users in the American English locale and most other locales.
It's not just that the letters are in the wrong order. It also won't be sufficient to put all characters in a consistent case, as the R6RS Unicode module authors seem to think, as this will still give unnatural behavior for punctuation, accent marks, ligatures, and other things. Removing all of these features won't do it either; they can't be completely ignored, becase "foot ball" should come consistently after "football", and "Apple" should come consistently either before or after "apple", depending on the needs of the user.
It's crucial that things are sorted correctly; if they are in a different order than the user expects, then a user, looking at a long alphabetized list, may conclude that an item on the list doesn't exist.
A better way for people
To get things right, the way humans expect them, you need another few levels of tiebreakers. When comparing two strings to see which one comes first, there's a number of steps: First, compare two strings with all of the features stripped. If they're equal, compare the accent marks of the strings, as a tie breaker. If those are equal again, compare the capitalization as a second tie breaker. And if those are still equal, compare the punctuation.
I left out one thing--ligatures like æ. This can be expanded to something that looks like "ae" in the first stage, but looks different in the accent mark tiebreaker. We'll call them expansions. In some languages, like Danish, there are also contractions like "aa" with their own special alphabetization; these contract into a single value for the initial sort, if the Danish locale is used. In Thai and Lao scripts, there is an additional problem of certain vowels and consonants reversing order for collation purposes. This can be achieved with a combination of the expansion and contraction patterns. There's also a special problem for Korean collation2.
UTS #10
Believe it or not, I didn't come up with this all by myself. It's part of the Unicode Standard, in Unicode Techinical Standard #10: Collation. UTS #10 isn't something that's required of Unicode implementations, but it is specifically standardized and codified. It does basically what I described in the section above, with a slight change for efficiency. Because the folks at the Unicode Consortium did all this work, we don't have to! They've even given us a convenient table of data that makes it easier to implement the above tiebreakers.
UTS #10 suggests a simple way for implementing these tiebreakers that works for efficiently for sorting large lists of strings: calculate a collation key, a bit array, for each string, and compare these collation keys with a simple bit comparison. This collation key can be calculated in linear time and space with respect to the string, so a human-appropriate comparison is asymptotically no worse than a naive one and suffers only linear overhead.
Collation keys and the DUCET
But how do we make the collation key? Well, that table I mentioned earlier is called DUCET, the Default Unicode Collation Element Table. It describes a neutral locale which is linguistically inaccurate everyone. For most3 characters in NFD, there is a three-tuple4 of the primary, secondary, and tertiary collation weights. In assembling the collation key from a string, go through the string, first appending the primary weights of each character5, then a separator of 0, then all of the secondary weights, followed by another separator, and finally the tertiary weights. If one of these weights for a particular character is 0, leave it out. This is all followed by a representation of the punctuation6.
A simple binary comparison of the collation keys ends up being equivalent to what I described before. How? When the primary weights are appended, they form something equivalent to the characters all put in a consistent case, with accent marks and punctuation removed. In a binary comparison of collation keys, these are compared first. They are followed by a separator of 0 so that, if the strings' primary weights are equivalent up to the end of the shorter one, the longer string will be sorted second. This is because 0 is less than any collation weight. The secondary collation weight represents the accent marks, and the tertiary collation weight represents the capitalization. (See note 5 about punctuation.)
Implementation and tailoring
I wish I could give hints on implementation, but, umm... I haven't written a working one yet. It's quite an involved algorithm, at least compared to things I've done before. Luckily, the UTS gives lots of hits for how to do it. One thing that makes implementation easier is that the collation weights of many characters, mostly Han ideographs, can be derived algorithmically through a formula specified in UTS #10. In fact, this is somewhat unfortunate, as it is a symptom of the fact that the ordering of Han ideographs requires a more involved process, using some sort of dictionary lookup for pronunciation or radical index, depending on the context.
But implementation comes with an additional challenge: tailoring to specific locales. The DUCET isn't linguistically accurate for any locale, and must always be modified somewhat. Different contexts such as language, nation, context and organizational preference can demand slightly different collation orders that users depend on. Modifying the collation table to suit this is known as tailoring. I'm not yet sure what algorithm should be used to reshuffle the collation weights in the right way. But there is a ton of locale data in the Common Locale Data Repository (CLDR), a project run by the Unicode Consortium. This includes information about collation.
Conclusion
Collation isn't easy, but it's necessary to get right. UTS #10 lays out an algorithm for many aspects of collation, but still leaves some things out: there's no specified process for ignoring words like 'the' in the beginning of a book title, or padding numbers with zeros so that 2 comes before 10. Nevertheless, with this algorithm and appropriate locale data, it's possible to give most users a linguistically accurate alphabetical order which makes sense to humans.
Posted by
Daniel Ehrenberg
at
12:38 PM
0
comments
Labels: unicode