Thursday, March 4, 2010

A protocol for sets

It's bothered me for some time that Factor has a protocol (an informal abstract class) for sequences and associative mappings, but nothing for sets. In core and basis, there are three set implementations: you can use a hashtable as a set, a bit array, or a plain old sequence. Different words are used to manipulate these, and each set type has a rather incomplete set of operations.

In a branch in my repository, I've fixed this omission. Now all three are under a common protocol, and more can easily be added. If you're interested in reading about the details of the protocol, you can pull from this repository, bootstrap and read the documentation included.

There are a number of benefits to this.
  1. All sets now support all operations using names that correspond to the intuitive meanings of the operations
  2. It is much easier to change set representations within a piece of code: just change the code that initializes the set
  3. It's easier to implement new types of sets because you get many operations 'for free' and a nice common API.

The design I went with had a few conflicting goals:
  • The protocol should be clean and simple, and therefore easy to implement
  • The new protocol should be mostly compatible with the existing sets vocabulary.
  • There should be no performance overhead (besides method dispatch) for using this protocol

Each of these had to be sacrificed a little. For performance, I had to make the protocol include many different operations, since different set implementations might have a way to override them for efficiency. To keep things simple and easy to implement, I did this only for operations that were currently in use in the Factor code base, and I included default methods for as many operations as I could. For compatibility, I made unordered sequences act as sets in a way that's very similar to the old sets vocabulary, and the major operations are generalizations of the operations on sequences.

There is not total backwards compatibility, though. If that had been a design requirement (say, if this were a change made on Factor 1.0), the design would be much cruftier. One change is that the word prune is gone, subsumed by members, which generates a sequence of the members of a set. Given a sequence, this gives a sequence with the same elements but only one copy of each.

A bigger change is that hashtables are no longer meant to be used as sets. In their place is a new data structure, the hash set. Hash sets have literal syntax like HS{ 1 2 3 }, similar to other collections. In their current implementation, they use a hashtable underneath, but in a future implementation a more memory-efficient construction may be used. Hash sets implement set operations and hashtables do not. Previously, words like conjoin, conjoin-at and key? were used to query hashtables as sets, but now these are subsumed by the set words adjoin, adjoin-at and in?. There is a lot of code that uses hashtables as sets, and it's not easy to sort out the set uses of hashtables from the non-set uses for someone who hasn't written the code. So for now, words to manipulate hashtables as sets are still present. conjoin and conjoin-at will be eliminated when all code in the Factor repository is updated to use hash sets instead.

It's somewhat to have a language evolution process where the language is not guaranteed to be compatible from one version to the next, as Factor is right now. There will continue to be incompatibilities until version 1.0 is released for the sake of clean organization of the language. There is a tradeoff here: incompatibilities make Factor harder to use now, and prevent adoption today, but the resulting system will be better-organized and easier to use as a result. Factor would probably be in much worse shape today if a policy of backwards compatibility had been adopted a few years ago, and it's a little too soon to start now in freezing the language.

Update: By the way, I forgot to mention in the original post: many other high-level programming languages don't have such a nice generic collection of data structures. Factor's isn't complete, but in many ways it's more advanced than many other popular programming languages. Java, C++ and C# are languages with generic data structures libraries, but using the data structures is more verbose due to certain missing language features, like the lack of syntax for literal hashtables. Popular scripting languages like Python, Ruby and Perl tend to privilege hashtables and resizable arrays over other data structures. It's possible to create data types that look just like arrays or hashtables using Python or Ruby in terms of their interface. But the higher methods for these data structures will only work for the builtin types and there's no way to make them work for your data type but to reimplement them. Functional languages like Scheme and Haskell have libraries for lists and arrays, but the interfaces are different for different data types. Even though Haskell has type classes, both of these languages' standard libraries are written with lists in mind for the most common operations. Factor used to resemble scripting languages in its support of data structures, but experience in writing large programs with these data structures has led to a better thought-out, more object-oriented model.

1 comment:

Anonymous said...

If you ever want to expand on sets, check out SETL.

http://en.wikipedia.org/wiki/SETL

http://setl.org/
(Binaries available, source code not available yet (???), but some documentation. He also has a link to his thesis on SETL on the page. I wonder if he's abandoned the code.)

http://www.settheory.com/
This is the website of the (now deceased) creator of SETL. It includes a mostly complete tutorial for a SETL implementation that really shows its utility - even as a replacement for SQL. A good set library might allow for the same use in Factor.

Someday I will try to add anything unique to SETL to Factor, if I can learn both languages well enough - that might take me a while :)