Monday, March 26, 2007

Dataflow in programming languages (with some Factor bias)

As I see it, there are three main ways to manage dataflow in a programming language. By this, I'm referring to how data passes both within a procedure. [The simplest one--visual boxes and arrows--is as easy to conceptualize as it is useless in practice, because it quickly gets unwieldy. So I'll ignore it for the purposes of this blog post.] Although it's often conflated in discussion, datflow pattern has nothing to do with other aspects of programming language paradigm. Data flow is often discussed in terms of functional vs. imperative, or applicative vs. concatenative, but this is a huge oversimplification. Advocates of one paradigm tend to claim that it is the best, for some reason or other, but in reality, all methods have applications where they excel and applications where they are pretty awkward.

I'll call the first dataflow method the mutation method. In this method, data is passed around within a function in named, mutable variables. Mutation is probably the oldest dataflow strategy because it directly corresponds to a register machine. So all early (50's and 60's) and most modern programming languages use it. The mutation method is probably what's most familiar to programmers: it's used in everything from C, Java, Basic, Ruby, Smalltalk and Python, to name a few. As an example, consider the definition of factorial in C:

int factorial(int x) {
int n;
for (n = 1; x>0; x--)
n *= x;
return n;
}

Update:Thanks to Dave for pointing out my extremely stupid errors in this code. It now works.
Though the Smalltalk version would look completely different, Smalltalk and C share the same fundamental property that the easiest way to pass around data in a relatively primitive loop is to mutate the value of variables in place. Self and Io may have some interesting semantics behind the getting and setting of variables, but in the end, it's all the same. The other two paradigms don't work like this. But not all programming languages using the mutation data flow are imperative. For example, Haskell, Lisp and ML can (all in different ways) emulate this sort of dataflow. In Common Lisp and Scheme, the value of any variable can be changed through setq and set!. In ML, mutable variables are implemented as an explicit reference. And in Haskell, there is the State monad and references through the I/O monad. I think it'd be possible to construct a purely functional language using this technique for everything.

The second way of doing things is what Lisp, Haskell and ML usually use: function application returning a value. I'll call this the application method. The first example I know of is John McCarthy's Lisp, created in 1958, a version of which is still in use today. Often, functional languages are defined in this way, but I think whether a language is functional or imperative has more to do with the use of data structures. In these languages, variables are (almost) always immutable, and loops are built up (at least in theory) through recursion. The main way a variable changes value is through function application: calling a function with different arguments "changes" the value of the variables. The same factorial function would be implemented in Haskell (without cheating and using any of Haskell's library or pattern matching) as

factorial n | n <= 0 = 1
| otherwise = n*factorial (n-1)
-- of course, it could more easily be written as
-- factorial n = product [1..n]

Recursion can, of course, be used in mutation dataflow languages, but in application dataflow method, it is usually much more efficient, so it can be used in practical applications. In application values, there tend to be fewer "statements" and more "expressions" (though statements still usually exist for definitions and module inclusion). Within application languages, there are of course differences. In some languages, mutable variables can actually be used (see above), and different forms of recursion are efficient in different application languages. But the overall concept is the same.

The third form of dataflow is point-free (also known as pointless by its detractors). It's definitely the newest kind of data flow at only around 30 years old. I'd estimate (and I'm basing this on absolutely nothing) that there are fewer than 1000 people writing code in this way in the world at any time. Point-free refers to the lack of points, or variables, in most or all code. Instead of variables, data is passed implicitly between functions. There are two ways to do point-free programming, which could be called stack-based and function-based. The progenitor of function-based programming is the late John Backus. It's sad his obituaries mostly said, "He invented Fortran, which is amazing. Then, he did other stuff that nobody cares about for 50 years." In his 1977 Turing Award address, Backus introduced a new programming language which he called FP. In FP, the only way to make programs is to compose functions with the function composition operator o. There are also operators used to apply one value to more than one function, or to map a function over all elements of a list. The problem is, this gets really awkward when you have three or more arguments to a function, or want to do certain other things that are more complex. But this style lives on as a sublanguage of Haskell and J. In Haskell's point-free style, factorial might be defined as

factorial = product . enumFromTo 1

Though, of course, this is cheating, as it uses two library functions defined in an application style.

As I already mentioned, the other style of point-free programming is stack-based, as used in Forth, Joy, Postscript, HP calculator language and Factor. Forth was developed by Chuck Moore in the early 1970s as the next generation language, though it never really caught on as such. Then, around 10 years ago, Manfred von Thun noticed the correspondence between FP and Forth, and created a new language called Joy which combines the strengths. In Forth and Joy, as in Factor, all dataflow (outside of some global variables and arrays) takes place on the stack. The stack can be manipulated with words (the equivalent of functions) like dup, swap and drop in ways that would be impossible in function-based languages like FP. There is more flexibility for more data to be passed around in different ways and treated granularly rather than as a unit. In Factor, factorial is (ignoring the library functions that would make it easier, and ignoring the fact that an iterative version would be more efficient)

: factorial ( n -- n! )
dup 0 <= [ drop 1 ]
[ dup 1- factorial * ] if ;
! Of course a better way of doing it would be
! : factorial ( n -- n! ) 1 [ 1+ * ] reduce ;

In some ways, this is very similar to the application definition, once you can understand the postfix syntax. But the semantics are very different: rather than applying a function to its arguments, you are passing around a single stack to various words. Nearly all words only modify the top couple arguments on the stack, though.

But there are finer distinctions to make. Though Joy and Factor are both stack-based, code comes out looking very different between them. This is because, in Joy's implementation, the data stack is just a linked list which is not mutated by words. In Factor, the data stack is implemented more efficiently, and it is mutated. Though Joy is slower, there are a lot of cool things that come out of this. For example, in Factor, it is a common pattern to dup one or more items on or near the top of the stack in order to evaluate the predicate for a conditional (eg if). In Factor, if takes three arguments. The first argument, found third-to-top on the stack, is a boolean value (Factor works like Common Lisp in that anything but f is treated as true). The second-to-top argument is the quotation to be executed if that value is not f, and the top argument is the quotation to be executed if the value is f. In Joy, the third-to-top argument is not a boolean but rather a quotation. This bottom quotation is executed, and a boolean value is taken from the top of the stack and saved somewhere. Then, the old stack from before that execution is restored, and one of the two following quotations is executed. Joy also uses recursion combinators, taking advantage of this property to factor out the abstract behavior of many recursion patterns. Though this is all nice to work with, it comes at the cost of efficiency, simplicity and ease of implementation.

As you may have realized at this point, point-free languages span the whole spectrum from relatively low-level imperative (Forth, at least at the most basic level) to pure or almost-pure functional (Joy and FP, also Haskell and J when used in this style) to impure functional (Factor, Postscript, Onyx). They can also be statically typed (Cat, StrongForth), untyped (Forth) or dynamically typed (basically everything else). Basically, they can be anything.

So, I'm not sure if this essay actually proved its thesis--that all dataflow strategies have strengths and weaknesses and that dataflow is irrelevant of paradigm --but I hope I've enlightened you a little bit as to what's out there.

Oh, and one more thing. I recently saw a blog post that pretty much claimed Haskell as the modern incarnation of FP. Although Factor doesn't completely hide the Von Neumann machine underneath, code doesn't need to care about it in any real sense. As far as dataflow goes, Factor is much closer to FP than Haskell is, as they are both about composing functions. Just to make it official and unambiguous, I claim Factor as the One True Heir to FP.

8 comments:

Anonymous said...

J is the closest thing to FP. FP was based on APL, and J is a successor to APL that was influenced by FP.

Daniel Ehrenberg said...

I didn't know that FP was based on APL, but I did know about J, so I referenced it in the article. It's entirely possible that J is closer to FP than Factor; my last comment about Factor being the one true heir of FP was tongue-in-cheek. I tried to learn J once, but I'm not smart enough to remember all the symbols and overloading. It seems like a cool language, though. Factorial in J would be something like /*1+i. wouldn't it? I really like that definition

In the future, could you identify yourself in comments? Thanks.

Dave said...

I know it's not the point of your article, but you may wish to use this C definition of factorial instead.

int factorial(int x) {
int n;
for (n = 1; x>0; x--)
n *= x;
return n;
}

(Blogger throws away the indentation, tsk.)

orib said...

Personally, I like K better.

J's implementation was inspired by A, a minimalistic version of APL written by Arthur Whitney on one afternoon. A+ (available from aplusdev.org) is probably an evolution of the same A -- and so is K, Arthur Whitney's latest creation.

An interpreter is no longer freely available, but the docs are very good and worth reading (at http://kx.com/q/d )

In K it would indeed be */1+!

I can't keep all of J's symbols in my head, but K is much cleaner and much more orthogonal.

Daniel Ehrenberg said...

Thanks Dave. That was a really stupid mistake. It's now fixed.

Orib, yeah, I was looking at K before, and it looked very good, but then it stopped being free, so I stopped looking at it. It's sort of sad that they did that.

Samuel Tardieu said...

Is this K style available for the definition of such a function?

In J, you can use:

*/>:i.10

to compute 10!, but to define a new fact word, you have to use the composition operators:

fact =: */@:>:@i.

if you want to do it the FP way.

Anonymous said...

Another method of dataflow is the "pure dataflow" model used in LabView or the more recent plt-scheme frTime module. It's completely different than the other three IMHO.

nikhil said...

nice post man!
Your blog isn't just helping me with factor, it's teaching me a lot of other (cool!) stuff as well