## Tuesday, April 10, 2007

### Why do we need data structures?

Right now, I'm facing a particularly annoying and indecipherable bug in integrating the assocs protocol with the rest of the Factor system (I have no idea what it is; it just doesn't work), so I decided to write another one of those stupid essays you all hate.

A couple nights ago, I was trying to explain to my mom what I was programming, the associative array protocol. I explained it badly, but a better way to put it would be this: imagine you're computerizing a library. The main thing you have to do is make it so that people can look up the call number of books. Everyone already knows the title and correct spelling of the book they want to get. Your job is to write a program that, given the title, will look up the call number for you. It has to do this quickly, even if there are a lot of books in the library. Of course, to do this, you need an associative array. This is obvious to anyone who's done much programming. I'm sure most of you already have your favorite structure picked out.

But why do we need to structure the data, anyway? A programmer might say, "to allow data to be accessed with a minimal asymptotic complexity." Basically, that means in as few steps as possible, even if there's a lot of data. Still, this doesn't completely mesh with how computers have been explained to the average person. It probably goes something like this. (This is about as far as the Magic School Bus gets, when the class goes into a computer)

First, input is given to the computer. The input is first in the RAM. It is in ones and zeroes, because that's easier for the computer to handle. To process the data, some ones and zeroes are sent to the CPU, together with some instructions, which write other ones and zeroes back to the RAM. The ones and zeroes represent numbers in base 2. Then, when you want to save the data, those ones and zeroes are taken from the RAM and written to the hard drive.

Then, many people understand the higher level stuff about computers: the Sony VAIO is a good laptop, this is how you use Microsoft Word, Mac OS X is based on Unix, etc.

In the middle, there's a huge gap of information that only specialists know about. Right above the bottom of that gap sit data structures. Data structures, along with file formats, are the way those ones and zeroes are organized. If we only had numbers, we really couldn't do much with computers.

The basis for all data structures is the pointer. Basically, every single location in the RAM has a number describing where it is. A pointer is a number that points to one of those positions in RAM. So the simplest thing you can do is say that a particular pointer doesn't just point to one number; it points to a number, and then the number right after it in RAM. And why limit yourself to just one? Why not 100? This pointer, the pointer to location 42313, represents the number at all of the locations 42313-42412. To get the 53rd item in the array, you just look up what number is at the pointer 42313+53 (counting starts at 0, not 1). This whole thing is called an array.

Data structures get much more complicated, but the idea remains the same: you have numbers, and then you have pointers to numbers, and even pointers to pointers to numbers. Why so many pointers and so many ways of organizing them? Because if you have the right pointers, they take you to the right number, the data that you want. A good deal of computer scientists' work has been devoted to getting just the right way to orient pointers, and counting exactly how many steps they take to find the data.

Anonymous said...

Daniel,

You are thinking at a physical level instead of at a logical level in this essay. Though you are correct as far as the current physical manifestation of hardware, this is not the only way it can be physically done.

So, let's not consider the physical at the moment but what you are trying to logically achieve. Leave pointers behind and work with symbols and values and see how this affects your thinking of the problem at hand.

These comments are meant only to give you a pointer (they can be useful logically) in the right direction.

regards

Bruce Rennie
(God's Own Country Downunder)

Daniel Ehrenberg said...

Thanks for commenting, Bruce. I see what you mean, but at the same time, I am thinking of what I'm logically trying to achieve. While writing code, I almost never think about pointers, but it's still useful to know that they're underneath it all (at least in most programming languages). In writing that article, I was thinking about what is the basis of data structures, and I realized it was references to other things.

Anonymous said...

Daniel,

The basis of data structures are values (of any kind in the universe of discourse). It can be helpful to think of each value as taking up the same amount of space as any other value no matter how complicated its structure might be.

Think of a pointer as a representation of the value in question and not a reference. Then forget about pointers altogether and think only of values.

I have been working on and off (more off than on) for some time on a project and in essence, from the perspective of the virtual machine, structures and the associated type may be represented by pointers, but from the perspective of the programmer there are no pointers and the type cannot be separated from the value. All values appear to be the same size irrespective of complexity.

Pointers are an implementation issue and for your purposes you don't need them in the language of choice.

regards

Bruce Rennie
(God's Own Country Downunder)

Nagaraja said...

which is the connecting media between a programming language and database...?