Wednesday, May 18, 2011

Why not mmap?

mmap() is a beautiful interface. Rather than accessing files through a series of read and write operations, mmap() lets you virtually load the whole file into a big array and access whatever part you want just like you would with other RAM. (It lets you do other things, too—in particular, it's the basis of memory allocation. See the man page for details.) In this article, I'll be discussing mmap() on Linux, as it works in virtual memory systems like x86.

mmap() doesn't actually load the whole file in when you call it. Instead, it loads nothing in but file metadata. In the memory page table, all of the mapped pages are given the setting to make a page fault if they are read or written. The page fault handler loads the page and puts it into main memory, modifying the page table to not fault for this page later. In this way, the file is lazily read into memory. The file is written back out through the same writeback mechanism used for the page cache in buffered I/O: after some time or under some memory pressure, the contents of memory are automatically synchronized with the disk.

mmap() is a system call, implemented by the kernel. Why? As far as I can tell, what I described above could be implemented in user-space: user-space has page fault handlers and file read/write operations. But the kernel allows several other advantages:
  • If a file is being manipulated by mmap() as well as something else at the same time, the kernel can keep these in sync
  • The kernel can do it faster, with specialized implementations for different file systems and fewer context switches between kernel-space and user-space
  • The kernel can do a better job, using its internal statistics to determine when to write back to disk and when to prefetch extra pages of the file

One situation where mmap() looks useful is databases. What could be easier for a database implementor than an array of "memory" that's transparently persisted to disk? Database authors often think they know better than the OS, so they like to have explicit control over caching policy. And various file and memory operations give you this, in conjunction with mmap():
  • mlock() lets you force a series of pages to be held in physical memory, and munlock() lets you release it. Memory locking here is basically equivalent to making part of the file present in the user-space cache, when no swap is configured on the server.

    Memory locking can be dangerous in an environment with many processes running because the out-of-memory killer (OOM killer) might some other process as a result of your profligate use of memory. However, the use of cgroups or virtualization can mitigate this possibility and provide isolation.
  • madvise() and posix_fadvise let you give the OS hints about how to behave with respect to the file. These can be used to encourage things to be pulled into memory or pushed out. MADV_DONTNEED is a quick call to zero a series of pages completely, and it could be translated into TRIM on SSDs.
  • fdatasync() lets a a process force some data onto the disk right now, rather than trusting writeback to get it there eventually. This is useful for implementing durable transactions.

Great! And in Linux, you can open up a raw block device just by opening a file like /dev/hda1 and use mmap() straight from there, so this gives database implementors a way to control the whole disk with the same interface. This is great if you're a typical database developer who doesn't like the OS and doesn't trust the file system.

So this sounds like a nice, clean way to write a database or something else that does serious file manipulation. Some databases use this, for example MongoDB. But the more advanced database implementations tend to open the database file in O_DIRECT mode and implement their own caching system in user-space. Whereas mmap() lets you use the hardware (on x86) page tables for the indirection between the logical address of the data and where it's stored in physical memory, these databases force you to go through an extra indirection in their own data structures. And these databases have to implement their own caches, even though the resulting caches often aren't smarter than the default OS cache. (The logic that makes the caching smarter is often encoded in an application-specific prefetcher, which can be done pretty clearly though memory mapping.)

A problem with mmap()

High-performance databases often get lots of requests. So many requests that, if they were to spawn a thread for each one of them, the overhead of a kernel task per request would slow them down (where task means 'thread or process', in Linux terminology). There's a bit of overhead for threads:
  • Each thread must have its own stack, which takes up memory (though this is mitigated by the lazy allocation of physical memory to back stacks, which is done by default)
  • Some Linux CPU schedulers use a lot of CPU themselves. So blocking and then getting resumed has a certain amount of overhead. In particular, overhead is incurred so that the scheduler can be completely fair, and so that it can load-balance between cores.

To solve these issues, database implementors often respond to each request with a user-level coroutine, or even with an explicitly managed piece of state sent around through various callbacks.

Let's say we have a coroutine responding to a database request, and this coroutine wants to read from the database in a location that is currently stored on disk. If it accesses the big array, then it will cause a memory fault leading to a disk read. This will make the current task block until the disk read can be completed. But we don't want the whole task to block—we just want to switch to another coroutine when we have to wait, and we want to execute that coroutine from the same task.

The typical way around this problem is using asynchronous or non-blocking operations. For non-blocking I/O, there's epoll, which works for some kinds of files. For direct I/O on disk, Linux provides a different interface called asynchronous I/O, with system calls like io_submit. These two mechanisms can be hooked up with an eventfd, which is triggered whenever there are AIO results, using the undocumented system call io_set_eventfd. The basic idea is that you set up a bunch of requests in an object, and then you have a main loop, driven by epoll, where you repeatedly ask for the next available event. The coroutine scheduler resumes the coroutine that had the event complete on it, and executes that coroutine until it blocks again. Details about using this mechanism are a bit obtuse, but not very deep or complicated.

A proposed solution

What the mmap() interface is missing is a non-blocking way to access memory. Maybe this would take the form of a call based around mlock, like
    int mlock_eventfd(const void *addr, ssize_t len, int eventfd);

which would trigger the eventfd once the memory from addr going length len was locked in memory. The eventfd could be placed in an epoll loop and then the memory requested would be dereferenced for real once it was locked. A similar mechanism would be useful for fdatasync.

We could implement mlock_eventfd in user-space using a thread pool, and the same goes for fdatasync. But this would probaly eliminate the performance advantages of using coroutines in the first place, since accessing the disk is pretty frequent in databases.

As databases and the devices that underlie them grow more complex, it becomes difficult to manage this complexity. The operating system provides a useful layer of indirection between the database and the drive, but old and messy interfaces make the use of the OS more difficult. Clean, high-level OS interfaces which let applications take full advantage of the hardware and kernel-internal mechanisms and statistics would be a great boon to further database development, allowing the explosion of new databases and solid-state drives to be fully exploited.

Saturday, April 16, 2011

I have been sucked into the vortex

Last Monday, I started at Google. I'm working on the kernel storage team, trying to optimize Linux asynchronous I/O for flash, which we are experimenting with. I really love it at Google; the food is great and the people are extremely smart. In the kernel, many top Linux hackers are employed by Google, and it's amazing that I can work with them.

I haven't been posting much here, partly out of laziness and partly out of perfectionism, but I have a few half-written blog posts that I'd really like to get out. Now that I am working, I'll have even less time than before. On the other hand, since what I'm doing is for the kernel, you can expect to see detailed explanations of my changes here once they're ready to go upstream. I'm really excited to be working on a project so deep in the stack, and it will be especially satisfying to release changes as open-source. Wish me luck!

Thursday, December 23, 2010

Work at RethinkDB

I've been working at RethinkDB for the past month. My project has been to convert the code base from using callbacks for asynchronous I/O—essentially continuation-passing style—to a more natural style of concurrency using coroutines. You can read about my progress on the RethinkDB blog:

Sunday, May 30, 2010

Paper for DLS 2010

Slava Pestov, Joe Groff and I are writing a paper about Factor for the Dynamic Languages Symposium, part of OOPSLA. It's a survey discussing both the language design and its implementation, and you can read the draft on the Factor website. The paper is due on Tuesday, and we would really appreciate any comments you have on it.

Update: I've submitted the paper. Thanks everyone for your helpful suggestions! I'll hear back about this July 15, and report here when I do.

Sunday, April 25, 2010

Guarded method inlining for Factor

As a compiler optimization to make code faster, Factor's compiler tries to eliminate calls to generic words, replacing them by direct calls to the appropriate method. If the method is declared inline, then its definition will also be inlined. Together, these two optimizations are informally called 'method inlining'. Method inlining is essential for making object-oriented code in Factor fast. And since basic operations like accessing elements of a sequence or adding two numbers are implemented by method calls, this is needed for any Factor code to be fast.

How things worked

Here's how the current algorithm works. Method inlining takes place within Factor's sparse conditional constant propagation pass. SCCP infers upper bounds for the classes of values that the code manipulates. When SCCP processes a generic word call, it examines the class of the receiver as well as the list of classes that have methods on the generic word. If SCCP can tell that a particular method will always be called, it can select that method. Below is some pseudocode for how it detects that.

For each method on the generic word:
If the class for that method intersects the receiver class:
If the class for that method is a superclass of the receiver class:
Put it on a list
Else:
We don't know whether this method will be called at runtime
or not, so bail out and fail to inline a method
Inline the method for the smallest class on the list

There's an additional complication. It might be that a word is compiled, with the method inlining optimization applied, but then after this, more vocabs get loaded that add additional methods to the generic word. This might invalidate the correctness of the method inlining, and the word should get recompiled to fix this problem. Factor uses a simple system to track these dependencies, in stack-checker.dependencies.

My new addition

A few days ago, another Factor developer told me about a hack he'd added to SCCP to make a particular benchmark faster. The hack was, if >fixnum is called on an object that SCCP knows is either a fixnum or f, then the >fixnum call is replaced with dup [ \ >fixnum no-method ] unless. This works because >fixnum doesn't have any methods on f, and >fixnum on fixnums is a no-op.

My immediate instinct here was to generalize this solution. The first step is to convert that code into dup fixnum? [ M\ fixnum >fixnum ] [ \ >fixnum no-method ] if, which we can do since >fixnum doesn't have any other methods on the union of f and fixnum. Unlike the kind of method inlining described earlier, this requires the insertion of a guard. Later code will know (through SCCP) that the object is a fixnum, even after the conditional exits, since it can tell that an exception would be thrown otherwise.

The second step is to convert fixnum? into >boolean, which we can do because we know that the value on the top of the stack is either f or a fixnum. With these two transformations, we should generate the same code as the hack generated, but the transformation should also work on other things.

The second change was easy. I just added custom inlining for the instance? word to detect basically this exact case, and convert the test into >boolean when this is valid.

The first change was a lot more work. First, I had to come up with the algorithm for choosing the right method (even if it seems obvious now in retrospect).

Make a list of methods on the generic word whose class intersects the receiver class
If this list consists of one element, then return it
If the list consists of multiple or zero elements, then there is no method to inline

Once this class is found, then propagation should generate code like this:

dup method's-class instance? [
M\ method's-class generic-word execute
] [
\ generic-word no-method
] if

This is valid because we know that, if the receiver fails the test, then it must be of some class that has no method on the generic word.

An extra requirement for the correct implementation of this compiler optimization is to track dependencies. For this, I made two new types of dependencies. One corresponds to the test that the generic word only has one particular method intersecting the class that's on the stack. The other tracks if a method that's inlined is overwritten. I wasn't able to reuse the dependency tracking for the other kind of method inlining, but it all fits into the same framework and didn't take much code.

All of this was pretty hard for me to debug, but it was a fun thing to work on. Among the code loaded in a basic development image, over 2400 methods are inlined using this technique, which were previously impossible to inline. There was probably also more method inlining done following this, due to improved type information, though I haven't collected statistics. And most importantly, I was able to eliminate the hack with >fixnum with no regression in performance.

Once I get a couple kinks worked out, this should be merged into mainline Factor. For now, it's in the propagation branch of my repository.

Tuesday, April 6, 2010

A couple language design ideas

I know I shouldn't really write about something I haven't implemented yet, but here are a couple ideas that I've been talking about with some of the other Factor developers that I'd like to share with you. Credit for these ideas goes mostly to Slava Pestov and Joe Groff.

Multimethods in Factor

Factor should have multimethods. They'd clean up a lot of code. There already are multimethods, in extra/multimethods, but these are extremely slow (you have to search through half of method list to find the right one in the average case, which is unacceptable). They could also have their syntax cleaned up a bit. These multimethods, as they're already implemented, combine dispatching on things on the stack with dispatching on the values of dynamically scoped variables. The latter is called 'hooks' in Factor, and is useful for what other languages do using conditional compilation.

Here's a sample of what the syntax might look like:
! Before the -- is upper bounds on the allowable methods
! after the -- is a type declaration that is checked.
GENERIC: like ( seq: sequence exemplar: sequence -- newseq: sequence )

! Parameters with specific types in the parentheses--this is not a stack effect
M: like ( array array ) drop ;

! The single parameter refers to the top of the stack
! and the second element isn't used for dispatch
! so it defaults to sequence
M: like ( array ) >array ;

GENERIC: generate-insn ( instruction: insn -- | cpu ) ! after the | is the hook variables

! Before the | comes from the stack, after is variables. x86 is a singleton class.
M: ( ##add-float | cpu: x86 )
[ dst>> ] [ src1>> ] [ src2>> ] tri
double-rep two-operand ADDSD ;

I've implemented some of this syntax in the multimethods branch of my git repository. Another cool part, language-design-wise, is that multimethods can replace a few different other language features.

For one, they can replace Joe Groff's TYPED:. That lets you declare the types of arguments of a word, and have those be automatically checked, with checks removed if the compiler can prove they're unnecessary. In the ideal design, : will replace TYPED:, and if any parameters have type declarations, then the colon definition is macro-expanded into a generic word with only one method. This implements the type checking.

Multimethods could also be used to implement 'hints' better than they are right now. Hints are a feature of the compiler where a programmer can instruct the compiler to create specialized versions of a function for certain parameter types. Currently, hints use a very inefficient dispatch mechanism: when you call a word with hints, it goes down the list of specialized implementations of the word, testing the stack to see if the contents of the stack are members of those classes. But if efficient multimethod dispatch were worked out, this could be much faster. Also, method inlining would make it so that the hints dispatch would be eliminated if types are known by the compiler. This isn't done right now.

Callsite specialization could also be implemented with the help of multimethods. The difference between this and hints is just that a new hint would be generated if type inference found certain types for the arguments of a function, but there wasn't already a hint for that type. The basic idea for callsite specialization is that the specialized versions would only be used when the compiler can prove that they're appropriate. But if efficient multimethod dispatch were used, then callsite specialization could be used to generate hints that are also available on values whose types are unknown until runtime. Runtime type feedback could work this way too.

There are some implementation challenges in making multimethods fast. From what I've read, it seems like the best way is something like this algorithm by Chambers and Chen used by the Cecil compiler. It generates a dispatch DAG for each multiple-dispatch generic word, where the nodes are single dispatches and the leaves are methods. The algorithm described directly in the paper looks a little impractical, since it includes iterating over all classes, but I think if you just iterated over classes that are method arguments, together with the intersections of these, then that'd be enough.

There's a problem, though, which is that Factor's object system is complicated, and it's not obvious how to get the intersection of classes. The current implementation is probably buggy; if it's not, then I can't understand why it works. So class algebra will have to be fixed first, before multimethods can work. I hope I can figure out how to do this. I bet it has something to do with Binary Decision Diagrams, since we're talking about intersections, unions and negation. We might want to make things simpler by factoring predicate classes out in the way that the Cecil paper does. Either way, things are complicated.

A protocol for variables

I never would have expected it, but the stack-based programming language Factor uses tons of variables. And lots of different kinds, too.
  • Lexically scoped variables in the locals vocab
  • Dynamically scoped variables in the namespaces vocab
  • Two types of globals, using VALUE: and using words for dynamically scoped variables in the global scope
  • Thread-local variables in the thread vocab
  • For web programming with Furnace, there are session variables and conversation variables and scope variables
These are all useful and necessary (though maybe we don't need two kinds of globals), but it's not necessary that they all use different syntactic conventions. They could all use the same syntax.

Some of these variables always correspond to words (locals and values) and some of them don't. Probably it'd be better if all variables corresponded to words, to reduce the risk of typos and to make it possible to have terser syntax to use them, the way we have terse syntax to use locals and values.

Here's my proposal for syntax. Whenever you want to use a variable (unless it's a local), you have to declare it beforehand in some vocab. For example, the following would declare a dynamically scoped variable foo and a thread-local variable bar. Currently, you wouldn't declare these ahead of time, and instead you would just use symbols for these purposes.
DYNAMIC: foo
THREAD-LOCAL: bar
Then, to read the values of these variables, the following syntax would be used
foo
bar
Reading a variable should just be done by executing the word that is the variable. This is how it works in almost all programming languages. Since the variable has been declared ahead of time, we know what kind of variable it is and how to read it. There could be a single parsing word for writing to variables, such as to: (this is what values uses right now), as in the following.
4 to: foo
Again, since the variable is declared ahead of time, we know how to write to it, and the right code is generated at parsetime. We could also have combinators like change, written in a way that works with all variable types:
[ 1 + ] change: foo

Another change I'm interested in is the scoping rules for dynamically scoped variables. Right now, when with-variable is invoked, a totally new scope is made such that if any dynamic variable is set, that setting is only kept as long as the innermost with-variable call. These semantics were made back when the fundamental construct for variables was bind, but in modern Factor code, this isn't so important anymore.

The current semantics are a frequent source of bugs. Basically, it's the funarg problem all over again: if you have a higher order function which includes a with-variable call, and if the quotation sets any variables, then these variable settings don't escape the higher order function call. This is usually not what the programmer intended! There are various hacks, euphemistically called idioms, used in Factor to get around this.

The semantics should really be that set modifies the binding in the scope that defined the variable, with the original binding always coming from with-scope. But this change would break tons of Factor code that already exists, so I don't expect it to be used any time soon. Also, it would make things less efficient if each variable binding had its own scope, since right now things are often grouped together.

One of my far-out hopes with variables is that maybe dynamically scoped variables could be optimized so that they would be as efficient as lexically scoped variables. But this would require lots of changes to Factor's implementation. For example, right now no level of the Factor compiler knows anything about any kind of variables; they're either subroutine calls or stack shuffling by the time the compiler sees it. Still, a sufficiently smart compiler could translate a usage of make into a version that stores the partially built sequence on the stack, and there have been compilers that do the equivalent of this.

Wednesday, March 17, 2010

Expressing joint behavior of modules

Sometimes, special code is needed to make two modules interact together. This seems to come up all the time in Factor. For example, there are two language features, locals and prettyprinting, implemented in the Factor library. If they are both loaded, then there is special code to make locals prettyprint. But neither one depends on the other.

The old way to handle this situation was to have code that looked like this, as a top-level form in the locals vocab:
"prettyprint" vocab [
"locals.prettyprint" require
] when
But this is problematic. What if locals is loaded first, and then prettyprint? Well, this actually happened, due to a change in some other code. Then locals don't prettyprint correctly! You could fix this by adding top-level statements in both vocabs, as mirrors and specialized arrays did, but then this single joint dependency is duplicated in two places. Maintenance is more difficult than it should be.

To fix this problem, I added a new word, require-when. The code above, as a top-level form in the locals vocab, would be replaced with
"prettyprint" "locals.prettyprint" require-when
The logic is: if the prettyprint vocab is already loaded, then locals.prettyprint will be loaded. Otherwise, the dependency is registered with the module system, so if prettyprint is loaded later by something else, then locals.prettyprint will be as well.

I'm pretty satisfied with this solution to a somewhat long-standing problem in Factor. I wonder how other module systems solve the same problem. I have this implemented in a branch in my repository and it should be incorporated into mainline Factor soon.