Wednesday, May 18, 2011

Why not mmap?

mmap() is a beautiful interface. Rather than accessing files through a series of read and write operations, mmap() lets you virtually load the whole file into a big array and access whatever part you want just like you would with other RAM. (It lets you do other things, too—in particular, it's the basis of memory allocation. See the man page for details.) In this article, I'll be discussing mmap() on Linux, as it works in virtual memory systems like x86.

mmap() doesn't actually load the whole file in when you call it. Instead, it loads nothing in but file metadata. In the memory page table, all of the mapped pages are given the setting to make a page fault if they are read or written. The page fault handler loads the page and puts it into main memory, modifying the page table to not fault for this page later. In this way, the file is lazily read into memory. The file is written back out through the same writeback mechanism used for the page cache in buffered I/O: after some time or under some memory pressure, the contents of memory are automatically synchronized with the disk.

mmap() is a system call, implemented by the kernel. Why? As far as I can tell, what I described above could be implemented in user-space: user-space has page fault handlers and file read/write operations. But the kernel allows several other advantages:
  • If a file is being manipulated by mmap() as well as something else at the same time, the kernel can keep these in sync
  • The kernel can do it faster, with specialized implementations for different file systems and fewer context switches between kernel-space and user-space
  • The kernel can do a better job, using its internal statistics to determine when to write back to disk and when to prefetch extra pages of the file

One situation where mmap() looks useful is databases. What could be easier for a database implementor than an array of "memory" that's transparently persisted to disk? Database authors often think they know better than the OS, so they like to have explicit control over caching policy. And various file and memory operations give you this, in conjunction with mmap():
  • mlock() lets you force a series of pages to be held in physical memory, and munlock() lets you release it. Memory locking here is basically equivalent to making part of the file present in the user-space cache, when no swap is configured on the server.

    Memory locking can be dangerous in an environment with many processes running because the out-of-memory killer (OOM killer) might some other process as a result of your profligate use of memory. However, the use of cgroups or virtualization can mitigate this possibility and provide isolation.
  • madvise() and posix_fadvise let you give the OS hints about how to behave with respect to the file. These can be used to encourage things to be pulled into memory or pushed out. MADV_DONTNEED is a quick call to zero a series of pages completely, and it could be translated into TRIM on SSDs.
  • fdatasync() lets a a process force some data onto the disk right now, rather than trusting writeback to get it there eventually. This is useful for implementing durable transactions.

Great! And in Linux, you can open up a raw block device just by opening a file like /dev/hda1 and use mmap() straight from there, so this gives database implementors a way to control the whole disk with the same interface. This is great if you're a typical database developer who doesn't like the OS and doesn't trust the file system.

So this sounds like a nice, clean way to write a database or something else that does serious file manipulation. Some databases use this, for example MongoDB. But the more advanced database implementations tend to open the database file in O_DIRECT mode and implement their own caching system in user-space. Whereas mmap() lets you use the hardware (on x86) page tables for the indirection between the logical address of the data and where it's stored in physical memory, these databases force you to go through an extra indirection in their own data structures. And these databases have to implement their own caches, even though the resulting caches often aren't smarter than the default OS cache. (The logic that makes the caching smarter is often encoded in an application-specific prefetcher, which can be done pretty clearly though memory mapping.)

A problem with mmap()

High-performance databases often get lots of requests. So many requests that, if they were to spawn a thread for each one of them, the overhead of a kernel task per request would slow them down (where task means 'thread or process', in Linux terminology). There's a bit of overhead for threads:
  • Each thread must have its own stack, which takes up memory (though this is mitigated by the lazy allocation of physical memory to back stacks, which is done by default)
  • Some Linux CPU schedulers use a lot of CPU themselves. So blocking and then getting resumed has a certain amount of overhead. In particular, overhead is incurred so that the scheduler can be completely fair, and so that it can load-balance between cores.

To solve these issues, database implementors often respond to each request with a user-level coroutine, or even with an explicitly managed piece of state sent around through various callbacks.

Let's say we have a coroutine responding to a database request, and this coroutine wants to read from the database in a location that is currently stored on disk. If it accesses the big array, then it will cause a memory fault leading to a disk read. This will make the current task block until the disk read can be completed. But we don't want the whole task to block—we just want to switch to another coroutine when we have to wait, and we want to execute that coroutine from the same task.

The typical way around this problem is using asynchronous or non-blocking operations. For non-blocking I/O, there's epoll, which works for some kinds of files. For direct I/O on disk, Linux provides a different interface called asynchronous I/O, with system calls like io_submit. These two mechanisms can be hooked up with an eventfd, which is triggered whenever there are AIO results, using the undocumented system call io_set_eventfd. The basic idea is that you set up a bunch of requests in an object, and then you have a main loop, driven by epoll, where you repeatedly ask for the next available event. The coroutine scheduler resumes the coroutine that had the event complete on it, and executes that coroutine until it blocks again. Details about using this mechanism are a bit obtuse, but not very deep or complicated.

A proposed solution

What the mmap() interface is missing is a non-blocking way to access memory. Maybe this would take the form of a call based around mlock, like
    int mlock_eventfd(const void *addr, ssize_t len, int eventfd);

which would trigger the eventfd once the memory from addr going length len was locked in memory. The eventfd could be placed in an epoll loop and then the memory requested would be dereferenced for real once it was locked. A similar mechanism would be useful for fdatasync.

We could implement mlock_eventfd in user-space using a thread pool, and the same goes for fdatasync. But this would probaly eliminate the performance advantages of using coroutines in the first place, since accessing the disk is pretty frequent in databases.

As databases and the devices that underlie them grow more complex, it becomes difficult to manage this complexity. The operating system provides a useful layer of indirection between the database and the drive, but old and messy interfaces make the use of the OS more difficult. Clean, high-level OS interfaces which let applications take full advantage of the hardware and kernel-internal mechanisms and statistics would be a great boon to further database development, allowing the explosion of new databases and solid-state drives to be fully exploited.