Useless Factor: Why not mmap?

Wednesday, May 18, 2011

Why not mmap?

mmap() is a beautiful interface. Rather than accessing files through a series of read and write operations, mmap() lets you virtually load the whole file into a big array and access whatever part you want just like you would with other RAM. (It lets you do other things, too—in particular, it's the basis of memory allocation. See the man page for details.) In this article, I'll be discussing mmap() on Linux, as it works in virtual memory systems like x86.

mmap() doesn't actually load the whole file in when you call it. Instead, it loads nothing in but file metadata. In the memory page table, all of the mapped pages are given the setting to make a page fault if they are read or written. The page fault handler loads the page and puts it into main memory, modifying the page table to not fault for this page later. In this way, the file is lazily read into memory. The file is written back out through the same writeback mechanism used for the page cache in buffered I/O: after some time or under some memory pressure, the contents of memory are automatically synchronized with the disk.

mmap() is a system call, implemented by the kernel. Why? As far as I can tell, what I described above could be implemented in user-space: user-space has page fault handlers and file read/write operations. But the kernel allows several other advantages:

If a file is being manipulated by mmap() as well as something else at the same time, the kernel can keep these in sync
The kernel can do it faster, with specialized implementations for different file systems and fewer context switches between kernel-space and user-space
The kernel can do a better job, using its internal statistics to determine when to write back to disk and when to prefetch extra pages of the file

One situation where mmap() looks useful is databases. What could be easier for a database implementor than an array of "memory" that's transparently persisted to disk? Database authors often think they know better than the OS, so they like to have explicit control over caching policy. And various file and memory operations give you this, in conjunction with mmap():

mlock() lets you force a series of pages to be held in physical memory, and munlock() lets you release it. Memory locking here is basically equivalent to making part of the file present in the user-space cache, when no swap is configured on the server.

Memory locking can be dangerous in an environment with many processes running because the out-of-memory killer (OOM killer) might some other process as a result of your profligate use of memory. However, the use of cgroups or virtualization can mitigate this possibility and provide isolation.
madvise() and posix_fadvise let you give the OS hints about how to behave with respect to the file. These can be used to encourage things to be pulled into memory or pushed out. MADV_DONTNEED is a quick call to zero a series of pages completely, and it could be translated into TRIM on SSDs.
fdatasync() lets a a process force some data onto the disk right now, rather than trusting writeback to get it there eventually. This is useful for implementing durable transactions.

Great! And in Linux, you can open up a raw block device just by opening a file like /dev/hda1 and use mmap() straight from there, so this gives database implementors a way to control the whole disk with the same interface. This is great if you're a typical database developer who doesn't like the OS and doesn't trust the file system.

So this sounds like a nice, clean way to write a database or something else that does serious file manipulation. Some databases use this, for example MongoDB. But the more advanced database implementations tend to open the database file in O_DIRECT mode and implement their own caching system in user-space. Whereas mmap() lets you use the hardware (on x86) page tables for the indirection between the logical address of the data and where it's stored in physical memory, these databases force you to go through an extra indirection in their own data structures. And these databases have to implement their own caches, even though the resulting caches often aren't smarter than the default OS cache. (The logic that makes the caching smarter is often encoded in an application-specific prefetcher, which can be done pretty clearly though memory mapping.)

A problem with mmap()

High-performance databases often get lots of requests. So many requests that, if they were to spawn a thread for each one of them, the overhead of a kernel task per request would slow them down (where task means 'thread or process', in Linux terminology). There's a bit of overhead for threads:

Each thread must have its own stack, which takes up memory (though this is mitigated by the lazy allocation of physical memory to back stacks, which is done by default)
Some Linux CPU schedulers use a lot of CPU themselves. So blocking and then getting resumed has a certain amount of overhead. In particular, overhead is incurred so that the scheduler can be completely fair, and so that it can load-balance between cores.

To solve these issues, database implementors often respond to each request with a user-level coroutine, or even with an explicitly managed piece of state sent around through various callbacks.

Let's say we have a coroutine responding to a database request, and this coroutine wants to read from the database in a location that is currently stored on disk. If it accesses the big array, then it will cause a memory fault leading to a disk read. This will make the current task block until the disk read can be completed. But we don't want the whole task to block—we just want to switch to another coroutine when we have to wait, and we want to execute that coroutine from the same task.

The typical way around this problem is using asynchronous or non-blocking operations. For non-blocking I/O, there's epoll, which works for some kinds of files. For direct I/O on disk, Linux provides a different interface called asynchronous I/O, with system calls like io_submit. These two mechanisms can be hooked up with an eventfd, which is triggered whenever there are AIO results, using the undocumented system call io_set_eventfd. The basic idea is that you set up a bunch of requests in an object, and then you have a main loop, driven by epoll, where you repeatedly ask for the next available event. The coroutine scheduler resumes the coroutine that had the event complete on it, and executes that coroutine until it blocks again. Details about using this mechanism are a bit obtuse, but not very deep or complicated.

A proposed solution

What the mmap() interface is missing is a non-blocking way to access memory. Maybe this would take the form of a call based around mlock, like

    int mlock_eventfd(const void *addr, ssize_t len, int eventfd);

which would trigger the eventfd once the memory from addr going length len was locked in memory. The eventfd could be placed in an epoll loop and then the memory requested would be dereferenced for real once it was locked. A similar mechanism would be useful for fdatasync.

We could implement mlock_eventfd in user-space using a thread pool, and the same goes for fdatasync. But this would probaly eliminate the performance advantages of using coroutines in the first place, since accessing the disk is pretty frequent in databases.

As databases and the devices that underlie them grow more complex, it becomes difficult to manage this complexity. The operating system provides a useful layer of indirection between the database and the drive, but old and messy interfaces make the use of the OS more difficult. Clean, high-level OS interfaces which let applications take full advantage of the hardware and kernel-internal mechanisms and statistics would be a great boon to further database development, allowing the explosion of new databases and solid-state drives to be fully exploited.

21 comments:

Kartik Agaram said...: Heh, it's a great contrast between this article and the virtual memory example in http://www2.parc.com/csl/groups/sda/projects/oi/towards-talk/transcript.html. Kiczales would rather the OS allowed us to configure page size and replacement policy. Is linux more configurable now in these respects?; May 18, 2011 at 10:33 AM
Unknown said...: Haha, Of course it doesn't!

Seriously, that's a great article. If only modern OSes implemented that (though I can't imagine it'd be easy for modern applications to actually come up with their page replacement policy). But Worse is Better has won out, and people aren't taking advantage of virtual memory at all in many domains where it would make sense. This is a relatively simple one.; May 18, 2011 at 10:38 AM
Bryan said...: By the way, epoll doesn't actually work on plain files, and neither do select or poll.; May 18, 2011 at 11:53 AM
Unknown said...: Bryan, oh, it doesn't? What are you supposed to use for asynchronous or non-blocking buffered files? I thought that, for reads of buffered files, epoll/select/poll would tell you when the page was in the page cache (though I can see how that could be flawed; it might be pushed out again before you actually read it). I'm not an expert on this topic at all.; May 18, 2011 at 1:16 PM
Anonymous said...: Dan, reads or writes to and from files never return EAGAIN, so you will never have an opportunity to call epoll() there. Non-blocking I/O is for sockets and pipes only.; May 18, 2011 at 1:46 PM
Anonymous said...: AFAIK rtorrent uses mincore() to see if accesses would have to fault in pages from the disk, and uses madvise() to fault in those pages in advance, before the actual access.; May 18, 2011 at 2:03 PM
Ken said...: There are other issues. The last time I checked, Linux didn't actually do anything with most of the madvise/fadvise information, so there was little likelihood that the kernel would actually do a better job of managing the page replacement policy.

Also, with mmap(), the kernel can't tell if you are doing write-only operations, except for the case when you are actually appending at the end of the file, which is normally avoided. (You always want to work with pre-allocated files, even for log files, for performance reasons--avoiding having to force a write of file metadata on every data write and fragmented allocations--and to prevent unexpected out-of-disk-space errors.) So the kernel will make you wait to page-in a useless page of data from disk that you're going to completely overwrite. This is most egregious for log files, but happens on a regular basis with data files, too.

I mention log files because Linus had a rant against O_DIRECT, even though it's the best thing for log files.

Writing updated data pages to disk requires careful coordination of in-memory updates, log file writes, and data writes. It's actually more complicated with mmap() than without, although it can be done.

On 32-bit systems, you can easily run out of address space if you try to mmap() everything. Even on 64-bit systems, the actual address space may be only 48 bits, which could be an issue if you're really heavy duty. You can buy 2^48 bytes of (raw) disk for around $30,000 dollars.

The people who wrote the Azul garbage-collection system released some Linux extensions to give greater control over memory management. I haven't looked at them in great detail.

The easiest thing you can do now is use FUSE to implement a layer in between the memory-mapped file and the disk.; May 18, 2011 at 2:25 PM
Jules said...: Just a thought: you could potentially run the whole web application as an OS, either directly on the hardware or in a virtualized environment. Then the app/database has direct access to the virtual memory system and can do all this.

Another advantage is that then you can compress data with your custom compression algorithm when putting pages back on disk. This is something that mmap lacks compared to direct io.; May 18, 2011 at 2:26 PM
Unknown said...: Ken, I agree, mmap doesn't make much sense for write-only workloads in its current form. And Linux ignores some, but definitely not all, madvise/fadvise hints. It's important to know this, but that doesn't mean that they are useless.

I didn't mention in the article, but I should've, that this is really only appropriate for 64-bit things. People have run into issues with this on MongoDB when running on 32-bit systems. These days, it's pretty typical to have a 64-bit x86 computer running Linux in a server environment, and this is the setup that I've been learning about recently. Thanks for pointing this out, Ken.; May 18, 2011 at 3:17 PM
Unknown said...: Thanks for the correction Slava, Bryan!

Anonymous, that's an interesting technique rtorrent is using. I haven't found the code yet, but it seems like they must have to poll mincore() if they want to completely prevent the risk of blocking on memory dereference. Or is there a better way that they can do it?; May 18, 2011 at 3:19 PM
Colin Percival said...: You can't emulate mmap in userspace: mmap(... MAP_SHARED ...) allows the same page of physical RAM to be mapped R/W into two processes' address spaces, which requires page table operations which only the kernel can perform.; May 18, 2011 at 9:12 PM
Todd said...: Another thing you're missing here is that memory APIs don't give you enough control over various ordering semantics. You can ask it to msync() a particular area of memory back to disk, but you can't ask it *not* to msync one page before another one.

The same's true for fdatasync/fsync.

Implementing your own IO scheduling over O_DIRECT allows you to do things like "don't push this dirty page until that other dirty page gets flushed, but meanwhile you can flush any pages over there in any order you like".

There are some interesting threads on this topic on LKML, for example:
http://lkml.org/lkml/2007/5/30/535
http://lkml.org/lkml/2008/2/26/41
and others (in fact these threads also discuss the fact that these same issues can happen at the block device/controller layer too!)

-Todd; May 18, 2011 at 11:00 PM
caf said...: msync() seem more apropos than fdatasync(), since it allows syncing only a range rather than the entire file.

Note that msync() can be asked to be either synchronous or asynchronous.; May 19, 2011 at 12:07 AM
Foritus said...: However your proposed async IO solution is very much doable on Windows.; May 19, 2011 at 2:48 AM
Unknown said...: Colin: Well, you can't implement MAP_SHARED in user-space starting from absolutely nothing, but with the Unix shm stuff, I think you could hack something together... that said, you would have a very hard time making something reasonably good, which is one reason why MAP_SHARED is part of the mmap system call interface (besides that it's the only way to write to a file with mmap).

Todd, caf: Interesting! I am revealing myself to be a novice here. Thanks for the information.

Foritus: I've heard that the Windows (NT) I/O API is much better, but I haven't read much about it yet. I'll have to get around to that one of these days...; May 19, 2011 at 8:22 AM
wtanksley said...: I just read a rather interesting article (arxiv) on user-mode paging management: http://arxiv.org/abs/1105.1815. Only tangentially related, but it addresses some of your points with a little data (not a lot).

-Wm; May 19, 2011 at 11:00 AM
Unknown said...: MongoDB uses memory-mapped files for all it's disk I/O. One of the reasons it performs well under load.; May 19, 2011 at 11:47 AM
Anonymous said...: Re: Dan @ May 18, 2011 3:19 PM

I think they do poll with mincore(), see hash_read_ahead / hash_interval / hash_max_tries here.; May 19, 2011 at 12:45 PM
Anonymous said...: You can have a fetching thread, that blockingly mlock()s the memory into place and then writes on a socket/pipe to your event loop.

There, I fixed it. No need to have kernel support. :-); May 19, 2011 at 3:28 PM
Unknown said...: Anonymous,

Yes, this is what I meant in the second-to-last paragraph. You may need multiple threads, since mlock blocks. You could use a socket or pipe, but an eventfd is more lightweight, so that's why I recommended that. I suspect the kernel will be able to do this faster, but I have no real evidence.; May 19, 2011 at 4:18 PM
tsuna said...: Great post, and great comments too. Just a quick correction: MADV_DONTNEED isn't to zero-out file mappings, it's simply to hint the kernel you're not planning to access this memory anymore, so the kernel can reclaim it and use it for something else (maybe after writing back to disk). So it wouldn't be used to TRIM SSDs. RTFM :); May 20, 2011 at 3:17 PM