July 12, 2016

Can it fork?

It’s a well-known fact that Linux (by default) overcommits memory, meaning that allocating memory via e.g. malloc practically never fails (this behavior is configurable via the vm.overcommit_memory sysctl knob).

This is possible because Linux doesn’t actually allocate any physical memory when a process requests some (e.g. via a mmap system call). Instead, it just records that the allocation happened in a Kernel data structure, and returns an address the process can use to access this memory later on. The first time (if ever) a process attempts to access memory it allocated, the Linux Kernel will attempt to allocate physical memory for it.

A bit on naming (this will matter later in this post): the entirety of the memory that was allocated by a process (regardless of whether it was ever accessed and is therefore backed by physical memory) is called virtual memory.

The upside of this approach is that if a process requests a huge chunk of memory and then never uses it, it won’t consume much physical memory (the downside is that if there isn’t enough physical memory to back all the virtual memory that was allocated, bad things will happen, like swapping and killing processes to free up some memory).

As a consequence, virtual memory is usually considered not to matter (much). That’s why the free program doesn’t take it into account when showing the system’s free memory, and it’s also why we know to look at the RSS (which more or less represents the physical memory used by a process) column in top or ps, and not at the VSZ one (which is the amount of virtual memory a process has allocated).

It’s also why top uses RSS to compute the %MEM column, not VSZ.

There is, however, one case where virtual memory does in fact matter quite a lot: fork. Indeed, if a process has allocated a large amount of virtual memory, then it might fail to fork, regardless of whether this memory is backed by physical memory!

We ran into this issue at Aptible recently where a Docker daemon was unable to fork for no obvious reason (we’re not the first ones to run into this). The logs would fill up with fork/exec /usr/sbin/iptables: cannot allocate memory, despite there being several GBs of free memory on the system (which you’d think would be more than enough!).

This blog post is a deep-(ish?) dive into how the Linux Kernel manages memory, and seeks to explain why you might run into this issue. Note that it’s only tangentially related to Docker: other programs run into this issue just as often (Redis is a good example).

Memory layout primer #

On Linux, a process’ memory space is divided in virtual memory areas (VMAs). You can see the memory areas for a process via the proc filesystem in /proc/PID/smaps, or via the pmap program (which parses those files for you).

Virtual memory areas can be created in a variety of ways. Here a few examples:

When you launch a new program (via exec), Linux loads the program’s binary into memory. It also injects the Kernel’s VDSO, etc. These will be represented as VMAs.
When you memory-map (mmap) a file, Linux creates a VMA to represent that file.
When you allocate anonymous memory (often via the mmap, brk and sbrk syscalls, or at a higher level via the malloc function), Linux creates a new VMA for the allocation (or expands an existing one; this detail is important).

Each VMA has a set of flags. Among those, one flag is of particular interest here: VM_ACCOUNT. It indicates whether a VMA is for memory that should be charged to the process (more on that later). What you need to know here is that this flag will practically always be set for anonymous memory (the third group above).

Forking is allocating #

When a process forks, Linux will go through each of its VMAs to make shallow copies of them: the child’s VMAs will point to the same physical memory as the parent’s VMAs, but if the child attempts to write to them, then a copy will be made on the fly (this is casually referred to as “fork‘s copy-on-write semantics on Linux”).

There are some exceptions here (e.g. the VM_DONTCOPY flag), but they’re not particularly relevant to our discussion.

But, there’s a catch: if a VMA has the VM_ACCOUNT flag set, then Linux first checks whether it can satisfy an allocation that big. If it can’t, then forking fails with an out of memory error (exactly the error we saw above from Docker).

In practice, this check is done by the exact same code that checks for whether an allocation would be allowed if the process had used e.g. malloc to make a regular memory allocation.

Now, you’d normally expect this check to pass, since as we mentioned earlier, Linux overcommits memory and allocations practically never fail. However, as mentioned in the Kernel overcommit documentation, overcommitting is heuristic, and Linux will sometimes reject an allocation if it’s too unreasonable (in practice, if you try to allocate more than the system’s free memory, the Kernel will reject the allocation, although that’s a gross over-simplification):

Heuristic overcommit handling. Obvious overcommits of address space are
refused. Used for a typical system. It ensures a seriously wild allocation
fails while allowing overcommit to reduce swap usage.

In other words, if your process has allocated too much virtual memory (i.e. has a large VSZ), then it might be unable to fork despite there being more free memory on the system than the process actually needs right now (i.e. RSS).

Not that simple! #

Now, you might ask: since VMAs are created via calls to e.g. mmap, then how can a given VMA be considered entirely reasonable to mmap, and then seriously wild when we try to fork? Of course, the “wild-ness” of an allocation depends on available memory, and the available memory on the system might have changed over time, but in practice I’ve seen cases where a process becomes suddenly unable to fork without substantial changes in available system memory.

Conversely, I’ve seen cases where Docker allocated 32 GB (!) of virtual memory on a system that only had ~12 GB of free RAM, but yet was still able to fork just fine!

Obviously, the answer to “can I fork?” isn’t the simple equation “VSZ < free memory?”. So, what’s going on?

VMA merging #

The answer lies in VMA merging: to satisfy an allocation, Linux will prefer to resize an existing VMA rather than create a new one. Whether or not merging is possible depends on the process’ memory layout and VMA flags, but in practice it’s often possible to resize when satisfying an allocation for anonymous memory.

In other words, a process can make a number of small “reasonable” allocations, have them all merged into the same VMA, and end up with a VMA that is “unreasonably” big and leaves the process unable to fork.

Conversely, your process can make the exact same number of “reasonable” allocations, have them end up in different VMAs, and then still be allowed to fork (because VMAs are checked one by one, in isolation), despite the fact that its virtual memory usage is exactly the same!

Testing interlude #

First, let’s confirm that this is actually what’s happening. I’ve set up a test program you can use to reproduce what I described above. You can find it on Gist. You can compile it with gcc test.c to follow along.

Before running this, make sure your system is running in default overcommit mode (i.e. vm.overcommit_memory is set to 1), otherwise you’ll be exercising your overcommit configuration instead.

First, if we allocate a bunch of VMAs, and let the kernel merge them, we’ll fail to fork pretty quick:

$ ./a.out
[ERROR (14503)] fork failed: Cannot allocate memory
[ERROR (14503)] fork failed after 121 iterations

Here, we failed after allocating about 15 GB of virtual memory in a single VMA (we can add a sleep at the end of main to take a peek), which incidentally was the free memory on my system when I ran the test (your results will probably differ depending on how much memory you have).

Now, if we ensure the Kernel does not merge VMAs (by forcing it to allocate a new VMA 128 MB away from the others), we can go through all 1024 iterations (i.e. allocate a whopping 128 GB of virtual memory), and yet still be allowed to fork! (on the other hand, scattering our memory allocation like this might be a bad idea from a performance standpoint)

$ ALLOC_SCATTER=1 ./a.out
[INFO  (14770)] all 1024 fork iterations succeeded

Finally, if we flag our VMA with MAP_NORESERVE (which will result in not setting the VM_ACCOUNT flag on our memory maps), then we’ll succeed as well:

$ ALLOC_NORESERVE=1 ./a.out
[INFO  (15846)] all 1024 fork iterations succeeded

What can we do? #

As developers #

As developers, the alternative vfork system call can sometimes be used (and allows you to fork regardless of your virtual memory size — try change the test program above to use vfork to see for yourself), but it has strict restrictions on what you’re allowed to do, and blocks the calling process until the child exits or execs something new.

A good example of a program for which vfork would be completely unsuitable is Redis, which I mentioned earlier, and uses fork to run a background save.

As users #

As users, when we run into this condition, the best option will typically be to take note and restart the offending process that cannot fork anymore. But of course, we’d rather not restart a process like Docker unless we absolutely have to! And, as I explained above, you can totally find yourself in a situation where a process has a virtual size of 32 GB but can still fork just fine!

So, ideally, we’d want to restart the process if we know it’s going to become unable to fork soon. We can do this by looking at the size of the biggest VMA with the VM_ACCOUNT flag set (via pmap or directly via /proc/PID/smaps), and attempting to allocate a similarly-sized chunk of memory (but don’t try touching it: you don’t want Linux to allocate physical memory for it!)

(I wrote a small Python script that does this: canfork, try it out if you’d like!)

121

Kudos

121