The memory problem that 80% of Linux don't understand

Author: Luo Daowen's Private Kitchen

http://luodw.cc/2016/08/13/linux-ca

foreword

During my previous internship, after listening to the sharing of OOM, I was very interested in Linux kernel memory management. However, this knowledge is very huge, and I didn't have a certain accumulation. After I have a certain understanding of kernel memory, I wrote this article today to record and share it.

This article mainly analyzes the memory layout and allocation of a single process space, and analyzes the kernel's memory management from a global perspective;

The following mainly introduces Linux memory management from the following aspects:

Process memory application and allocation;
OOM after memory exhaustion;
Where is the requested memory?
The system reclaims memory;

1. Process memory application and allocation

The previous article introduced how the hello world program loads memory and how to apply for memory. Let me explain again: Again, the address space of the process is given first. I think this picture must be remembered for any developer. There is also a time chart of operating disk, memory and cpu cache.

When we start a program on the terminal, the terminal process calls the exec function to load the executable file into the memory. At this time, the code segment, data segment, bbs segment, and stack segment are all mapped to the memory space through the mmap function, and the heap depends on whether there is Allocate memory on the heap to decide whether to map.

After exec is executed, the process does not really start at this time, but the cpu control right is handed over to the dynamic link library loader, which loads the dynamic link library required by the process into the memory. After that, the execution of the process starts. This process can be analyzed by tracing the system functions called by the process through the strace command.

This is the program in the recognition pipe. From this output process, it can be seen that it is consistent with what I described above.

When calling malloc for the first time to apply for memory, it is embedded into the kernel through the system call brk. First, a judgment will be made to see if there is a vma about the heap. If not, an anonymous memory will be mapped to the heap through mmap, and a vma structure will be created. to the red-black tree and linked list on the mm_struct descriptor.

Then return to user mode, manage the allocated memory through the memory allocator (ptmaloc, tcmalloc, jemalloc) algorithm, and return the memory required by the user.

If the user mode applies for large memory, it directly calls mmap to allocate memory. At this time, the memory returned to the user mode is still virtual memory, and the memory allocation is not performed until the returned memory is accessed for the first time.

In fact, virtual memory is also returned through brk, but after cutting and allocation by the memory allocator (the memory must be accessed for cutting), all are allocated to physical memory

When the process releases the memory by calling free in user mode, if the memory is allocated by mmap, then calling munmap returns it directly to the system.

Otherwise, the memory is first returned to the memory allocator, and then returned to the system by the memory allocator uniformly. This is why when we access the memory again after calling free to reclaim the memory, it may not report an error.

Of course, when the entire process exits, the memory occupied by the process will be returned to the system.

2. OOM after memory exhaustion

During the internship, the mysql instance on a test machine was often killed by oom. OOM (out of memory) is the self-rescue measure of the system when the memory is exhausted. He will select a process, kill it, and release the memory. , it's obvious which process takes the most memory, i.e. is most likely to be killed, but is this the case?

Going to work this morning, I happened to encounter an OOM, and suddenly found that the OOM once, the world is quiet, haha, the redis on the test machine was killed.

The OOM key file oom_kill.c, which introduces how the system selects the process that should be killed the most when the memory is not enough, there are quite a lot of selection factors, in addition to the memory occupied by the process, there is also the running time of the process, the priority of the process level, whether it is a root user process, the number of child processes and the occupied memory and the user control parameter oom_adj are all related.

After the oom is generated, the function select_bad_process will traverse all the processes. Through the factors mentioned above, each process will get an oom_score score, and the highest score will be selected as the killed process.

We can intervene in the processes the system chooses to kill by setting the score./proc/<pid>/oom_adj

This is the kernel's definition of the adjustment value of this oom_adj. The maximum can be adjusted to 15, and the minimum is -16. If it is -17, the process is like buying a vip membership, and will not be killed by the system. Therefore, if If there are many servers running on one machine, and you don't want your service to be killed, you can set the oom_adj of your service to -17.

Of course, speaking of this, another parameter /proc/sys/vm/overcommit_memory must be mentioned. The man proc description is as follows:

It means that when overcommit_memory is 0, it is heuristic oom, that is, when the applied virtual memory is not exaggeratedly larger than the physical memory, the system allows the application, but when the virtual memory applied by the process is exaggeratedly larger than the physical memory, it will produces OOM.

For example, there is only 8g of physical memory, and then the redis virtual memory occupies 24g, and the physical memory occupies 3g. If bgsave is executed at this time, the child process and the parent process share the physical memory, but the virtual memory is its own, that is, the child process will apply for 24g of virtual memory Memory, which is exaggeratedly larger than physical memory, will generate an OOM.

When overcommit_memory is 1, overmemory memory application is always allowed, that is, no matter how big your virtual memory application is, it is allowed, but when the system memory is exhausted, oom will be generated, that is, the above redis example, in overcommit_memory=1 When the oom is not generated, because the physical memory is sufficient.

When overcommit_memory is 2, the memory application of a certain limit can never be exceeded. This limit is the swap+RAM* coefficient (/proc/sys/vm/overcmmit_ratio, the default is 50%, you can adjust it yourself), if there are so many resources has run out, then any subsequent attempts to allocate memory will return an error, which usually means that no new programs can be run at this time.

The above is the content of OOM, understand the principle, and how to set OOM reasonably according to your own application.

3. Where is the memory applied by the system?

After we understand the address space of a process, will we be curious, where does the applied physical memory exist? Many people may think, isn't it just physical memory?

I am talking about where the applied memory is because the physical memory is divided into cache and ordinary physical memory, which can be viewed through the free command, and the physical memory is divided into three areas: DMA, NORMAL, and HIGH. The cache and ordinary memory are mainly analyzed here. .

Through the first part, we know that the address space of a process is almost all mmap function application, there are two kinds of file mapping and anonymous mapping.

3.1 Shared file mapping

Let's first look at the code segment and the dynamic link library mapping segment, both of which belong to the shared file mapping, that is to say, the two processes started by the same executable file share these two segments, both of which are mapped to the same file. A piece of physical memory, so where is this memory? I wrote a program to test as follows:

Let's first look at the current system memory usage:

When I create a new 1G file locally:

dd if=/dev/zero of=fileblock bs=M count=1024

Then call the above program to perform shared file mapping. At this time, the memory usage is:

We can find that the buff/cache has grown by about 1G, so we can conclude that the code segment and the dynamic link library segment are mapped to the kernel cache, that is to say, when the shared file mapping is performed, the file is read first. cache, and then mapped to the user process space.

3.2 Private file mapping segment

For the data segment in the process space, it must be a private file mapping, because if it is a shared file mapping, then the two processes started by the same executable file, any process that modifies the data segment will affect the other process, I Rewrite the above test program into anonymous file mapping:

When the executor is executed, the previous cache needs to be released first, otherwise it will affect the result

echo 1 >> /proc/sys/vm/drop_caches

Then execute the program to see the memory usage:

From the comparison before and after use, it can be found that the used and buff/cache have increased by 1G respectively, indicating that when mapping private files, the first is to map the file to the cache, and then if a file modifies this file, it will be Allocating a piece of memory from other memory first copies the file data to the newly allocated memory, and then modifies it on the newly allocated memory, which is copy-on-write.

This is also easy to understand, because if multiple instances of the same executable file are opened, the kernel first maps the executable data segment to the cache, and then each instance will allocate a block of memory storage if there is a modified data segment. The data segment, after all, the data segment is also private to a process.

Through the above analysis, it can be concluded that if it is a file mapping, the files are mapped to the cache, and then different operations are performed according to whether it is shared or private.

3.3 Private anonymous mapping

Like the bbs segment, heap, stack, these are anonymous mappings, because there is no corresponding segment in the executable file, and it must be a private mapping, otherwise if the current process forks a child process, the parent and child processes will share these segments, a modification will affect each other, which is unreasonable.

ok, now I changed the above test program to private anonymous mapping

Then look at the memory usage

We can see that only the used has increased by 1G, while the buff/cache has not increased; it means that the cache is not occupied when the anonymous private mapping is performed. In fact, this makes sense, because only the current process is using this block. Memory, there is no need to occupy the precious cache.

3.4 Shared anonymous mapping

When we need to share memory between parent and child processes, we can use mmap to share anonymous mapping, so where is the memory of shared anonymous mapping stored? I continue to rewrite the above test program to share anonymous mapping.

Now look at the memory usage:

From the above results, we can see that only the buff/cache has increased by 1G, that is, when the shared anonymous mapping is performed, the memory is requested from the cache at this time. The reason is also obvious, because the parent and child processes share this memory and share the anonymous mapping. Exist in the cache, and then each process is mapped to each other's virtual memory space, so that the same piece of memory can be operated.

4. The system reclaims memory

When the system memory is insufficient, there are two ways to release the memory, one is the manual way, and the other is the memory recovery triggered by the system itself. Let’s first look at the manual triggering method.

4.1 Manually reclaiming memory

Manually reclaiming memory, which has also been demonstrated before, namely

echo 1 >> /proc/sys/vm/drop_caches

We can see an introduction to this under man proc

It can be seen from this introduction that when the drop_caches file is 1, the releasable part of the pagecache will be released (some caches cannot be released through this). When the drop_caches is 2, the dentries and inodes caches will be released at this time. When drop_caches is 3, this frees both of the above.

The key is the last sentence, which means that if there is dirty data in the pagecache, the operation drop_caches cannot be released. The dirty data must be flushed to the disk through the sync command, and the pagecache can be released through the operation drop_caches.

ok, it was mentioned before that some pagecaches cannot be released through drop_caches, so in addition to the above mentioned file mapping and shared anonymous mapping, what else is there in pagecache?

4.2 tmpfs

Let's take a look at tmpfs first. Like procfs, sysfs and ramfs, they are all memory-based file systems. The difference between tmpfs and ramfs is that ramfs files are based on pure memory, and tmpfs uses swap in addition to pure memory. space, and ramfs may run out of memory, and tmpfs can limit the size of the memory used. You can use the command df -T -h to view some file systems in the system, some of which are tmpfs, and the more famous directory is /dev/shm

The tmpfs file system source file is in the kernel source code mm/shmem.c. The implementation of tmpfs is very complicated. The virtual file system was introduced before. Creating files based on the tmpfs file system is the same as other disk-based file systems. There will also be inode, super_block, identity, file and other structures, the difference is mainly in reading and writing, because reading and writing only involves whether the carrier of the file is memory or disk.

The read function shmem_file_read of the tmpfs file mainly finds the address_space address space through the inode structure, which is actually the pagecache of the disk file, and then locates the cache page and the offset within the page through the read offset.

At this time, the data in the cache page can be copied to the user space directly from the pagecache through the function __copy_to_user. When the data we want to read is not in the pagecache, then we need to determine whether it is in the swap, and if so, first swap the memory page. in, read again.

tmpfs file writing function shmem_file_write, the process is mainly to first determine whether the page to be written is in memory, if so, directly copy the user mode data to the kernel pagecache through the function __copy_from_user to overwrite the old data, and mark it as dirty.

If the data to be written is no longer in memory, judge whether it is in swap, if so, read it first, overwrite the old data with new data and mark it as dirty, if it is neither in memory nor in disk, then generate a new pagecache Store user data.

From the above analysis, we know that files based on tmpfs also use cache. We can create a file on /dev/shm to detect:

See, the cache has grown by 1G, which verifies the cache memory used by tmpfs.

In fact, the anonymous mapping principle of mmap also uses tmpfs. Inside the mm/mmap.c->do_mmap_pgoff function, it is judged that if the file structure is empty and mapped to SHARED, the shmem_zero_setup(vma) function is called to create a new file on tmpfs

This explains why the shared anonymous mapping memory is initialized to 0, but we know that the memory allocated with mmap is initialized to 0, which means that the mmap private anonymous mapping is also 0, so where is it reflected?

This is not reflected in the do_mmap_pgoff function, but in a page fault exception, and then allocates a special page initialized to 0.

So can the memory page occupied by this tmpfs be reclaimed?

That is to say, the pagecache occupied by the tmpfs file cannot be reclaimed, and the reason is obvious, because there are files that reference these pages, they cannot be reclaimed.

4.3 Shared memory

The posix shared memory is actually the same as the mmap shared mapping. It is used to create a new file on the tmpfs file system, and then map it to the user mode. The last two processes operate on the same physical memory, then whether the System V shared memory is also used. tmpfs file system?

We can trace to the following function

This function is to create a new shared memory segment, in which the function

shmem_kernel_file_setup

It is to create a file on the tmpfs file system, and then realize process communication through this memory file. I will not write a test program, and this cannot be recycled, because the life cycle of the shared memory ipc mechanism is with the kernel, that is to say After you create the shared memory, if the deletion is not displayed, the shared memory still exists after the process exits.

I read some technical blogs before, and said that the two ipc mechanisms (message queue, semaphore and shared memory) of Poxic and System V all use the tmpfs file system, which means that the final memory uses pagecache, but I am in the source code. It can be seen that the two shared memories are based on the tmpfs file system, and other semaphores and message queues have not been seen (to be followed up).

The implementation of posix message queue is somewhat similar to the implementation of pipe. It also has its own set of mqueue file system, and then hangs the message queue attribute mqueue_inode_info on i_private on the inode. On this attribute, in kernel 2.6, an array is used to store messages , and in 4.6, the red-black tree is used to store messages (I downloaded these two versions, when I started to use the red-black tree, I didn't go into it).

Then each operation of the two processes is to operate the message array or red-black tree in this mqueue_inode_info to realize process communication. Similar to this mqueue_inode_info, there are also the tmpfs file system attribute shmem_inode_info and the file system eventloop serving epoll, which also has a special attribute struct eventpoll, this is private_data hanging in the file structure and so on.

说到这，可以小结下，进程空间中代码段，数据段，动态链接库(共享文件映射)，mmap 共享匿名映射都存在于 cache 中，但是这些内存页都有被进程引用，所以是不能释放的，基于 tmpfs 的 ipc 进程间通信机制的生命周期是随内核，因此也是不能通过 drop_caches 释放。

虽然上述提及的cache不能释放，但是后面有提到，当内存不足时，这些内存是可以 swap out 的。

因此 drop_caches 能释放的就是当从磁盘读取文件时的缓存页以及某个进程将某个文件映射到内存之后，进程退出，这时映射文件的的缓存页如果没有被引用，也是可以被释放的。

4.4 内存自动释放方式

当系统内存不够时，操作系统有一套自我整理内存，并尽可能的释放内存机制，如果这套机制不能释放足够多的内存，那么只能 OOM 了。

之前在提及 OOM 时，说道 redis 因为 OOM 被杀死，如下:

第二句后半部分，

total-vm:186660kB, anon-rss:9388kB, file-rss:4kB

把一个进程内存使用情况，用三个属性进行了说明，即所有虚拟内存，常驻内存匿名映射页以及常驻内存文件映射页。

其实从上述的分析，我们也可以知道一个进程其实就是文件映射和匿名映射：

文件映射:代码段，数据段，动态链接库共享存储段以及用户程序的文件映射段；
匿名映射：bbs段，堆，以及当 malloc 用 mmap 分配的内存，还有mmap共享内存段；

其实内核回收内存就是根据文件映射和匿名映射来进行的，在 mmzone.h 有如下定义:

LRU_UNEVICTABLE 即为不可驱逐页 lru，我的理解就是当调用 mlock 锁住内存，不让系统 swap out 出去的页列表。

简单说下 linux 内核自动回收内存原理，内核有一个 kswapd 会周期性的检查内存使用情况，如果发现空闲内存定于 pages_low，则 kswapd 会对 lru_list 前四个 lru 队列进行扫描，在活跃链表中查找不活跃的页，并添加不活跃链表。

然后再遍历不活跃链表，逐个进行回收释放出32个页，知道 free page 数量达到 pages_high，针对不同的页，回收方式也不一样。

当然，当内存水平低于某个极限阈值时，会直接发出内存回收，原理和 kswapd 一样，但是这次回收力度更大，需要回收更多的内存。

文件页：

如果是脏页，则直接回写进磁盘，再回收内存。

如果不是脏页，则直接释放回收，因为如果是io读缓存，直接释放掉，下次读时，缺页异常，直接到磁盘读回来即可，如果是文件映射页，直接释放掉，下次访问时，也是产生两个缺页异常，一次将文件内容读取进磁盘，另一次与进程虚拟内存关联。

匿名页：因为匿名页没有回写的地方，如果释放掉，那么就找不到数据了，所以匿名页的回收是采取 swap out 到磁盘，并在页表项做个标记，下次缺页异常在从磁盘 swap in 进内存。

swap 换进换出其实是很占用系统IO的，如果系统内存需求突然间迅速增长，那么cpu 将被io占用，系统会卡死，导致不能对外提供服务，因此系统提供一个参数，用于设置当进行内存回收时，执行回收 cache 和 swap 匿名页的，这个参数为:

意思就是说这个值越高，越可能使用 swap 的方式回收内存，最大值为100，如果设为0，则尽可能使用回收 cache 的方式释放内存。

5、总结

这篇文章主要是写了 linux 内存管理相关的东西：

首先是回顾了进程地址空间；

其次当进程消耗大量内存而导致内存不足时，我们可以有两种方式：第一是手动回收 cache；另一种是系统后台线程 swapd 执行内存回收工作。

最后当申请的内存大于系统剩余的内存时，这时就只会产生 OOM，杀死进程，释放内存，从这个过程，可以看出系统为了腾出足够的内存，是多么的努力啊。

- EOF -