[LINUX] Memory Management »Concepts overview

Originally, it is a part of the Linux Kernel source code, so it will be treated as GPLv2 (recognition that it should be).

https://www.kernel.org/doc/html/latest/index.html

Licensing documentation

The following describes the license of the Linux kernel source code (GPLv2), how to properly mark the license of individual files in the source tree, as well as links to the full license text.

https://www.kernel.org/doc/html/latest/process/license-rules.html#kernel-licensing

https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html

Concepts overview

The memory management in Linux is a complex system that evolved over the years and included more and more functionality to support a variety of systems from MMU-less microcontrollers to supercomputers.

Linux memory management is a complex system that has been improved over the years and has the functionality to support a wide variety of systems, from microcontrollers without MMUs to supercomputers.

The memory management for systems without an MMU is called nommu and it definitely deserves a dedicated document, which hopefully will be eventually written.

Memory management for systems that do not have an MMU is called nommu. And there is no doubt that it is worth providing a dedicated document.

Yet, although some of the concepts are the same, here we assume that an MMU is available and a CPU can translate a virtual address to a physical address.

However, although some concepts are the same, we assume that there are MMUs available and that the CPU can translate from virtual addresses to physical addresses.

Virtual Memory Primer

The physical memory in a computer system is a limited resource and even for systems that support memory hotplug there is a hard limit on the amount of memory that can be installed.

The physical memory of a computer system is a limited resource, and even if the system supports hotplug of memory, there are hard restrictions on the amount of memory that can be installed.

The physical memory is not necessarily contiguous; it might be accessible as a set of distinct address ranges.

Physical memory does not require continuity. Sometimes it is accessible as a separate address range.

Besides, different CPU architectures, and even different implementations of the same architecture have different views of how these address ranges are defined.

What's more, different CPU architectures, or even the same architecture but different implementations, can have different address ranges.

.

All this makes dealing directly with physical memory quite complex and to avoid this complexity a concept of virtual memory was developed.

This made dealing directly with physical memory very complicated, and the concept of virtual memory was developed to avoid this complexity.

The virtual memory abstracts the details of physical memory from the application software, allows to keep only needed information in the physical memory (demand paging) and provides a mechanism for the protection and controlled sharing of data between processes.

Virtual memory abstracts the details of physical memory to application software, allowing it to be retained in physical memory (page requests) and providing a mechanism for data protection and data sharing between processes. ..

.

With virtual memory, each and every memory access uses a virtual address.

When using virtual memory, all memory access is via virtual addresses.

When the CPU decodes the an instruction that reads (or writes) from (or to) the system memory, it translates the virtual address encoded in that instruction to a physical address that the memory controller can understand.

When the CPU decodes an instruction to read from or write from system memory, it translates the virtual address to a physical address so that the memory controller can understand it.

.

The physical system memory is divided into page frames, or pages.

Physical system memory is divided into page frames or pages.

The size of each page is architecture specific.

The size of each page is architecture dependent.

Some architectures allow selection of the page size from several supported values; this selection is performed at the kernel build time by setting an appropriate kernel configuration option.

Depending on the architecture, you can choose from several values that support page sizes. This selection can be selected with the kernel configuration option when building the kernel.

.

Each physical memory page can be mapped as one or more virtual pages.

Each physical memory page can be mapped to one or more virtual pages.

These mappings are described by page tables that allow translation from a virtual address used by programs to the physical memory address. The page tables are organized hierarchically.

This mapping is listed on page tables, which programmatically translates the physical address into physical memory. This page table is organized hierarchically.

.

The tables at the lowest level of the hierarchy contain physical addresses of actual pages used by the software.

The table at the bottom of the hierarchy contains the physical address of the actual page used by the software.

The tables at higher levels contain physical addresses of the pages belonging to the lower levels.

The upper layer table contains the physical addresses of the pages contained in the lower layer.

The pointer to the top level page table resides in a register.

The pointer of the top table is in a register.

When the CPU performs the address translation, it uses this register to access the top level page table.

When the CPU attempts to translate an address, it uses that register through the top-level page table.

The high bits of the virtual address are used to index an entry in the top level page table.

The high-order bits of the virtual address are used to indicate the entry in the high-level page table.

That entry is then used to access the next level in the hierarchy with the next bits of the virtual address as the index to that level page table.

That entry is used to access the next level in the hierarchy, using the next bit of the virtual address as an index to the page table for that level.

The lowest bits in the virtual address define the offset inside the actual page.

The lowest bits of the virtual address define the actual page offset.

Huge Pages

The address translation requires several memory accesses and memory accesses are slow relatively to CPU speed.

Address translation requires several memory accesses, which are slow compared to the CPU speed.

To avoid spending precious processor cycles on the address translation, CPUs maintain a cache of such translations called Translation Lookaside Buffer (or TLB).

To avoid using a valuable process cycle for address translation, the CPU has a translation cache called Translation Lookaside Buffer (TLB).

Usually TLB is pretty scarce resource and applications with large memory working set will experience performance hit because of TLB misses.

TLBs are typically very low resources, and TLB misses slow performance in large memory usage sets.

.

Many modern CPU architectures allow mapping of the memory pages directly by the higher levels in the page table. For instance, on x86, it is possible to map 2M and even 1G pages using entries in the second and the third level page tables.

Oki, the latest CPU architecture, can map memory pages directly in the upper layers of the page table. For example, x86 allows you to map 2M to 1G using entries in the second and third page tables.

In Linux such pages are called huge.

Linux calls this huge.

Usage of huge pages significantly reduces pressure on TLB, improves TLB hit-rate and thus improves overall system performance.

The Huge page can be used to reduce the pressure on the TLB. Increase the TLB hit-rate to improve overall system performance.

.

There are two mechanisms in Linux that enable mapping of the physical memory with the huge pages.

Linux has two mechanisms for mapping physical memory to huge pages.

The first one is HugeTLB filesystem, or hugetlbfs.

The first is "HugeTLB filesystem", hugetlbfs.

It is a pseudo filesystem that uses RAM as its backing store. For the files created in this filesystem the data resides in the memory and mapped using huge pages.

This is a pseudo file system that uses RAM as a recording medium. When a file is created on this file system, the data resides in memory and huge pages are mapped.

The hugetlbfs is described at Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>.

The hugetlbfs can be found in Documentation / admin-guide / mm / hubetlbpage.rst <hugetlbpage>.

.

Another, more recent, mechanism that enables use of the huge pages is called Transparent HugePages, or THP.

Another mechanism to enable newer huge pages is called Transparent HugePages THP.

Unlike the hugetlbfs that requires users and/or system administrators to configure what parts of the system memory should and can be mapped by the huge pages, THP manages such mappings transparently to the user and hence the name.

Unlike hueltlbfs, which requires users and system administrators to configure which parts of system memory they need to configure and can be mapped by huge pages, THP provides equivalent mappings for users and names.

See Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge> for more details about THP.

Zones

Often hardware poses restrictions on how different physical memory ranges can be accessed.

Hardware often imposes restrictions on how to access various physical memory ranges.

In some cases, devices cannot perform DMA to all the addressable memory.

In some cases, the device cannot perform DMA on all addressable memory.

In other cases, the size of the physical memory exceeds the maximal addressable size of virtual memory and special actions are required to access portions of the memory.

In other cases, if the size of physical memory exceeds the maximum addressable size of virtual memory, special action is required to access part of the memory.

Linux groups memory pages into zones according to their possible usage.

Linux has organized memory pages called zone to do this.

For example, ZONE_DMA will contain memory that can be used by devices for DMA, ZONE_HIGHMEM will contain memory that is not permanently mapped into kernel's address space and ZONE_NORMAL will contain normally addressed pages.

For example, ZONE_DMA contains the memory available to the device by DMA. DONE_HIGHMEM contains non-persistent memory in the kernel address space. ZONE_NORMAL contains regular addressed pages.

.

The actual layout of the memory zones is hardware dependent as not all architectures define all zones, and requirements for DMA are different for different platforms.

The actual layout of the memory zone is hardware dependent. Not all architectures define all zones. DMA requirements are different on different platforms.

Nodes

Many multi-processor machines are NUMA - Non-Uniform Memory Access - systems.

Many multiprocessor machines are NUMA (Non-Uniform Memory Access) systems.

In such systems the memory is arranged into banks that have different access latency depending on the "distance" from the processor.

In such a system, the memory is divided into banks and accessed by different latencies depending on the "distance" from the processor.

Each bank is referred to as a node and for each node Linux constructs an independent memory management subsystem.

Each bank is referred to as a node. Each node constitutes an independent memory management subsystem.

A node has its own set of zones, lists of free and used pages and various statistics counters.

node has its own set of zones, a list of free and used pages, and various stats counters.

You can find more details about NUMA in

Documentation/vm/numa.rst <numa> and in Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>.

Page cache

The physical memory is volatile and the common case for getting data into the memory is to read it from files.

Physical memory is volatile, and a common case of getting data into memory is to read data from a file.

Whenever a file is read, the data is put into the page cache to avoid expensive disk access on the subsequent reads.

Each time you read a file, the data is put into the page cache to avoid costly disk access for subsequent reads.

Similarly, when one writes to a file, the data is placed in the page cache and eventually gets into the backing storage device.

Similarly, when writing to a file, the data is placed in the page cache and eventually stored in the backing storage device.

The written pages are marked as dirty and when Linux decides to reuse them for other purposes, it makes sure to synchronize the file contents on the device with the updated data.

The written pages are marked as dirty and will ensure that the contents of the files on the device are synchronized with the updated data if linux tries to reuse it for any other purpose.

Anonymous Memory

The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem.

ʻAnonymous memory or ʻanonymous mappings is memory that is not supported by the file system.

Such mappings are implicitly created for program's stack and heap or by explicit calls to mmap(2) system call.

Such mappings are either implicitly implemented in the program stack and heap, or by explicit calls to the mmap (2) system call.

Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access.

Normally, anonymous mapping defines only virtual memory areas that the program can access.

The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes.

Read access results in a page table entry that references a special 0-filled physical page.

When the program performs a write, a regular physical page will be allocated to hold the written data.

When the program performs a write, it is assigned a regular physical page to hold the written data.

The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.

This page is marked dirty and if the kernel decides it should be reused, the dirty page is swapped out.

Reclaim

Throughout the system lifetime, a physical page can be used for storing different types of data.

Throughout the life of the system, physical pages are used to hold various types of data.

It can be kernel internal data structures, DMA'able buffers for device drivers use, data read from a filesystem, memory allocated by user space processes etc.

Kernel internal data structures, DMA-capable buffers for use by device drivers, data read from the file system, memory allocated by user-space processes, and more.

.

Depending on the page usage it is treated differently by the Linux memory management.

Linux memory management handles it differently depending on how the page is used.

The pages that can be freed at any time, either because they cache the data available elsewhere, for instance, on a hard disk, or because they can be swapped out, again, to the hard disk, are called reclaimable.

A page that can be reclaimed at any time is called a "reclaimable" so that you can cache data available elsewhere, such as your hard disk, or swap it out to your hard disk again.

The most notable categories of the reclaimable pages are page cache and anonymous memory.

The most notable categories of reusable pages are page cache and anonymous memory.

.

In most cases, the pages holding internal kernel data and used as DMA buffers cannot be repurposed, and they remain pinned until freed by their user. Such pages are called unreclaimable.

In many cases, kernel internal data is retained, pages used as DMA buffers cannot be diverted, and are fixed until released by the user. Such pages are called ʻunreclaimable`.

However, in certain circumstances, even pages occupied with kernel data structures can be reclaimed.

However, at NARO, even pages occupied by kernel data structures can be reused.

For instance, in-memory caches of filesystem metadata can be re-read from the storage device and therefore it is possible to discard them from the main memory when system is under memory pressure.

For example, the in-memory cache of file system metadata can be read again from the storage device. You can free them from main memory when the system is out of memory.

.

The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim.

The process of freeing and reusing reusable physical memory pages (what!) Is called a "reclaim".

Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system.

Linux reuses pages synchronously or asynchronously, depending on the state of the system.

When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply.

When the system load is low, most of the memory is freed and allocation requests are executed sequentially from the fall page.

As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (high watermark), an allocation request will awaken the kswapd daemon.

As the load increases, the fall page volume decreases. When a high watermark is reached, the allocation request wakes up the kswapd daemon.

It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?).

Asynchronously scan memory pages and free them if the data they contain is reusable elsewhere, or kick them out to a backing storage device (remember the Dirty page).

As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim.

When the memory usage increases further and another threshold (min watermark) is reached, the allocation triggers a "direct reclaim".

In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.

In this case, the allocation will stop until enough memory pages have been reclaimed to satisfy the request.

Compaction

As the system runs, tasks allocate and free the memory and it becomes fragmented.

When the system is run, memory is allocated and released, and it becomes a fragment.

Although with virtual memory it is possible to present scattered physical pages as virtually contiguous range, sometimes it is necessary to allocate large physically contiguous memory areas.

Virtual memory allows you to gather physical pages together to create a virtually contiguous range. However, sometimes it is necessary to allocate a physically continuous memory range.

Such need may arise, for instance, when a device driver requires a large buffer for DMA, or when THP allocates a huge page.

For example, if the device driver requires a large buffer for DMA, or if the THP allocator allocates a huge page, such a request will occur.

Memory compaction addresses the fragmentation issue.

Memory compaction addresses fragmentation issues.

This mechanism moves occupied pages from the lower part of a memory zone to free pages in the upper part of the zone.

This mechanism moves the occupied page from the bottom of the memory zone to the free space above the memory zone.

When a compaction scan is finished free pages are grouped together at the beginning of the zone and allocations of large physically contiguous areas become possible.

After the compaction scan, the free memory is grouped at the beginning of the zone and can be allocated a large physically contiguous area.

.

Like reclaim, the compaction may happen asynchronously in the kcompactd daemon or synchronously as a result of a memory allocation request.

Similar to reclaim, compaction can occur asynchronously by the kcompactd daemon or synchronously as a result of memory allocation.

OOM killer

It is possible that on a loaded machine memory will be exhausted and the kernel will be unable to reclaim enough memory to continue to operate.

Memory may run out on heavily loaded machines. As a result, the kernel may not be able to modify enough memory to keep it running.

In order to save the rest of the system, it invokes the OOM killer.

ʻOOM killer` is called to save the rest of the system.

.

The OOM killer selects a task to sacrifice for the sake of the overall system health.

ʻOOM killer` selects tasks to sacrifice for overall system health.

The selected task is killed in a hope that after it exits enough memory will be freed to continue normal operation.

The selected task is killed in the hope that it will free enough memory to continue normal operation after it finishes.

Recommended Posts

Memory Management »Concepts overview
Boot time memory management
[Translation] Spark Memory Management since 1.6.0
[OS / Linux] Process, thread, memory management