[LINUX] When will mmap (2) files be updated? (2)

Last time outlines two types of detection methods for writing to the mmap area, and three types in detail (1a page table scan method, 1b). Physical page scan method, 2 write fault capture method) was predicted. In this article, I'll look at the OS code of the widely used OSS to see which method is used, or which method is used.

For Linux

Linux virtual memory related code is located in the source tree directory mm. The page replacement code can be found in vmscan.c.

A feature of Linux mm code is that the model-independent part directly operates the page table. Of course, the structure of the page table varies depending on the model, but it is operated via the macro defined in the model-dependent part. I think it is a rare feature that even the structure of the multi-stage page table appears in the model-independent part. I think this is due to the fact that Linux was originally a kernel dedicated to x86, and that eccentric MMUs were eliminated because it was relatively new, and that it was developed after only MMUs with similar configurations were developed. Be done.

Scan page table

The x86 MMU operation macro is [arch / x86 / include / asm / pgtable.h](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch It is defined in /x86/include/asm/pgtable.h) etc. The macro to check if the D bit of a page is set is pte_dirty ().

vmscan is a very complex code. There are two ways page swapping works, one is when a background process (kthread) running inside the kernel called kswapd runs, and the other is when trying to get memory without Linux. If it runs in the foreground (direct reclaim) when there is no (very little) free memory. Both will eventually [shrink_node ()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmscan.c?h=v5 Come to the place called .7 # n2683).

Even if you chase after this (it's a pain to some extent), there is no place that calls pte_dirty () directly or indirectly. Therefore, it can be seen that Linux seems to be a write fault capture method.

write fault and page write

Just in case, let's take a look at the write fault code and the export part of the dirty page. The Linux page fault code goes through the page fault routine in the model-dependent part and goes through handle_mm_fault (). /mm/memory.c?h=v5.7#n4348). At this time, in the case of write fault, FAULT_FLAG_WRITE is specified in the flags argument (bitmap). After this, \ _ \ _ handle_mm_fault () → handle_pte_fault () comes (I don't care about HugePages or the processing of the upper part of the multi-level page table on the way). [When FAULT_FLAG_WRITE and the page is write-protected (! Pte_write (entry))](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memory. c? h = v5.7 # n4230) is [do_wp_page ()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memory. jump to c? h = v5.7 # n2878). Wp_page_shared () If you are mapping the file with MAP_SHARED? h = v5.7 # n2844).

vma-> vm_ops-> page_mkwrite is [filemap_page_mkwrite ()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?h = v5.7 # n2656), which updates the time stamp of the file and makes the page dirty (this dirty is a page attribute unlike the PTE D bit dirty) If a page is mapped from multiple processes, there are multiple corresponding PTEs, and the D bit of the PTE in the page table corresponding to the virtual space where the write was done is set. For example, write (2). It is the PTE in the kernel space page table that becomes dirty in the system call and is not subject to vmscan, but since it is clear when this will be dirty, the page attribute dirty at the time of the system call It can be set. In set_page_dirty (), the process of connecting the page to the list that manages dirty pages is also performed).

The person who writes this dirty page is a person named bdi flusher. Export by bdi flusher is controlled by sysctl variable vm.dirty_expire_centisecs etc. , It is designed to store some memory writes and then write them to storage. Currently, it is implemented using workqueue. When data is queued, the callback function is executed later. In this case, the data is created for each block device (to be exact, for memcg) [struct bdi_writeback (wb)](https: / /git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/backing-dev-defs.h?h=v5.7#n130), the callback function is wb_workfn ..

Since the future is complicated, I will omit it, but writing the page cache is [do_writepages ()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/ The writepages method prepared for each file system with page-writeback.c? h = v5.7 # n2336) ([ext4_writepages ()](https://git.kernel.org/pub/scm/linux/kernel for ext4) /git/torvalds/linux.git/tree/fs/ext4/inode.c?h=v5.7#n2610), for XFS [xfs_vm_writepages ()](https://git.kernel.org/pub/scm/ linux / kernel / git / torvalds / linux.git/tree/fs/xfs/xfs_aops.c?h=v5.7#n570) etc.) It seems to be done first. Here, we will follow generic_writepages (), which is general purpose (called when the writepages method for each file system is not registered, and is also template-like code). The main body is [write_cache_pages ()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page-writeback.c?h=v5.7 # n2127). Writes the dirty page cache for the specified file.

In the while loop, pagevec_lookup_range_tag () searches for dirty cache pages. The struct page vec that contains the results is a fixed-length array of struct pages that represent physical pages (length 15 /tree/include/linux/pagevec.h?h=v5.7#n15)), for loop is for each element of pagevec. pagevec_lookup_range_tag () is designed to search only dirty pages, but after processing such as skipping if flash is performed on another CPU etc.

                        if (!clear_page_dirty_for_io(page))
                                goto continue_unlock;

                        trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
                        error = (*writepage)(page, wbc, data);

This part is the miso. I'll see clear_page_dirty_for_io () later, but it's named to remove the dirty mark because I'm going to write it out. writepage is a file system dependent writepage method that writes pages as the name implies.

clear_page_dirty_for_io () is [a little behind] the same file (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page-writeback.c?h = v5.7 # n2639). The following is Kimo.

                if (page_mkclean(page))
                        set_page_dirty(page);

page_mkclean () is [in rmap.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/rmap.c?h=v5. It is in 6 # n986). rmap is an abbreviation for reverse map, which is a structure for obtaining PTE from a physical page. Physical pages can be referenced by multiple PTEs, so for each one [page_mkclean_one ()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/ Call linux.git/tree/mm/rmap.c?h=v5.6#n1979). So finally

			entry = ptep_clear_flush(vma, address, pte);
			entry = pte_wrprotect(entry);
			entry = pte_mkclean(entry);
			set_pte_at(vma->vm_mm, address, pte, entry);

I arrived at. At the same time as dropping the D bit, the R / W bit is cleared and write protect is performed. Now I know I've read only before I start writing the page cache.

For NetBSD

NetBSD is an operating system based on 4.4BSD. 4.4BSD seems to have ported the virtual memory system from CMU's microkernel Mach (mark), but maybe it was impossible to continue using Mach's virtual memory system with different requirements, NetBSD has its own virtual memory system Replaced with. This code is called UVM and is detailed in Charles D. Cranor's dissertation linked from The NetBSD Project's Documents Page (https://www.netbsd.org/docs/kernel/uvm.html). (It is a paper more than 20 years ago, and it should be noted that it is different from the current situation).

In NetBSD UVM, machine-dependent parts (MD: Machine Dependent in NetBSD terminology) are called pmap layers. What appears in the machine-independent part (MI: Machine Independent) is not the page table itself like Linux, but an opaque structure called struct pmap (implemented in the pmap layer of MD), virtual address (VA), and physical. For example, address (PA). The multi-stage structure of the page table is hidden in the pmap layer. UVM was implemented to maintain the interface between the Mach virtual memory system and the pmap layer.

Not many people are familiar with the NetBSD source code, so here's a quick glossary.

struct vm_map: Represents a virtual memory space. Equivalent to Linux struct mm_struct.
struct vm_map_entry / entry: A part of the virtual memory space, one for each of the text area, data area, stack, etc. of the executable file. Equivalent to Linux struct vm_area_struct.
struct vnode / vnode: Represents a file. Corresponds to the Linux struct inode. The methods corresponding to inode operations are called vnode operations (vnops).
pager: A person who performs paging operations, depending on the pager destination, vnode pager (normal file), aobj pager (swap area. Normal anonymous memory is not aobj, aobj pager has a slightly special purpose such as tmpfs), device There are pagers (character devices) etc. (others such as UBC which is a special pager for read (2) / write (2) etc.). The entity is a group of function pointers called PGO (Pager Operations), which includes pgo_get for loading pages and pgo_put for writing pages.
struct uvm_object / uobj: Refers to files, devices, etc. Each is associated with one pager. There are vnode object (uvn) associated with vnode pager, device object (udv) associated with device, aobj associated with aobj pager, and so on.
upper layer / lower layer: The lower layer of each vm_map_entry is uobj, and the upper layer is a structure that manages anonymous memory (not aobj). For example, the text area of a process usually has only the lower layer of the vnode object. Initially, the data area has only the lower layer of the vnode object, but when a write occurs there, anonymous memory is allocated to the upper layer by Copy on Write (this operation is called promotion). The heap and stack have only the upper layer.
wired page / wire / unwire: A page that is temporarily not subject to page-out, such as during mlock (2) or I / O, is called a wired page, and an operation that puts a page in such a state. Is called wire, and the operation of returning the wired page to the paging target is called unwire.
protection: Virtual memory permissions (R / W / X). In addition to the protection indicated by the argument of mmap (2), max_protection is defined for each entry, which is determined by the flag of open (2) at mmap (O_RDONLY, etc.). For example, in the area mmaped by MAP_PRIVATE, writing is always allowed by max_protection. The text area of the process is usually mmaped with r-x (protection is r-x), but when a breakpoint is set from gdb etc., the instruction at the specified address is rewritten to the breakpoint instruction. In these special cases max_protection (rwx because write is always allowed) is applied.
pv mapping / pv: Physical to Virtual mapping. It is a mechanism to subtract the virtual space (struct pmap) / virtual address from the physical address, which is equivalent to Linux rmap, but is implemented in the pmap layer. Since the same physical page may be mapped to multiple virtual addresses, it has a list structure.
loan: A UVM feature that lends and borrows pages between virtual memory spaces and is used to implement interprocess communication (such as pipes).

The NetBSD source code is managed by CVS, but it links to a specific line, so below is Github Mirror. See //github.com/NetBSD/src/tree/netbsd-9).

Scan page table

In the pmap interface, it is pmap_is_modified () to check if the D bit of a page is set. /pmap.h#L332). Since the argument is a physical page, there may be multiple PTEs pointing to it, in which case true will be returned if any PTE has a D bit set. Another PTE (s) pmap_clear_modify () points to a page Clears the (possible) D bit and at the same time returns whether or not there was a PTE set (equivalent to Linux page_mkclean ()). UVM is responsible for page release by kernel thread called page daemon (pdaemon). However, far from this, in fact, even if I grep the entire UVM source code, there is no place that calls pmap_is_modified () (or pmap_clear_modify ()). Therefore, it turns out that NetBSD also seems to be a write fault capture method.

Interestingly, NetBSD has a pluggable page replacement algorithm that allows you to actually choose between two types: Clock and Clock Pro.

write fault and write page

Just in case, let's take a look at the write fault code and the export part of the dirty page. NetBSD page fault handling can be found at uvm / uvm_fault.c. In the case of write fault, the argument access_type contains VM_PROT_WRITE, which is stored in the struct uvm_faultctctx flt.access_type. According to Comment at the top, go to the part where the file is pasted with MAP_SHARED in mmap. The write fault in is classified as CASE 2A. It is processed at uvm_fault_lower (). uobj represents the mmap file.

For regular files, uobj-> pgops-> pgo_fault is NULL, uobj-> pgops-> pgo_get is uvn_get () /uvm_vnode.c#L155) and vop_getpages when the file system is BSD FFS is genfs_getpages () If you follow the code, keeping in mind that it is .c # L102), uvm_fault_lower () → uvm_fault_lower_lookup () /uvm_fault.c#L1764) → uvm_get () → VOP_GETPAGES () → genfs_getpages () Proceed with. genfs_getpages () is about to read data from storage into the page cache. VM_PROT_WRITE of access_type is Reflected in variable memwrite. Also timestamps have been updated. And line 261.

                if (error == 0 && memwrite) {
                        genfs_markdirty(vp);
                }

Here, vnode is made dirty and registered in the list of writeback targets by syncer described later. There doesn't seem to be a part that makes the page dirty (drops the PG_CLEAN flag from the page flags) (reading the process of exporting dirty pages below, it certainly seems unnecessary). If you're dirty somewhere, please let me know.

Next, the process of exporting the dirty page. This is done by a kernel thread called syncer. The body of syncer is sched_sync (). From here lazy_sync_vnode () → VOP_FSYNC () /NetBSD/src/blob/netbsd-9/sys/kern/vnode_if.c#L798) → [ffs_fsync ()](https://github.com/NetBSD/src/blob/netbsd-9/sys/ufs/ ffs / ffs_vnops.c # 325) → ffs_full_fsync () → vflushbuf () → VOP_PUTPAGES () /netbsd-9/sys/kern/vnode_if.c#L1614) → [genfs_putpages ()](https://github.com/NetBSD/src/blob/netbsd-9/sys/miscfs/genfs/genfs_io.c# Written in L777). wapbl is equivalent to the ext3 / 4 journal, but let's not say it here. Since PGO_CLEANIT is set in vflushbuf (), if you follow the flow, first start transaction with a mechanism called fstrans and [retry](https://github.com/NetBSD/src/blob/netbsd-9/ Go back to sys / miscfs / genfs / genfs_io.c # L885) and while loop to go into. The while loop seems to switch between by_list (following the list of pages) and pages in sequence, depending on the density of the on-memory pages. In any case, the current target page is in struct vm_page * pg.

I'm a little unsure in the future. Line 1112

		if (flags & PGO_CLEANIT) {
 			needs_clean = pmap_clear_modify(pg) ||
			    (pg->flags & PG_CLEAN) == 0;
			pg->flags |= PG_CLEAN;
		} else {

With pmap_clear_modify (), the D bit of each PTE pointing to pg is dropped, and the D bit before dropping is reflected in needs_clean.

Also, it seems that it was read only shortly before that.

			if (cleanall && wasclean &&
			    gp->g_dirtygen == dirtygen) {

				/*
				 * uobj pages get wired only by uvm_fault
				 * where uobj is locked.
				 */

				if (pg->wire_count == 0) {
					pmap_page_protect(pg,
					    VM_PROT_READ|VM_PROT_EXECUTE);
				} else {
					cleanall = false;
				}
			}

wasclean is set when the vnode's vp_numoutput attribute is 0. This member has been increased at the beginning of writing the next page, but is it always 0 at the time of coming here? dirtygen is copied from the attribute of the same name in vnode at the beginning of genfs_do_putpages (). This attribute has been increased at genfs_markdirty (). If it seems to be a dirty generation and a write fault occurs on another page of the same file behind the page writing, the condition of gp-> g_dirtygen == dirtygen is removed. In this case, the process of excluding this vnode from the target of syncer is not done in the later process, so this certainly seems to be good.

Suspiciously, it turned out that NetBSD also allowed the page to be written during a write fault, at the same time made it dirty, and made it read only again when writing it out. However, D bit is used to determine which page of the page cache of a file is dirty.

It's been long, so that's it for this article. In Next article, we will see two more and the final episode. Two operating systems from different origins, Linux and NetBSD, used the same method. Please look forward to whether this is the mainstream or if there are other types of OS.