[LINUX] When will mmap (2) files be updated? (3)

The last time gives an overview and there are two major methods for detecting writing to the mmap area, and three detailed methods (1a page table scan method, 1b). Physical page scan method, 2 write fault capture method) was predicted. Last time actually looked at the source code of Linux and NetBSD, and confirmed that both adopted the write fault capture method. In this article, we'll look at two more operating systems, macOS and Solaris.

For macOS

macOS was developed based on NeXTSTEP (OpenSTEP), which was built on top of CMU's microkernel, Mach 2.5. In the microkernel, only the minimum necessary functions are implemented in the kernel, and functions such as network stack, file system, and process management are implemented in a server (subsystem) that runs in user space. However, a pure microkernel often has performance problems due to the high frequency of virtual space switching and context switching, and subsystem functions are often implemented in the kernel space. This is called a hybrid kernel (Windows is also classified as a hybrid kernel). Mach contains a large amount of 4.3BSD-derived code to ensure compatibility with 4.3BSD, which was the standard at the time, and it is a single binary that runs in the same address space as the microkernel part. Was there. That is, it is classified as a hybrid kernel. macOS has updated the NeXTSTEP kernel, which was Mach 2.5 + 4.3BSD, to the microkernel + FreeBSD of OSF / 1. Seems to be. The remnants of OSF / 1 appear in the directory name osfmk, where the Mach part of the source tree is stored. Unix commands and the like also seem to be based on FreeBSD.

The whole macOS is not OSS, but the parts derived from Mach / FreeBSD including the kernel (XNU) are OSS under the name Darwin. There is a [mirror] on github (https://github.com/apple/darwin-xnu/), so I will refer to this, but since the file system (APFS, HFS +) is not included, there is a feeling of itching.

I mentioned that NetBSD's former virtual memory system was Mach-based. UVM is also designed under the influence of the Mach virtual memory system, and has many similarities to XNU's virtual memory system. For example, D bit is referenced by pmap_is_modified () like NetBSD, and pmap_clear_modify () works the same as NetBSD. In addition, there are interfaces such as pmap_get_refmod () and pmap_clear_refmod () that operate A bit and D bit at once. Manipulating A bit and D bit at the same time means that unlike Linux and NetBSD, it looks for dirty pages by scanning the page table. Let's take a look.

Scan page table

XNU's virtual memory system seems to be in a directory called osfmk / vm. The Mach part is stored in the directory osfmk and the BSD subsystem part is stored in the bsd directory. [Bsd / vm](https://github.com/apple/darwin-xnu/tree/xnu-4903.221.2/bsd/vm There is also a directory called). bsd / vm implements BSD subsystem-dependent parts such as the vnode pager (equivalent to the NetBSD vnode pager).

Looking at the data structure of osfmk / vm, there are structures such as struct vm_map, struct vm_map_entry, struct vm_page, etc. as seen in NetBSD, and vm_page is connected to the list by listq, pageq, and so on. Make me smell that.

If you grep with pmap_is_modified, pmap_clear_modify, pmap_get_refmod, pmap_clear_refmod, etc., vm_pageout.c or vm_resident It seems to be around here because it gets caught around .c. However, it is quite difficult to understand because there are incredibly long blocks and the number of states (flag of struct vm_page) is large.

It seems to be one of the features that there are many queues (lists) connected via pageq. Basically, it seems to move pages and change the order between active / inactive / free lists, but each list is divided into several parts.

By dividing into many lists in this way, it seems that the timing and frequency of scans can be finely controlled according to the usage status of the page.

Scanning each active / inactive list is Context after kernel initialization It seems that it will be done as it is. The body is a long function with nearly 1500 lines called vm_pageout_scan () is there. In this (for example, around here), both A bit and D bit At the same time, if the D bit is set, the page is dirty.

		if (m->vmp_reference == FALSE && m->vmp_pmapped == TRUE) {
		        refmod_state = pmap_get_refmod(VM_PAGE_GET_PHYS_PAGE(m));

		        if (refmod_state & VM_MEM_REFERENCED)
			        m->vmp_reference = TRUE;
		        if (refmod_state & VM_MEM_MODIFIED) {
				SET_PAGE_DIRTY(m, FALSE);
			}
		}

You can also roughly chase vm_fault () to handle page faults. , It seems that there is only an error (SEGV etc.) or Copy on Write at the time of write fault.

From the above, it was found that macOS seems to use a physical page scan method that detects dirty by inspecting the D bit at the same time as inspecting the A bit in the page table.

For OpenIndiana

Speaking of SunOS (Solaris: Strictly speaking, the scope of SunOS and Solaris seems to be different, but I don't care about the details), for a while it was the Unix, and it was an OS that continued to influence all Unix-like OSs. .. I think Linux was aiming for SunOS up to around 2.6. In this area as well, for example, SunOS 4 was the first to integrate the page cache used for mmap (2) and the buffer cache used for read (2) / write (2).

The SunOS once released the source code as OpenSolaris around 5.10 (Solaris 10). The license was CDDL, which was OSS, though not compatible with the GPL. At that time, the product Solaris also seemed to be based on OpenSolaris, but when Sun was acquired by Oracle, the Sun (Oracle) -led OpenSolaris project ended.

OpenIndiana, which is derived from the last version of OpenSolaris, is still under development, so I would like to investigate this. Of the illumos-gate part that includes the kernel, the latest release OpenIndiana Hipster 2020.04 Refer to commit corresponding to / openindiana-hipster-2020-04-is-here /). The kernel is located at usr / src / uts. Unix Time-sharing System. The x86-dependent part is uts / i86pc, and the architecture-independent part is [uts/common](https:: //github.com/illumos/illumos-gate/tree/45de8795bcb0e4c49743f37edfdd2c89d5a7863b/usr/src/uts/common).

Scan page table

Of the virtual memory systems, the model-dependent part seems to be HAT (hardware Address Translation) management. For x86, go to uts / i86pc / vm / hat_i86.c is there. It is very similar to pmap, such as hat_t (struct hat) corresponding to each virtual space. Read diagonally using PT_MOD corresponding to PTE D bit as a key. Then, we can see the following.

hat_pagesync () takes a struct page (corresponding to a physical page: therefore multiple PTEs) and a bitmap flag as arguments, and A bit, D bit, R / W bit of the PTE that refers to a physical page. Is reflected in p_nrm of the struct page. As a flag

Take a bitmap of. I'm not sure what HAT_SYNC_STOP_ON_SHARED means, but I think it gives preferential treatment to pages shared by many virtual spaces.

/*
 * get hw stats from hardware into page struct and reset hw stats
 * returns attributes of page
 * Flags for hat_pagesync, hat_getstat, hat_sync
 *
 * define	HAT_SYNC_ZERORM		0x01
 *
 * Additional flags for hat_pagesync
 *
 * define	HAT_SYNC_STOPON_REF	0x02
 * define	HAT_SYNC_STOPON_MOD	0x04
 * define	HAT_SYNC_STOPON_RM	0x06
 * define	HAT_SYNC_STOPON_SHARED	0x08
 */

With this in mind, look at the code that scans the page table. The model-independent part of the virtual memory system is in uts / common / vm. The code that scans the page table for some reason is in uts / common / os is there. Oh, for some reason it has the same file name as macOS. Actually, also on FreeBSD [also on NetBSD] before UVM conversion (http://cvsweb.netbsd) .org / bsdweb.cgi / src / sys / vm / Attic / vm_pageout.c), so that might be the case (note that macOS, FreeBSD, and NetBSD's vm_pageout.c are clearly rooted).

OpenIndiana, or the Solaris source, has a lot of comments and is easy to read. The page replacement is pageout_scanner (), which is for exclusive use. It is running in the kernel thread of. It seems to implement the Clock algorithm, but unlike other operating systems I've seen so far, it doesn't have an inactive list, it seems to use a single list. The hand means the hands of a clock, and it feels like the two hands, fronthand and backhand, follow a list of physical pages (struct pages) at regular intervals (handspreadpages).

checkpage () seems to be the point. Basically, fronthand [clears D bit, A bit](https://github.com/illumos//illumos-gate/blob/45de8795bcb0e4c49743f37edfdd2c89d5a7863b/usr/src/uts/common/os/vm_pageout.c # L1010), go after [backhand checks A bit](https://github.com/illumos//illumos-gate/blob/45de8795bcb0e4c49743f37edfdd2c89d5a7863b/usr/src/uts/common/os/vm_pageout.c# L1012) It looks like it.

	/*
	 * Turn off REF and MOD bits with the front hand.
	 * The back hand examines the REF bit and always considers
	 * SHARED pages as referenced.
	 */
	if (whichhand == FRONT)
		pagesync_flag = HAT_SYNC_ZERORM;
	else
 		pagesync_flag = HAT_SYNC_DONTZERO | HAT_SYNC_STOPON_REF |
		    HAT_SYNC_STOPON_SHARED;

	ppattr = hat_pagesync(pp, pagesync_flag);

And, when A bit is set What Go back and when D bit is set , Queue_io_request () [call](https :: //github.com/illumos//illumos-gate/blob/45de8795bcb0e4c49743f37edfdd2c89d5a7863b/usr/src/uts/common/os/vm_pageout.c#L1088), request export. If neither is set, Unmap the page Open.

From the above, it can be seen that OpenIndiana, or Solaris at a certain time, uses the physical page scanning method.

For 4.4BSD

Looking at it so far, I saw that NetBSD and macOS, which seem to have the same roots, have different methods, and that Solaris, which was the norm until a certain time, and Linux and NetBSD are different, when did I change the method? Did you come to take it, or did you care about history?

If you've been using Unix for a long time, you may remember that there used to be a daemon called update (8) that issued a system call called sync (2) once every 30 seconds. Writing by write (2) etc. was not automatically written by the kernel, and I had to issue fsync (2) etc. or wait for update (8) to write after a maximum of 30 seconds. Perhaps the mmap (2) file was the same.

For example, the source code for 4.4BSD, a common ancestor of NetBSD and macOS (macOS is one of the ancestors), is available on the Web (https://minnie.tuhs.org/cgi-bin/utree). .pl? file = 4.4BSD / usr / src). vm_pageout.c, which was scanning a lot of active / inactive lists on macOS, is [so simple] on 4.4BSD (https://minnie.tuhs.org/cgi-bin/utree.pl?file=4.4BSD /usr/src/sys/vm/vm_pageout.c). I haven't seen the D bit here. Also, Page fault code is a copy for write fault. -just write on-write. On the contrary, defined here sync (2) Now, I'm only flushing the buffer cache (even though the page cache and buffer cache weren't integrated at this time), and the page cache used by mmap is untouched.

It seems that the page cache is flushed only when msync (2) is issued and when munmap (2) or process termination. In the case of NetBSD, sync (2) started flushing the page cache since February 1997. That's it.

After this, I've only been chasing after this, but when the Mach VM was erased [when UVM was introduced](https: / /github.com/NetBSD/src/commit/8f7ee94e136f377ecd00e43bc146adecc505c8ea) looks the same. Apparently, when the buffer cache and page cache are integrated I've come to catch write faults.

FreeBSD has lost its pre-2.0-RELEASE history due to legal issues. It seems that 2.0 has already caught the write fault and made it dirty, so it's not easy to know when this was also the current method. Linux has also changed VCS, so it's quite annoying to trace the history before 2.6.12. In 2.4.20 I had, there was code to make it dirty during a write fault.

Summary

So, unlike NetBSD, macOS, which is generally called BSD, is a physical page scan method, and Solaris, which was once the norm for various Unix OSs, is also a physical page scan method. Historically, at least 4.4BSD does not automatically flush the page cache, which is different between macOS, which has an ancestor of 4.3BSD (and both FreeBSD), and NetBSD, a descendant of 4.4BSD. But no wonder.

Recommended Posts

When will mmap (2) files be updated? (3)
When will mmap (2) files be updated? (2)
When will mmap (2) files be updated? (1)
Predict when the ISS will be visible
"Temporary solution" when pip cannot be updated [Python]
Be careful when working with gzip-compressed text files
[Python] When are variables created? When will class instances be erased?
Reverse lookup Numpy / Pandas (will be updated at any time)
When USB cannot be formatted
When pydub cannot be installed