This article is the 24th day article of Fujitsu Advent Calendar 2020. (The article is an individual view and does not represent an organization.)
I will write about the trend of Linux non-volatile memory support in the annual Adevent Calendar this year as well.
To be honest, I thought about writing something different from non-volatile memory this year. I have already talked about the trend of Linux in non-volatile memory at Open Source Summit Japan 2020 (hereinafter OSSJ) this year, and the slides are also published, so the contents are all this year. is. So I was a little wondering if I needed to talk on the Advent Calendar again.
Also, while the article "Mainframe Exception Handling" published in May was quite popular, many people didn't know about Linux exception handling, so I could write about it. Wonder? I also thought a little. However, it seemed that I still needed some research time, so I gave up on that as well.
Well, OSSJ is an international conference, so I'm talking in English, and I'm talking about it at Information Processing Society of Japan Comsys last year. Isn't it bad to summarize the update of the year? I thought. That's why I will write in NVDIMM this year as well.
This article will skip what I talked about last year at Comsys, so if you haven't seen it yet, check it out. Non-volatile memory (NVDIMM) and Linux support trends (for comsys 2019 ver.)
A new library has been added to the PMDK. First, let's briefly introduce them.
libpmem2
There was libpmem in the low layer library, but libpmem2 has been added. To be precise, this seems to have been added before 2020.
As a feature, the concept of GRANULARITY is newly introduced, and the following three are defined. The third is a little interesting.
--PMEM_GRANULARITY_PAGE: For HDD and SSD --PMEM_GRANULARITY_CACHE_LINE: For current NVDIMMs. For platforms that still have to flush the cpu cache --PMEM_GRANULARITY_BYTE: This is also for NVDIMM, but for an environment where the platform flushes the cpu cache when the power is turned off.
I don't know if there really is a platform like the one in the third definition right now. It may be available in an environment where extra batteries can be loaded, or it may be planned in the future. Handling the cpu cache is one of the difficulties of NVDIMM, so it would be great if we could create an environment where we don't have to worry about it!
This library also seems to have an interface to get the status of NVDIMM's abnormal blocks and whether a sudden power failure has occurred. Internally, it seems to be realized by using the library of ndctl which is the management command of the namespace of NVDIMM.
It seems that the interface has been renewed with the old libpmem, and it seems that it is not compatible.
librpma
Previously PMDK had librpmem as a library for RDMA, but it remained experimental status for a long time. However, apart from that, a new librpma library has appeared. Why was it newly created?
It's easy to cite this from the Intel Announcement (https://www.openfabrics.org/wp-content/uploads/2020-workshop-presentations/202.-gromadzki-ofa-workshop-2020.pdf). I won't go into details, but librpmem seems to have some things that didn't meet the needs of the user.
In fact, when I asked Pmem's mailing list "Why is librpmem Experimental?", He answered, "No one is using it. If you use it, I'll remove Experimental now." I have the experience of returning. For this reason, it seems that a new library was created to meet the needs of users.
Currently it is v0.9, and it seems that the main body of the library is v1.0 will be released in 1Q 2021.
Filesystem DAX unfortunately remained in Experimental Status this year as well. However, the remaining issues are gradually being resolved, or workarounds are being found, and we are moving forward little by little. Broadly speaking, the following three problems remain, but let's explain each of them below.
As mentioned earlier, Filesytem DAX aims to be able to persist data by simply flushing the CPU cache without calling sync ()/fsync ()/msync (). However, the user's data is fine, but the file's metadata is not. Until now, the timing of updating the Filesystem itself could be updated when calling sync () etc., because that timing disappears.
This problem has been solved by using mmap ()'s MAP_SYNC option for access from user space, and waiting for truncate () processing for kernel layer DMA/RDMA until data transfer is completed. ..
However, this is not the case with functions such as infiniband and video (V4L2) that transfer data directly to the user space via DMA/RDMA. Since it is illegal for the driver/kernel layer program to keep the CPU for a long time, it can be expected that it will someday let go of the processing. So you may have the option of waiting, but user-space programs cannot guarantee that. A good example is video recording, but when the user stops recording is up to the user and cannot be predicted by the kernel. You can't wait because you don't know when the data transfer will end.
There seemed to be some suggestions for a solution to this problem, but there is no community-consensual implementation yet. However, it seems that avoidance can be done if the hardware supports the function __On Demand Paging __. This feature does not normally map the physical page to user space, but only when the process makes that access to the area that is mapped to user space for I/O. With this function, metadata can be updated when pasting a physical page, so it seems that there is a chance to arbitrate information such as blocks lost by truncate () etc. It seems that mellanox's inifiniband's relatively new card has this function, and in the future other hardware may have a similar function. Also, with the introduction of this workaround, it seems that the priority of this issue has been considerably lowered in the Filesystem-DAX issues.
It was decided that we should create a function that can specify DAX on/off for each inode, that is, for each File or directory. This was supposed to have the following use cases:
However, when I heard this story, it seemed to be quite difficult to develop, so I expected it to take several years to resolve. For example, the following issues can be considered just by thinking of them.
If you turn on a file that has DAX turned off, what happens to the cache (Page cache) in the memory of that file? At least you have to do something like memory migration from the area of RAM you are using as the page cache to the area of NVDIMM. There is already a memory migration function for moving between RAMs, but it will be necessary to extend this for moving with NVDIMMs.
Since normal files and DAX files are handled differently in the file system, the methods used are naturally different. Address_space_operations for handling page cache may be prepared separately for DAX as follows.
const struct address_space_operations xfs_address_space_operations = { <---For regular files
.readpage = xfs_vm_readpage,
.readahead = xfs_vm_readahead,
.writepage = xfs_vm_writepage,
.writepages = xfs_vm_writepages,
.set_page_dirty = iomap_set_page_dirty,
.releasepage = iomap_releasepage,
.invalidatepage = iomap_invalidatepage,
.bmap = xfs_vm_bmap,
.direct_IO = noop_direct_IO,
.migratepage = iomap_migrate_page,
.is_partially_uptodate = iomap_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
.swap_activate = xfs_iomap_swapfile_activate,
};
const struct address_space_operations xfs_dax_aops = { <----For DAX
.writepages = xfs_dax_writepages,
.direct_IO = noop_direct_IO,
.set_page_dirty = noop_set_page_dirty,
.invalidatepage = noop_invalidatepage,
.swap_activate = xfs_iomap_swapfile_activate,
};
If you switch DAX on/off, you need to switch between such methods, but when you switch, the method may still be running. How should exclusive processing be performed and at what timing should it be switched? There is a very difficult task.
However, this problem was solved by choosing a reasonably reasonable solution.
__ "Switch on/off when the inode cache is not in memory" __
If the inode cache is not in memory, it is easy to switch because no one is using the file. In other words, it can be said that "if someone is using it, you can not switch", but when you want to change the DAX setting, you also want to change the behavior of the application using that file. Most of the time, so it won't be a problem.
When someone is using the target file, such as opening it, the inode cache remains in memory, so switching is not performed at that point and processing is delayed. When there are no users, the inode cache is synced to the disk and deleted once, and after the inode is reloaded from the disk, it finally switches by looking at the DAX flag at that time.
This allows you to set DAX on/off, and in XFS you can change it by typing a command like xfs_io -c'chattr + x'as shown. In addition, when using this function, it is necessary to specify dax = inode as the option at the time of mount as follows.
# mount ... -o dax=inode
However, I personally think that this function still has some problems. That's because the current kernel documentation Documentation/filesystem/dax.txt says the following as an operation to clear the cache:
b) evict the data set from kernel caches so it will be re-instantiated when
the application is restarted. This can be achieved by:
i. drop-caches
ii. a filesystem unmount and mount cycle
iii. a system reboot
The first drop cache operation is a [^ drop_cache] operation that expels the system-wide cache by the following operations on the procfs interface.
[^ drop_cache]: If the specified value is 1, the page cache will be ejected, and if it is 2, the slab cache including the inode cache will be ejected. 3 is both.
# echo 3 > /proc/sys/vm/drop_caches
Whether it is the page cache or the slab cache including the inode, it is an operation that tries to reduce the cache of the entire system, so the effect on the entire system is inevitable. I have the impression that redoing the mount and rebooting are rather emergency operations such as when some problem occurs on the kernel side, but the drop_cache operation is not such an atmosphere and is a halfway operation Right. Even if you switch files and directories, it is an operation that you want to avoid if it affects the entire system. In particular, I think that there are many servers running on container hosts these days, but it will be a big problem that this operation will affect unrelated containers next to it.
When I asked the community why this is the case, it seems that this description was unavoidable because there is a timing when the cache will inevitably remain due to race condition now. The author also suggested "Can I create a function that drops_cache on a file-by-file basis?", But it was rejected because it would be a hotbed for DOS attacks.
Therefore, I suggest a patch to prevent cache from remaining by setting and evaluating I_DONTCACHE and DCACHE_DONTCACHE well when switching DAX on/off. With this patch, you may no longer have to drop_cache.
The CoW file system incorporated into the kernel on Linux is btrfs, but XFS also has a function [^ reflink] called reflink/deduplication.
[^ reflink]: By the way, unlike btrfs, reflink/deduplication of XFS does not CoW for metadata, but only user data.
When this feature dedup works, a block in a file system is shared by both File A's offset 100 and File B's offset 200. If the data in the block is the same, it will be duplicated and can be deleted.
In order to remove Experimental Status, it was a condition that these functions and Filesystem-DAX were compatible. However, it is currently in an exclusive relationship with Filesystem DAX, and if reflink is enabled during mkfs of XFS, an error will occur even if DAX is specified at mount time. What's wrong?
There are two main problems with this.
In particular, problem 2 can be said to have arisen because Filesystem-DAX has both memory and storage characteristics (as was the case with the metadata update problem mentioned above).
A member of my team has posted a patch to resolve this issue, so let's talk about it.
As mentioned earlier, you'll need to add sources where CoW and Filesystem DAX haven't been considered before. Correspondence is required in the following three layers, but it is necessary to implement each of them in a straightforward manner.
iomap is an interface of the block layer layer that replaces buffer_head, and it is characterized by being able to perform I/O across multiple pages at once, as opposed to buffer_head, which was required to issue I/O in units of 1 page at the maximum. The lower layer of Filesystem DAX uses this interface, but it didn't have a mechanism for CoW. First of all, I had to add an interface for CoW here. This time, we are adding an interface for the copy source information called srcmap.
You need to add logic for CoW during the write () process of the Filesystem DAX code and during the page fault with mmap (). I am obediently implementing the fact that I get the information of srcmap and copy and update the data. You also need to add dedup processing for Filesystem DAX, which also creates a DAX-specific route.
When XFS and btrfs are Filesystem-DAX, it is necessary to modify to use the above interface.
When a process that is using the memory has to be killed due to a memory failure, the process of searching for the process that is using the area has been searched from the mechanism of the memory management mechanism. In the case of page cache and so on, in such a case, it was realized by finding one file and killing the process using it. All I had to do was find one of the related files from the mapping and index information on the struct page of the memory I was using and track it down. However, due to the combination of Filesystem DAX and CoW filesystem functionality, you have to search for multiple files from the struct page. This is a very difficult problem. This is because the struct page does not have the space to register multiple files.
Although it is a little off the beaten track, the struct page is an important structure in Linux's memory management mechanism, and one is allocated to each physical memory page in order to record the state of each physical page. On x86, about 40 bytes of struct page is allocated for 4K page. In other words, about 1% of physical memory becomes a struct page. If you have 1TB of memory, about 10Gbyte will be occupied only by this struct page. For this reason, traditionally, great care has been taken not to increase the size of the struct page, and the definition of its structure is chaotic year by year using union. Since it is necessary to record multiple file information in such a chaotic place, it became necessary to worry about how to realize it without changing the size.
It was a difficult task, so I have to redo the realization method three times. Let's look at them in order.
For each page, the idea is to register in a tree structure when multiple files are used by the dedup function. This idea is straightforward and a design that everyone can think of (I was thinking about it at first). However, the memory consumption and overhead are high, and as a result of discussions in the community, it died.
Filesystem-DAX seems to have a tree structure for searching files used from the block used called rmapbt. In this plan, you can search for it, register it as an owner list, and follow the list when a memory failure occurs. We were able to reduce memory usage and overhead compared to Plan 1.
However, this also died because it still uses extra memory to create the list itself.
Instead of creating an owner list in advance, the idea is to do your best to search for owners by following rmapbt in the event of a memory failure. In addition, the call back routine to be processed in the event of a memory failure is registered, and this call back routine is executed for each owner. This call back has become a common interface that can also be used during Device DAX. We are currently implementing this proposal, and the general direction is now agreed by the community.
There are still some remaining issues [^ issues], but if this implementation is incorporated upstream, it seems that the last problem of Filesystem-DAX can finally be solved.
[^ issues]: At the time of creating the presentation material, it seems that there was a problem such as not being able to follow rmapbt when using LVM. It seems to have been resolved, but my understanding has not caught up yet. ). Another problem is that features like XFS's realtime Device don't have rmapbt, so this method can't solve it.
I summarized the movement of Linux of NVDIMM in 2020 in Japanese, but how was it? We hope you find it useful. Have a nice year!