Introduction

A summary of the LWN.net article Dedfending aginst page-cache attack by Jonathan Corbet and related information. Added notes and related information. The original article is mainly about the discussions held in the Linux Kernel Mailing List and the behavior of system calls made in Linux 5.0 or later. It is a commentary on changes.

Vulnerabilities are now reported in the page cache of virtual memory, which is the basis of the memory management mechanism of the OS. Vulnerabilities due to page cache timing have been pointed out for some time, but clever tricks are being found one after another. The paper that raised the issue exploits vulnerabilities in Linux and Windows, but other operating systems have vulnerabilities as well. Also, the behavior of system calls has changed, which is believed to affect the applications that use them. It's still a change in the rc version (release candidate), so it won't affect users using regular kernels, but it will do so in the future. See Resources for system calls and other features / terms.

license

According to the original article, it is by-sa-4.0.

Overview

--Page cache vulnerability changed the behavior of the mincore () system call --mincore is a function to know the status of the page cache. --Unrelated third party processes can also know the cache status used by other processes --Changed to return only the page that first failed, but relaxed to return the cache status conditionally in consideration of the impact on the performance of other programs. --Concerned about the impact of user space on programs

--Other system calls, some filesystems, and I / O drivers are also believed to contain page cache vulnerabilities.

Defending against page-cache attacks The kernel's page cache feature has contributed to improved performance by reducing disk I / O (when accessing files) and increasing sharing in physical memory.

However, like other performance-enhancing techniques that share resources across security boundaries, page caches can be abused as a way to read confidential information.

Page caches can be the target of a variety of attacks, as shown in a recent [paper] by Daniel Glass and his colleagues (https://arxiv.org/pdf/1901.01161.pdf). Yes, 5.0 Merge Window introduces that the behavior of the mincore () system call has suddenly changed. But later discussions reveal that mincore () is just the tip of the iceberg. It's unclear what really should be done to protect the system from page cache attacks and how much performance will be sacrificed.

The page cache keeps a fragmentary copy of the file in main memory. When a process needs to access data from a file, the presence of the data in the page cache eliminates the need to read from disk, which makes processing much faster. When multiple processes access the same file (such as the C library), the same copy is shared in the page cache, reducing the total amount of memory required by the process being executed. Many runtime systems are shared in this way on the system that hosts the container.

While page caches are useful, it has been known that this type of cache sharing can sometimes disclose information between processes. If an attacker could find out which files exist in the page cache, they could know what the process running on the system was doing.

* arXiv: 1901.01161v1 [cs.CR] 4 Borrowed from Jan 2019 *

If an attacker can observe that a particular page is in the cache, he or she can determine when some kind of access is taking place. For example, it is possible to guess when a particular function is not called or when a page containing a function appears in the cache. Glass et al. Were able to exploit many vulnerabilities. It included information such as secret channels and keystroke timing, and was completed using cached information.

There are two components to the success of a page cache attack. One is that it is possible to know if any page is in the cache. If possible, it is desirable not to scratch the cache state of the process. The other is to be able to kick certain pages out of the cache while doing so, which is essential to see when the target visits those pages. In this paper, we were able to easily evict the target page from the cache with a sufficient amount of other pages. This was successful, but there may actually be an easier way.

fix mincore () system call

The focus of the developer community has been on the ability to get location information for the page cache. It is probably impossible to completely prevent an attacker from changing the state of the cache (although memory control groups may help in this regard). But if the attacker can't observe the cache status, most attacks will be quite difficult. Certainly, it will be difficult to know if the target page was successfully kicked out. Unfortunately, it is not easy to keep this information safe.

In Glass's paper, mincore () was concerned with system calls. It is well known that the mincore () system call is for processing page cache state reports. As a result of the 5.0 merge, mincore () has been changed to just report on pages that have faulted [^ 1] on call from the process [^ 7]. An attacker can still use mincore () to know when a page has been evacuated, but can no longer be used to observe when a page has been faulted back by another process. To do so, the attacker must first fault the page, destroying the information he wants.

Changing the behavior of mincore () is a significant change. I dared to refrain from changing it in the stable update. Considering it from a realistic point of view, it could destroy the behavior of the user space program and lead to a revert. Kevin Easton is a Debian package list of those using the mincore () system call, But it's still unclear how many packages will be corrupted. Probably the most problematic on the list is vmtouch, but a known working set [^ 2] to speed up virtual machine startup. It is used in some settings that prefault.

Regarding this fatal effect, Josh Snyder [^ 10] said, "For Netflix, it took days to maintain the database cluster [^ 9] in months if the mincore system call lost accurate information. It will be stretched "reported. The report encouraged leading developers to rethink their options, including adding a system mode to change mincore () to privileged execution [^ 8]. The idea is probably close to proposed and adopted by Dominique Martinet. He said the information should only be provided if the (mincore) caller (process) is allowed to write to the mapping source file. This will solve the Netflix use case while preventing pages from being monitored from system executables. Patch implementation was posted by Jiri Kosina.

Even bigger problem

If a viable solution is found, some will try to solve and end the bigger problem. But this case is not like that for now. David Chinner can use the preadv2 () system call with the RWF_NOWAIT flag to non-destructively test the contents of the page cache. Pointed out [^ 15]. A available solution is to change the cache state and at the same time improve performance as much as possible for general users by initializing readahead when data retrieval in the page cache fails when reading RWF_NOWAIT. The Kosina patch listed above contains these changes.

Chinner sees these patches as moles, and still in the midst of a lot of moles. "Many kernel interfaces are designed to query whether data is immediately available," he said (which generally means that it's in the page cache). This information is used in many applications because it can be obtained in the correct way and is useful. Another route to vulnerabilities is, he calls overlayfs, which is used as a means of page caching between containers. According to him, changing mincore () is the wrong approach.

This is just a quick bandage for a particular read method and does nothing in narrowing the actual scope of information leakage. If you just look at adhesive plasters, the other paths that also expose information and all of the infrastructure we've built are on top of the core concept of "sharing pages on the kernel side across security boundaries." You will miss that there is.

In subsequent discussions, he unveiled another path of vulnerability. On at least some filesystems, direct I / O reads on the page will expel the page from the cache. So, disabling (page cache) is very easy for an attacker. There was some heated debate, but there was also the question, "Is it right for filesystems like XFS to do that?" (Linus Torvalds considers this a bug ). However, one clear point from this discussion is that this behavior is unlikely to change immediately.

Even if all the holes are closed, problems like blunt instruments still exist. It's a simple timing attack issue. If a particular page loads quickly, it's almost certainly in the cache. If it takes time, it's probably a read from persistent storage (such as HDD, SSD). Timing attacks are usually annoying and easy to notice, but they are still available. And it seems that new holes will come out in the future. In another discussion, Chinner featured a recently posted virtio pmem device [^ 5] [Commented] about what could be abused in the same way (https://lwn.net/ml/linux-kernel/20190110012617.GA4205@dastard/). For io_uring features [^ 6], if they were merged in their current form, it would make it easier for attackers to query the page cache. ..

In other words, this problem seems almost unsolvable, at least in the absolute sense. Perhaps the best thing you can do is raise the threshold high enough to prevent most attacks. So it seems to stop the known mechanism of non-destructive queries to the page cache state only if the kernel is set to "secure mode". It is too difficult (or costly) to completely defend against timing attacks. So as Linus posted, longing for full security People who are will be disappointed as usual.

We will never ** never ** prevent all side channel attacks. Some parts of caching are very fundamental (especially when it comes to timing effects). So it's impossible to draw an absolute line on the ground ** no matter what. You can't say "you are protected" in black and white. There is only a difference in convenience.

Blocking the abuse of known vectors still leaves a problem in that it doesn't cause problems for existing userspace applications. Like Meltdown and Specter, it seems like a problem that keeps kernel developers busy for the time being.

Original, primary information

-Original article --Defending against page-cache attacks -Page Cache Attack Vulnerability Paper-Daniel Gruss -Kernel mailing list thread-[PATCH] mm / mincore: allow for making sys_mincore () privileged 1/30 post -Statement of Linus Torvalds regarding changes in mincore as of 1/6

References

-New hardware-independent side-channel attack "Page Cache Attacks" announced -The Linux Kernel- 4. Memory Management -Relationship between Linux kernel page cache and buffer_head, address_space -I checked and released the page cache on Linux

Linux and memory basics that can't be heard anymore & detailed usage of vmstat -[Page fault --wikipedia](https://ja.wikipedia.org/wiki/%E3%83%9A%E3%83%BC%E3%82%B8%E3%83%95%E3%82%A9 % E3% 83% BC% E3% 83% AB% E3% 83% 88) -[Working set-wikipedia](https://ja.wikipedia.org/wiki/%E3%83%AF%E3%83%BC%E3%82%AD%E3%83%B3%E3%82%B0 % E3% 82% BB% E3% 83% 83% E3% 83% 88)
mincore - man
-Check if the file is in the page cache
happycache -Check if the file is cached with vmtouch
CAP_SYS_ADMIN
preadv2 - man -Avoiding page cache by direct I / O of XFS
Overlay Filesystem - Neil Brown -docker --Using OverlayFS Storage

Meanwhile, at that time OpenBSD ...

I tried to make mincore lie. The nature of memory sharing is that you can spy on what other processes are doing. We don't want that, so we always return "memory is in the core (cache)". https://twitter.com/OpenBSD_src/status/1089658147294273536

Intel's Hyper-Threading feature is a vulnerable nest, and OpenBSD with Hyper-Threading feature ** disabled ** However, I have to say that it is a stone's throw. I yearn for it.

[^ 1]: An interrupt (or exception) that occurs in hardware when a program accesses a page in a virtual address space where physical memory is not mapped.

[^ 2]: A collection of virtual memory pages in use at some point in the process.

[^ 7]: This patch was written by Linus himself.

[^ 8]: Specifically, when sysctl_mincore_privileged is enabled, EPERM is returned without CAP_SYS_ADMIN.

[^ 9]: Using my own tool called happycache, I dump / restore the cache before rebooting during DB maintenance.

[^ 5]: Pseudo persistence memory used by kvm to avoid guest page cache

[^ 6]: Interface introduced for the purpose of renewing the old asynchronous I / O

[^ 10]: NetFlix Cloud Engineer

[^ 15]: When preadv2 is used at the same time as the RWF_NOWAIT flag, if the data cannot be read immediately (that is, it is not in the cache), it returns nothing and can be used instead of checking by mincore. Available from 4.14.

The impact of Linux page cache attack countermeasures is large and is expected to be prolonged [Translation]