[LINUX] Kernel technology that helped Kubernetes contributions cgroupv2 analysis

Introduction

While verifying the application of cgroupv2 to the container infrastructure, it was found that the preflight checks of kubeadm need to be modified, so the following measures were taken.

https://github.com/kubernetes/system-validators/pull/12

The specific modification is to change the source of subsystem information from cgrpupv2 (/ proc / cgroups is obsolete, and I want to get it from /sys/fs/cgroup/cgroup.controllers). Fortunately, runC has already fixed a similar issue, and I made a Pull Request to kubeadm using the code below.

https://github.com/opencontainers/runc/commit/74a3fe5d1b894f01bcef94469f76ca62df80948a

For this Pull Request, there is a problem that there is a subsystem that is enabled internally in the kernel for this matter but is not displayed in /sys/fs/cgroup/cgroup.controllers, and the reason is the source code. I received feedback to add it on the comment. However, I couldn't find clear specifications and information immediately from the official documents, so this time I read the kernel code for the behavior of cgroupv2 and actually confirmed the expected behavior from a desk survey with kprobe. In this article, I will explain the procedure of kernel analysis based on the memo at the time of investigation. The kernel to be investigated for source code is 5.3.

Behavior on the kernel side when reading /sys/fs/cgroup/cgroup.controllers

First, I looked at the code in kernel space that runs when loading /sys/fs/cgroup/cgroup.controllers from user space. The relevant interface was implemented below.

https://github.com/torvalds/linux/blob/v5.3/kernel/cgroup/cgroup.c#L4864

So, a little around the cgroup_controllers_show function set in .seq_show When I read it, I selected the subsystems to be displayed by cgroup_control function. I found out that. In addition, some of the subsystems to be displayed were selected by using the inverted values of cgrp_dfl_implicit_ss_mask and cgrp_dfl_inhibit_ss_mask.

https://github.com/torvalds/linux/blob/v5.3/kernel/cgroup/cgroup.c#L433-L434

Therefore, I decided to investigate what values are set in the following variables and how.

Cgroupv2 initial settings at system boot

First

I checked where the bits of are actually set. Then I found the code in the same source file where the bits were set on a subsystem-by-subsystem basis when the kernel booted. https://github.com/torvalds/linux/blob/v5.3/kernel/cgroup/cgroup.c#L5783-L5786

	for_each_subsys(ss, ssid) {
(Omission)
		if (ss->implicit_on_dfl)
			cgrp_dfl_implicit_ss_mask |= 1 << ss->id;
		else if (!ss->dfl_cftypes)
			cgrp_dfl_inhibit_ss_mask |= 1 << ss->id;

Details are omitted, but the implicit_on_dfl of cgroup_subsys structure It is a mechanism to set a bit depending on the presence or absence of dfl_cftypes. First, if implicit_on_dfl exists, the ID-th bit of that subsystem is set in cgrp_dfl_implicit_ss_mask, otherwise the ID-th bit of the subsystem that does not have dfl_cftypes is set in cgrp_dfl_inhibit_ss_mask.

Actually check the bit mask with kprobes

By reading the kernel code, I was able to understand the behavior when loading /sys/fs/cgroup/cgroup.controllers from user space. Therefore, check if the bits are set as expected from the kernel that is actually running.

Verification environment

Ubuntu 18.04(x86_64, kernel : 5.0.0-32-generic)
cgroupv2
Kubernetes : v1.15.2

In this verification, ftrace is used directly from tracefs to trace. So if you can mount tracefs, you don't need any other tools. It is assumed that tracefs is mounted in / sys / kernel / debug / tracing.

Notes on cgroupv2 subsystem settings

Currently, each subsystem is mounted on cgroupv1, so I think there are many environments where subsystems cannot be mounted on cgroupv2. In my environment, I added cgroup_no_v1 = all to the kernel command line parameters for verification, but I don't recommend using it in a production environment.

https://www.kernel.org/doc/html/v5.3/admin-guide/kernel-parameters.html

probe for cgroup_control function

Now let's hook into the cgroup_control function. This time, I want to check the bitmask of the return value, so set kretprobe.

# echo 'r:retcgcontrol cgroup_control ret=$retval:u16' > /sys/kernel/debug/tracing/kprobe_events
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/retcgcontrol/enable 
# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory pids rdma
# cat /sys/kernel/debug/tracing/trace | grep cat
             cat-xxxx  [xxx] d...   xxx.xxxxxx: retcgcontrol: (cgroup_controllers_show+0x44/0x60 <- cgroup_control) ret=6171

When the subsystem "cpuset cpu io memory pids rdma" was displayed, the return value of the cgroup_control function was found to be 6171 (1100000011011). So, make sure that these 6 bits point to "cpuset cpu io memory pids rdma". The cgroup subsystem IDs are assigned in the order they appear in cgroup_subsys.h. To go. For detailed implementation, refer to the following site.

[cgroup SUBSYS macro] (https://tenforward.hatenablog.com/entry/2017/03/16/200009 "SUBSYS macro for cgroup")

When I checked the subsystems corresponding to the 6 bits, I found that they were "cpuset, cpu, io, memory, pids, rdma".

Show the values of cgrp_dfl_implicit_ss_mask and cgrp_dfl_inhibit_ss_mask

I knew which subsystems were displayed, but I was also wondering which subsystem had a bit in cgrp_dfl_implicit_ss_mask or cgrp_dfl_inhibit_ss_mask. Therefore, try dumping these values dynamically as well. First, get the addresses of cgrp_dfl_implicit_ss_mask and cgrp_dfl_inhibit_ss_mask. (Added on 2020/02/17) Since ftrace was able to get the address directly from the symbol, the following is unnecessary. Rather, KASLR is troublesome, so it seems better to write the symbol as it is. </ font>

# cat /proc/kallsyms | grep grp_dfl_implicit_ss_mask
ffffffffa3fdd8a6 b cgrp_dfl_implicit_ss_mask
# cat /proc/kallsyms | grep cgrp_dfl_inhibit_ss_mask
ffffffffa3fdd8a8 b cgrp_dfl_inhibit_ss_mask

Next, include the address dump in the cgroup_control function probe procedure and execute it.

# echo 'r:retcgcontrol cgroup_control cgrp_dfl_implicit_ss_mask=@0xffffffffa3fdd8a6:u16 cgrp_dfl_inhibit_ss_mask=@0xffffffffa3fdd8a8:u16 ret=$retval:u16' > /sys/kernel/debug/tracing/kprobe_events
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/retcgcontrol/enable 
# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory pids rdma
# cat /sys/kernel/debug/tracing/trace | grep cat
             cat-xxxx [000] d...  xxxx.xxxxxx: retcgcontrol: (cgroup_controllers_show+0x44/0x60 <- cgroup_control) cgrp_dfl_implicit_ss_mask=256 cgrp_dfl_inhibit_ss_mask=1764 ret=6171

(Added on 2020/02/17) Get the address directly from the symbol with ftrace </ font>

# echo 'r:retcgcontrol cgroup_control cgrp_dfl_implicit_ss_mask=@cgrp_dfl_implicit_ss_mask:u16 cgrp_dfl_inhibit_ss_mask=@cgrp_dfl_inhibit_ss_mask:u16 ret=$retval:u16' > /sys/kernel/debug/tracing/kprobe_events
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/retcgcontrol/enable 
# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory pids rdma
# cat /sys/kernel/debug/tracing/trace | grep cat
             cat-xxxx [000] d...  xxxx.xxxxxx: retcgcontrol: (cgroup_controllers_show+0x44/0x60 <- cgroup_control) cgrp_dfl_implicit_ss_mask=256 cgrp_dfl_inhibit_ss_mask=1764 ret=6171

cgrp_dfl_implicit_ss_mask = 256 (100000000) and cgrp_dfl_inhibit_ss_mask = 1764 (11011100100) are displayed. From cgroup_subsys.h, cgrp_dfl_implicit_ss_mask is "perf_event", cgrp_dfl_inhibit_ss_mask is "cpucc" It turns out that it masks "devices, freezer, net_cls, net_prio, hugetlb".

(Bonus) implicit_on_dfl of perf_event

Currently, it turns out that only perf_event is implicitly flagged, but its location is https://github.com/torvalds/linux/blob/v5.3/kernel/events/core It was in .c # L12216.

	/*
	 * Implicitly enable on dfl hierarchy so that perf events can
	 * always be filtered by cgroup2 path as long as perf_event
	 * controller is not mounted on a legacy hierarchy.
	 */
	.implicit_on_dfl = true,

Summary

The variables related to the display of /sys/fs/cgroup/cgroup.controllers are summarized below.

	cpuset	cpu	cpuacct	io	memory	devices	freezer	net_cls	perf_event	net_prio	hugetlb	pids	rdma
cgrp_dfl_implicit_ss_mask	0	0	0	0	0	0	0	0	1	0	0	0	0
cgrp_dfl_inhibit_ss_mask	0	0	1	0	0	1	1	1	0	1	1	0	0

Only subsystems that are not masked by either of these two variables will be displayed in /sys/fs/cgroup/cgroup.controllers.

in conclusion

I'm not familiar with cgroupv2 either, but I was able to quickly find out the information needed for Kubernetes contributions without modifying the kernel, using my knowledge of the kernel and debugging techniques. Not limited to this case, it is often possible to discover or solve unexpected problems by acquiring information from a perspective different from the user space. If there is demand, I would like to introduce it somewhere in addition to the details of this case.