[LINUX] Getting Started with CPU Steal Time

There seems to be a lot of misunderstanding about CPU Steal Time. For example, it is a metric that shows how much CPU resources that other VMs running on the same host have stolen CPU resources that should be allocated to them, or a metric that shows that CPU resource competition is occurring. It is true that the virtual environment tends to overallocate CPU resources depending on the settings etc., but it is not possible to simply conclude that it was "stolen" just by looking at the CPU Steal Time.

CPU Steal Time is a metric that counts the amount of performance a virtual machine is trying to outperform its allocated CPU resources. Originally, it is a metric that should be called ** involuntary wait time ** instead of the name Steal Time, as in the comment in kernel / sched / cputime.c.

/proc/stat First, let's see where the steal time metric % st, which can be seen intop (1)andvmstat (8), comes from. In fact, this metric just gets the information from / proc / stat and calculates it. Is it the following part?

fs/proc/stat.clinux/stat.c at master · torvalds/linux · GitHub:

		steal		+= cpustat[CPUTIME_STEAL];

It seems that I just got the value of cpustat [CPUTIME_STEAL]. So where is cpustat [CPUTIME_STEAL] accounted for? When I checked it, I came across the following source

VM CPU time

It is calculated by kernel / sched / cputime.c. Excerpt.

linux/cputime.c at master · torvalds/linux · GitHub:

/*
 * Account for involuntary wait time.
 * @cputime: the CPU time spent in involuntary wait
 */

void account_steal_time(u64 cputime)
{
	u64 *cpustat = kcpustat_this_cpu->cpustat;
	cpustat[CPUTIME_STEAL] += cputime;
}
...
/*
 * When a guest is interrupted for a longer amount of time, missed clock
 * ticks are not redelivered later. Due to that, this function may on
 * occasion account more time than the calling functions think elapsed.
 */
static __always_inline u64 steal_account_process_time(u64 maxtime)
{
#ifdef CONFIG_PARAVIRT
	if (static_key_false(&paravirt_steal_enabled)) {
		u64 steal;

		steal = paravirt_steal_clock(smp_processor_id());
		steal -= this_rq()->prev_steal_time;
		steal = min(steal, maxtime);
		account_steal_time(steal);
		this_rq()->prev_steal_time += steal;

		return steal;
	}
#endif
	return 0;
}

You can see that ʻaccount_steal_time ()only contains the measured steal time. It seems that the actual value is acquired byparavirt_steal_clock (), which is called in steal_account_process_time ()`.

Note that if you run both Xen and KVM as guest VMs, you don't have to worry because they usually have CONFIG_PARAVIRT = y and go inside #ifdef CONFIG_PARAVIRT, regardless of HVM / PV. This seems rather natural, as HVM often doesn't use paravirtual features. By the way, CONFIG_PARAVIRT = y is exactly an option that just enables the paravirtualization code, with or without virtualization. The help contents of Kconfig are excerpted below. linux/Kconfig at master · torvalds/linux · GitHub:

config PARAVIRT
	bool "Enable paravirtualization code"
	---help---
	  This changes the kernel so it can modify itself when it is run
	  under a hypervisor, potentially improving performance significantly
	  over full virtualization.  However, when run without a hypervisor
	  the kernel is theoretically slower and slightly larger.

Well, the story is off, but it was paravirt_steal_clock () that actually counted the steal time. If you follow this, [linux / paravirt.h at master · torvalds / linux · GitHub](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/paravirt.h# You can see that it is abstracted as pv_ops.time.steal_clock in L34-L37). Below is that part:

static inline u64 paravirt_steal_clock(int cpu)
{
	return PVOP_CALL1(u64, time.steal_clock, cpu);
}

Since there is a macro, you may be confused by something, but if you expand this macro, you will get pv_ops.time.steal_clock from time.steal_clock. In other words, you can see that the original identity is pv_ops.time.steal_clock.

Now let's look at what is registered as pv_ops.time.steal_clock for Xen and KVM, respectively.

For Xen

For Xen, it is defined in drivers / xen / time.c.

linux/time.c at master · torvalds/linux · GitHub:

	pv_ops.time.steal_clock = xen_steal_clock;

The reality is:

u64 xen_steal_clock(int cpu)
{
	struct vcpu_runstate_info state;

	xen_get_runstate_snapshot_cpu(&state, cpu);
	return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
}

Apparently, in the case of Xen, it seems to measure steal time from the state of VCPU. Note that xen_get_runstate_snapshot_cpu () takes a snapshot of the VCPU state at that time. This function is also defined in the same source file, so if you're curious, take a look there. The meanings of RUNSTATE_runnable and RUNSTATE_offline can be found in ʻinclude / xen / interface / vcpu.h`.

linux/vcpu.h at master · torvalds/linux · GitHub:

/* VCPU is currently running on a physical CPU. */
#define RUNSTATE_running  0

/* VCPU is runnable, but not currently scheduled on any physical CPU. */
#define RUNSTATE_runnable 1

/* VCPU is blocked (a.k.a. idle). It is therefore not runnable. */
#define RUNSTATE_blocked  2

/*
 * VCPU is not runnable, but it is not blocked.
 * This is a 'catch all' state for things like hotplug and pauses by the
 * system administrator (or for critical sections in the hypervisor).
 * RUNSTATE_blocked dominates this state (it is the preferred state).
 */
#define RUNSTATE_offline  3

As you can see from the above, RUNSTATE_runnable and RUNSTATE_offline do not reflect the effects of other resources taken. If the application running on the virtual machine has the necessary resources allocated, the VCPU will not be in the runnable state. It could be running, blocked, or ʻoffline. You can borrow extra resources in Xen, but you can't steal resources from other VMs. If the VCPU of a virtual machine is in the runnable state for a long time and the Steal Time of top/vmstat` is high, it means that the virtual machine is requesting more CPU resources than it can use. .. Instead, if Steal TIme suddenly jumps up, it's possible that something went wrong.

For KVM

For KVM, the steal value is to be read directly from the MSR. You can find it in ʻarch / x86 / kernel / kvm.c`.

linux/kvm.c at master · torvalds/linux · GitHub:

static void __init kvm_guest_init(void)

{
	int i;
	paravirt_ops_setup();
...
	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
		has_steal_clock = 1;
		pv_ops.time.steal_clock = kvm_steal_clock;
	}

linux/kvm.c at master · torvalds/linux · GitHub:

static u64 kvm_steal_clock(int cpu)
{
	u64 steal;
	struct kvm_steal_time *src;
	int version;

	src = &per_cpu(steal_time, cpu);
	do {
		version = src->version;
		virt_rmb();
		steal = src->steal;
		virt_rmb();
	} while ((version & 1) || (version != src->version));

	return steal;
}

It just loads the MSR (Model Specific Register). Probably the implementation of what values are provided is [linux / x86.c at master · torvalds / linux · GitHub](https://github.com/torvalds/linux/blob/master/arch/x86/kvm/ It will be the part of x86.c # L2651-L2694), but I have not followed it here. Fortunately, the documentation mentions this definition.

https://www.kernel.org/doc/Documentation/virtual/kvm/msr.txt

MSR_KVM_STEAL_TIME: 0x4b564d03 ... steal: the amount of time in which this vCPU did not run, in nanoseconds. Time during which the vcpu is idle, will not be reported as steal time.

It is written as the time when VCPU was not executed except idle. Similar to Xen.

Note that steal time is provided by the hypervisor in both Xen and KVM, as we have seen above. However, depending on the cloud service provider, it seems that there is a difference in whether or not this value is provided correctly depending on the environment. It seems that GCE did not report steal time. VP on AWS/Being a Distinguished EngineerMatthew S. Wilson(@msw)Mr.| TwitterVia,IwaspreviouslyaProductDirectorofGCEongooglePaulR.Nash(@paulrnash)Mr.|TwitterTaughtme(ItseemsthatyouarecurrentlyinMSAzure)。

In both cases, steal time is reported from the hypervisor to the guest. In some cloud environments, this is accurately reported. In other cloud environments, it is not.
— Matthew S. Wilson (@_msw_) March 6, 2020

Hah, that's an oldie. As you can see, it was an extremely infrequent request (doesn't look like that has changed).
— Paul R. Nash (@paulrnash) March 6, 2020

I've heard that GCE uses a KVM-based hypervisor, so isn't MSR_KVM_STEAL_TIME enabled? I don't know the details.

Summary

In general, the situation where CPU steal time is counted does not indicate an underlying host issue. The steal time also covers the time spent on the hypervisor and VMM acting on behalf of the guest, and basically requires more than the tolerance given to the VM to run. It reveals the time of the minute.

You should carefully check the guest and host metrics and select the appropriate VM. CPU steal time is a kind of indicator for choosing the right virtual machine. When looking at the CPU consumption required to perform a task, it is important to be able to subtract the time waiting for the physical CPU to execute. CPU steal time will help for that.

When the CPU steal time increases, first read the documentation for your virtual environment, analyze the metrics provided by the host, and see if it is valid.