About 10 years ago, I will revive the blog when I was a fledgling engineer. Recently, the chances of connecting to the server with ssh and checking it have decreased, but it's worth remembering. https://itinao.hatenadiary.org/

One day event

I am an engineer. As soon as I go to work, glittering salespeople and planning people say this.

Sales "The server is a little heavy" Project "Somehow! It won't be a job"

I "Yeah, I don't know how to investigate." I "I'm sorry. I'll contact my senior soon."

As a programmer, I lack the knowledge and knowledge on the infrastructure side. I want to do something, but I can't do anything about it. ..

It's a common sight when you operate your own service.

After all I want to do something!

Ho-Ren-So is important, but after all he is an engineer. I want to be able to do it myself. To such a person. First, let's learn the concept of bottlenecks.

There are two main ways of thinking about bottlenecks

1.CPU load
2. I/O load

1. What is CPU load?

Process occupies CPU (CPU is calculating) スクリーンショット 2020-03-18 0.58.50.png

If one process (program) uses the CPU and the usage rate is 100% for a long time, it will interfere with the execution of other processes.

It would be a problem if there is one word wrong, but 100% CPU usage itself is not bad, and it is ideal if there are no bottlenecks other than disk and memory capacity.

If CPU 100% continues suddenly

Check if the program is out of control (infinite loop, etc.).
Review the processing in the latest release version.

2. Then what is the difference between CPU load and I / O load?

I / O means input / output. Frequent data in and out puts a load on the hardware and network, so the CPU load and I / O load are different. A high CPU load does not necessarily slow down I / O, but a large amount of reading and writing to the disk.

If the I / O load continues

Is there a lot of I / O programs in the file?
Is swapping caused by insufficient memory and disk access is occurring?

If there is not enough memory, the system will use swap. Conversely, if there are many accesses to swap, there is a possibility of insufficient memory.

Bottleneck investigation procedure

So far we have seen the concepts of CPU load and I / O load. Next, we will immediately start investigating bottlenecks in CPU load and I / O load.

1.First of all, calm the mind. This is important.
2.Check the load average on top.
3.CPU and I with sar/O Check which is higher.
4.View information for each process in ps.
5.We will take measures such as reviewing the execution program and returning the version.
6.If there is no problem stopping in the middle, kill or restart the bad process.

1. First, calm the mind

This is important at all times. It hurts my eyes to get rid of it quickly, but don't panic.

Let's deal with the fatness.

2. Immediately TOP command

First of all, let's see the load average with the TOP command.

What is Road Average?

The number of processes waiting for execution and disk I / O per unit time per CPU. A number that reports how many tasks were waiting per unit of time. If this is high, it means that the load on the system is high.

What is high road average?

If the load average is higher than the number of cores, it may cause a load.

$top
top - 00:41:49 up 6 days,  2:24,  1 user,  load average: 2.15, 3.02, 3.20
Tasks:  93 total,   1 running,  45 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3977928 total,  3324844 free,   121568 used,   531516 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  3630656 avail Mem

The load average: 2.15, 3.02, 3.20 is the load average. The values for the last 1 minute, 5 minutes, and 15 minutes from the left.

There are two things to see

Does the load average status exceed the number of cores?
Is Swap occurring?

Next, let's look at the load status for each core.

2. See CPU usage and I / O wait rate with sar command

In the case of multi-core, it may not be possible to judge by load average alone. In such a case, use sar -P ALL to grasp the status of each CPU individually. If there is only one disk even if multiple CPUs are installed, the CPU load can be distributed to other CPUs, but I / O cannot be distributed, which causes a load.

$ sar -P ALL
Linux 3.10.0-862.2.3.el7.x86_64 (118-27-1-88)   10/01/2018  _x86_64_    (2 CPU)

01:17:35 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
01:17:36 AM     all      0.00      0.00      0.00      0.00      0.00    100.00
01:17:36 AM       0      0.00      0.00      0.00      0.00      0.00    100.00
01:17:36 AM       1      0.00      0.00      0.00      0.00      0.00    100.00

The meaning of each is here.

display	Description
%user	Percentage of time the CPU was in user mode
%system	Percentage of time the CPU was in kernel mode
%iowait	Percentage of time the CPU was waiting for IO
%idle	Percentage of time the CPU has been idle

Here's what to see

%If the idle is small, the CPU usage is high and the CPU may be the bottleneck.

Check the state transition of the process assigned to the CPU with the ps command

If the CPU is found to be the cause of the load Next, let's figure out which process is doing the wrong thing.

$ ps auwx | head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  19232  1516 ?        Ss   Feb09   0:00 /sbin/init
root         2  0.0  0.0      0     0 ?        S    Feb09   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Feb09   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S    Feb09   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Feb09   0:00 [stopper/0]
root         6  0.0  0.0      0     0 ?        S    Feb09   0:06 [watchdog/0]
root         7  0.0  0.0      0     0 ?        S    Feb09   0:00 [migration/1]
root         8  0.0  0.0      0     0 ?        S    Feb09   0:00 [stopper/1]
root         9  0.0  0.0      0     0 ?        S    Feb09   0:00 [ksoftirqd/1]

See below for the meaning of each.

display	Description
%CPU	Process CPU utilization
%MEM	Process physical memory
VSZ(RSS)	Virtual reserved by the process(Physics)Memory area
STAT	Process state
TIME	Time the process occupied the CPU

About STAT (process status)

The processes that can be executed on the CPU are in the TASK_RUNNING state. The CPU is given to the task with the highest priority among multiple processes in the TASK_RUNNING state.

Notation	Status	Description
R	TASK_RUNNING	Executable state
S	TASK_INTERRUPTIBLE	Waiting state. Signal can be received
D	TASK_UNINTERRUPTIBLE	Waiting state. No signal reception
Z	TASK_ZOMBIE	Zombie state. State after exit
T	TASK_STOPPED	Suspend state

There are two things to see

Look at the size of the RSS to see if there are any extremely large processes.
Check the status of TIME. infinite loop(TASK_RUNNING)If, TIME continues to increase.

If swap is occurring

If swapping occurs with the TOP command, it may be caused by insufficient physical memory. Let's take a closer look with the sar command.

$ sar -S

00:00:00 kbswpfree kbswpused%swpused  kbswpcad   %swpcad
00:10:01 2097148 0 0.00         0      0.00
00:20:01 2097148 0 0.00         0      0.00
00:30:01 2097148 0 0.00         0      0.00
00:40:01 2097148 0 0.00         0      0.00

Status	Description
kbswpfree	Free space in swap space
kbswpused	Swap area usage capacity
%swpused	Swap area usage percentage
kbswpcad	Swap area cache capacity

After checking how much Swap is occurring here, It is easy to understand if you specify the interval with vmstat like vmstat 1 100

$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1 244208  10312   1552  62636    4   23    98   249   44  304 28  3 68  1  0
 0  2 244920   6852   1844  67284    0  544  5248   544  236 1655  4  6  0 90  0
 1  2 256556   7468   1892  69356    0 3404  6048  3448  290 2604  5 12  0 83  0
 0  2 263832   8416   1952  71028    0 3788  2792  3788  140 2926 12 14  0 74  0
 0  3 274492   7704   1964  73064    0 4444  2812  5840  295 4201  8 22  0 69  0

display	Description
r	Number of processes waiting to run
b	Number of sleep (interruptable) processes, number of processes that cannot be executed
swpd	Swap size (KB)
free	Free memory (KB)
buff	Buffer memory size (KB)
cache	Cache memory size (KB)
si	Memory size swapped in from disk (KB)/Seconds)
so	Memory size swapped out to disk (KB)/Seconds)
bi	Number of blocks received from the block device (blocks)/Seconds)
bo	Number of blocks sent to the block device (blocks)/Seconds)
in	Number of interrupts/Seconds
cs	Context switch count/Seconds
us	CPU usage time ratio of user process
sy	Time used to execute kernel code
id	Percentage of time the CPU is idle
wa	CPU is I/Waiting for O
st	Percentage of time the guest operating system was unable to allocate CPU

r and b are usually 0~About 2.
If this number is large, you may feel that the server is heavy.

Basically, si and so are always zero.
If a number always appears here, either there is insufficient memory or there is a program that consumes memory.

Load countermeasure summary

First, determine whether it is CPU or I / O with the following command

top
sar
ps
vmstat

As a workaround

When CPU load is high

Server expansion, program logic, and algorithm improvement

When the I / O load is high

Expand the cache area by adding memory
If memory expansion is not possible, consider distributing data and introducing a cache server
Program improvement I/O reduce frequency

Hmm, I'm tired of putting it together. I hope this can explain the cause of the load.

[LINUX] Concept of server load that new engineers want to know