About 10 years ago, I will revive the blog when I was a fledgling engineer. Recently, the chances of connecting to the server with ssh and checking it have decreased, but it's worth remembering. https://itinao.hatenadiary.org/
I am an engineer. As soon as I go to work, glittering salespeople and planning people say this.
Sales "The server is a little heavy" Project "Somehow! It won't be a job"
I "Yeah, I don't know how to investigate." I "I'm sorry. I'll contact my senior soon."
As a programmer, I lack the knowledge and knowledge on the infrastructure side. I want to do something, but I can't do anything about it. ..
It's a common sight when you operate your own service.
Ho-Ren-So is important, but after all he is an engineer. I want to be able to do it myself. To such a person. First, let's learn the concept of bottlenecks.
1.CPU load
2. I/O load
Process occupies CPU (CPU is calculating)

If one process (program) uses the CPU and the usage rate is 100% for a long time, it will interfere with the execution of other processes.
It would be a problem if there is one word wrong, but 100% CPU usage itself is not bad, and it is ideal if there are no bottlenecks other than disk and memory capacity.
Check if the program is out of control (infinite loop, etc.).
Review the processing in the latest release version.
I / O means input / output. Frequent data in and out puts a load on the hardware and network, so the CPU load and I / O load are different. A high CPU load does not necessarily slow down I / O, but a large amount of reading and writing to the disk.
Is there a lot of I / O programs in the file?
Is swapping caused by insufficient memory and disk access is occurring?
If there is not enough memory, the system will use swap. Conversely, if there are many accesses to swap, there is a possibility of insufficient memory.
So far we have seen the concepts of CPU load and I / O load. Next, we will immediately start investigating bottlenecks in CPU load and I / O load.
1.First of all, calm the mind. This is important.
2.Check the load average on top.
3.CPU and I with sar/O Check which is higher.
4.View information for each process in ps.
5.We will take measures such as reviewing the execution program and returning the version.
6.If there is no problem stopping in the middle, kill or restart the bad process.
This is important at all times. It hurts my eyes to get rid of it quickly, but don't panic.
Let's deal with the fatness.
First of all, let's see the load average with the TOP command.
The number of processes waiting for execution and disk I / O per unit time per CPU. A number that reports how many tasks were waiting per unit of time. If this is high, it means that the load on the system is high.
If the load average is higher than the number of cores, it may cause a load.
$top
top - 00:41:49 up 6 days,  2:24,  1 user,  load average: 2.15, 3.02, 3.20
Tasks:  93 total,   1 running,  45 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3977928 total,  3324844 free,   121568 used,   531516 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  3630656 avail Mem 
The load average: 2.15, 3.02, 3.20 is the load average. The values for the last 1 minute, 5 minutes, and 15 minutes from the left.
There are two things to see
Does the load average status exceed the number of cores?
Is Swap occurring?
Next, let's look at the load status for each core.
In the case of multi-core, it may not be possible to judge by load average alone. In such a case, use sar -P ALL to grasp the status of each CPU individually. If there is only one disk even if multiple CPUs are installed, the CPU load can be distributed to other CPUs, but I / O cannot be distributed, which causes a load.
$ sar -P ALL
Linux 3.10.0-862.2.3.el7.x86_64 (118-27-1-88)   10/01/2018  _x86_64_    (2 CPU)
01:17:35 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
01:17:36 AM     all      0.00      0.00      0.00      0.00      0.00    100.00
01:17:36 AM       0      0.00      0.00      0.00      0.00      0.00    100.00
01:17:36 AM       1      0.00      0.00      0.00      0.00      0.00    100.00
The meaning of each is here.
| display | Description | 
|---|---|
| %user | Percentage of time the CPU was in user mode | 
| %system | Percentage of time the CPU was in kernel mode | 
| %iowait | Percentage of time the CPU was waiting for IO | 
| %idle | Percentage of time the CPU has been idle | 
Here's what to see
%If the idle is small, the CPU usage is high and the CPU may be the bottleneck.
If the CPU is found to be the cause of the load Next, let's figure out which process is doing the wrong thing.
$ ps auwx | head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  19232  1516 ?        Ss   Feb09   0:00 /sbin/init
root         2  0.0  0.0      0     0 ?        S    Feb09   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Feb09   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S    Feb09   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Feb09   0:00 [stopper/0]
root         6  0.0  0.0      0     0 ?        S    Feb09   0:06 [watchdog/0]
root         7  0.0  0.0      0     0 ?        S    Feb09   0:00 [migration/1]
root         8  0.0  0.0      0     0 ?        S    Feb09   0:00 [stopper/1]
root         9  0.0  0.0      0     0 ?        S    Feb09   0:00 [ksoftirqd/1]
See below for the meaning of each.
| display | Description | 
|---|---|
| %CPU | Process CPU utilization | 
| %MEM | Process physical memory | 
| VSZ(RSS) | Virtual reserved by the process(Physics)Memory area | 
| STAT | Process state | 
| TIME | Time the process occupied the CPU | 
The processes that can be executed on the CPU are in the TASK_RUNNING state. The CPU is given to the task with the highest priority among multiple processes in the TASK_RUNNING state.
| Notation | Status | Description | 
|---|---|---|
| R | TASK_RUNNING | Executable state | 
| S | TASK_INTERRUPTIBLE | Waiting state. Signal can be received | 
| D | TASK_UNINTERRUPTIBLE | Waiting state. No signal reception | 
| Z | TASK_ZOMBIE | Zombie state. State after exit | 
| T | TASK_STOPPED | Suspend state | 
There are two things to see
Look at the size of the RSS to see if there are any extremely large processes.
Check the status of TIME. infinite loop(TASK_RUNNING)If, TIME continues to increase.
If swapping occurs with the TOP command, it may be caused by insufficient physical memory. Let's take a closer look with the sar command.
$ sar -S
00:00:00 kbswpfree kbswpused%swpused  kbswpcad   %swpcad
00:10:01 2097148 0 0.00         0      0.00
00:20:01 2097148 0 0.00         0      0.00
00:30:01 2097148 0 0.00         0      0.00
00:40:01 2097148 0 0.00         0      0.00
| Status | Description | 
|---|---|
| kbswpfree | Free space in swap space | 
| kbswpused | Swap area usage capacity | 
| %swpused | Swap area usage percentage | 
| kbswpcad | Swap area cache capacity | 
After checking how much Swap is occurring here, It is easy to understand if you specify the interval with vmstat like vmstat 1 100
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1 244208  10312   1552  62636    4   23    98   249   44  304 28  3 68  1  0
 0  2 244920   6852   1844  67284    0  544  5248   544  236 1655  4  6  0 90  0
 1  2 256556   7468   1892  69356    0 3404  6048  3448  290 2604  5 12  0 83  0
 0  2 263832   8416   1952  71028    0 3788  2792  3788  140 2926 12 14  0 74  0
 0  3 274492   7704   1964  73064    0 4444  2812  5840  295 4201  8 22  0 69  0
| display | Description | 
|---|---|
| r | Number of processes waiting to run | 
| b | Number of sleep (interruptable) processes, number of processes that cannot be executed | 
| swpd | Swap size (KB) | 
| free | Free memory (KB) | 
| buff | Buffer memory size (KB) | 
| cache | Cache memory size (KB) | 
| si | Memory size swapped in from disk (KB)/Seconds) | 
| so | Memory size swapped out to disk (KB)/Seconds) | 
| bi | Number of blocks received from the block device (blocks)/Seconds) | 
| bo | Number of blocks sent to the block device (blocks)/Seconds) | 
| in | Number of interrupts/Seconds | 
| cs | Context switch count/Seconds | 
| us | CPU usage time ratio of user process | 
| sy | Time used to execute kernel code | 
| id | Percentage of time the CPU is idle | 
| wa | CPU is I/Waiting for O | 
| st | Percentage of time the guest operating system was unable to allocate CPU | 
r and b are usually 0~About 2.
If this number is large, you may feel that the server is heavy.
Basically, si and so are always zero.
If a number always appears here, either there is insufficient memory or there is a program that consumes memory.
First, determine whether it is CPU or I / O with the following command
top
sar
ps
vmstat
When CPU load is high
Server expansion, program logic, and algorithm improvement
When the I / O load is high
Expand the cache area by adding memory
If memory expansion is not possible, consider distributing data and introducing a cache server
Program improvement I/O reduce frequency
Hmm, I'm tired of putting it together. I hope this can explain the cause of the load.
Recommended Posts