[JVM] Let's challenge OOM (Out Of Memory) failure response

Introduction

The basic knowledge of the JVM was introduced in this article. [JVM] Necessary knowledge for OOM (Out Of Memory) failure handling

I will touch on specific OOM solutions from now on.

Output

From here, when you actually run into an OOM problem, you will learn actions to solve it. Here, we will also introduce tools that can be used concretely. (Refer to the link for how to use it.)

Detailed events of the problem addressed this time

To summarize the problems that occurred this time, It was an event that the java process went down once a week. When I checked the memory usage rate with the sar command, it was used up to around 100% just before it fell. The JVM crash log contained the word OOM.

The following is a summary of the survey policy, survey method, points to note, etc. When I was investigating, I didn't know how to investigate, and I missed important information. I was interpreting incorrect data strangely.

So the spear survey method described here is the best method I can think of at the moment.

Survey policy

You may want to investigate in the following order.

  1. Check hs_err_ .log to isolate the problem (this can be narrowed down considerably)
  2. Check the heap status from the gc log
  3. If there is a problem with the Java heap, identify the problem from the heap dump
  4. If there is a problem with non-heap, identify the problem from the number of threads
  5. Suspect OOM Killer if memory usage is increasing for reasons other than the above

** If you go first, when dealing with OOM problems, the problems may be non-linearly intertwined, so It is necessary to have a steel mentality that calmly formulates hypotheses and steadily verifies them. ** **

1. Check hs_err_ .log to isolate the problem (this can be narrowed down considerably)

** By looking at this log, you can narrow down the OOM problem considerably. ** ** By repeating the hypothesis verification after that, I think that the speed to reach the cause will be much faster, so Please check with the utmost care. (I was fighting without knowing that at first)

By the way, this log is the log that is output when the JVM crashes. If nothing is set in the JVM startup option, it will be output to the current directory. If you want to specify the output destination, you can decide by setting the following startup options.

java -XX:ErrorFile=/var/log/java/java_error%p.log

Read hs_err_ .log

Now read hs_err_ .log to identify the problem. When you open the contents of the log file, you will find the following description. ** Please refer to the description as it can narrow down the problem to some extent. ** **

-There is a problem with the Java heap

Exception in thread "main": java.lang.OutOfMemoryError: Java heap space

Java heap space indicates that an object could not be allocated in the Java heap a. I couldn't allocate an object to the java heap area! Can be interpreted as That is, the capacity of the java heap area is small. → Increasing the capacity will solve the problem.

the message might be an indication that the application is unintentionally holding references to objects There is also. An object has been referenced all the time and is gradually running out of memory without being subject to GC deletion! Can be interpreted as So this is a memory leak → The problem is solved by getting a heap dump, identifying the gradually increasing objects, and recovering the source. See ** 3. If there is a problem with the Java heap, identify the problem from the heap dump ** below

-There is a problem with the Permanent heap

Exception in thread "main": java.lang.OutOfMemoryError: PermGen space

PermGen space indicates that the permanent generation is full. a. There is not enough Permanent area! Can be interpreted as In other words, it is necessary to secure a sufficient area with -XX: MaxPermSize option.

In addition, static variables and classes that are first loaded into the class loader are stored in this area.

-There is a problem with the Java heap.

Exception in thread "main": java.lang.OutOfMemoryError: Requested array size exceeds VM limit

Requested array size exceeds VM limit indicates that the application (or APIs used by that application) attempted to allocate an array that is larger than the heap size. a. The array couldn't be allocated in the Java heap area! Can be interpreted as That is, the capacity of the java heap area is small. → Increasing the capacity will solve the problem. Alternatively, the problem can be solved by taking a heap dump, identifying the gradually increasing objects, and reclaiming the source. See ** 3. If there is a problem with the Java heap, identify the problem from the heap dump ** below

-There is a problem with the C heap

Exception in thread "main": java.lang.OutOfMemoryError: request <size> bytes for <reason>. Out of swap space?

the HotSpot VM code reports this apparent exception when an allocation from the native heap failed and the native heap might be close to exhaustion. a. Memory was not allocated to the native heap (C heap)! Can be interpreted as By the way, it was this problem that I dealt with. When you hit this problem, you can basically assume that the number of threads is the cause. So ** 4. If you have a problem with non-heap, see ** Identify the problem from the number of threads **

For details, refer to the following site.

I think the information so far has narrowed down the problem to some extent. From here, the problem is transformed from a hypothesis to a fact.

2. Check the heap status from the gc log

As mentioned in the input section, by setting -verbose: gc as a startup option, You can get the gc log. If you look at this log, you can see how the heap area fluctuates as a result of running minor GC and major GC. ** GC Viewer ** is very useful for visualizing and viewing.

This article is useful for how to use and view the tool https://qiita.com/i_matsui/items/aabbdaa169c6ae51ecb3

3. If there is a problem with the Java heap, identify the problem from the heap dump

By visualizing and comparing the results of heap dumps using Memory Analyzer, You'll see if there really is a problem with the Java heap. The following articles are very useful for how to use Memory Analyzer.

The specific method is -Get a heap dump. (Described in the input section)

Once you know which object is the problem, modify the source. However, if the heap area is insufficient, you can increase it.

4. If there is a problem with non-heap, identify the problem from the number of threads

The number of threads can be compared by comparing the thread dumps.

#pid confirmation
jcmd -l

#Get thread dump
jstack <pid> > threaddump.txt

Also, by using jconsole, you can visualize the transition of the number of threads, so You can see if the number of threads is increasing in proportion to the time.

I haven't used it, but I'll introduce it because it seems to be very useful. https://github.com/irockel/tda

You can see the number of threads and memory usage with the following command, so you can compare whether it is increasing over time.

ps auxww -L | grep -e java -e PID | grep -v grep

I referred to this article. http://d.hatena.ne.jp/rx7/20101219/p1

Request: I think there are other better ways. Please let me know if you know.

5. Suspect OOM Killer if memory usage is increasing for reasons other than the above

Linux specifications ~ cache ~

One of the ideas of Linux is a mechanism to actively use free memory. The annoyance of this is not visible with the ps command. For example, if you look at the memory usage of a process with ps aux, the java process is using 30% memory. However, when looking at the usage rate of the entire memory with sar -r 1, it is about 90%.

If this is happening, it is likely that it is being used for caching. By the way, about 60% of the memory was used for the page cache at my time as well.

#Clear all page cache
# echo 1 > /proc/sys/vm/drop_caches

You can delete the page cache by the above method, but it is subtle because it also deletes the necessary cache. At worst, regular execution with cron solves the problem of memory being used for page cache.

Of course, you should also investigate the cause of why the page is cached. You can suspect that a large amount of logs are being output (I / O is occurring).

The following articles will be helpful https://tech.mercari.com/entry/2015/07/16/170310

Linux specifications ~ OOM Killer ~

so. A murderer lurks in Linux. In order to prevent panic when Linux runs out of memory There is a specification to forcibly kill a process that is using memory.

When a process is killed by OOM killer, which process was killed is output to the following file.

less /var/log/messages

02:53:58 xxxxx1 kernel: Out of memory: Killed process 28036, UID 80, (xmllint).

Also, the log that appears when the JVM crashes, the pid of hs_err_ .log, and If the killed pid matches under / var / log / messages, you know that it was killed by OOM Killer.

The following articles will be helpful. https://blog.ybbo.net/2013/07/10/oom-killer%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6/

Things to be aware of when investigating

-If there are environmental differences such as the local environment and the verification environment, it may not be possible to reproduce. Behavior changes depending on the OS such as Linux (Because there may be a problem other than the java process)

-Be careful about the timing when you get a heap dump for batch processing. For example, if you take a heap dump of a java process that runs regularly when the load is high, Of course, there are too many objects to compare well. Therefore, when comparing, you should get a heap dump under the same conditions and conditions and compare.

・ Do not conduct surveys without hypotheses or policies I think this is the most important thing. As I research, I research various places that I am interested in and look at graphs. It's almost meaningless, so you should stop. I sometimes do it.

Summary

We have summarized how to identify specific problems when OOM problems occur and how to solve them. I hope it helps people who are suffering from OOM problems like me.

Also, since I just summarized what I investigated, methods other than those introduced here, I think there are more perspectives. If you know it, please let me know.

Recommended Posts

[JVM] Let's challenge OOM (Out Of Memory) failure response
Cloud9 is out of memory: Rails tutorial memorandum