Introduction

Recently, I implemented DQN used in research with some difficulty, and when I learned it, it was eating memory abnormally, and it took me a week to investigate the cause. I would like to keep a record easily so that a person like myself does not appear.

environment

python==3.7 Tensorflow==2.0.0 Ubuntu 18.04

Phenomenon that actually occurred

I was using a learning server called DeepStation installed in the laboratory. It has GTX1080ti * 4 and 64GB of RAM, so even if I forced it to work a little, nothing strange happened. It did not occur when actually training the end-to-end discriminative model normally.

However, I had a problem while learning with DQN. The following article was used as a reference for the implementation of DQN this time.

https://qiita.com/sugulu_Ogawa_ISID/items/bc7c70e6658f204f85f9

When this source code was executed and left unattended, it exceeded 64GB of memory in about 3 hours and crashed with memory-over. At first, I thought there was a problem with my source code, so I checked the source code line by line, but I had a hard time finding no problem. Originally, Python is less likely to cause memory-leak, so I wasn't really aware of it.

As you can see from this source code, every time episide exceeds the threshold, model.fit () is called to learn little by little. The problem was here. If you look at the following, you will find similar symptoms.

https://github.com/tensorflow/tensorflow/issues/33030

Looking at this, it is not a big problem to do model.fit and model.predict several times, and it seems that memory-leak occurred by doing this very many times. Probably, I think that the model called by model.fit or model.predict was retained without being released.

Solution

Basically, it's OK if you update the version of tensorflow. I solved it by installing tenosorflow == 2.3.0.

I didn't expect this to take a week. I thought that a memory leak really happens. When I was running with jupyter-lab, I couldn't connect to the server with ssh because it didn't release the memory immediately due to an error, so I was quite impatient. Even if you have a large amount of memory, it is safe to limit the memory usage of python without overconfidence. I do that now. You may refer to the following articles.

https://blog.imind.jp/entry/2019/08/10/022501

Finally

Even though I was studying with DQN, I didn't get much similar information, so I had a hard time understanding it. I hope this will be helpful to Min-san.

[PYTHON] Memory Leak occurred after learning DQN with tensorflow == 2.0.0

Introduction

environment

Phenomenon that actually occurred

Solution

Finally