[PYTHON] Try OpenAI's standard reinforcement learning algorithm PPO

OpenAI has [announced] that it will use an algorithm called PPO (Proximal Policy Optimization) as the standard algorithm of the organization (https://blog.openai.com/openai-baselines-ppo/). The code has also been released so I'll give it a try. It seems that it is included in the reinforcement learning package called baselines.

I tried it on OSX 10.11.6, Python 3.5.1, TensorFlow 1.2.1.

Try to raise an inverted pendulum (again!)

The installation procedure will be described later, and we will try it first. The sample is run_atari.py from here.

python run_atari.py

I started running something, but it seems that it will take time in the atari environment with my MacBook Pro, so it is something light as usual Inverted pendulum Let's. Use Pendulmn-v0 of OpenAI gym. You might be told how much you like the inverted pendulum swing, but it's easy but it's just right in the sense of accomplishment. .. ..

That's fine, but there is no code that is the most fun from the user's point of view, such as saving the learning results and experimenting with trained agents. Well let's write. .. .. Since handling the coefficients is troublesome, I did it with a rough method of saving and restoring the entire TensorFlow session. The code is given here [https://github.com/ashitani/PPO_pendulmn).

python run_pendulmn.py train
python run_pendulmn.py replay

Learn and play with the trained model respectively.

The learning process is written to monitor.json, so let's plot the transition of reward. The horizontal axis is the number of iterations, and the vertical axis is reward.

python plot_log.py

Spits out png with.

log_linear.png

Humutty. As usual, the behavior after achieving the highest record is not stable. Well, reinforcement learning applied to unstable systems is up to that point, but it would be nice to be able to leave agents with higher reward rankings.

Let's do something with hyperparameters to see if we can make it even a little better. There is a schedule in the argument of learn (), and this time I tried to attenuate the learning rate linearly with schedule = "linear", but if it is linear, it does not have the effect of calming down for a while after it has fallen completely. Therefore, the custom attenuation factor is set as follows. I'll wait for a while after it becomes extremely small. This area is done in a file called pposgd_simple.py, so I modified it.

cur_lrmult =  max(1.0 - float(timesteps_so_far) / (max_timesteps/2), 0)
if cur_lrmult<1e-5:
    cur_lrmult =  1e-5

How about now?

log_custom.png

Yes. It's a little better. I wonder if I have to do it for a longer time. Also, Pendulmn-v0 seems to have a random initial state, and I think that is also affecting it.

Let's replay the learning results.

out.gif

it is a good feeling. Unlike my previous entry, this agent sometimes outputs a continuous amount, which is especially beautiful after rest.

Impressions

I can only thank you for the code release because I was able to execute the PPO paper without reading a single line. However, I think it would be better to make it a little more accessible if it is to be lifted to the standard algorithm. Well, I wonder from now on.

I think the same OpenAI gym is a great job in terms of unifying the interface on the environment side. Isn't it possible for the agent side to create a unified interface in the same way? Roughly speaking, even DQN and PPO of baseline are not unified (well, I understand that generalization is difficult).

I haven't benchmarked with DQN, but I think it will make a difference if I don't do it with a more difficult problem.

Installation

By the way, the following is the installation procedure as of today (July 22, 2017). Eventually, one pip will pass. ..

First, you need TensorFlow 1.0.0 or higher. To install TensorFlow, see Documentation.

pip install --upgrade https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.2.1-py3-none-any.whl

Then install the latest version of baseline from git. It seems that you can go with pip, but as of today there seems to be an inconsistency.

git clone https://github.com/openai/baselines.git

Add \ _ \ _ init \ _ \ _. Py with the following contents to baselines / baselines / pposgd /.

from baselines.pposgd import *

It is an installation.

cd baselines
python setup.py install

Install any other dependencies.

brew install openmpi
pip install mpi4py
pip install atari_py

At least the sample run_atari.py has passed.

Recommended Posts

Try OpenAI's standard reinforcement learning algorithm PPO
Reinforcement learning 5 Try programming CartPole?
Reinforcement learning 13 Try Mountain_car with ChainerRL.
Reinforcement learning 8 Try using Chainer UI
Reinforcement learning 11 Try OpenAI acrobot with ChainerRL.
[Introduction] Reinforcement learning
Reinforcement learning 10 Try using a trained neural network.
Dictionary learning algorithm
Future reinforcement learning_2
Future reinforcement learning_1
Try Standard Scaler
[Introduction to Reinforcement Learning] part.1-Epsilon-Greedy Algorithm in Bandit Game
I investigated the reinforcement learning algorithm of algorithmic trading
Reinforcement learning 1 Python installation
Reinforcement learning 3 OpenAI installation
Reinforcement learning for tic-tac-toe
Python + Unity Reinforcement Learning (Learning)
Reinforcement learning 1 introductory edition
[Introduction to Reinforcement Learning] Reinforcement learning to try moving for the time being
Try Q-learning in Dragon Quest-style battle [Introduction to Reinforcement Learning]