Introduction

Reinforcement learning I'm not sure, but it's for impatient people who want to move and see for the time being because of the theory. In other words, he is a person like me. OpenAI Gym provides an environment for reinforcement learning, so I will use it. OpenAI Gym is just an environment, and you need something else to actually learn. When I looked it up, there was a person who wrote keras-rl that does reinforcement learning with Keras, and it seemed easy to try it, so I used it. I will. Thanks to my ancestors.

Preparing the environment

This environment

Python 3.6.0 :: Anaconda 4.3.1 (x86_64)
Mac OS Sierra 10.12.5
keras 2.0.5 (backend tensorflow)
tensorflow 1.2.0

At first I did it on a server without a display, but it was annoying, so I did it in a local environment. By the way, even a server without a display seems to be able to do its best with Xvfb. It seems to be the one who reproduces the display on the virtual memory.

Installation

pip install gym
pip install keras-rl

Both installations can be done with pip. It is assumed that keras is included.

CartPole

What is CartPole

CartPole is a game in which a pole is on the cart and the cart is moved to balance it so as not to knock it down (?) is.

This.

Screen Shot 2017-07-23 at 1.44.51.png

The cart can only move left and right. Therefore, there are two values for taking a cart, right and left. Depending on your current environment, choose right or left to get a good balance. This can be confirmed as follows.

import gym
env = gym.make('CartPole-v0')
env.action_space
# Discrete(2)

env.action_space.sample()
# 0

Also, for information about the environment in which the cart can be obtained,

env.observation_space
# Box(4,)

env.observation_space.sample()
# array([  4.68609638e-01, 1.46450285e+38, 8.60908446e-02, 3.05459097e+37])

These four values. In turn, the location of the cart, the speed of the cart, the angle of the pole, and the speed at which the pole rotates. (Kart and Paul are too early, right?) sample()The method is a method for sampling behavior and environment appropriately.

DQN example There is a example that does this with DQN in keras-rl, so use it as it is. I wanted a diagram to write this article, so I've added only two lines. (Where it says Add)

About DQN Get used to Keras while implementing [Python] Reinforcement Learning (DQN) Reinforcement learning from zero to deep The area will be helpful.

It seems that the action value function is a deep neural network. In this case, it is the part of the function that expresses that when the pole is tilted to the right, the action of moving the cart to the right is more valuable.

import numpy as np
import gym
from gym import wrappers #add to

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

ENV_NAME = 'CartPole-v0'

# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
env = wrappers.Monitor(env, './CartPole') #add to
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

# Next, we build a very simple model.
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
policy = BoltzmannQPolicy()
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
dqn.fit(env, nb_steps=50000, visualize=True, verbose=2)

# After training is done, we save the final weights.
dqn.save_weights('dqn_{}_weights.h5f'.format(ENV_NAME), overwrite=True)

# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=5, visualize=True)

In this example, the strategy `BoltzmannQPolicy ()` is used, but this is an action according to Future Strengthening Learning. It seems that it is decided by the softmax function of the value of the action value function when selecting. The more action you have, the better you choose.

result

1st episode

openaigym.video.0.43046.video000001.gif

An episode is a learning unit of reinforcement learning, and one episode is until the outcome of the game becomes clear. And since this is the result of the first episode, I haven't learned anything yet and it's completely random.

The cart is moving to the left even though Paul is about to fall to the right.

The reason why it's a little crazy is that the game ends when the CartPole is tilted by 15 degrees or more, so no further drawing is done. Also, it will end even if it moves too much to the left or right.

Episode 216

openaigym.video.0.43046.video000216.gif

Oh ... it's holding up ...

At the end

-Mario Kart I want to learn ...

[PYTHON] Reinforcement learning in the shortest time with Keras with OpenAI Gym