The ChainaerRL Quick Start Guide is a shame for Windows, so it's a windows version without the consent of the original. Anaconda has explanations in various places, so please refer to them. With anaconda3, proceed assuming python3.7.

ChainerRL Quick Start Guide

This Notebook is a quick start guide for users who want to try ChainerRL for the first time. Run the following command to install ChainerRL.

# Install Chainer, ChainerRL and CuPy!
!conda install cupy chainer
!pip -q install chainerrl
!pip -q install gym
!pip -q install pyglet
!pip -q install pyopengl
!pip -q install pyvirtualdisplay
!pip -q install JSAnimation
!pip -q install matplotlib
!pip -q install jupyter
!conda install -c conda-forge ffmpeg

First, you need to import the required modules. The module name of ChainerRL is chainerrl. Let's also import gym and numpy for later use.

import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import gym
import numpy as np

ChainerRL can be used for any problem if the "environment" is modeled. OpenAI Gym provides different types of benchmarking environments and defines a common interface between them. ChainerRL uses a subset of this interface. Specifically, the environment must define its state space (observation space) and action space (action space) and have at least two methods, reset and step.

env.reset resets the environment to its initial state and returns the initial state (observation). env.step performs the given action, moves to the next state, and returns four values: --observation --scalar reward --boolean indicating whether the current state is the finished state Value-Additional information env.render renders the current state. Now let's try CartPole-v0, a classic control problem. Below you can see that the state space is made up of four real numbers and its working space is made up of two discrete actions.

env = gym.make('CartPole-v0')
print('observation space:', env.observation_space)
print('action space:', env.action_space)

obs = env.reset()
#env.render()
print('initial observation:', obs)

action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs)
print('reward:', r)
print('done:', done)
print('info:', info)

WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
observation space: Box(4,)
action space: Discrete(2)
initial observation: [-0.04055678 -0.00197163  0.02364212  0.03487198]
next observation: [-0.04059621 -0.1974245   0.02433956  0.33491948]
reward: 1.0
done: False
info: {}

You have now defined your environment. Next, we need to define an agent that learns through interaction with the environment.

ChainerRL offers a variety of agents, each implementing a deep reinforcement learning algorithm.

To use DQN (Deep Q-Network), you need to define a Q function that takes a state and returns the expected future return that each action of the agent can take. In ChainerRL, the Q function can be defined as chainer.Link as follows. Note that the output is wrapped by chainerrl.action_value.DiscreteActionValue, which implements chainerrl.action_value.ActionValue. ChainerRL can handle such discrete action Q function and NAF (Normalized Advantage Functions) in the same way by wrapping the output of the Q function.

class QFunction(chainer.Chain):

    def __init__(self, obs_size, n_actions, n_hidden_channels=50):
        super().__init__()
        with self.init_scope():
            self.l0 = L.Linear(obs_size, n_hidden_channels)
            self.l1 = L.Linear(n_hidden_channels, n_hidden_channels)
            self.l2 = L.Linear(n_hidden_channels, n_actions)

    def __call__(self, x, test=False):
        """
        Args:
            x (ndarray or chainer.Variable): An observation
            test (bool): a flag indicating whether it is in test mode
        """
        h = F.tanh(self.l0(x))
        h = F.tanh(self.l1(h))
        return chainerrl.action_value.DiscreteActionValue(self.l2(h))

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
q_func = QFunction(obs_size, n_actions)

If you want to use CUDA for calculations like Chainer, call to_gpu.

If you want to use Colaboratory, you need to change the runtime type to GPU. ..

q_func.to_gpu(0)

<__main__.QFunction at 0x7f0bc217beb8>

You can also use the ChainerRL's predefined Q functions.

_q_func = chainerrl.q_functions.FCStateQFunctionWithDiscreteAction(
    obs_size, n_actions,
    n_hidden_layers=2, n_hidden_channels=50)

Like Chainer, chainer.Optimizer is used to update the model.

# Use Adam to optimize q_func. eps=1e-2 is for stability.
optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(q_func)

The Q function and its optimization function are used by DQN's agents. To create a DQN agent, you need to specify more parameters and settings.

# Set the discount factor that discounts future rewards.
gamma = 0.95

# Use epsilon-greedy for exploration
explorer = chainerrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.3, random_action_func=env.action_space.sample)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=10 ** 6)

# Since observations from CartPole-v0 is numpy.float64 while
# Chainer only accepts numpy.float32 by default, specify
# a converter as a feature extractor function phi.
phi = lambda x: x.astype(np.float32, copy=False)

# Now create an agent that will interact with the environment.
agent = chainerrl.agents.DoubleDQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    replay_start_size=500, update_interval=1,
    target_update_interval=100, phi=phi)

Now the agent and environment are ready. Let's start reinforcement learning!

When learning, use agent.act_and_train to select exploratory actions. You must call agent.stop_episode_and_train after the episode ends. You can use agent.get_statistics to get training statistics for agents.

n_episodes = 200
max_episode_len = 200
for i in range(1, n_episodes + 1):
    obs = env.reset()
    reward = 0
    done = False
    R = 0  # return (sum of rewards)
    t = 0  # time step
    while not done and t < max_episode_len:
        # Uncomment to watch the behaviour
        # env.render()
        action = agent.act_and_train(obs, reward)
        obs, reward, done, _ = env.step(action)
        R += reward
        t += 1
    if i % 10 == 0:
        print('episode:', i,
              'R:', R,
              'statistics:', agent.get_statistics())
    agent.stop_episode_and_train(obs, reward, done)
print('Finished.')

episode: 10 R: 37.0 statistics: [('average_q', 1.2150215711003933), ('average_loss', 0.05015367301912823)]
episode: 20 R: 44.0 statistics: [('average_q', 3.7857904640201947), ('average_loss', 0.09890545599011519)]
episode: 30 R: 97.0 statistics: [('average_q', 7.7720408907953145), ('average_loss', 0.12504807923600555)]
episode: 40 R: 56.0 statistics: [('average_q', 10.963194695758215), ('average_loss', 0.15639676991049656)]
episode: 50 R: 177.0 statistics: [('average_q', 14.237965547239822), ('average_loss', 0.23526638038745168)]
episode: 60 R: 145.0 statistics: [('average_q', 17.240442032833762), ('average_loss', 0.16206694621384216)]
episode: 70 R: 175.0 statistics: [('average_q', 18.511116289009692), ('average_loss', 0.18787805607905012)]
episode: 80 R: 57.0 statistics: [('average_q', 18.951395985384725), ('average_loss', 0.149411012387425)]
episode: 90 R: 200.0 statistics: [('average_q', 19.599694542558165), ('average_loss', 0.16107124308010012)]
episode: 100 R: 200.0 statistics: [('average_q', 19.927458098228968), ('average_loss', 0.1474102671167888)]
episode: 110 R: 200.0 statistics: [('average_q', 19.943080568511867), ('average_loss', 0.12303519377444547)]
episode: 120 R: 152.0 statistics: [('average_q', 19.81996694327306), ('average_loss', 0.12570420169091834)]
episode: 130 R: 196.0 statistics: [('average_q', 19.961466224568177), ('average_loss', 0.17747677703107395)]
episode: 140 R: 194.0 statistics: [('average_q', 20.05166109574271), ('average_loss', 0.1334155925948816)]
episode: 150 R: 200.0 statistics: [('average_q', 19.982061292121358), ('average_loss', 0.12589899261907)]
episode: 160 R: 175.0 statistics: [('average_q', 20.060457421033803), ('average_loss', 0.13909796300744334)]
episode: 170 R: 200.0 statistics: [('average_q', 20.03359962493644), ('average_loss', 0.12457978502375021)]
episode: 180 R: 200.0 statistics: [('average_q', 20.023962037264738), ('average_loss', 0.10855797175237188)]
episode: 190 R: 200.0 statistics: [('average_q', 20.023348743333067), ('average_loss', 0.11714457311489457)]
episode: 200 R: 200.0 statistics: [('average_q', 19.924879051722634), ('average_loss', 0.08032495725586702)]
Finished.

This completes the agent training. How well is this agent learning? It can be tested using agent.act and agent.stop_episode. Explorations such as epsilon-greedy are not used here.

In order to check the execution result on Notebook, it is displayed using the animation function of matplotlib.

from JSAnimation.IPython_display import display_animation
from matplotlib import animation
import matplotlib.pyplot as plt
%matplotlib inline

frames = []
for i in range(3):
    obs = env.reset()
    done = False
    R = 0
    t = 0
    while not done and t < 200:
        frames.append(env.render(mode = 'rgb_array'))
        action = agent.act(obs)
        obs, r, done, _ = env.step(action)
        R += r
        t += 1
    print('test episode:', i, 'R:', R)
    agent.stop_episode()
env.render()

from IPython.display import HTML
plt.figure(figsize=(frames[0].shape[1]/72.0, frames[0].shape[0]/72.0),dpi=72)
patch = plt.imshow(frames[0])
plt.axis('off') 
def animate(i):
    patch.set_data(frames[i])
anim = animation.FuncAnimation(plt.gcf(), animate, frames=len(frames),interval=50)
anim.save('movie_cartpole.mp4')
HTML(anim.to_jshtml())

test episode: 0 R: 200.0
test episode: 1 R: 200.0
test episode: 2 R: 200.0

If the above test scores and results are sufficient, the rest of the work is to save the agent for reuse. All you have to do is call agent.save to save the agent and then call agent.load to load the saved agent.

# Save an agent to the 'agent' directory
agent.save('agent')

# Uncomment to load an agent from the 'agent' directory
# agent.load('agent')

With the above, reinforcement learning has been learned and tested.

However, writing code like this every time you implement reinforcement learning can be tedious. Therefore, ChainerRL has utility functions that do these things.

# Set up the logger to print info messages for understandability.
import logging
import sys
gym.undo_logger_setup()  # Turn off gym's default logger settings
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

chainerrl.experiments.train_agent_with_evaluation(
    agent, env,
    steps=20000,           # Train the agent for 2000 steps
    eval_n_steps=None,       # 10 episodes are sampled for each evaluation
    eval_n_episodes=10,       # 10 episodes are sampled for each evaluation
    eval_max_episode_len=200,  # Maximum length of each episodes
    eval_interval=1000,   # Evaluate the agent after every 1000 steps
    outdir='result')      # Save everything to 'result' directory

That's all for the ChainerRL Quick Start Guide. To find out more about ChainerRL, take a look at the examples directory and read and run the examples. Thank you very much!

Supplement Installation around cupy and cudnn was difficult. It's full of land mines. It's ridiculous, but it seems better to run it on VMware. As long as I ran it on jupyter notebook, it worked fine.

The file of chainerrl to be remodeled is userfolder.conda\envs\chainer\Lib\site-packages\chainerrl\experiments\train_agent.py was. chainerui worked fine.

Addition If you conda it at the command prompt, it will be created in .conda as shown above. If you make it with powershell, it seems that it will be made in a different place.

I referred to here. (Almost a full copy.) https://book.mynavi.jp/manatee/detail/id=88961

I hope that windows users will also be able to use chainerRL.

[PYTHON] Reinforcement learning 12 ChainerRL quick start guide windows version

ChainerRL Quick Start Guide