Introduction

I am the author of "Let's learn AI with Scratch". In this book, Q-learning was explained in three stages using Game as a theme, aiming for even beginners to understand properly.

Here, as the sequel, ** python ** is the basic ** Q learning **, ** neural network ** application, and although it is a somewhat experimental attempt, it is a short-term memory unit **. I would like to introduce the application of LSTM/GRU **.

The program to use ** memoryRL ** can be downloaded from GitHub. It is assumed that you can study in a short time (a few minutes at the longest) even on an ordinary PC without a GPU.

First, look at two examples.

The first example is a task called many_swamp. If you can operate the robot and patrol all four goals (blue), it will be cleared and the episode will end. The placement of walls and goals changes randomly for each episode. The animation is the result of training with ** Q-learning using a neural network ** (qnet).

The left side of the animation represents the entire environment. The black and white center and right figures correspond to the observation information actually received by the reinforcement learning algorithm. The observation information will be a pattern of surrounding goals (center) and walls (left) within 2 squares.

The robot is set to see only the surroundings up to 2 squares, but it seems that it is moving with algorithms such as "avoid walls", "turn there when approaching the goal", "go straight if there is nothing". Looks like. This ** algorithm itself was created by reinforcement learning **. The map changes randomly, but it does not require any additional learning.

By the way, this learning was completed in less than ** 4 minutes ** even on my old laptop (qnet, many_swamp, 33000 qstep, 3 minutes 40 seconds).

The CPU of my notebook PC is Intel Core i7-6600U without GPU. Even i7 is the 6th generation, so it's quite late, and i5-7500U seems to be higher. Reference: CPU performance comparison table | Easy comparison from the latest to standard CPUs

The second example is a task called Tmaze_either. It is clear when you reach the blue goal, but the goal disappears from the observation information. Even if it disappears from the observation, there is a goal, so it is necessary to remember the position of the goal and turn the T-junction in the correct direction. In other words, it is a task that ** requires short-term memory **.

Such problems cannot be solved by normal Q-learning, but it can be solved by Q-learning ** that incorporates the short-term memory unit ** LSTM.

With the above two examples, this memoryRL project allows you to experiment with seven variations of tasks, including this one, with four types of reinforcement learning agents.

This article mainly introduces how to use memryRL, but please refer to the following for more detailed explanations such as learning principles and implementation methods. http://itoshi.main.jp/tech/0100-rl/rl_introduction/

Environment

Here, assuming windows 10, we will explain the environment construction for running memoryRL with python.

Anaconda installation

First, install anaconda as the basic environment for doing python.

The following articles will be helpful. Installing Anaconda for Windows

Create virtual environment

Next, create a virtual environment in conda to set the environment for a particular python version or module version.

The following will help you to learn more about virtual environments. [For beginners] Try creating a virtual environment with Anaconda python japan, Conda command

Launch Anaconda Powershell Prompt from the Start menu.

Create a virtual environment named mRL with python3.6 with the following command (it doesn't have to be mRL, but we will proceed as mRL here).

(base)> conda create -n mRL python=3.6

Activate mRL (enter the virtual environment) below.

(base)> conda activate mRL

Install tensorflow, numpy, h5py by version.

(mRL)> pip install tensorflow==1.12.0 numpy==1.16.1 h5py==2.10.0

Install opencv-python and matplotlib with the latest version.

(mRL)> pip install opencv-python matplotlib

In the future, by entering this virtual environment mRL (> conda activate mRL), you can run the program in the environment of the module (library) you just installed.

Download and unpack memoryRL

If you can use git, clone it with the following command and you're done.

(mRL)> git clone https://github.com/itoshin-tech/memoryRL.git

Below is an explanation for those who are not using git.

In your browser, go to the following URL: https://github.com/itoshin-tech/memoryRL

From Code, select Download zip, save it in a suitable location on your PC and unzip it, and a folder called memoryRL-master will be created.

Now you are ready to go.

Run memoryRL

Unzip memoryRL and enter the created folder.

(mRL)> cd C:\The path of the unzipped location\memoryRL-master\

Execute sim_swanptour.py with the following command.

(mRL)> python sim_swanptour.py

Then, the usage is displayed as follows.

----How to use---------------------------------------
Execute with 3 parameters

> python sim_swanptour.py [agt_type] [task_type] [process_type]

[agt_type]      : q, qnet, lstm, gru
[task_type]     :silent_ruin, open_field, many_swamp, 
Tmaze_both, Tmaze_either, ruin_1swamp, ruin_2swamp, 
[process_type]  :learn/L, more/M, graph/G, anime/A
Example> python sim_swanptour.py q open_field L
---------------------------------------------------

As described, use python sim_swanptour.py with 3 parameters set after it.

The agt_type and task_type will be illustrated later, but the parameter explanations are as follows.

[agt_type]	Reinforcement learning algorithm
q	Q learning
qnet	Q-learning using a neural network
lstm	Q-learning with short-term memory using LSTM
gru	Q-learning with short-term memory using GRU

[task_type]	The type of task. Clear if you reach all blue goals in all tasks
silent_ruin	Fixed map, 2 goals.
open_field	No walls, 1 goal. The position of the goal changes randomly.
many_swamp	There is a wall and the number of goals is 4. The arrangement changes randomly.
Tmaze_both	T maze. Needs short-term memory.
Tmaze_either	T maze. Needs short-term memory.
ruin_1swamp	There is a wall and the number of goals is 1.
ruin_2swamp	There is a wall and the number of goals is 2. High difficulty.

[process type]	Process type
learn/L	Learn from the beginning
more/M	Do additional learning
graph/G	Display the learning curve
anime/A	Animated display of solving tasks

Now, specifically, learn the task of many_swamp in qnet (Q learning using neural network). I will explain the case to make it.

The last parameter should be more or L as we will start learning.

(mRL)> python sim_swanptour.py qnet many_swamp L

Then, the evaluation of the learning process is displayed on the console as shown below, and all 5000 steps of learning are performed.

qnet many_swamp  1000 --- 5 sec, eval_rwd -3.19, eval_steps  30.00
qnet many_swamp  2000 --- 9 sec, eval_rwd -0.67, eval_steps  28.17
qnet many_swamp  3000 --- 14 sec, eval_rwd -0.21, eval_steps  26.59
qnet many_swamp  4000 --- 18 sec, eval_rwd -1.27, eval_steps  28.72
qnet many_swamp  5000 --- 23 sec, eval_rwd -1.29, eval_steps  28.90

Once in 1000, there is a ** evaluation ** process where eval_rwd and eval_step are calculated. eval_rwd is the average reward in one episode at that time, and eval_steps is the average number of steps. In the evaluation process, learning is stopped, the noise of action selection is also set to 0, and 100 episodes are performed to average.

Learning ends when learning reaches the target value using eval_rwd or eval_step as an index (EARY_STOP). The target value is set for each task.

Finally, the following graph of the learning process (eval_rwd, eval_steps) is displayed. Press [q] to close the graph.

To see the motion animation after the learning result, set the last parameter to anime or A .

(mRL)> python sim_swanptour.py qnet many_swamp A

Then, the following animation will be displayed.

It ends when 100 episodes are over. Press [q] to end halfway.

If you look at the animation, you can see that it is not moving properly. I haven't learned enough yet (the black and white figure from the center to the right of the animation shows the input to the agent).

Therefore, additional learning . Run with the last parameter more or M (use L to learn from scratch).

(mRL)> python sim_swanptour.py qnet many_swamp M

Repeat this command several times. When EARY_STOP is displayed and the process ends halfway, it can be said that learning has progressed to a good point. many_swamp becomes EARY_STOP when eval_rwd is 1.4 or more or eval_steps is 22 or less.

qnet many_swamp  1000 --- 5 sec, eval_rwd  0.55, eval_steps  24.25
qnet many_swamp  2000 --- 9 sec, eval_rwd  0.92, eval_steps  23.52
qnet many_swamp  3000 --- 14 sec, eval_rwd  1.76, eval_steps  21.69
EARY_STOP_STEP 22 >= 21

The graph is displayed at the end. The reward per episode (rwd) has increased and the number of steps (Steps) has decreased, indicating that learning has progressed.

Let's see the animation.

(mRL)> python sim_swanptour.py qnet many_swamp A

It sometimes fails, but it seems to be working.

To display the graph that you have learned so far, set the last parameter to graph or G . ..

(mRL)> python sim_swanptour.py qnet many_swamp G

The above is the explanation of how to use sim_swamptour.py (ike tour).

Types of reinforcement learning algorithms

There are four types of reinforcement learning algorithms (agents) that can be specified with [agt_type]: q, qnet, lstm, and gru. Here is a brief description of its features.

q: Normal Q-learning

The basic Q-learning algorithm . For each observation, the Q value of each action is saved in a variable (Q table) and updated. Up to 500 observation values can be registered. If more patterns are observed, it will be forcibly terminated due to memory over.

qnet: Q-learning algorithm using neural network

Learn to output a Q value with a neural network . The input is an observation and the output is three. These three values correspond to the Q value of each action. The middle layer is 64 ReLU units. Q value can be output even for unknown observation values.

lstm/gru: Q-learning algorithm using storage unit

This model adds a storage unit (LSTM or GRU) that can react depending on past inputs. The algorithm using LSTM is lstm, and the algorithm using GRU is gru.

q and qnet can only output the same behavior if the observed values are the same, but lstm and gru models are If the past input is different, the current observed value will be In principle, it is possible to output the same but different actions .

LSTMs are storage units that are often used in natural language processing models. GRU is a simplified model of LSTM.

I have 64 LeRUs and 32 LSTMs or GRUs.

Task type

There are seven types of tasks that can be specified in [task_type]: silent_ruin, open_field, many_swamp, Tmaze_both, Tmaze_either, ruin_1swamp, ruin2swamp. Here is a brief description of its features.

Rules common to all tasks

What all tasks have in common is the rule that the robot clears when it reaches all goals (blue squares) .

The information that the algorithm receives is robot-centric goal and wall information in a limited field of view . The black-and-white diagram on the right of each task diagram corresponds to that information.

The reward is +1.0 when you reach your first goal, -0.2 when you hit a wall, and -0.1 when you hit the other steps.

silent_ruin

There are two goals, but the map is always the same. Therefore, the variation of observation is limited, and it is possible to learn with q.

open_field

There are no walls, but the location of the goal will change randomly from episode to episode. However, since there is no wall, the variation of observation is limited, and learning is possible even with q.

many_swamp

Since the positions of the goal and the wall are randomly determined for each episode, there are many variations in observation, and q will cause memory over and learning will not be possible. You can study with qnet.

Tmaze_both

This is a problem that cannot be achieved by ordinary reinforcement learning.

The map is fixed, but when you reach one goal, you will be returned to the starting point. And the goals I visited are still visible. In this state, you have to move on to another goal.

When the robot is at the starting point, the observation will be the same immediately after the start and after visiting one of the goals. Therefore, q and qnet, which can only select the same behavior for the same observation, cannot learn the appropriate behavior. Only gru and lstm can learn , which can change their behavior with past history.

Tmaze_either

In this Tmaze, goals will appear on either the left or right side, but after 2 steps, the goals will disappear.

The robot will come to the fork in the T maze in two steps, but it is necessary to remember the person who saw the goal and head for it. This task is also not possible with q or qnet. Only gru and lstm can learn .

ruin_1swamp

There are 8 walls, 1 goal, and the placement changes randomly. The difficulty is high because you may have to go around and reach the goal.

ruin_2swamp

This is the most difficult task. There are two goals, and when you reach one, you will be returned to the starting point. The goal does not disappear even if you visit. It's a task like a random version of Tmaze_both. I wasn't satisfied with gru or lstm either.

You can increase the difficulty level by increasing the number of goals and the field. In the future, I would like to develop a reinforcement learning algorithm that can solve such tasks well.

reference

In this article, we mainly introduced how to use memryRL, but please refer to the following for more detailed explanations such as learning principles and implementation methods. http://itoshi.main.jp/tech/0100-rl/rl_introduction/

[PYTHON] 4 types of agents and 7 types of tasks, from light-moving reinforcement learning and basic Q-learning to applications of neural networks and LSTMs.