I am the author of "Let's learn AI with Scratch". In this book, Q-learning was explained in three stages using Game as a theme, aiming for even beginners to understand properly.
Here, as the sequel, ** python ** is the basic ** Q learning **, ** neural network ** application, and although it is a somewhat experimental attempt, it is a short-term memory unit **. I would like to introduce the application of LSTM/GRU **.
The program to use ** memoryRL ** can be downloaded from GitHub. It is assumed that you can study in a short time (a few minutes at the longest) even on an ordinary PC without a GPU.
First, look at two examples.
The first example is a task called many_swamp. If you can operate the robot and patrol all four goals (blue), it will be cleared and the episode will end. The placement of walls and goals changes randomly for each episode. The animation is the result of training with ** Q-learning using a neural network ** (qnet).
The left side of the animation represents the entire environment. The black and white center and right figures correspond to the observation information actually received by the reinforcement learning algorithm. The observation information will be a pattern of surrounding goals (center) and walls (left) within 2 squares.
The robot is set to see only the surroundings up to 2 squares, but it seems that it is moving with algorithms such as "avoid walls", "turn there when approaching the goal", "go straight if there is nothing". Looks like. This ** algorithm itself was created by reinforcement learning **. The map changes randomly, but it does not require any additional learning.
By the way, this learning was completed in less than ** 4 minutes ** even on my old laptop (qnet, many_swamp, 33000 qstep, 3 minutes 40 seconds).
The second example is a task called Tmaze_either. It is clear when you reach the blue goal, but the goal disappears from the observation information. Even if it disappears from the observation, there is a goal, so it is necessary to remember the position of the goal and turn the T-junction in the correct direction. In other words, it is a task that ** requires short-term memory **.
Such problems cannot be solved by normal Q-learning, but it can be solved by Q-learning ** that incorporates the short-term memory unit ** LSTM.
With the above two examples, this memoryRL project allows you to experiment with seven variations of tasks, including this one, with four types of reinforcement learning agents.
This article mainly introduces how to use memryRL, but please refer to the following for more detailed explanations such as learning principles and implementation methods. http://itoshi.main.jp/tech/0100-rl/rl_introduction/
Here, assuming windows 10, we will explain the environment construction for running memoryRL with python.
First, install anaconda as the basic environment for doing python.
The following articles will be helpful. Installing Anaconda for Windows
Next, create a virtual environment in conda to set the environment for a particular python version or module version.
The following will help you to learn more about virtual environments. [For beginners] Try creating a virtual environment with Anaconda python japan, Conda command
Launch Anaconda Powershell Prompt from the Start menu.
Create a virtual environment named mRL with python3.6 with the following command (it doesn't have to be mRL, but we will proceed as mRL here).
(base)> conda create -n mRL python=3.6
Activate mRL (enter the virtual environment) below.
(base)> conda activate mRL
Install tensorflow, numpy, h5py by version.
(mRL)> pip install tensorflow==1.12.0 numpy==1.16.1 h5py==2.10.0
Install opencv-python and matplotlib with the latest version.
(mRL)> pip install opencv-python matplotlib
In the future, by entering this virtual environment mRL (> conda activate mRL), you can run the program in the environment of the module (library) you just installed.
If you can use git, clone it with the following command and you're done.
(mRL)> git clone https://github.com/itoshin-tech/memoryRL.git
Below is an explanation for those who are not using git.
In your browser, go to the following URL: https://github.com/itoshin-tech/memoryRL
From Code, select Download zip, save it in a suitable location on your PC and unzip it, and a folder called memoryRL-master will be created.
Now you are ready to go.
Unzip memoryRL and enter the created folder.
(mRL)> cd C:\The path of the unzipped location\memoryRL-master\
Execute sim_swanptour.py with the following command.
(mRL)> python sim_swanptour.py
Then, the usage is displayed as follows.
----How to use---------------------------------------
Execute with 3 parameters
> python sim_swanptour.py [agt_type] [task_type] [process_type]
[agt_type] : q, qnet, lstm, gru
[task_type] :silent_ruin, open_field, many_swamp,
Tmaze_both, Tmaze_either, ruin_1swamp, ruin_2swamp,
[process_type] :learn/L, more/M, graph/G, anime/A
Example> python sim_swanptour.py q open_field L
---------------------------------------------------
As described, use python sim_swanptour.py with 3 parameters set after it.
The agt_type and task_type will be illustrated later, but the parameter explanations are as follows.
[agt_type] | Reinforcement learning algorithm |
---|---|
q | Q learning |
qnet | Q-learning using a neural network |
lstm | Q-learning with short-term memory using LSTM |
gru | Q-learning with short-term memory using GRU |
[task_type] | The type of task. Clear if you reach all blue goals in all tasks |
---|---|
silent_ruin | Fixed map, 2 goals. |
open_field | No walls, 1 goal. The position of the goal changes randomly. |
many_swamp | There is a wall and the number of goals is 4. The arrangement changes randomly. |
Tmaze_both | T maze. Needs short-term memory. |
Tmaze_either | T maze. Needs short-term memory. |
ruin_1swamp | There is a wall and the number of goals is 1. |
ruin_2swamp | There is a wall and the number of goals is 2. High difficulty. |
[process type] | Process type |
---|---|
learn/L | Learn from the beginning |
more/M | Do additional learning |
graph/G | Display the learning curve |
anime/A | Animated display of solving tasks |
Now, specifically, learn the task of many_swamp in qnet (Q learning using neural network). I will explain the case to make it.
The last parameter should be more or L as we will start learning.
(mRL)> python sim_swanptour.py qnet many_swamp L
Then, the evaluation of the learning process is displayed on the console as shown below, and all 5000 steps of learning are performed.
qnet many_swamp 1000 --- 5 sec, eval_rwd -3.19, eval_steps 30.00
qnet many_swamp 2000 --- 9 sec, eval_rwd -0.67, eval_steps 28.17
qnet many_swamp 3000 --- 14 sec, eval_rwd -0.21, eval_steps 26.59
qnet many_swamp 4000 --- 18 sec, eval_rwd -1.27, eval_steps 28.72
qnet many_swamp 5000 --- 23 sec, eval_rwd -1.29, eval_steps 28.90
Once in 1000, there is a ** evaluation ** process where eval_rwd and eval_step are calculated. eval_rwd is the average reward in one episode at that time, and eval_steps is the average number of steps. In the evaluation process, learning is stopped, the noise of action selection is also set to 0, and 100 episodes are performed to average.
Learning ends when learning reaches the target value using eval_rwd or eval_step as an index (EARY_STOP). The target value is set for each task.
Finally, the following graph of the learning process (eval_rwd, eval_steps) is displayed. Press [q] to close the graph.
To see the motion animation after the learning result, set the last parameter to anime or A .
(mRL)> python sim_swanptour.py qnet many_swamp A
Then, the following animation will be displayed.
It ends when 100 episodes are over. Press [q] to end halfway.
If you look at the animation, you can see that it is not moving properly. I haven't learned enough yet (the black and white figure from the center to the right of the animation shows the input to the agent).
Therefore, additional learning . Run with the last parameter more or M (use L to learn from scratch).
(mRL)> python sim_swanptour.py qnet many_swamp M
Repeat this command several times. When EARY_STOP is displayed and the process ends halfway, it can be said that learning has progressed to a good point. many_swamp becomes EARY_STOP when eval_rwd is 1.4 or more or eval_steps is 22 or less.
qnet many_swamp 1000 --- 5 sec, eval_rwd 0.55, eval_steps 24.25
qnet many_swamp 2000 --- 9 sec, eval_rwd 0.92, eval_steps 23.52
qnet many_swamp 3000 --- 14 sec, eval_rwd 1.76, eval_steps 21.69
EARY_STOP_STEP 22 >= 21
The graph is displayed at the end. The reward per episode (rwd) has increased and the number of steps (Steps) has decreased, indicating that learning has progressed.
Let's see the animation.
(mRL)> python sim_swanptour.py qnet many_swamp A
It sometimes fails, but it seems to be working.
To display the graph that you have learned so far, set the last parameter to graph or G . ..
(mRL)> python sim_swanptour.py qnet many_swamp G
The above is the explanation of how to use sim_swamptour.py (ike tour).
There are four types of reinforcement learning algorithms (agents) that can be specified with [agt_type]: q, qnet, lstm, and gru. Here is a brief description of its features.
The basic Q-learning algorithm . For each observation, the Q value of each action is saved in a variable (Q table) and updated. Up to 500 observation values can be registered. If more patterns are observed, it will be forcibly terminated due to memory over.
Learn to output a Q value with a neural network . The input is an observation and the output is three. These three values correspond to the Q value of each action. The middle layer is 64 ReLU units. Q value can be output even for unknown observation values.
This model adds a storage unit (LSTM or GRU) that can react depending on past inputs. The algorithm using LSTM is lstm, and the algorithm using GRU is gru.
q and qnet can only output the same behavior if the observed values are the same, but lstm and gru models are If the past input is different, the current observed value will be In principle, it is possible to output the same but different actions .
LSTMs are storage units that are often used in natural language processing models. GRU is a simplified model of LSTM.
I have 64 LeRUs and 32 LSTMs or GRUs.
There are seven types of tasks that can be specified in [task_type]: silent_ruin, open_field, many_swamp, Tmaze_both, Tmaze_either, ruin_1swamp, ruin2swamp. Here is a brief description of its features.
What all tasks have in common is the rule that the robot clears when it reaches all goals (blue squares) .
The information that the algorithm receives is robot-centric goal and wall information in a limited field of view . The black-and-white diagram on the right of each task diagram corresponds to that information.
The reward is +1.0 when you reach your first goal, -0.2 when you hit a wall, and -0.1 when you hit the other steps.
silent_ruin
There are two goals, but the map is always the same. Therefore, the variation of observation is limited, and it is possible to learn with q.
open_field
There are no walls, but the location of the goal will change randomly from episode to episode. However, since there is no wall, the variation of observation is limited, and learning is possible even with q.
many_swamp
Since the positions of the goal and the wall are randomly determined for each episode, there are many variations in observation, and q will cause memory over and learning will not be possible. You can study with qnet.
Tmaze_both
This is a problem that cannot be achieved by ordinary reinforcement learning.
The map is fixed, but when you reach one goal, you will be returned to the starting point. And the goals I visited are still visible. In this state, you have to move on to another goal.
When the robot is at the starting point, the observation will be the same immediately after the start and after visiting one of the goals. Therefore, q and qnet, which can only select the same behavior for the same observation, cannot learn the appropriate behavior. Only gru and lstm can learn , which can change their behavior with past history.
Tmaze_either
In this Tmaze, goals will appear on either the left or right side, but after 2 steps, the goals will disappear.
The robot will come to the fork in the T maze in two steps, but it is necessary to remember the person who saw the goal and head for it. This task is also not possible with q or qnet. Only gru and lstm can learn .
ruin_1swamp
There are 8 walls, 1 goal, and the placement changes randomly. The difficulty is high because you may have to go around and reach the goal.
ruin_2swamp
This is the most difficult task. There are two goals, and when you reach one, you will be returned to the starting point. The goal does not disappear even if you visit. It's a task like a random version of Tmaze_both. I wasn't satisfied with gru or lstm either.
You can increase the difficulty level by increasing the number of goals and the field. In the future, I would like to develop a reinforcement learning algorithm that can solve such tasks well.
In this article, we mainly introduced how to use memryRL, but please refer to the following for more detailed explanations such as learning principles and implementation methods. http://itoshi.main.jp/tech/0100-rl/rl_introduction/