[PYTHON] Solve your own maze with Q-learning

Overview

Using Q-learning, which is one of reinforcement learning, I tried to solve a simple maze of my own work. I implemented it in python.

Reinforcement learning

Reinforcement learning is one of the machine learning methods along with supervised learning and unsupervised learning. Briefly, the "agent" is trained to maximize the "value" in the given "environment". In reinforcement learning, the three elements of state * s *, behavior * a *, and reward * r * are generally used. Below is a description of each element.

State * s *: Indicates the state of the "environment". Action * a *: Shows the action taken by the "agent" in a given "environment". Reward * r *: Indicates the reward obtained as a result of the "agent" performing an action.

If you know the "value when you perform an action * a * in a certain state * s *", you can train the agent to maximize the value.

Q learning

The Q-learning method is a method in which the "value when an action * a * is performed in a certain state * s *" in the previous section is called the Q value, and learning is performed by updating this value. The Q value when performing the action $ a_t $ in the state $ s_t $ is expressed as $ Q (s_t, a_t) . It will be updated according to the reward * r * that can receive this Q value, but not only the direct action that can actually receive the reward, but also the previous action, and the previous action ・ It can be said that it is valuable in that it is approaching to get a reward. Therefore, by incorporating and updating the expected value of the reward that will be finally obtained by taking that action, we will calculate how much it is worth at each point in time. The specific formula is as follows. $ Q(s_t,a_t)=Q(s_t,a_t)+\alpha(r_{t+1}+\gamma \text{max}Q(s_{t+1},a_{t+1})-Q(s_t,a_t)) $$

The next t + 1 reward for updating $ Q (s_t, a_t) $ $ r_ {t + 1} $ and the maximum Q value in the state $ s_ {t + 1} $ $ \ text {max} Use Q (s_ {t + 1}, a_ {t + 1}) $. $ \ Alpha $ and $ \ gamma $ are hyperparameters.

environment

python 3.5.2

Implementation (maze part)

The implementation of the maze part is as follows. The structure of the maze is a grid of $ 5 \ times 5 $. (See comments in the code for details)

tresure.py


import numpy as np

class Game():
    def __init__(self):
        self.x_size = 5
        self.y_size = 5
        self.init_position = np.array([4,0])
        self.game_board = np.zeros((self.x_size,self.y_size))
        """
Maze structure
        0 0 0 0 G
        0 D 0 D D
        0 0 D 0 0
        D 0 0 0 0
        S 0 D 0 0

        G = 1
        D = -1
        """
        self.game_board[0][4] = 1
        self.game_board[1][1] = -1
        self.game_board[3][0] = -1
        self.game_board[2][2] = -1
        self.game_board[1][3] = -1
        self.game_board[1][4] = -1
        self.game_board[4][2] = -1
    
    def judge(self,x,y):
        #Judgment if the game is over
        if self.game_board[x][y] == 0:
            return 0
        elif self.game_board[x][y] == 1:
            return 1
        else:
            return -1
    
    def check_size(self,x,y):
        #Determine if it is out of the maze
        if x < 0 or y < 0 or x >= self.x_size or y >= self.y_size:
            return 0
        else:
            return 1
    
    def move(self,position,direction):
        if direction == 'up':
            position[0] -= 1
        elif direction == 'down':
            position[0] += 1
        elif direction == 'right':
            position[1] -= 1
        elif direction == 'left':
            position[1] += 1
        
        return position

The rule is that the player can start from the S square and move to any square up, down, left, or right for each state. If you reach G, you will get a goal and a reward, and if you move to D, the game is over and you will be given a negative reward as a punishment.

Implementation (learning part)

The code looks like this

Whole learning part code

train.py


import numpy as np
from treasure import Game
import copy
import matplotlib.pyplot as plt

#board proscess
game = Game()
game_board = game.game_board
print(game_board)
game_direction = ['up','down','left','right']

def get_action(Q_table,dire,epsilon,x,y):

    #random choice
    if np.random.rand() < epsilon:
        return np.random.choice(dire)

    else:
        return dire[np.argmax(Q_table[x,y])]

def game_init():
    #init process
    game = Game()
    position= game.init_position
    
    return game,position

def game_reward(game,position):
    #result print
    if game.judge(position[0],position[1]) == 1:
        #print('You got a goal!')
        return 1
    elif game.judge(position[0],position[1]) == -1:
        #print('You died..')
        return -1
    else:
        return 0

def game_step(game,position,Q_table,dire,epsilon):
    while(True):
        pos = copy.deepcopy(position)
        direction = get_action(Q_table,dire,epsilon,pos[0],pos[1])
        index_dire = dire.index(direction)
        move_position = game.move(pos,direction)
        if game.check_size(pos[0],pos[1]):
            break
    reward = game_reward(game,move_position)

    return move_position,index_dire,reward

def Q_koushin(Q,state_x,state_y,state_a,s_next_x,state_next_y,alpha,reward,gamma):
    Q[state_x,state_y,state_a] += alpha*(reward + gamma*np.max(Q[s_next_x,state_next_y]) - Q[state_x,state_y,state_a])
    return Q[state_x,state_y,state_a]


if __name__ == '__main__':
    #hyper parameter
    epsilon = 0.01
    alpha = 0.5
    gamma = 0.8
    Q_table = np.zeros((game_board.shape[0],game_board.shape[1],len(game_direction)))
    episode = 200
    sucess = []

    for i in range(episode):
        game,position = game_init()
        while(not game.judge(position[0],position[1])):
            next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
            Q_table[position[0],position[1],dire] = Q_koushin(Q_table,position[0],position[1],dire,next_position[0],next_position[1],alpha,reward,gamma)
            position = next_position
        
        if i % 10 == 0:
            count = 0
            heatmap = np.zeros((game_board.shape[0],game_board.shape[1]))
            for j in range(100):
                game,position = game_init()
                while(not game.judge(position[0],position[1])):
                    next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
                    position = next_position
                    heatmap[next_position[0]][next_position[1]] += 1
                if reward == 1:
                    count += 1
            sucess.append(count)
            print(i)
            print(heatmap)
    print(sucess)

I will explain each part below.

** Measures by the epsilon-greedy method **

train.py


def get_action(Q_table,dire,epsilon,x,y):
    #random choice
    if np.random.rand() < epsilon:
        return np.random.choice(dire)

    else:
        return dire[np.argmax(Q_table[x,y])]

It moves randomly with a probability of $ \ epsilon $, and otherwise selects the one with the highest Q value.

** Reward function settings **

train.py


def game_reward(game,position):
    #result print
    if game.judge(position[0],position[1]) == 1:
        #print('You got a goal!')
        return 1
    elif game.judge(position[0],position[1]) == -1:
        #print('You died..')
        return -1
    else:
        return 0

The reward was 1 when the goal was reached, -1 as the punishment when the game was over, and 0 otherwise.

** Update Q value **

train.py


def Q_koushin(Q,state_x,state_y,state_a,s_next_x,state_next_y,alpha,reward,gamma):
    Q[state_x,state_y,state_a] += alpha*(reward + gamma*np.max(Q[s_next_x,state_next_y]) - Q[state_x,state_y,state_a])
    return Q[state_x,state_y,state_a]

Update the Q value according to the formula.

** Move player **

train.py


def game_step(game,position,Q_table,dire,epsilon):
    while(True):
        pos = copy.deepcopy(position)
        direction = get_action(Q_table,dire,epsilon,pos[0],pos[1])
        index_dire = dire.index(direction)
        move_position = game.move(pos,direction)
        if game.check_size(pos[0],pos[1]):
            break
    reward = game_reward(game,move_position)

    return move_position,index_dire,reward

Make one move during the game. It judges whether it is out of the maze, and if it is okay, returns the position of the destination, the direction of movement, and the reward obtained. In python, the list is passed by reference, so I made a deep copy so that the original list wouldn't change.

** Learning part **

train.py


if __name__ == '__main__':
    #hyper parameter
    epsilon = 0.01
    alpha = 0.5
    gamma = 0.8
    Q_table = np.zeros((game_board.shape[0],game_board.shape[1],len(game_direction)))
    episode = 200
    sucess = []

    for i in range(episode):
        game,position = game_init()
        while(not game.judge(position[0],position[1])):
            next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
            Q_table[position[0],position[1],dire] = Q_koushin(Q_table,position[0],position[1],dire,next_position[0],next_position[1],alpha,reward,gamma)
            position = next_position
        
        if (i+1) % 20 == 0:
            count = 0
            heatmap = np.zeros((game_board.shape[0],game_board.shape[1]))
            for j in range(100):
                game,position = game_init()
                while(not game.judge(position[0],position[1])):
                    next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
                    position = next_position
                    heatmap[next_position[0]][next_position[1]] += 1
                if reward == 1:
                    count += 1
            sucess.append(count)
            print('%d times' %(i+1))
            print(heatmap)

The Q_table is a matrix of $ 5 \ times5 \ times4 $ in (maze squares) $ \ times $ (moving direction up, down, left and right). The Q value was updated for each move until the goal was reached or the game was over. In addition, every time I finished the game 20 times, I played the game 100 times using the Q_table at that time, and investigated how many goals I had and where I passed.

result

The result is as follows.

#Maze structure diagram
[[ 0.  0.  0.  0.  G.]
 [ 0. -1.  0. -1. -1.]
 [ 0.  0. -1.  0.  0.]
 [-1.  0.  0.  0.  0.]
 [ S.  0. -1.  0.  0.]]
At 20 times
[[4.500e+01 1.800e+01 1.500e+01 6.000e+00 3.000e+00]
 [4.000e+01 1.400e+01 5.000e+00 3.000e+00 5.000e+00]
 [9.000e+00 4.148e+03 8.000e+00 8.270e+02 5.000e+00]
 [6.700e+01 4.172e+03 1.100e+01 8.350e+02 3.000e+00]
 [0.000e+00 4.900e+01 0.000e+00 2.000e+00 0.000e+00]]
40 times
[[1.600e+01 1.000e+01 5.000e+00 2.000e+00 2.000e+00]
 [1.300e+01 1.000e+01 2.000e+00 3.000e+00 3.000e+00]
 [7.000e+00 4.105e+03 9.000e+00 6.400e+02 3.000e+00]
 [7.300e+01 4.132e+03 7.000e+00 6.450e+02 2.000e+00]
 [0.000e+00 4.900e+01 0.000e+00 2.000e+00 0.000e+00]]
At 60 times
[[1.900e+01 8.000e+00 3.000e+00 3.000e+00 3.000e+00]
 [1.700e+01 1.600e+01 0.000e+00 2.000e+00 4.000e+00]
 [6.000e+00 3.754e+03 8.000e+00 2.120e+02 4.000e+00]
 [6.700e+01 3.783e+03 6.000e+00 2.140e+02 2.000e+00]
 [0.000e+00 5.600e+01 0.000e+00 0.000e+00 0.000e+00]]
At 80 times
[[1.100e+01 6.000e+00 6.000e+00 6.000e+00 6.000e+00]
 [1.100e+01 1.200e+01 0.000e+00 6.000e+00 4.000e+00]
 [6.000e+00 3.544e+03 1.300e+01 2.069e+03 1.107e+03]
 [5.900e+01 3.571e+03 1.500e+01 2.080e+03 1.109e+03]
 [0.000e+00 5.700e+01 0.000e+00 3.000e+00 2.000e+00]]
At 100 times
[[8.000e+00 8.000e+00 8.000e+00 8.000e+00 8.000e+00]
 [8.000e+00 1.600e+01 0.000e+00 5.000e+00 2.000e+00]
 [8.000e+00 4.783e+03 1.500e+01 3.083e+03 1.225e+03]
 [5.400e+01 4.824e+03 2.300e+01 3.100e+03 1.224e+03]
 [0.000e+00 7.100e+01 0.000e+00 1.000e+01 1.000e+00]]
At 120 times
[[1.100e+01 1.100e+01 1.100e+01 1.100e+01 1.100e+01]
 [1.100e+01 6.000e+00 0.000e+00 6.000e+00 3.000e+00]
 [1.100e+01 4.215e+03 1.500e+01 2.403e+03 1.660e+03]
 [5.900e+01 4.251e+03 1.900e+01 2.416e+03 1.665e+03]
 [1.000e+00 6.400e+01 0.000e+00 4.000e+00 4.000e+00]]
140 times
[[ 99. 100.  98.  96.  96.]
 [100.   4.   1.   0.   0.]
 [101. 101.   0.   0.   0.]
 [  0. 102.   0.   0.   0.]
 [  0. 102.   0.   0.   0.]]
At 160 times
[[ 96.  95.  96.  96.  95.]
 [ 97.   2.   0.   0.   0.]
 [ 97. 100.   1.   0.   0.]
 [  0.  99.   0.   0.   0.]
 [  0. 100.   2.   0.   0.]]
180 times
[[101. 100. 100. 100.  99.]
 [101.   0.   0.   1.   0.]
 [100. 101.   0.   0.   0.]
 [  0. 101.   0.   0.   0.]
 [  0. 100.   0.   0.   0.]]
At 200 times
[[ 99.  99. 101. 100.  99.]
 [100.   1.   1.   0.   0.]
 [100. 100.   0.   0.   0.]
 [  0. 101.   0.   0.   0.]
 [  0. 101.   0.   0.   0.]]

#Number of goals(100 times)
[3, 2, 3, 6, 8, 11, 96, 95, 99, 99]

Looking at the results, in the early stages when learning was not advanced, the game was often over at the off-field near the start, but after 140 times of repeated learning, it seems that almost the correct route can be found. I can see.

Summary

I tried to solve a simple maze of my own using Q-learning. It was also a good opportunity to practice implementation. In the future, I would like to play more complicated games using another method of reinforcement learning.

References

Platinum Data Blog by BrainPad Introduction to Reinforcement Learning ~ Basic Knowledge for Those Who Want to Learn Reinforcement Learning ~ 2017 https://blog.brainpad.co.jp/entry/2017/02/24/121500

Recommended Posts

Solve your own maze with Q-learning
Solve your own maze with DQN
Solve OpenAI Gym Copy-v0 with Q-learning
Train UGATIT with your own dataset
Your own Twitter client made with Django
[Reinforcement learning] DQN with your own library
Create your own DNS server with Twisted
Create your own Composite Value with SQLAlchemy
To import your own module with jupyter
Publish your own Python library with Homebrew
Try to make your own AWS-SDK with bash
Argument implementation (with code) in your own language
Make your own module quickly with setuptools (python)
Train Stanford NER Tagger with your own data
Make your own music player with Bottle0.13 + jPlayer2.5!
Steps to install your own library with pip
Flow of creating your own package with setup.py with python
Create your own exception
Memo to create your own Box with Pepper's Python
Call your own C library with Go using cgo
Solve AtCoder 167 with python
Write your own activation function with Pytorch (hard sigmoid)
Let's call your own C ++ library with Python (Preferences)
Define your own distance function with k-means of scikit-learn
Solve Sudoku with Python
Solve Sudoku with PuLP
Solve POJ 2386 with python
Try sorting your own objects with priority queue in Python
Run the intellisense of your own python library with VScode.
Learn "x86 architecture learned with your own emulator" with Manjaro (Linux)
Reinforcement learning 23 Create and use your own module with Colaboratory
Try HeloWorld in your own language (with How to & code)
Solve AtCoder ABC166 with python
Operate your website with Python_Webbrowser
Create your own Django middleware
Solve AtCoder ABC 186 with Python
Make your own VPC with a Single Public Subnet Only with boto
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
How to make your own domain site with heroku (free plan)
Introduction to Deep Learning (2) --Try your own nonlinear regression with Chainer-
Draw your own Drop Indicator with PySide (Realize QProxyStyle with black magic)