Overview

Using Q-learning, which is one of reinforcement learning, I tried to solve a simple maze of my own work. I implemented it in python.

Reinforcement learning

Reinforcement learning is one of the machine learning methods along with supervised learning and unsupervised learning. Briefly, the "agent" is trained to maximize the "value" in the given "environment". In reinforcement learning, the three elements of state * s *, behavior * a *, and reward * r * are generally used. Below is a description of each element.

State * s *: Indicates the state of the "environment". Action * a *: Shows the action taken by the "agent" in a given "environment". Reward * r *: Indicates the reward obtained as a result of the "agent" performing an action.

If you know the "value when you perform an action * a * in a certain state * s *", you can train the agent to maximize the value.

Q learning

The Q-learning method is a method in which the "value when an action * a * is performed in a certain state * s *" in the previous section is called the Q value, and learning is performed by updating this value. The Q value when performing the action $ a_t $ in the state $ s_t $ is expressed as $ Q (s_t, a_t) . It will be updated according to the reward * r * that can receive this Q value, but not only the direct action that can actually receive the reward, but also the previous action, and the previous action ・ It can be said that it is valuable in that it is approaching to get a reward. Therefore, by incorporating and updating the expected value of the reward that will be finally obtained by taking that action, we will calculate how much it is worth at each point in time. The specific formula is as follows. $ Q(s_t,a_t)=Q(s_t,a_t)+\alpha(r_{t+1}+\gamma \text{max}Q(s_{t+1},a_{t+1})-Q(s_t,a_t)) $$

The next t + 1 reward for updating $ Q (s_t, a_t) $ $ r_ {t + 1} $ and the maximum Q value in the state $ s_ {t + 1} $ $ \ text {max} Use Q (s_ {t + 1}, a_ {t + 1}) $. $ \ Alpha $ and $ \ gamma $ are hyperparameters.

environment

python 3.5.2

Implementation (maze part)

The implementation of the maze part is as follows. The structure of the maze is a grid of $ 5 \ times 5 $. (See comments in the code for details)

`tresure.py`


import numpy as np

class Game():
    def __init__(self):
        self.x_size = 5
        self.y_size = 5
        self.init_position = np.array([4,0])
        self.game_board = np.zeros((self.x_size,self.y_size))
        """
Maze structure
        0 0 0 0 G
        0 D 0 D D
        0 0 D 0 0
        D 0 0 0 0
        S 0 D 0 0

        G = 1
        D = -1
        """
        self.game_board[0][4] = 1
        self.game_board[1][1] = -1
        self.game_board[3][0] = -1
        self.game_board[2][2] = -1
        self.game_board[1][3] = -1
        self.game_board[1][4] = -1
        self.game_board[4][2] = -1
    
    def judge(self,x,y):
        #Judgment if the game is over
        if self.game_board[x][y] == 0:
            return 0
        elif self.game_board[x][y] == 1:
            return 1
        else:
            return -1
    
    def check_size(self,x,y):
        #Determine if it is out of the maze
        if x < 0 or y < 0 or x >= self.x_size or y >= self.y_size:
            return 0
        else:
            return 1
    
    def move(self,position,direction):
        if direction == 'up':
            position[0] -= 1
        elif direction == 'down':
            position[0] += 1
        elif direction == 'right':
            position[1] -= 1
        elif direction == 'left':
            position[1] += 1
        
        return position

The rule is that the player can start from the S square and move to any square up, down, left, or right for each state. If you reach G, you will get a goal and a reward, and if you move to D, the game is over and you will be given a negative reward as a punishment.

Implementation (learning part)

The code looks like this

Whole learning part code

`train.py`


import numpy as np
from treasure import Game
import copy
import matplotlib.pyplot as plt

#board proscess
game = Game()
game_board = game.game_board
print(game_board)
game_direction = ['up','down','left','right']

def get_action(Q_table,dire,epsilon,x,y):

    #random choice
    if np.random.rand() < epsilon:
        return np.random.choice(dire)

    else:
        return dire[np.argmax(Q_table[x,y])]

def game_init():
    #init process
    game = Game()
    position= game.init_position
    
    return game,position

def game_reward(game,position):
    #result print
    if game.judge(position[0],position[1]) == 1:
        #print('You got a goal!')
        return 1
    elif game.judge(position[0],position[1]) == -1:
        #print('You died..')
        return -1
    else:
        return 0

def game_step(game,position,Q_table,dire,epsilon):
    while(True):
        pos = copy.deepcopy(position)
        direction = get_action(Q_table,dire,epsilon,pos[0],pos[1])
        index_dire = dire.index(direction)
        move_position = game.move(pos,direction)
        if game.check_size(pos[0],pos[1]):
            break
    reward = game_reward(game,move_position)

    return move_position,index_dire,reward

def Q_koushin(Q,state_x,state_y,state_a,s_next_x,state_next_y,alpha,reward,gamma):
    Q[state_x,state_y,state_a] += alpha*(reward + gamma*np.max(Q[s_next_x,state_next_y]) - Q[state_x,state_y,state_a])
    return Q[state_x,state_y,state_a]


if __name__ == '__main__':
    #hyper parameter
    epsilon = 0.01
    alpha = 0.5
    gamma = 0.8
    Q_table = np.zeros((game_board.shape[0],game_board.shape[1],len(game_direction)))
    episode = 200
    sucess = []

    for i in range(episode):
        game,position = game_init()
        while(not game.judge(position[0],position[1])):
            next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
            Q_table[position[0],position[1],dire] = Q_koushin(Q_table,position[0],position[1],dire,next_position[0],next_position[1],alpha,reward,gamma)
            position = next_position
        
        if i % 10 == 0:
            count = 0
            heatmap = np.zeros((game_board.shape[0],game_board.shape[1]))
            for j in range(100):
                game,position = game_init()
                while(not game.judge(position[0],position[1])):
                    next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
                    position = next_position
                    heatmap[next_position[0]][next_position[1]] += 1
                if reward == 1:
                    count += 1
            sucess.append(count)
            print(i)
            print(heatmap)
    print(sucess)

I will explain each part below.

** Measures by the epsilon-greedy method **

`train.py`


def get_action(Q_table,dire,epsilon,x,y):
    #random choice
    if np.random.rand() < epsilon:
        return np.random.choice(dire)

    else:
        return dire[np.argmax(Q_table[x,y])]

It moves randomly with a probability of $ \ epsilon $, and otherwise selects the one with the highest Q value.

** Reward function settings **

`train.py`


def game_reward(game,position):
    #result print
    if game.judge(position[0],position[1]) == 1:
        #print('You got a goal!')
        return 1
    elif game.judge(position[0],position[1]) == -1:
        #print('You died..')
        return -1
    else:
        return 0

The reward was 1 when the goal was reached, -1 as the punishment when the game was over, and 0 otherwise.

** Update Q value **

`train.py`


def Q_koushin(Q,state_x,state_y,state_a,s_next_x,state_next_y,alpha,reward,gamma):
    Q[state_x,state_y,state_a] += alpha*(reward + gamma*np.max(Q[s_next_x,state_next_y]) - Q[state_x,state_y,state_a])
    return Q[state_x,state_y,state_a]

Update the Q value according to the formula.

** Move player **

`train.py`


def game_step(game,position,Q_table,dire,epsilon):
    while(True):
        pos = copy.deepcopy(position)
        direction = get_action(Q_table,dire,epsilon,pos[0],pos[1])
        index_dire = dire.index(direction)
        move_position = game.move(pos,direction)
        if game.check_size(pos[0],pos[1]):
            break
    reward = game_reward(game,move_position)

    return move_position,index_dire,reward

Make one move during the game. It judges whether it is out of the maze, and if it is okay, returns the position of the destination, the direction of movement, and the reward obtained. In python, the list is passed by reference, so I made a deep copy so that the original list wouldn't change.

** Learning part **

`train.py`


if __name__ == '__main__':
    #hyper parameter
    epsilon = 0.01
    alpha = 0.5
    gamma = 0.8
    Q_table = np.zeros((game_board.shape[0],game_board.shape[1],len(game_direction)))
    episode = 200
    sucess = []

    for i in range(episode):
        game,position = game_init()
        while(not game.judge(position[0],position[1])):
            next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
            Q_table[position[0],position[1],dire] = Q_koushin(Q_table,position[0],position[1],dire,next_position[0],next_position[1],alpha,reward,gamma)
            position = next_position
        
        if (i+1) % 20 == 0:
            count = 0
            heatmap = np.zeros((game_board.shape[0],game_board.shape[1]))
            for j in range(100):
                game,position = game_init()
                while(not game.judge(position[0],position[1])):
                    next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
                    position = next_position
                    heatmap[next_position[0]][next_position[1]] += 1
                if reward == 1:
                    count += 1
            sucess.append(count)
            print('%d times' %(i+1))
            print(heatmap)

The Q_table is a matrix of $ 5 \ times5 \ times4 $ in (maze squares) $ \ times $ (moving direction up, down, left and right). The Q value was updated for each move until the goal was reached or the game was over. In addition, every time I finished the game 20 times, I played the game 100 times using the Q_table at that time, and investigated how many goals I had and where I passed.

result

The result is as follows.

#Maze structure diagram
[[ 0.  0.  0.  0.  G.]
 [ 0. -1.  0. -1. -1.]
 [ 0.  0. -1.  0.  0.]
 [-1.  0.  0.  0.  0.]
 [ S.  0. -1.  0.  0.]]
At 20 times
[[4.500e+01 1.800e+01 1.500e+01 6.000e+00 3.000e+00]
 [4.000e+01 1.400e+01 5.000e+00 3.000e+00 5.000e+00]
 [9.000e+00 4.148e+03 8.000e+00 8.270e+02 5.000e+00]
 [6.700e+01 4.172e+03 1.100e+01 8.350e+02 3.000e+00]
 [0.000e+00 4.900e+01 0.000e+00 2.000e+00 0.000e+00]]
40 times
[[1.600e+01 1.000e+01 5.000e+00 2.000e+00 2.000e+00]
 [1.300e+01 1.000e+01 2.000e+00 3.000e+00 3.000e+00]
 [7.000e+00 4.105e+03 9.000e+00 6.400e+02 3.000e+00]
 [7.300e+01 4.132e+03 7.000e+00 6.450e+02 2.000e+00]
 [0.000e+00 4.900e+01 0.000e+00 2.000e+00 0.000e+00]]
At 60 times
[[1.900e+01 8.000e+00 3.000e+00 3.000e+00 3.000e+00]
 [1.700e+01 1.600e+01 0.000e+00 2.000e+00 4.000e+00]
 [6.000e+00 3.754e+03 8.000e+00 2.120e+02 4.000e+00]
 [6.700e+01 3.783e+03 6.000e+00 2.140e+02 2.000e+00]
 [0.000e+00 5.600e+01 0.000e+00 0.000e+00 0.000e+00]]
At 80 times
[[1.100e+01 6.000e+00 6.000e+00 6.000e+00 6.000e+00]
 [1.100e+01 1.200e+01 0.000e+00 6.000e+00 4.000e+00]
 [6.000e+00 3.544e+03 1.300e+01 2.069e+03 1.107e+03]
 [5.900e+01 3.571e+03 1.500e+01 2.080e+03 1.109e+03]
 [0.000e+00 5.700e+01 0.000e+00 3.000e+00 2.000e+00]]
At 100 times
[[8.000e+00 8.000e+00 8.000e+00 8.000e+00 8.000e+00]
 [8.000e+00 1.600e+01 0.000e+00 5.000e+00 2.000e+00]
 [8.000e+00 4.783e+03 1.500e+01 3.083e+03 1.225e+03]
 [5.400e+01 4.824e+03 2.300e+01 3.100e+03 1.224e+03]
 [0.000e+00 7.100e+01 0.000e+00 1.000e+01 1.000e+00]]
At 120 times
[[1.100e+01 1.100e+01 1.100e+01 1.100e+01 1.100e+01]
 [1.100e+01 6.000e+00 0.000e+00 6.000e+00 3.000e+00]
 [1.100e+01 4.215e+03 1.500e+01 2.403e+03 1.660e+03]
 [5.900e+01 4.251e+03 1.900e+01 2.416e+03 1.665e+03]
 [1.000e+00 6.400e+01 0.000e+00 4.000e+00 4.000e+00]]
140 times
[[ 99. 100.  98.  96.  96.]
 [100.   4.   1.   0.   0.]
 [101. 101.   0.   0.   0.]
 [  0. 102.   0.   0.   0.]
 [  0. 102.   0.   0.   0.]]
At 160 times
[[ 96.  95.  96.  96.  95.]
 [ 97.   2.   0.   0.   0.]
 [ 97. 100.   1.   0.   0.]
 [  0.  99.   0.   0.   0.]
 [  0. 100.   2.   0.   0.]]
180 times
[[101. 100. 100. 100.  99.]
 [101.   0.   0.   1.   0.]
 [100. 101.   0.   0.   0.]
 [  0. 101.   0.   0.   0.]
 [  0. 100.   0.   0.   0.]]
At 200 times
[[ 99.  99. 101. 100.  99.]
 [100.   1.   1.   0.   0.]
 [100. 100.   0.   0.   0.]
 [  0. 101.   0.   0.   0.]
 [  0. 101.   0.   0.   0.]]

#Number of goals(100 times)
[3, 2, 3, 6, 8, 11, 96, 95, 99, 99]

Looking at the results, in the early stages when learning was not advanced, the game was often over at the off-field near the start, but after 140 times of repeated learning, it seems that almost the correct route can be found. I can see.

Summary

I tried to solve a simple maze of my own using Q-learning. It was also a good opportunity to practice implementation. In the future, I would like to play more complicated games using another method of reinforcement learning.

References

Platinum Data Blog by BrainPad Introduction to Reinforcement Learning ~ Basic Knowledge for Those Who Want to Learn Reinforcement Learning ~ 2017 https://blog.brainpad.co.jp/entry/2017/02/24/121500

[PYTHON] Solve your own maze with Q-learning

Overview

Reinforcement learning

Q learning

environment

Implementation (maze part)

tresure.py

Implementation (learning part)

train.py

train.py

train.py

train.py

train.py

train.py

result

Summary

References

`tresure.py`

`train.py`

`train.py`

`train.py`

`train.py`

`train.py`

`train.py`