[PYTHON] I want to climb a mountain with reinforcement learning

51988414-脳のギア。ai-人工知能の概念。.jpg

Reinforcement learning is cool, isn't it? This time, I've played with "Mountain Car" in the python environment "OpenAI Gym", so I will introduce it. By the way, I use Google Colab </ font>.

I referred to this article quite a bit. Introduction to OpenAI Gym

Q-learning

Looking back on Q-learning as a learning method. No matter what! If so, please skip it.

In Q-learning, $ Q \ left (s_ {t}, a_ {t} \ right) $ is called state action value </ font>, and in a certain state $ st $, action $ Represents the value of taking a_ {t} $. The notation $ t $ used here does not mean time, but a single state of a state. The value here is not the reward that you will receive temporarily when you change states, but the cumulative reward that you will receive when you complete the episode to the end.

Therefore, as a measure, you should select $ a $ that becomes $ \ max_ {a \ in At} (Q (s_ {t}, a)) $ in a certain state $ s_ {t} $. Will be.

State action value update

Generally, the method of updating the state behavior value is expressed as follows.

$\begin{aligned}Q\left( s_{t},a_{t}\right) \\ \leftarrow Q\left( s_{t},a_{t}\right) \\ +\alpha \left( G_{t}-Q\left( s_{t},a_{t}\right) \right) \end{aligned}$ Here, in the case of Q learning, $ G_ {t} $ is

$G_{t}=r_{t+1}+\gamma\max_{a\in At}[Q(s_{t+1},a)]$ You can see that it considers the next state instead of the current state, such as $ s_ {t + 1} $. You can see that this also takes into account future rewards. $ r_ {t + 1} $ is called immediate reward </ font>, and the reward you get immediately after transition, $ \ alpha $ is learning rate < It is called / font> and determines how much the value is updated in one learning. $ \ Gamma $ is called discount rate </ font> and determines how much you want to refer to future rewards.

It is this $ Q $ learning that updates this state value function by repeating episodes (games) and seeks the optimal strategy.

Mountain Car rules

スクリーンショット 2020-09-10 22.31.19.png

In this environment, the game ends when the car's position reaches the position of the flag on the right. Unless you reach it, you will get a reward of -1 for each action you take. If you do not reach the goal after 200 actions, the game is over. In that case, you've got a -200 reward. In other words, reinforcement learning is done to get rewards greater than -200.

Actions are limited to three: move to the left: 0, do not move: 1, move to the right: 2.

Preparation

I put various things in Google Colab to describe the state, but in the end, all I have to do is put in the gym.

bash


#gym installation
$ pip install gym
#Not required(Recommended when using colab)
$apt update
$apt install xvfb
$apt-get -qq -y install libcusparse8.0 libnvrtc8.0 $ibnvtoolsext1 > /dev/null
$ln -snf /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so.8.0 /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so

$apt-get -qq -y install xvfb freeglut3-dev ffmpeg> /dev/null
$pip install pyglet
$pip install pyopengl
$pip install pyvirtualdisplay

Library import

Some of them are not needed, so you can choose them as appropriate.

import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML

Q-learning implementation

I refer to github for the types of actions and how to use the environment. GitHub MountainCar


class Q:
    def __init__(self, env):
        self.env = env
        self.env_low = self.env.observation_space.low #Minimum position and velocity
        self.env_high = self.env.observation_space.high #Maximum position and speed
        self.env_dx = (self.env_high - self.env_low) / 40 #Divide into 50 equal parts
        self.q_table = np.zeros((40,40,3))
        
    def get_status(self, _observation):
        position = int((_observation[0] - self.env_low[0])/self.env_dx[0])
        velocity = int((_observation[1] - self.env_low[1])/self.env_dx[1])
        return position, velocity
    
    def policy(self, s, epsilon = 0.1):
        if np.random.random() <= epsilon:
            return np.random.randint(3)
        else:
            p, v = self.get_status(s)
            if self.q_table[p][v][0] == 0 and self.q_table[p][v][1] == 0 and self.q_table[p][v][2] == 0:
                return np.random.randint(3)
            else:
                return np.argmax(self.q_table[p][v])
    
    def learn(self, time = 5000, alpha = 0.4, gamma = 0.99):
        log = []
        for j in range(time):
            total = 0
            s = self.env.reset()
            done = False
            
            while not done:
                a = self.policy(s)
                next_s, reward, done, _ = self.env.step(a)
                total += reward

                p, v = self.get_status(next_s)
                G = reward + gamma * max(self.q_table[p][v])
                
                p,v = self.get_status(s)
                self.q_table[p][v][a] += alpha*(G - self.q_table[p][v][a])
                s = next_s

            log.append(total)
            if j %100 == 0:
              print(str(j) + " ===total reward=== : " + str(total))
        return plt.plot(log)

    def show(self):
        s = self.env.reset()
        img = plt.imshow(env.render('rgb_array'))
        done = False
        while not done:
          p, v = self.get_status(s)
          s, _, done, _ = self.env.step(self.policy(s))
          display.clear_output(wait=True)
          img.set_data(env.render('rgb_array'))
          plt.axis('off')
          display.display(plt.gcf())
                
        self.env.close()

The handling state $ s $ includes the current position of the car (position) and the current speed (velocity). Since both take continuous values, they are set to discrete values (40, 40) with the get_status function. q_table is the storage location of the state value function, which is the product of each of the states (40, 40) by the type of action, (3).

env = gym.make('MountainCar-v0')
agent = Q(env)
agent.learn()

スクリーンショット 2020-09-10 22.38.42.png

With about 5,000 lessons, the number of times you can reach your goal has increased.

By the way, if you want to check the animation with google colab,

from IPython import display
from pyvirtualdisplay import Display
import matplotlib.pyplot as plt

d = Display()
d.start()

agent.show()

You can see it in. 車を走らせる.gif

Impressions

This time, the explanation is quite complicated, so I will update it gradually. We will continue to disseminate information while playing with Gym more. see you!

Recommended Posts