[PYTHON] If you're a man, just shut up and tell him to stand up with a stick

Recently I'm addicted to reinforcement learning. When doing reinforcement learning, it's a man who wants to stand a stick after all. So, I will introduce the CartPole </ font> of OpenAI Gym following the last time.

Previous article I want to climb a mountain with reinforcement learning

What is the SARSA learning method?

As for Q-learning mentioned in the previous article, this time I would like to use a method called SARSA. Let's review. Update of state behavior value Q in reinforcement learning,

\begin{aligned}Q\left( s_{t},a_{t}\right) \\ \leftarrow Q\left( s_{t},a_{t}\right) \\ +\alpha \left( G_{t}-Q\left( s_{t},a_{t}\right) \right) \end{aligned}

Is performed for each state transition. The difference between SARSA and Q-learning is how to determine this $ G_ {t} $.

For ** Q-learning ** </ font> $G_{t}=r_{t+1}+\gamma\max_{a\in At}[Q(s_{t+1},a)]$

For ** SARSA ** </ font> $G_{t}=r_{t+1}+\gamma Q(s_{t+1},a_{t+1}^{\pi})$

Here, $ a_ {t + 1} ^ {\ pi} $ indicates the action when the next action is selected according to the policy in the state $ s_ {t + 1} $. What we can see from the above is that Q-learning uses max to update the value, that is, it is an optimistic learning method that updates using the maximum state value that can be obtained. In contrast, SARSA takes into account the following actions, making it a more ** realistic ** way to determine strategies. This time, we will also compare these.

CartPole rules

スクリーンショット 2020-09-14 18.07.29.png If you keep standing this stick for a long time (200 steps), it will be cleared. There are four states given, ** the position of the dolly, the speed of the dolly, the angle of the pole, and the angular velocity of the pole **. Actions are limited to pushing the dolly to the left: 0 and pushing to the right: 1. The angle of the pole is tilted by 12 degrees or more, or the endurance is 200 steps.

Implementation

First, import the library.

import gym
from gym import logger as gymlogger
gymlogger.set_level(40) #error only
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64

Defines a class SARSA that implements learning.

class SARSA:
    def __init__(self, env):
        self.env = env
        self.env_low = self.env.observation_space.low #Minimum state
        self.env_high = self.env.observation_space.high #Maximum state
        
        tmp = [7,7,7,7] #Divide the state into 7 states
        self.env_dx = [0,0,0,0]
        self.env_dx[0] = (self.env_high[0] - self.env_low[0]) / tmp[0]
        self.env_dx[1] = (self.env_high[1] - self.env_low[1]) / tmp[1]
        self.env_dx[2] = (self.env_high[2] - self.env_low[2]) / tmp[2]
        self.env_dx[3] = (self.env_high[3] - self.env_low[3]) / tmp[3]
        
        self.q_table = np.zeros((tmp[0],tmp[1],tmp[2],tmp[3],2)) #Initialization of state value function
        
    def get_status(self, _observation): #Discretize state
        
        s1 = int((_observation[0] - self.env_low[0])/self.env_dx[0]) #Drop into any of the seven states
        
        if _observation[1] < -1.5: #Classify by yourself
          s2 = 0
        elif -1.5 <= _observation[1] < - 1:
          s2 = 1
        elif -1 <= _observation[1] < -0.5:
          s2 = 2
        elif -0.5 <= _observation[1] < 0.5:
          s2 = 3
        elif 0.5 <= _observation[1] < 1.5:
          s2 = 4
        elif 1.5 <= _observation[1] < 2:
          s2 = 5
        elif 2 <= _observation[1]:
          s2 = 6
        
        s3 = int((_observation[2] - self.env_low[2])/self.env_dx[2]) #Drop into any of the seven states
        
        if _observation[3] < -1: #Classify by yourself
          s4 = 0
        elif -1 <= _observation[3] < -0.7:
          s4 = 1
        elif -0.7 <= _observation[3] < -0.6:
          s4 = 2
        elif -0.6 <= _observation[3] < -0.5:
          s4 = 3
        elif -0.5 <= _observation[3] < -0.4:
          s4 = 4
        elif -0.4 <= _observation[3] < -0.4:
          s4 = 5
        else:
          s4 = 6
          
        return s1, s2, s3, s4
    
    def policy(self, s, epi): #Select an action in state s
        
        epsilon = 0.5 * (1 / (epi + 1))
        
        if np.random.random() <= epsilon:
            return np.random.randint(2) #Randomly choose
        else:
            s1, s2, s3, s4 = self.get_status(s)
            return np.argmax(self.q_table[s1][s2][s3][s4]) #Select the action with the highest action value
    
    def learn(self, time = 200, alpha = 0.5, gamma = 0.99): #Learn as many times as time
        
        log = [] #Record total rewards for each episode
        t_log = [] #Record the number of steps per episode
        
        for j in range(time+1):
            t = 0 #Number of steps
            total = 0 #Total reward
            s = self.env.reset()
            done = False
            
            while not done:
                t += 1
                a = self.policy(s, j)
                next_s, reward, done, _ = self.env.step(a)
                
                reward = t/10 #The longer you endure, the more rewards you get
                
                if done:
                  if t < 195:
                    reward -= 1000 #Penalties for failure to endure
                  else:
                    reward = 1000 #Give more rewards on success

                total += reward
                
                
                s1, s2, s3, s4 = self.get_status(next_s)
                G = reward + gamma * self.q_table[s1][s2][s3][s4][self.policy(next_s, j)] #Cumulative reward calculation
                
                s1, s2, s3, s4 = self.get_status(s)
                self.q_table[s1][s2][s3][s4][a] += alpha*(G - self.q_table[s1][s2][s3][s4][a]) #Q update
                s = next_s

            t_log.append(t)
            log.append(total)
            
            if j %1000 == 0:
              print(str(j) + " ===total reward=== : " + str(total))
            
        return plt.plot(t_log)

    def show(self): #Display learning results
        s = self.env.reset()
        img = self.env.render()
        done = False
        t = 0
        while not done:
          t += 1
          a = self.policy(s, 10000)
          s, _, done, _ = self.env.step(a)
          self.env.render()
                
        print(t)
        self.env.reset()
        self.env.close()

Trouble point

Here's where I stumbled. At init, I'm preparing to discretize each of the four states with env_dx, but I have a problem here. If you look closely at the reference, スクリーンショット 2020-09-14 18.34.59.png The variable area of the velocity value is ** inf **. Yes, ** it's infinite! **

With this, the value of env_dx also becomes infinite, and the discretization of continuous values does not work. Therefore,

from random import random
env.step(random.randint(2))

We performed many times and observed the variation of the dolly speed and then the angular velocity of the pole. Then,

if _observation[1] < -1.5: #Bogie speed
          s2 = 0
        elif -1.5 <= _observation[1] < - 1:
          s2 = 1
        elif -1 <= _observation[1] < -0.5:
          s2 = 2
        elif -0.5 <= _observation[1] < 0.5:
          s2 = 3
        elif 0.5 <= _observation[1] < 1.5:
          s2 = 4
        elif 1.5 <= _observation[1] < 2:
          s2 = 5
        elif 2 <= _observation[1]:
          s2 = 6

        if _observation[3] < -1: #Angular velocity of pole
          s4 = 0
        elif -1 <= _observation[3] < -0.7:
          s4 = 1
        elif -0.7 <= _observation[3] < -0.6:
          s4 = 2
        elif -0.6 <= _observation[3] < -0.5:
          s4 = 3
        elif -0.5 <= _observation[3] < -0.4:
          s4 = 4
        elif -0.4 <= _observation[3] < -0.4:
          s4 = 5
        else:
          s4 = 6

I realized that I could classify it like this.

Learning

Learn like that. It's about 3000 times and I can afford it.

env = gym.make('CartPole-v0')
agent = SARSA(env)
agent.learn(time = 3000)

The change in the number of steps is like this. スクリーンショット 2020-09-14 18.44.03.png

Now, let's check the animation with ```agent.show () `` `. 棒を立てる.gif

It's pretty stable and has great sustainability. I'm happy to be a ** man **.

Q-learning vs SARSA

Let's compare Q-learning and SARSA in this environment. G in Q-learning

G = reward + gamma * max(self.q_table[s1][s2][s3][s4])

To do. When I try to learn with this, スクリーンショット 2020-09-14 18.48.59.png The stability of convergence seems to be better with SARSA. If you're a man, look at reality like SARSA. Yes.

Impressions

In this environment, I wondered if the discretization of states was the most difficult part. It seems that DQN was born in terms of solving that problem. Next time, I think I'll try to build DQN. see you!