Recently I'm addicted to reinforcement learning. When doing reinforcement learning, it's a man who wants to stand a stick after all. So, I will introduce the CartPole </ font> of OpenAI Gym following the last time.
Previous article I want to climb a mountain with reinforcement learning
As for Q-learning mentioned in the previous article, this time I would like to use a method called SARSA. Let's review. Update of state behavior value Q in reinforcement learning,
Is performed for each state transition. The difference between SARSA and Q-learning is how to determine this $ G_ {t} $.
For ** Q-learning ** </ font>
For ** SARSA ** </ font>
Here, $ a_ {t + 1} ^ {\ pi} $ indicates the action when the next action is selected according to the policy in the state $ s_ {t + 1} $. What we can see from the above is that Q-learning uses max to update the value, that is, it is an optimistic learning method that updates using the maximum state value that can be obtained. In contrast, SARSA takes into account the following actions, making it a more ** realistic ** way to determine strategies. This time, we will also compare these.
If you keep standing this stick for a long time (200 steps), it will be cleared. There are four states given, ** the position of the dolly, the speed of the dolly, the angle of the pole, and the angular velocity of the pole **. Actions are limited to pushing the dolly to the left: 0 and pushing to the right: 1. The angle of the pole is tilted by 12 degrees or more, or the endurance is 200 steps.
First, import the library.
import gym
from gym import logger as gymlogger
gymlogger.set_level(40) #error only
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
Defines a class SARSA that implements learning.
class SARSA:
def __init__(self, env):
self.env = env
self.env_low = self.env.observation_space.low #Minimum state
self.env_high = self.env.observation_space.high #Maximum state
tmp = [7,7,7,7] #Divide the state into 7 states
self.env_dx = [0,0,0,0]
self.env_dx[0] = (self.env_high[0] - self.env_low[0]) / tmp[0]
self.env_dx[1] = (self.env_high[1] - self.env_low[1]) / tmp[1]
self.env_dx[2] = (self.env_high[2] - self.env_low[2]) / tmp[2]
self.env_dx[3] = (self.env_high[3] - self.env_low[3]) / tmp[3]
self.q_table = np.zeros((tmp[0],tmp[1],tmp[2],tmp[3],2)) #Initialization of state value function
def get_status(self, _observation): #Discretize state
s1 = int((_observation[0] - self.env_low[0])/self.env_dx[0]) #Drop into any of the seven states
if _observation[1] < -1.5: #Classify by yourself
s2 = 0
elif -1.5 <= _observation[1] < - 1:
s2 = 1
elif -1 <= _observation[1] < -0.5:
s2 = 2
elif -0.5 <= _observation[1] < 0.5:
s2 = 3
elif 0.5 <= _observation[1] < 1.5:
s2 = 4
elif 1.5 <= _observation[1] < 2:
s2 = 5
elif 2 <= _observation[1]:
s2 = 6
s3 = int((_observation[2] - self.env_low[2])/self.env_dx[2]) #Drop into any of the seven states
if _observation[3] < -1: #Classify by yourself
s4 = 0
elif -1 <= _observation[3] < -0.7:
s4 = 1
elif -0.7 <= _observation[3] < -0.6:
s4 = 2
elif -0.6 <= _observation[3] < -0.5:
s4 = 3
elif -0.5 <= _observation[3] < -0.4:
s4 = 4
elif -0.4 <= _observation[3] < -0.4:
s4 = 5
else:
s4 = 6
return s1, s2, s3, s4
def policy(self, s, epi): #Select an action in state s
epsilon = 0.5 * (1 / (epi + 1))
if np.random.random() <= epsilon:
return np.random.randint(2) #Randomly choose
else:
s1, s2, s3, s4 = self.get_status(s)
return np.argmax(self.q_table[s1][s2][s3][s4]) #Select the action with the highest action value
def learn(self, time = 200, alpha = 0.5, gamma = 0.99): #Learn as many times as time
log = [] #Record total rewards for each episode
t_log = [] #Record the number of steps per episode
for j in range(time+1):
t = 0 #Number of steps
total = 0 #Total reward
s = self.env.reset()
done = False
while not done:
t += 1
a = self.policy(s, j)
next_s, reward, done, _ = self.env.step(a)
reward = t/10 #The longer you endure, the more rewards you get
if done:
if t < 195:
reward -= 1000 #Penalties for failure to endure
else:
reward = 1000 #Give more rewards on success
total += reward
s1, s2, s3, s4 = self.get_status(next_s)
G = reward + gamma * self.q_table[s1][s2][s3][s4][self.policy(next_s, j)] #Cumulative reward calculation
s1, s2, s3, s4 = self.get_status(s)
self.q_table[s1][s2][s3][s4][a] += alpha*(G - self.q_table[s1][s2][s3][s4][a]) #Q update
s = next_s
t_log.append(t)
log.append(total)
if j %1000 == 0:
print(str(j) + " ===total reward=== : " + str(total))
return plt.plot(t_log)
def show(self): #Display learning results
s = self.env.reset()
img = self.env.render()
done = False
t = 0
while not done:
t += 1
a = self.policy(s, 10000)
s, _, done, _ = self.env.step(a)
self.env.render()
print(t)
self.env.reset()
self.env.close()
Here's where I stumbled. At init, I'm preparing to discretize each of the four states with env_dx, but I have a problem here. If you look closely at the reference, The variable area of the velocity value is ** inf **. Yes, ** it's infinite! **
With this, the value of env_dx also becomes infinite, and the discretization of continuous values does not work. Therefore,
from random import random
env.step(random.randint(2))
We performed many times and observed the variation of the dolly speed and then the angular velocity of the pole. Then,
if _observation[1] < -1.5: #Bogie speed
s2 = 0
elif -1.5 <= _observation[1] < - 1:
s2 = 1
elif -1 <= _observation[1] < -0.5:
s2 = 2
elif -0.5 <= _observation[1] < 0.5:
s2 = 3
elif 0.5 <= _observation[1] < 1.5:
s2 = 4
elif 1.5 <= _observation[1] < 2:
s2 = 5
elif 2 <= _observation[1]:
s2 = 6
if _observation[3] < -1: #Angular velocity of pole
s4 = 0
elif -1 <= _observation[3] < -0.7:
s4 = 1
elif -0.7 <= _observation[3] < -0.6:
s4 = 2
elif -0.6 <= _observation[3] < -0.5:
s4 = 3
elif -0.5 <= _observation[3] < -0.4:
s4 = 4
elif -0.4 <= _observation[3] < -0.4:
s4 = 5
else:
s4 = 6
I realized that I could classify it like this.
Learn like that. It's about 3000 times and I can afford it.
env = gym.make('CartPole-v0')
agent = SARSA(env)
agent.learn(time = 3000)
The change in the number of steps is like this.
Now, let's check the animation with ```agent.show () `` `.
It's pretty stable and has great sustainability. I'm happy to be a ** man **.
Let's compare Q-learning and SARSA in this environment. G in Q-learning
G = reward + gamma * max(self.q_table[s1][s2][s3][s4])
To do. When I try to learn with this, The stability of convergence seems to be better with SARSA. If you're a man, look at reality like SARSA. Yes.
In this environment, I wondered if the discretization of states was the most difficult part. It seems that DQN was born in terms of solving that problem. Next time, I think I'll try to build DQN. see you!
Recommended Posts