[PYTHON] Solve OpenAI Gym Copy-v0 with Q-learning

Purpose

It is a solution of the problem Copy-v0 of OpenAI Gym [^ 1].

https://gym.openai.com/envs/Copy-v0

In the continuation of the following article, it is solved by Q learning.

http://qiita.com/namakemono/items/16f31c207a4f19c5a4df

solution

--Q learning

Q(s,a) \leftarrow Q(s,a) + \alpha \left\{ r(s,a,s') + \gamma\max_{a}Q(s',a') - Q(s,a) \right\} \\
r(s,a,s') = \mathbb{E}[R_{t+1} | S_t=s, A_t=a, S_{t+1}=s']

-$ \ epsilon $ -greedy method

Derivation of Q-learning [^ 2]

\begin{align}
Q(s,a) &= r(s,a)+\gamma \sum_{s' \in S} \max_{a' \in A(s')} p(s'|s,a) Q(s',a') & \\
       &\simeq  r(s,a)+\gamma \max_{a' \in A(s')} Q(s',a') & (\because s' \sim p(s'|s,a), s'Assumption that it is unlikely to be other than) \\
       &\simeq  (1-\alpha)Q(s,a) + \alpha \left \{ r(s,a)+\gamma \max_{a' \in A(s')} Q(s',a') \right\} & (\because smoothing) \\
       &=  Q(s,a) + \alpha \left \{ r(s,a)+\gamma \max_{a' \in A(s')} Q(s',a') - Q(s,a)) \right \}  &
\end{align}

Code [^ 3]

import numpy as np
import gym
from gym import wrappers

def run(alpha=0.3, gamma=0.9):
    Q = {}
    env = gym.make("Copy-v0")
    env = wrappers.Monitor(env, '/tmp/copy-v0-q-learning', force=True)
    Gs = []
    for episode in range(10**6):
        x = env.reset()
        X, A, R = [], [], [] # States, Actions, Rewards
        done = False
        while not done:
            if (np.random.random() < 0.01) or (not x in Q):
                a = env.action_space.sample()
            else:
                a = sorted(Q[x].items(), key=lambda _: -_[1])[0][0]
            X.append(x)
            A.append(a)
            if not x in Q:
                Q[x] = {}
            if not a in Q[x]:
                Q[x][a] = 0
            x, r, done, _ = env.step(a)
            R.append(r)
        T = len(X)
        x, a, r = X[-1], A[-1], R[-1]
        Q[x][a] += alpha * (r - Q[x][a])
        for t in range(T-2, -1, -1):
            x, nx, a, r = X[t], X[t+1], A[t], R[t]
            Q[x][a] += alpha * (r + gamma * np.max(Q[nx].values()) - Q[x][a])
        G = sum(R) # Revenue 
        print "Episode: %d, Revenue: %d" % (episode, G)
        Gs.append(G)
        if np.mean(Gs[-100:]) > 25.0:
            break

if __name__ == "__main__":
    run()

Score

Episode: 30229, Reward: 29

References

Recommended Posts

Solve OpenAI Gym Copy-v0 with Q-learning
Solve OpenAI Gym Copy-v0 with Sarsa
Solve Copy-v0 of OpenAI Gym
Solve your own maze with Q-learning
Challenge DQN (Modoki) with Chainer ✕ OpenAI Gym!
OpenAI Gym to learn with PD-controlled Cart Pole
Solve AtCoder 167 with python
Solve Sudoku with Python
Solve Sudoku with PuLP
Create an OpenAI Gym environment with bash on Windows 10
Solve POJ 2386 with python
Reinforcement learning in the shortest time with Keras with OpenAI Gym
[Python] Solve equations with sympy
Solve AtCoder ABC166 with python
Solve AtCoder ABC 186 with Python