[PYTHON] Verstärkungslernen 10 Versuchen Sie es mit einem trainierten neuronalen Netz.

Es wird davon ausgegangen, dass Sie bis zur Stärkung des Lernens 9 abgeschlossen haben. Entwicklung verwendet Jupyter Notebook. Da VSCode nicht verwendet wird, ist es einfach zu wechseln.

Chainer RL Schnellstart wie er ist. Installieren Sie zuerst matplotlib.

Dies ist nicht in der Schnellinstallation geschrieben.

pip install matplotlib

Das Folgende ist eine Kopie aus dem Jupyter-Notizbuch.

import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import gym
import numpy as np

env = gym.make('CartPole-v0')
print('observation space:', env.observation_space)
print('action space:', env.action_space)

obs = env.reset()
#env.render()
print('initial observation:', obs)

action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs)
print('reward:', r)
print('done:', done)
print('info:', info)

class QFunction(chainer.Chain):

    def __init__(self, obs_size, n_actions, n_hidden_channels=50):
        super().__init__()
        with self.init_scope():
            self.l0 = L.Linear(obs_size, n_hidden_channels)
            self.l1 = L.Linear(n_hidden_channels, n_hidden_channels)
            self.l2 = L.Linear(n_hidden_channels, n_actions)

    def __call__(self, x, test=False):
        """
        Args:
            x (ndarray or chainer.Variable): An observation
            test (bool): a flag indicating whether it is in test mode
        """
        h = F.tanh(self.l0(x))
        h = F.tanh(self.l1(h))
        return chainerrl.action_value.DiscreteActionValue(self.l2(h))

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n
q_func = QFunction(obs_size, n_actions)

# Use Adam to optimize q_func. eps=1e-2 is for stability.
optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(q_func)

# Set the discount factor that discounts future rewards.
gamma = 0.95

# Use epsilon-greedy for exploration
explorer = chainerrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.3, random_action_func=env.action_space.sample)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=10 ** 6)

# Since observations from CartPole-v0 is numpy.float64 while
# Chainer only accepts numpy.float32 by default, specify
# a converter as a feature extractor function phi.
phi = lambda x: x.astype(np.float32, copy=False)

# Now create an agent that will interact with the environment.
agent = chainerrl.agents.DoubleDQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    replay_start_size=500, update_interval=1,
    target_update_interval=100, phi=phi)

# Start virtual display
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1024, 768))
display.start()
import os
os.environ["DISPLAY"] = ":" + str(display.display) + "." + str(display.screen)

agent.load('agent')
frames = []
for i in range(3):
    obs = env.reset()
    done = False
    R = 0
    t = 0
    while not done and t < 200:
        frames.append(env.render(mode = 'rgb_array'))
        action = agent.act(obs)
        obs, r, done, _ = env.step(action)
        R += r
        t += 1
    print('test episode:', i, 'R:', R)
    agent.stop_episode()
env.render()

import matplotlib.pyplot as plt
import matplotlib.animation
import numpy as np
from IPython.display import HTML

plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
patch = plt.imshow(frames[0])
plt.axis('off')
animate = lambda i: patch.set_data(frames[i])
ani = matplotlib.animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval = 50)
HTML(ani.to_jshtml())

Da Fenster sehr unterschiedlich sind, werde ich sie in Enhanced Learning 12 zusammen schreiben.

Eine kleine Zusammenfassung von bis zu 10. Der Kettenstart-Schnellstart war im Allgemeinen gut, mit einigen von mir. Ist Chainerrl ein Wrapper für Chainer? Es ist einfach umzubauen und ich finde es ausgezeichnet. Ich werde in Zukunft Tensorflow verwenden, aber vorerst denke ich, dass ich Chainerrl verwenden werde. Bis zu ungefähr 30 werde ich OpenAI Fitnessstudio machen.

Der Grund für Chainer ist, dass ich hohe Erwartungen an bevorzugte Netzwerke habe. In den USA gibt es ein System, das Forschern wie Google viel Geld gibt, in Japan gibt es jedoch nur wenige. Ein unerforschtes Projekt, das Forschungsgelder als Inkubator bezahlt, hat auch einen Stundenlohn von 1600 Yen. Das bevorzugte Praktikum beträgt 2500 Yen. Darüber hinaus gibt es verschiedene Zulagen. Hier ist ihre Ernsthaftigkeit. Und der Benchmark ist immer hoch. Ich freue mich darauf in der Zukunft.