[PYTHON] [Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)

Previously implemented R2D2, but I couldn't implement mini-batch learning. After that, I managed to implement it this time through trial and error.

It's been a long time since the previous article, so I'll explain the overall flow roughly. We will also fix any mistakes in the previous implementation. .. ..

In addition, this article consists of two parts, a commentary section and a hyperparameter setting section. See below for hyperparameters [Reinforcement learning] R2D2 implementation / explanation revenge hyperparameter explanation (Keras-RL)

Postscript: R2D3 has also been implemented. [Reinforcement learning] I implemented / explained R2D3 (Keras-RL)

Whole code

The code created in this article is below. This time only github.

table of contents

Implementation explanation of DQN (Rainbow)

As a review, I will explain the image of the implementation of DQN (Rainbow) again. See the previously posted article for a detailed explanation.

The image of learning with DQN (Rainbow) is summarized below.

zu1.PNG

zu2.PNG

DQN stores experience data (experience) in memory as follows.

e_{t} = (s_{t},a_{t},r_{t},s_{t+1})

If step of Multi-step learning is 1, the next state will be $ t + 1 $, If it is 3steps, it will be $ t + 3 $.

Formula Figure
Previous state s_{t} observation: t(n-6) ~ t(n-3)
Next state s_{t+1} observation: t(n-3) ~ t(n)
action a_{t} action: t(n-3)
Reward r_{t} reward: t(n)

In addition, the size held inside each variable is as follows.

Length to hold Length to save in memory
rewards multisteps 0(Used for calculations only)
Calculated rewards 1 1(Current state)
actions multisteps + 1 1(Previous state)
observations input_sequence + multisteps input_sequence + multisteps

Incorrect action referenced in Multi-Step learning

In the previous article Multi-Step learning, I referred to the action with $ t_n $, which is incorrect. Hey ... $ T_ {n-multisteps} $ was the correct answer because it refers to the action in the previous state.

Incorrect importance sampling

The previous article is below.

To put it simply, priority sampling is prioritized when retrieving experiences with Priority Experience Reply. Then, the number of experiences to be acquired will be biased. Then, the bias will put a bias on the learning, so it is the importance sampling to correct this.

Specifically, experiences selected with a high probability have a low reflection rate in the update of the Q value, and experiences selected with a low probability have a high reflection rate in the update of the Q value.

In the past, the implementation seemed to be subtly strange and it didn't learn well. I used to apply it to the updated Q value itself, but I should have applied it to td_error itself. (Variable naming was not good) Also, since it is reflected in the update of the Q value, it is not applied to priority.

-Previous implementation (pseudo code)

IS


def train():

    #Gain experience from PER according to probability
    batchs, batch_weight = memory.sample(batch_size)

    #Get the Q value of the previous state from model
    # state0_qvals contains the Q value for each action
    state0_qvals = model.predict(state0_batch)
    
    for batch_i in range(batch_size):
        reward = batchs[batch_i]Reward
        action = batchs[batch_i]Action
        q0 = state0_qvals[batch_i][action]  #Q value before update

        #model and target_Get the maximum Q value of the current state using model
        # (The acquisition method is different between DQN and DDQN)
        maxq =model and target_Get from model

        td_error = reward + (gamma ** reward_multisteps) * maxq
        td_error *= batch_weight

        priority = abs(td_error - q0)
        
        #Learn by changing only the Q value of the target action
        state0_qvals[batch_i][action] = td_error

    # train
    model.train_on_batch(state0_qvals)

-Implementation after change (pseudo code)

IS


def train():

    #Gain experience from PER according to probability
    batchs, batch_weight = memory.sample(batch_size)

    #Get the Q value of the previous state from model
    # state0_qvals contains the Q value for each action
    state0_qvals = model.predict(state0_batch)
    
    for batch_i in range(batch_size):
        reward = batchs[batch_i]Reward
        action = batchs[batch_i]Action
        q0 = state0_qvals[batch_i][action]  #Q value before update

        #model and target_Get the maximum Q value of the current state using model
        # (The acquisition method is different between DQN and DDQN)
        maxq =model and target_Get from model

        #※ -Add q0 and td properly_Issue an error
        #* Also, batch_Apply weight here
        td_error = reward + (gamma ** reward_multisteps) * maxq - q0

        #※ td_The absolute value of error becomes priority as it is
        priority = abs(td_error)
        
        #Learn by changing only the Q value of the target action
        #※ td_Since the error became a difference, apply the weight to it and update the Q value with the difference.
        state0_qvals[batch_i][action] += td_error * batch_weight

    # train
    model.train_on_batch(state0_qvals)

R2D2 implementation description

Mini batch learning

Previously, mini-bachi learning could not be implemented because Keras's stateful LSTM was not well understood. The previous survey articles are as follows.

zu3.PNG

Apparently there are batch_size worth of states in hidden_states and you can specify them. Now you can proceed with multiple learnings between sequences at the same time.

DRQN(R2D2) For the sake of clarity, I will explain with R2D2, which has the parallel processing part removed. The previous article is below.

It is an image diagram like DQN.

zu4.PNG

zu5.PNG

It's getting pretty complicated ... I wrote this figure because I was confused when implementing it ...

The method of updating the Q value and issuing Priority is the same as DQN, so it is omitted from the figure.

The points are input sequence and input length. Last time I wasn't aware of this. (Assuming input sequence = 1, input length was expressed as input sequence)

The input sequence is the length of the state to be input to model, and the number of inputs is the input length. The Q value is updated for each input length, and Priority is also calculated. (I'm a little unsure about this interpretation, but R2D2's paper Section 2.3 proposes a new way to put out Priority, and it makes sense to think that one experience gives multiple Priority as described above.)

The size held inside each variable is as follows.

Length to hold Length to save in memory
rewards multisteps + input_length - 1 0(Used for calculations only)
Calculated rewards input_length input_length
actions multisteps + input_length input_length(From the previous state)
hidden states burnin + multisteps + input_length + 1 1(Oldest state)
observations burnin + input_sequence + multisteps + input_length - 1 0(For summarizing below)
Summary observations burnin + multisteps + input_length Same length

rescaling function

h(x) = sign(x)(\sqrt{|x|+1}-1)+\epsilon x

rescaling2.png

The rescaling function was introduced in R2D2 and was to be used instead of reward clipping (-1 to 1). I used to worry about the inverse function, but I forcibly made it unnecessary.

The formula for deriving the TD error using the rescaling function is: ($ y_t $ is the TD error)

y_{t} = h \Bigl(r_{t} + \gamma h^{-1}(\max_pQ_{target}(s_{t+1},a_{t}))\Bigr)

Expand $ h () $ in the above formula.

y_{t} = h (r_{t}) + h \Bigl(\gamma h^{-1}(\max_pQ_{target}(s_{t+1},a_{t}))\Bigr)

Applying the inverse function to a function returns it to its original value. * $ H (h ^ {-1} (x)) = x $ So the right side can be offset ($ \ gamma $ is ignored as an error ...)

Then it becomes as follows.

y_{t} = h (r_{t}) + \gamma (\max_pQ_{target}(s_{t+1},a_{t}))

The rescaling function is now applied only to rewards ($ r_ {t} $). If you look at the graph, you can see that the rewards are rounded nicely. (100 rewards will be around 10) It's a good alternative to clipping.

Parallel processing (interprocess communication)

The previous article is below.

Reference: Complete understanding of Python threading and multiprocessing

At first, I used Queue, but since the amount of weights data was large and it seemed to be a bottleneck, I investigated the communication between each process. The survey results are the following articles.

From this, the communication is as follows. (As a result, Queue is used as it is)

zu10.PNG

zu11.PNG

zu12.PNG

Information exchange between processes is implemented in shared memory. I don't lock it because the writer and the reader are clearly separated.

Callbacks

It turns out that interprocess communication is quite costly, I implemented it because there was a process that straddled Actor and Leaner.

I mainly create it for save / load and logs. The base class of the implemented Callback is as follows.

R2D2Callback


import rl.callbacks
class R2D2Callback(rl.callbacks.Callback):
    def __init__(self):
        pass

    #--- train ---

    def on_r2d2_train_begin(self):
        pass

    def on_r2d2_train_end(self):
        pass

    #--- learner ---

    def on_r2d2_learner_begin(self, learner):
        pass
    
    def on_r2d2_learner_end(self, learner):
        pass

    def on_r2d2_learner_train_begin(self, learner):
        pass

    def on_r2d2_learner_train_end(self, learner):
        pass

    #--- actor ---
    #Below and rl.callbacks.Callback inheritance method

    def on_r2d2_actor_begin(self, actor_index, runner):
        pass

    def on_r2d2_actor_end(self, actor_index, runner):
        pass

As you can see, it inherits from Keras-rl's Callback. It is used as is by the Agent.

Note that train, learner, and actor are supposed to be called by another process. So even if you write a process that straddles these, the value will not be retained because the process is different.

Save / load and log using these are explained in the parameter section.

GPU

When I run GPU as it is with tensorflow 2.1.0, I get the following error.

tensorflow.python.framework.errors_impl.InternalError:  Blas GEMM launch failed : a.shape=(32, 12), b.shape=(12, 128), m=32, n=128, k=12     

Apparently it's an error that occurs when using GPU in multiple processes. Refer to the following and set to use GPU in multiple processes.

#I want you to set it for all processes, so it is described globally
for device in tf.config.experimental.list_physical_devices('GPU'):
    tf.config.experimental.set_memory_growth(device, True)

Also, I am writing a process to automatically determine whether it is a CPU or GPU inside R2D2Manager.

import tensorflow as tf

def train(self):
    (abridgement)
    if len(tf.config.experimental.list_physical_devices('GPU')) > 0:
        self.enable_GPU = True
    else:
        self.enable_GPU = False
    (abridgement)

Other implementations

ImageModel extension

The image processing layer in NN (Neural Network) is unchanged from DQN. So I extended it so that I can change it here.

The NN layer in DQN is as follows.

layer Overview
1 Input layer
2 Input conversion layer Layer to generalize input format
3 Image processing layer For image processing
4 LSTM layer When using LSTM
5 dueling network layer When using dueling network
6 Dense layer Included when using dueling network
7 (Output layer) Actually included in the dueling network layer

Generalization of input conversion layer

The input conversion layer is a layer that makes a one-dimensional output (Flatten) for the input format. It is created assuming the following four types of input.

InputType


import enum
class InputType(enum.Enum):
    VALUES = 1    #No image
    GRAY_2ch = 3  # (width, height)
    GRAY_3ch = 4  # (width, height, 1)
    COLOR = 5     # (width, height, ch)

Input layer without image (VALUES) (without LSTM)

Just flatten it.

input_sequence = 4
input_shape = 3

c = Input(shape=(input_sequence,) + input_shape)
# output_shape == (None, 4, 3)

c = Flatten()(c)
# output_shape == (None, 12)

Input layer without image (VALUES) (with LSTM)

Just flatten it as it is. Wrapped in TimeDistributed to hold timesteps.

batch_size = 16
input_sequence = 4
input_shape = 3

c = Input(batch_shape=(batch_size, input_sequence,) + input_shape)
# output_shape == (16, 4, 3)

c = TimeDistributed(Flatten())(c)
# output_shape == (16, 4, 3)

Gray image (no channel) input layer (GRAY_2ch) (without LSTM)

It's the conversion used in DQN. Replace the channel with input_sequence (input size).

input_sequence = 4
input_shape = (84, 84)  #(widht, height)

c = Input(shape=(input_sequence,) + input_shape)
# output_shape == (None, 4, 84, 84)

c = Permute((2, 3, 1))(c)  #Layer to change the order
# output_shape == (None, 84, 84, 4)

c =Image processing layer(c)

Input layer for gray image (without channel) (GRAY_2ch) (with LSTM)

If LSTM is enabled, sequence information can be supplemented with timesteps, so We are increasing the channel layer.

batch_size = 16
input_sequence = 4
input_shape = (84, 84)  #(widht, height)

c = Input(batch_shape=(batch_size, input_sequence,) + input_shape)
# output_shape == (16, 4, 84, 84)

c = Reshape((input_sequence, ) + input_shape + (1,) )(c)  #Add channel layer
# output_shape == (16, 4, 84, 84, 1)

c =Image processing layer(c)

Image (with channel) input layer (GRAY_3ch, COLOR) (without LSTM)

Pass it to the image processing layer as it is. However, the information of input_sequence cannot be expressed.

input_sequence = 4
input_shape = (84, 84, 3)  #(widht, height, channel)

c = Input(shape=input_shape)
# output_shape == (None, 84, 84, 3)

c =Image processing layer(c)

Image (with channel) input layer (GRAY_3ch, COLOR) (with LSTM)

There is no difference.

batch_size = 16
input_sequence = 4
input_shape = (84, 84, 3)  #(widht, height, channel)

c = Input(batch_shape=(batch_size, input_sequence,) + input_shape)
# output_shape == (16, 4, 84, 84, 3)

c =Image processing layer(c)

Generalization of image processing layer

The ImageModel class is defined so that the layer can be changed.

The argument c of create_image_model is passed in the following format.

No LSTM: shape(batch_size, width, height, channel) 
With LSTM: shape(batch_size, timesteps, width, height, channel)

The return value should be in the following format:

No LSTM: shape(batch_size, dim) 
With LSTM: shape(batch_size, timesteps, dim)

The following is an example of DQN format.

DQNImageModel


class DQNImageModel(ImageModel):
    """ native dqn image model
    https://arxiv.org/abs/1312.5602
    """

    def create_image_model(self, c, enable_lstm):
        """
        c shape(batch_size, width, height, channel)
        return shape(batch_size, dim)
        """

        if enable_lstm:
            c = TimeDistributed(Conv2D(32, (8, 8), strides=(4, 4), padding="same"), name="c1")(c)
            c = Activation("relu")(c)
            
            c = TimeDistributed(Conv2D(64, (4, 4), strides=(2, 2), padding="same"), name="c2")(c)
            c = Activation("relu")(c)
            
            c = TimeDistributed(Conv2D(64, (3, 3), strides=(1, 1), padding="same"), name="c3")(c)
            c = Activation("relu")(c)
            
            c = TimeDistributed(Flatten())(c)

        else:
                
            c = Conv2D(32, (8, 8), strides=(4, 4), padding="same", name="c1")(c)
            c = Activation("relu")(c)

            c = Conv2D(64, (4, 4), strides=(2, 2), padding="same", name="c2")(c)
            c = Activation("relu")(c)

            c = Conv2D(64, (3, 3), strides=(1, 1), padding="same", name="c3")(c)
            c = Activation("relu")(c)

            c = Flatten()(c)
        return c

Policy extension

The previous commentary article is below.

DQN uses only the search policy of the ε-greedy method, but there are some policies introduced in the above article. I implemented them so that they can be used, but it seems that ε-greedy is enough. Details will be explained in the parameter section.

Afterword

I implemented it for the time being. Next time, I would like to make a sample article on how to set each parameter.

Paper links

Recommended Posts

[Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)
[Reinforcement learning] R2D2 implementation / explanation revenge hyperparameter explanation (Keras-RL)
[Reinforcement learning] I implemented / explained R2D3 (Keras-RL)
Deep reinforcement learning 2 Implementation of reinforcement learning
[Reinforcement learning] Explanation and implementation of Ape-X in Keras (failure)
[Reinforcement learning] Easy high-speed implementation of Ape-X!
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
[Introduction] Reinforcement learning
Future reinforcement learning_2
Future reinforcement learning_1