It is an Atari game that is often used in the evaluation of reinforcement learning, but it seems that a method beyond humans has finally appeared in all 57 games. I implemented it immediately.

Please note that it is not accurate because it collects information on the net and implements it in its own way.
Please note that there may be misunderstandings (some parts may not be understood).

Whole code

The code created in this article is below.

Google Colaboratory seems to be incompatible with multiprocessing, so only 1 Actor will be learned.)
Learn about Mountain Car

As for the configuration, the first half is the technical explanation and the second half is the implementation explanation.

What is Agent57?
Explanation of NGU (Never Give Up)
About internal rewards
Measures against POMDP (UVFA and Retrace)
Explanation of Agent57
About the action value function (Q function)
About the meta controller
Implementation
Implementation of internal rewards
Implementation of discovery policy
Implementation of behavioral value function (Q function) model

Introduction

Agent57 is a so-called DQN series reinforcement learning method. I have written about the technology up to Agent57 of DQN series before, so please do not hesitate

About Agent 57

The Atari 2600's 57 games are often used as indicators of reinforcement learning, but finally all 57 games have revealed reinforcement learning methods that exceed human performance. There were games that were difficult to learn with the existing method, such as Montezuma's Revenge, but Agent57 seems to be able to learn properly.

Below is a comparison of each method quoted from the DeepMind article.

Artboard 1 copy.png

I thought F (ApeX) and G (R2D2) were amazing, but it's overwhelming performance beyond that.

·reference Agent57: Outperforming the human Atari benchmark Atari Complete Conquest! Atari57 What is Agent57 that surpasses humans in all games !?

Up to Agent57

Artboard 1 copy 3.png

(Figure quoted from DeepMind article)

This article describes NGU (Never Give Up) and Agent57.

As shown in the figure, Agent 57 is an improvement of the existing DQN method. I hope you can refer to my past articles for the content that interests you about past methods.

The major change in NGU / Agent57 is that it focuses on exploration and exploitation, and introduces a mechanism that enables efficient exploration even in environments where rewards are difficult to obtain. There are various other methods adopted, so I would like to explain them one by one.

NGU(Never Give Up)

In reinforcement learning, there is a trade-off between exploring an unknown area and obtaining a reward (exploitation). There was a problem that the existing method could not learn well in games where the reward was sparse (it took a long time to get the reward). Therefore, NGU realizes efficient search by adding an internal reward (Intrinsic Reward) to the reward.

Internal rewards are set to meet the following three conditions.

+1 Avoid visiting the same state in an episode (strengthen)

Avoid visiting the same condition throughout the training (weak)
Ignore changing states regardless of agent behavior

·reference (Paper) Never Give Up: Learning Directed Exploration Strategies [Never give Up! Reinforcement learning that leads to complete domination of Atari and searches without giving up even in difficult environments!](Https://ai-scholar.tech/articles/reinforcement-learning/never-give-up-ai- 394)

Intrinsic Reward

NGU defines rewards by adding internal rewards (Intrinsic Rewards) to external rewards (Extrinsic Rewards).

r_{t}=r^e_t + \beta r^i_t

Where $ r ^ e_t $ is the external reward, $ r ^ i_t $ is the internal reward, and $ \ beta $ is a constant that weights the internal reward. Then, train using $ r_ {t} $ (extended reward), which is the sum of these.

Now let's look at the internal rewards. (The figure is quoted from the paper, and this figure is referred to as Figure A hereafter)

Internal rewards are created from the life long novelty module (green frame on the right side of Figure A) and the episodic novelty module (red frame on the right side of Figure A).

Generate a value of $ \ alpha_t $ from the lifetime memory and $ r ^ {episode} _t $ from the episodic memory, and synthesize them as follows.

r ^ i_t = r ^ {episode} _t ・ min (max (\ alpha _ {t}, 1), L)

Written in Japanese, internal reward = episodic memory x lifetime memory.

However, if the lifetime memory is 1 or less, it will be 1, and if it is L or more, it will be L.
L is a constant representing the maximum scale reward (L = 5 in the paper)

Let's start with the episodic memory.

Episodic novelty module

In the episodic memory, the reward is decided so that the same state will not be visited in one episode. The overall flow is as follows. (Corresponds to the red frame in Fig. A)

Use an embedded function (embedding network) to output a controllable state from the current state
From the controllable state and episodic memory M, use the k-nearest neighbors method to get an approximation of the number of visits in the current state.
Output the episode storage ($ r ^ {episode} _t $) from the approximate value of the number of visits.
Add controllable state to episode memory

The embedded function and the approximate value of the number of visits will be described later. Episode memory is initialized at the beginning of one episode. This seems to determine the number of visits within one episode.

Embedded function (embedding network)

Embedded functions are neural network functions that generate controllable states.

It looks like a simple neural network that just compresses the current state to p dimension. The problem is learning, which is likened to the Siamese network as follows.

(This figure is a detailed figure on the left side of Figure A)

The input is the previous state and the current state, and the output is the probability that each action will be selected. In short, it is a network that learns which action was performed from the difference between the previous state and the current state.

It seems that this network is learning only the states that change in connection with behavior.

Reference: [Deep distance learning] Thorough explanation of Siamese Network and Contrastive Loss

Approximation of the number of visits by the k-nearest neighbor method

The episodic memory ($ r ^ {episode} _t $) is calculated based on the number of visits. The number of visits is calculated by approximating the episode memory and the current state using the k-nearest neighbor method. The following is an explanation using mathematical formulas.

The episodic memory ($ r ^ {episode} _t $) is:

r^{episode}_t = \frac{1}{ \sqrt{ n(f(x _{t})) } }

$ n (f (x_t)) $ is the number of visits. The number of visits is approximated using the kernel function K, and is as follows.

\frac{1}{ \sqrt{ n(f(x_t)) } } \approx \frac{1}{ \sqrt{ \sum_(f_i \in N_k)K(f(x_t), f_i) } + c}

c is a constant that guarantees a minimum number of pseudo visits. (c = 0.001) K is calculated by the k-nearest neighbor method of the controllable state in memory and the current controllable state.

K(x,y) = \frac{\epsilon}{ \frac{d^2(x,y)}{d^2_m} + \epsilon}

$ \ Epsilon $ is a small constant. ($ \ Epsilon $ = 0.001) d represents the Euclidean distance and $ d ^ 2_m $ represents the moving average of the Euclidean distance.

Life long novelty module

RND (Random Network Distillation) is used for the lifetime memory.

·reference Exploration by Random Network Distillation was tested on MountainCar. (Paper) Exploration by Random Network Distillation

As for the explanation of RND, when you want to judge whether it is an unknown state in a certain state, you usually count the number of visits and think that the more visits you have, the more known the state. However, RND has a reversal idea. The reference blog's analogy was good, so I will use it, but at RND, I will prepare a scratch map first. At first, the map is covered with silver paper, but it is gradually scraped off with each visit. RND is a method to judge how unknown it is by looking at the degree of scraping.

Then, how to realize this is to prepare two neural networks with the same structure. (A, B) A is left in its initial state and B learns to approach A. This learning of B is equivalent to the act of scraping silver paper. As a result, when comparing A and B, the error is large = the place that is not searched, and the error is small = the place that is well searched.

MDP and POMDP

Let's return to the premise of reinforcement learning. Q-learning is premised on a stochastic model of MDP (Markov decision process). To put it simply, if you perform an action in a certain state, it will shift to a random state and you will receive a fixed reward. (It's a general reinforcement learning model)

However, in NGU, internal rewards are added, so even if you move to the same state, the rewards will change. Therefore, NGU should be considered as a model of POMDP (partially observable Markov decision process), not MDP.

·reference [Markov decision process (Wikipedia)](https://ja.wikipedia.org/wiki/%E3%83%9E%E3%83%AB%E3%82%B3%E3%83%95%E6%B1% BA% E5% AE% 9A% E9% 81% 8E% E7% A8% 8B) [Partially Observed Markov Decision Process (Wikipedia)](https://ja.wikipedia.org/wiki/%E9%83%A8%E5%88%86%E8%A6%B3%E6%B8%AC%E3% 83% 9E% E3% 83% AB% E3% 82% B3% E3% 83% 95% E6% B1% BA% E5% AE% 9A% E9% 81% 8E% E7% A8% 8B)

However, it is much more difficult to think of it as a POMDP than an MDP, so NGU has introduced two workarounds:

Enter external rewards directly into the Q network
Maintain an internal state representation that summarizes the history of all inputs (states, actions, and rewards) in the episode

Also, since the search action by the internal reward is added directly to the reward, the search action cannot be easily turned off. To solve this problem, NGU proposes a method to learn external reward and exploratory behavior at the same time.

UVFA(Universal Value Function Approximators)

DQN generally estimates the Q value by taking a certain state as an input. Here, UVFA is a value function that generalizes the input by adding not only the state but also the goal.

NGU uses this UVFA, but instead of the target, $ \ beta $ (internal reward reflection rate: search rate) is entered. By doing this, we propose to learn exploratory behavior at the same time. If $ \ beta = 0 $, you can get the result when the search is off, and if $ \ beta> 0 $, you can get the result when the search is on.

·reference (DeepMind slides) Universal Value Function Approximators (Paper) Universal Value Function Approximators

Retrace

Q-learning uses an off-policy search method in TD learning. (Conversely, Sarsa is one of the on-policy searches in TD learning.)

The difference is the reference of the Q value to be updated, and in the case of Off-Policy, the policy is not followed. (In Q learning, the action that maximizes the Q value is selected) In the case of On-Policy, it is acquired according to the policy. (For example, if you are searching with ε-greedy, use ε-greedy as the reference destination as well.)

The problem with Off-Policy is that the policy you actually acted on (Behavior policy) and the policy you want to learn (Target policy) are different. If this is different, the values may never converge (unsafe). In Retrace's paper, there are four methods to solve this (existing method 3). + Retrace) is compared to show that Retrace is superior.

To put it simply, Retrace is a method of predicting a strategy based on past results and controlling learning by the difference. (The coefficients used to control learning are called cutting traces coefficients ($ c_s $))

NGU incorporates Retrace.

·reference Reinforcement learning that I can't hear anymore (10): Difference between Sarsa and Q-learning On-Policy and Off-Policy for Reinforcement Learning (Paper on Off-policy 1) Q (λ) with Off-Policy Corrections (Paper on Off-policy 2) Off-Policy Temporal-Difference Learning with Function Approximation (DeepMind slide) off-policy deep RL (Retrace's paper) Safe and efficient off-policy reinforcement learning (Japanese slide of Retrace paper) safe and efficient off policy reinforcement learning Basics of Reinforcement Learning Theory 2 [Review] UCL_RL Lecture05 Model Free Control

Discount rate

The discount rate ($ \ gamma $) depends on $ \ beta $ (internal rate of return) and changes. If $ \ beta = 0 $, then $ \ gamma = max $, and if $ \ beta = max $, then $ \ gamma = min $. (Specifically, $ \ beta = 0 $ is $ \ gamma = 0.997 $, $ \ beta = max $ is $ \ gamma = 0.99 $)

When searching, the influence of external rewards will be small, so it is a discount rate for adjustment.

Agent57

We were able to achieve efficient exploration with NGU, but in some games we could only get the same score as a simple random exploration. Also, one of the drawbacks of NGU is that it searches regardless of the progress of learning.

Agent57 has proposed the following and improved them.

Introduced a new parameter that separates external and internal rewards to stabilize learning of internal rewards
Introduced a meta controller and introduced a mechanism that allows the search rate to be selected according to priority.

(Paper) Agent57: Outperforming the Atari Human Benchmark

Change in behavior value function (Q function) architecture

Learning could be unstable at NGU. It happens when the external and internal rewards have different scale sparsity, or when only one reward reacts excessively compared to other rewards.

Agent57 speculated that it would be difficult to learn the behavioral value function when the nature of the reward is very different between external and internal rewards, and proposed a change in the architecture of the behavioral value function.

Specifically, the action value function is set in $ Q (x, a, j; \ theta ^ e) $ for learning external reward and $ Q (x, a, j; \ theta ^ i) $ for learning internal reward. Divide and calculate the Q value with the following formula. (It seems to be based on the transformed Bellman operator)

Q(x,a,j;\theta) = h(h^{-1}(Q(x,a,j;\theta^e)) + \beta_j h^{-1}( Q(x,a,j;\theta^i)))

$ \ Beta_j $ is the reflection rate (search rate) of the internal value function, and h is the Bellman operator function ?. h is not very well understood, but it is the same as the rescaling function of R2D2, and is as follows.

h(z) = sign(z)(\sqrt{|z|+1}-1 ) + \epsilon z

h^{-1}(z) = sign(z)(( \frac{ \sqrt{1+4\epsilon (|z|+1+\epsilon} + 4\epsilon }{ 2\epsilon } ) -1)

Meta controller

NGU and Agent57 can now set the search rate ($ \ beta_j $) using internal rewards.

However, in NGU, the search rate was fixed for each actor. For example, when Actor is 4, it looks like the following. (Value is tentative)

Actor1: $ \ beta_j $ = 0 (no search) Actor2: $ \ beta_j $ = 0.1 (search somehow) Actor3: $ \ beta_j $ = 0.2 (search a little) Actor4: $ \ beta_j $ = 0.3 (so-so search)

It is better to use a high value for the search rate in the early stage of training, but it is expected that it is better to use a low value as the training progresses. Therefore, Agent 57 proposed a meta controller in order to adopt a method of selecting an efficient search rate.

In the meta controller, the selection of an efficient search rate is regarded as a multi-armed bandit problem (MAB problem), which search policy (Actor in NGU) should be selected to obtain the maximum reward. This is being resolved. (Reward here refers to the total of undiscounted external rewards within an episode)

·reference Article I wrote earlier: [Reinforcement learning] Implemented / explained and compared multiple search policies (multi-armed bandit problem) Theory and Algorithm of Multi-armed Bandit Problem Introduction to Reinforcement Learning: Multi-armed Bandit Problem

Multi-armed bandit problem (MAB problem)

To put it simply, the MAB problem is the problem of maximizing the reward when selecting from multiple slots (arms) a certain number of times. For example, suppose you have four slots (A, B, C, D). What is the best way to choose the maximum prize after choosing 100 slots? Is the problem. I don't know the probability that the four slots will win.

In a typical MAB problem, the odds of winning a prize are constant for each slot. (It doesn't change after the first decision) However, with Agent57, the probability of getting a reward (reward result) changes. Therefore, the meta controller is a little devised.

sliding-window UCB method

Agent57 uses the UCB method as an algorithm for MAB problems. However, the UCB method cannot be applied as it is because the reward changes. Therefore, we propose to use the UCB method only for the last few episodes. This is referred to in the paper as the sliding-window UCB method. It is the same as the UCB method except that it only looks at the last few episodes.

Implementation

Intrinsic Reward

Embedded function (embedding network)

It is necessary to create a special neural network in which the network structure at the time of learning and the network structure at the time of use are different. First is the embedded function for getting the value.

Untitled Diagram (12)-Page-1.jpg

There is no mention of the image processing layer in the paper, which is the term used in this article. (Because the paper is premised on image input)

Next is the embedded function for learning.

Untitled Diagram (12)-Page-2.jpg

The code that implements the above figure is as follows.

This is a model for image input.
Each layer is named to share the weight.

input_sequence =Number of input frames
input_shape =Input format
nb_actions =Number of actions

`Embedded function for value acquisition`


c = input_ = Input(shape=(input_sequence,) + input_shape)
c = Conv2D(32, (8, 8), strides=(4, 4), padding="same", name="l0")(c)
c = Conv2D(64, (4, 4), strides=(2, 2), padding="same", name="l1")(c)
c = Conv2D(64, (3, 3), strides=(1, 1), padding="same", name="l2")(c)
c = Dense(32, activation="relu", name="l3")(c)
emb_model = Model(input_, c)

`Embedded function for training`


#Make two inputs
c1 = input1 = Input(shape=(input_sequence,) + input_shape)
c2 = input2 = Input(shape=(input_sequence,) + input_shape)

#Create a network while sharing weights
l = Conv2D(32, (8, 8), strides=(4, 4), padding="same", name="l0")
c1 = l(c1)
c2 = l(c2)
l = Conv2D(32, (4, 4), strides=(2, 2), padding="same", name="l1")
c1 = l(c1)
c2 = l(c2)
l = Conv2D(64, (3, 3), strides=(1, 1), padding="same", name="l2")
c1 = l(c1)
c2 = l(c2)
l = Dense(32, activation="relu", name="l3")
c1 = l(c1)
c2 = l(c2)

#Join
c = Concatenate()([c1, c2])
c = Dense(128, activation="relu")(c)
c = Dense(nb_actions, activation="softmax")(c)

emb_train_model = Model([input1, input2], c)

#optimizer from the paper
model.compile(
    loss='mean_squared_error', 
    optimizer=Adam(lr=0.0005, epsilon=0.0001))

Although it is a loss function, there was only a description in the paper that it was learned by the maximum likelihood estimation method. (Perhaps…) So for the time being, the mean square error is used.

·reference TPU technique of dark Keras: How to transfer the coefficients of a model with two inputs and a model with one input Multi-input multi-output model (Keras official)

Next is the code that copies the weights from emb_train_model to emb_model.

def sync_embedding_model():
    for i in range(4):  #layer Turn for a few minutes
        train_layer = emb_train_model.get_layer("l" + str(i))
        val_layer = emb_model.get_layer("l" + str(i))

        #Copy weight
        val_layer.set_weights(train_layer.get_weights())

We have created a great model with disjointed inputs and shared weights. .. ..

Learning embedded functions (embedding network)

It seems that it is hardly written in the paper ... For the time being, the update interval of Embeddings target is written as once per episode. What is an Embeddings target?

For the time being, the learning method was executed using the same sample as the learning of the behavioral value function, and the synchronization to emb_train_model → emb_model was set to once per episode.

batch_size =Batch size

def forward(observation):
    #Timing of learning every frame

    #Randomly acquire experience for batch from experience memory
    batchs = memory.sample(batch_size)

    #Format the data
    state0_batch = []   #Previous state
    state1_batch = []   #Next state
    emb_act_batch = []
    for batch in batchs:
        state0_batch.append(batchs["Previous state"])
        state1_batch.append(batchs["Next state"])

        #The action is one-Create data as hot vector
        a = np_utils.to_categorical(
            batchs["action"],
            num_classes=nb_actions
        )
        emb_act_batch.append(a)
    
    #training
    emb_train_model.train_on_batch(
        [state0_batch, state1_batch],  #There are two inputs, the previous state and the next state
        emb_act_batch                  #Teacher action
    )

def backward(self, reward, terminal):
    if terminal:
        #Synchronize at the end of the episode
        sync_embedding_model()    

* Omitting numpy conversion

Get the value of an embedded function (embedding network)

You just get the value from emb_model.

state =Current state
cont_state = emb_model.predict([state], batch_size=1)[0]

* Omitting numpy conversion

Episodic memory calculation

Fortunately, the pseudo code was written in the paper for the calculation here. It is expressed in python-like pseudo code as follows.

def calc_int_episode_reward(():
    state =Current state
    k =Numbers in the k-nearest neighbor method(k=10)
    epsilon = 0.001
    c = 0.001
    cluster_distance = 0.008
    maximum_similarity = 8

    #Get controllable state from embedded function
    cont_state = embedding_network.predict(state, batch_size=1)[0]

    #Find all elements in episode memory and Euclidean distance
    euclidean_list = []
    for mem_cont_state in int_episodic_memory:
        euclidean = np.linalg.norm(mem_cont_state - cont_state)
        euclidean_list.append(euclidean)

    #Top k
    euclidean_list.sort()
    euclidean_list = euclidean_list[:k]

    #Euclidean distance squared
    euclidean_list = np.asarray(euclidean_list) ** 2

    #Get the average
    ave = np.average(euclidean_list)

    #Approximate the number of visits(Calculation of Σ in formula)
    count = 0
    for euclidean in euclidean_list:
        d = euclidean / ave

        #There is no explanation in the paper, but it is written in pseudo code ...
        #Is a certain distance summarized in the shortest distance?
        d -= cluster_distance
        if d < euclidean_list[0]:
            d = euclidean_list[0]
        
        count += epsilon / (d+epsilon)
    s = np.sqrt(count) + c

    #There is no explanation in the paper here either, but is the value that is small to some extent rounded to 0?
    if s > maximum_similarity:
        episode_reward = 0
    else:
        episode_reward = 1/s

    #Added controllable state to episode memory, not in the pseudocode of the paper
    int_episodic_memory.append(cont_state)
    if len(int_episodic_memory) >= 30000:
        int_episodic_memory.pop(0)

    return episode_reward

Life long novelty module

RND model

The model of RND is as follows.

The output doesn't mean much. The code when there is an image processing layer is as follows.

def build_rnd_model():
    #Input layer
    c = input = Input(shape=(input_sequence,) + input_shape)

    #Image processing layer
    c = Conv2D(32, (8, 8), strides=(4, 4), padding="same")(c)
    c = Conv2D(64, (4, 4), strides=(2, 2), padding="same")(c)
    c = Conv2D(64, (3, 3), strides=(1, 1), padding="same")(c)

    # Dense
    c = Dense(128)(c)

    model = Model(input, c)

    #loss, optimizer from the paper
    model.compile(
        loss='mean_squared_error', 
        optimizer=Adam(lr=0.0005, epsilon=0.0001))
    
    return model

# train_model is target_It is an image approaching model
rnd_target_model = build_rnd_model()
rnd_train_model = build_rnd_model()

Learning RND

I think this is also rarely written in the paper ... It uses the same sample as learning the behavioral value function as well as the embedded function. (I am learning with the interpretation that I used it for learning = I visited (but it may be a bit suspicious ...))

def forward(observation):
    #Timing of learning every frame

    #Randomly acquire experience for batch from experience memory
    batchs = memory.sample(batch_size)

    #Format the data
    state0_batch = []   #Previous state
    state1_batch = []   #Next state
    for batch in batchs:
        state0_batch.append(batchs["Previous state"])
        state1_batch.append(batchs["Next state"])

    #Get the value of target
    rnd_target_val = rnd_target_model.predict(state1_batch, batch_size)
    #training
    rnd_train_model.train_on_batch(state1_batch, rnd_target_val)

* Omitting numpy conversion

Lifetime memory calculation

As for the calculation formula of the lifetime storage, first, the square error of rnd_target_model and rnd_train_model is calculated.

err(x_t) = || \hat{g}(x_t;\theta) - g(x_t)||^2

Based on that, the lifelong memory section is calculated using the following formula.

\alpha_t = 1 + \frac{ err(x_t) - \mu_e }{ \sigma_e }

Where $ \ sigma_e $ is the standard deviation of $ err (x_t) $ and $ \ mu_e $ is the mean of $ err (x_t) $.

The standard deviation and average have come out ... There is no description in the paper on how to calculate it, but in order to calculate it, you have to save the past results. For the time being, I created an array to store $ err (x_t) $. However, I am making an upper limit because it will be a problem if it is added infinitely.

Below is the code.


#---Variables for calculation
rnd_err_vals = []
rnd_err_capacity = 10_000

#---Calculation part
def calc_int_lifelong_reward():
    state =Current state

    #Get RND
    rnd_target_val = rnd_target_model.predict(state, batch_size=1)[0]
    rnd_train_val = rnd_train_model.predict(state, batch_size=1)[0]

    #Get squared error
    mse = np.square(rnd_target_val - rnd_train_val).mean()
    rnd_err_vals.append(mse)
    if len(rnd_err_vals) > rnd_err_capacity:
        rnd_err_vals.pop(0)

    #standard deviation
    sd = np.std(rnd_err_vals)
    if sd == 0:
        return 1

    #average
    ave = np.average(rnd_err_vals)

    # life long reward
    lifelong_reward = 1 + (mse - ave)/sd

    return lifelong_reward

Calculation of internal rewards

The internal reward is the following formula.

r ^ i_t = r ^ {episode} _t ・ min (max (\ alpha _ {t}, 1), L)

L = 5
episode_reward =Episodic memory
lifelong_reward =Lifelong memory department

if lifelong_reward < 1:
    lifelong_reward = 1
if lifelong_reward > L:
    lifelong_reward = L

intrinsic_reward = episode_reward * lifelong_reward

Search rate (β) and discount rate (γ)

The search policy is represented by a pair of search rate (β) and discount rate ($ \ gamma $) and is calculated by the following formula. The calculation formula for the discount rate is different between NGU and Agent57. (I'll leave both for the time being)

First is the calculation formula for the search rate.

\beta_i = \left\{
\begin{array}{ll}
0 \qquad (i = 0) \\
\beta \qquad (i = N-1) \\
\beta ・\sigma(10\frac{2i-(N-2)}{N-2}) \quad (otherwise)
\end{array}
\right.

$ \ Sigma $ is the sigmoid function and $ \ beta $ is the maximum reflection rate.

def sigmoid(x, a=1):
    return 1 / (1 + np.exp(-a * x))

def create_beta_list(N, max_beta=0.3):
    beta_list = []
    for i in range(N):
        if i == 0:
            b = 0
        elif i == N-1:
            b = max_beta
        else:
            b = 10 * (2*i-(N-2)) / (N-2)
            b = max_beta * sigmoid(b)
        beta_list.append(b)
    return beta_list

NGU discount rate.

\gamma_i = 1 - exp \Bigl( \frac{ (N-1-i)log(1-\gamma_max) + (ilog(1-\gamma_min)) }{N-1} \Bigr)

def create_gamma_list_ngu(N, gamma_min=0.99, gamma_max=0.997):
    if N == 1:
        return [gamma_min]
    if N == 2:
        return [gamma_min, gamma_max]
    gamma_list = []
    for i in range(N):
        g = (N - 1 - i)*np.log(1 - gamma_max) + i*np.log(1 - gamma_min)
        g /= N - 1
        g = 1 - np.exp(g)
        gamma_list.append(g)
    return gamma_list

Agent57 discount rate.

\gamma_j = \left\{
\begin{array}{ll}
\gamma0 \qquad (j=0) \\
\gamma1 + (\gamma0 - \gamma1)\sigma(10 \frac{2i-6}{6} ) \qquad (j \in (1,...,6)) \\
\gamma1 \qquad (j=7) \\
1-exp(\frac{(N-9)log(1-\gamma1) + (j-8)log(1-\gamma2)}{N-9}) \quad (otherwise)
\end{array}
\right.

def create_gamma_list_agent57(N, gamma0=0.9999, gamma1=0.997, gamma2=0.99):
    gamma_list = []
    for i in range(N):
        if i == 1:
            g = gamma0
        elif 2 <= i and i <= 6:
            g = 10*((2*i-6)/6)
            g = gamma1 + (gamma0 - gamma1)*sigmoid(g)
        elif i == 7:
            g = gamma1
        else:
            g = (N-9)*np.log(1-gamma1) + (i-8)*np.log(1-gamma2)
            g /= N-9
            g = 1-np.exp(g)
        gamma_list.append(g)
    return gamma_list

Internal Reward Model (UVFA)

Understanding is doubtful here ...

The figure below is a model of Agent57's behavioral value function.

At the bottom right is the description of [Action, External Reward, Internal Reward, Search Rate]. However, it doesn't say how to add this ...

For the time being, the search rate is added as an ohe-hot vector only to the action value function of the internal reward as shown below. (If you enter the internal reward or external reward as it is, you will not learn ...)

Below is the code. Only the action value function for internal reward is described. (The part with the comment (UVFA) is the part changed by UVFA)

from keras.utils import to_categorical

#Where to create a model for the Q function
def build_actval_int_model():

    #Input layer
    c1 = input1 = Input(shape=(input_sequence,) + input_shape)
    c2 = input1 = Input(shape=(policy_num,))  # (UVFA)add to

    #Image processing layer
    c1 = Conv2D(32, (8, 8), strides=(4, 4), padding="same")(c1)
    c1 = Conv2D(64, (4, 4), strides=(2, 2), padding="same")(c1)
    c1 = Conv2D(64, (3, 3), strides=(1, 1), padding="same")(c1)

    c = Concatenate()([c1, c2])  # (UVFA)add to

    # lstm
    c = LSTM(512)(c)

    # dueling network
    c =Create a model for Dueling Network(c)

    model = Model([input1, input2], c)  # (UVFA)Change
    model.compile(loss=clipped_error_loss, optimizer=Adam(lr=0.0001, epsilon=0.0001))

    return model

#Create a model
actval_int_model = build_actval_int_model()
actval_int_model_target = build_actval_int_model()

Internal reward model training


#Every frame
def forward(observation):
    #Get batch data
    batchs = memory.sample()

    #Format data
    state0_batch = []
    state1_batch = []
    policy_batch = []
    for batch in batchs:
        state0_batch.append(batchs["Previous state"])
        state1_batch.append(batchs["Next state"])
        #One discovery policy-Hot
        t = to_categorical(batchs["Exploration policy"], num_classes=policy_num)
        policy_batch.append(t)

    #Input is state and discovery policy
    state0_batch = [state0_batch, policy_batch]
    state1_batch = [state1_batch, policy_batch]
    
    #Example of giving a Q value
    state0_qvals = actval_int_model.predict(state0_batch, batch_size)

(State0 for update_Calculation of qvals)

    #Learning
    actval_int_model.train_on_batch(state0_batch, state0_qvals)

Calculation of Q value

The calculation of the Q value in NGU is omitted because it is lost due to the division of the action value function of Agent57. The formula for calculating the Q value in Agent57 is the following, which is a combination of internal reward and external reward.

Q(x,a,j;\theta) = h(h^{-1}(Q(x,a,j;\theta^e)) + \beta_j h^{-1}( Q(x,a,j;\theta^i)))


def rescaling(x, epsilon=0.001):
    if x == 0:
        return 0
    n = math.sqrt(abs(x)+1) - 1
    return np.sign(x)*n + epsilon*x

def rescaling_inverse(x, epsilon=0.001):
    if x == 0:
        return 0
    n = math.sqrt(1 + 4*epsilon*(abs(x)+1+epsilon)) - 1
    return np.sign(x)*( n/(2*epsilon) - 1)


def get_qvals():
    state =Current state
    policy_index =Exploration policy

    #Get an external Q value
    qvals_ext = avtval_ext_model.predict(state, batch_size=1)[0]

    #Search policy one-hot
    policy_onehot = to_categorical(policy_index, num_classes=policy_num)

    #Get the internal Q value
    qvals_int = avtval_int_model.predict([state, policy_onehot], batch_size=1)[0]

    #Search rate
    beta = int_beta_list[policy_index]

    #Calculate Q value
    qvals = []
    for i in range(nb_actions):
        q_ext = rescaling_inverse(qvals_ext[i])
        q_int = rescaling_inverse(qvals_int[i])
        qval = rescaling(q_ext + beta * q_int)
        qvals.append(qvals)

    return qvals

Retrace

As I said earlier, the implementation of Retrace is on hold (I can't understand the contents ...) I will describe the contents that I do not understand.

The Retrace function is:

c_s = \lambda min(1, \frac{\pi(A_s|X_s)}{\mu(A_s|X_s)})

$ \ pi $ is the target policy, and $ \ mu $ is the probability distribution for each action in the behavior policy.

However, in DQN, the search policy is complete greedy (select only the maximum Q value), so I feel that the probability distribution is 1 for the action with the maximum Q value and 0 for the others. In that case, the numerator takes only 0 or 1, so I feel that it will always be 0, but ... orz

The behavior policy will probably be ε-greedy, so I think it will be as follows (?)

\mu(\alpha|s) = \left\{
\begin{array}{ll}
\epsilon/N + 1 - \epsilon \quad (argmax_{\alpha \in A} Q(s, \alpha)) \\
\epsilon/N \quad (otherwise)
\end{array}
\right.

Where N is the number of actions.

Afterword

I haven't fully explained the details such as LSTM surroundings, queues, and parallel processing. .. .. The source code is open to the public, so please check the details there.

With this, I feel that I have reached one goal as an algorithm for reinforcement learning, but I think there is still room for improvement. What I want you to improve the most is the required specifications. This has been the case since R2D2, but the number of actors in the paper is 512. My PC has a limit of 2 and at most 4 ... (although it may have only one CPU) I'm looking forward to the next new method.

[PYTHON] [Reinforcement learning] Finally surpassed humans! ?? I tried to explain / implement Agent57 (Keras-RL)

Whole code

table of contents

Introduction

About Agent 57

Up to Agent57

Intrinsic Reward

Episodic novelty module

Embedded function (embedding network)

Approximation of the number of visits by the k-nearest neighbor method

Life long novelty module

MDP and POMDP

Discount rate

Change in behavior value function (Q function) architecture

Meta controller

Multi-armed bandit problem (MAB problem)

sliding-window UCB method

Implementation

Intrinsic Reward

Embedded function (embedding network)

`Embedded function for value acquisition`

`Embedded function for training`

Learning embedded functions (embedding network)

Get the value of an embedded function (embedding network)

Episodic memory calculation

Life long novelty module

RND model

Learning RND

Lifetime memory calculation

Calculation of internal rewards

Search rate (β) and discount rate (γ)

Internal Reward Model (UVFA)

Internal reward model training

Calculation of Q value

Afterword