[PYTHON] [Reinforcement learning] R2D2 implementation / explanation revenge hyperparameter explanation (Keras-RL)

This is a hyperparameter explanation. I summarized each parameter.

Click here for the algorithm of the contents [Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)

Whole code

The target code for this article is on github.

table of contents

Common parameters

This parameter is common to Rainbow (DQN) and R2D2.

env dependency parameters

Overview Mold Example Remarks
input_shape Input shape tuple (84,84) env.observation_space.shape
input_type Specify input format InputType InputType.GRAY_2ch Original implementation
image_model Image layer model format ImageModel(Original implementation) DQNImageModel()
nb_actions Number of actions(Number of outputs) int 4 env.action_space.n
processor Classes that provide custom gym functionality Processor(Keras-rl) None

InputType


class InputType(enum.Enum):
    VALUES = 1    #No image
    GRAY_2ch = 3  # (width, height)
    GRAY_3ch = 4  # (width, height, 1)
    COLOR = 5     # (width, height, ch)

NN (Neural Network) model related parameters

Overview Mold Example Remarks
batch_size Batch size int 32
optimizer Optimization algorithm Optimizer(Keras) Adam(lr=0.0001) Keras implementation
metrics Evaluation function array [] Keras implementation
input_sequence Number of input frames int 4
dense_units_num Number of units in the Dense layer int 512
enable_dueling_network Whether to use Dueling Network bool True
dueling_network_type Algorithm used in Dueling Network DuelingNetwork DuelingNetwork.AVERAGE
lstm_type Types when using LSTM LstmType(Original implementation) LstmType.NONE
lstm_units_num Number of LSTM layer units int 512
lstm_ful_input_length Number of input learnings per learning int 4 Used only for STATEFUL

DuelingNetwork


class DuelingNetwork(enum.Enum):
    AVERAGE = 0
    MAX = 1
    NAIVE = 2

LstmType


class LstmType(enum.Enum):
    NONE = 0
    STATELESS = 1
    STATEFUL = 2

Experience Replay Memory related

Overview Mold Example Remarks
memory/remote_memory Memory to use Memory(Original implementation) ReplayMemory(10000) See below

Specifies the type of memory to store experience. DQN stores the experienced data once in memory. After that, the experience is randomly extracted from the memory and learned. There are several types depending on how to retrieve from memory, so I will explain them.

ReplayMemory This is the simple memory used by DQN. (Previous article) Randomly retrieve empirical data.

ReplayMemory(
    capacity=10_000
)

PERGreedyMemory [Priorityed Experience Playback](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4%BD% 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F) is a straightforward implementation. This is a method to extract the experience with the largest TD error (the highest reflection rate in learning) rather than random. However, since there are no random elements, I feel like I'm going into a local solution right away, so I can't learn well ... (Why implemented)

PERGreedyMemory(
    capacity=10_000
)

PERProportionalMemory [Priorityed Experience Playback](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4%BD% 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F) Proportional Prioritization memory .. It is a method to extract experience according to the probability distribution of TD error instead of random. (Experience with more TD error has a higher probability of being extracted)

It feels much more efficient than Replay Memory (random selection).

PERGreedyMemory(
    capacity=100000,
    alpha=0.9,
    beta_initial,
    beta_steps,
    enable_is,
)

The parameters will be described later.

PERRankBaseMemory [Priorityed Experience Playback](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4%BD% RankBase memory in 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F). Experience is extracted in proportion to the order of TD error rather than random. For example, if you have three experiences, the first place is 50%, the second place is 33%, and the third place is 17%.

It feels much more efficient than Replay Memory (random selection), I don't really understand the difference from Proportional. This should be a little faster in terms of speed ...

PERRankBaseMemory(
    capacity=100000,
    alpha=0.9,
    beta_initial,
    beta_steps, 
    enable_is,
)

The parameters will be described later.

PERProportionalMemory and PERRankBaseMemory parameters

Overview Mold Example Remarks
capacity Maximum capacity to store in memory int 1_000_000
alpha Probability reflection rate float 0.9 0.0~1.0
beta_initial Initial value of IS reflection rate float 0.0 0.0~1.0
beta_steps IS reflection rate 1.Number of steps to 0 int 100_000 Depends on the number of learnings
enable_is Whether to enable IS bool True

Here, Importance Sampling (IS) (https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#%E9%87%8D%E8%A6%81%E5%BA%A6%E3%82%B5% E3% 83% B3% E3% 83% 97% E3% 83% AA% E3% 83% B3% E3% 82% B0 is-importance-sampling). When experiencing experiences according to a probability distribution, there is a bias in the number of times each experience is selected. If the number of experience selections is biased, learning will be biased, and it is important to avoid this in importance sampling.

Specifically, experiences selected with a high probability have a low reflection rate in the update of the Q value, and experiences selected with a low probability have a high reflection rate in the update of the Q value.

It seems that learning will be stable by introducing IS. In addition, IS is annealing (gradually reflecting).

Learning-related parameters

Overview Mold Example Remarks
memory_warmup_size/ remote_memory_warmup_size Size that does not learn until experience is accumulated in memory int 1000
target_model_update Update interval to Target model int 10000
gamma Q-learning discount rate float 0.99 0.0~1.0
enable_double_dqn Whether to use DoubleDQN bool True
enable_rescaling Whether to use the rescaling function bool True
rescaling_epsilon Constants used in the rescaling function float 0.001
priority_exponent Ratio when calculating experience priorities float 0.9 Use only LESTFUL
burnin_length burn-in period int 2 Use only LESTFUL
reward_multisteps Number of steps in MultiStep Reward int 3

Action relationship

Overview Mold Example Remarks
action_interval Action execution interval int 1 1 or more
action_policy Measures to use in action execution Policy(Original implementation) See below

ε-greedy ε-greedy acts randomly if it is less than $ epsilon $ against a random number (0.0 to 1.0), If it is larger than that, select the action that maximizes the Q value.

EpsilonGreedy(
    epsilon
)

ε-greedy(Annealing) [DQN](https://qiita.com/pocokhc/items/125479c9ae0df1de4234#%E3%82%A2%E3%82%AF%E3%82%B7%E3%83%A7%E3%83%B3%E3 The method used in% 81% AE% E6% B1% BA% E5% AE% 9A). It is a method to lower $ epsilon $ in ε-greedy (according to Q value) as learning progresses.

AnnealingEpsilonGreedy(
    initial_epsilon=1,
    final_epsilon=0.1,
    exploration_steps=1_000_000
)

ε-greedy(Actor) Method used in Ape-X is. $ Epsilon $ in ε-greedy is calculated based on the number of Actors.

EpsilonGreedyActor(
    actor_index,
    actors_length,
    epsilon=0.4,
    alpha=7
)

Softmax It is a method to determine the action by the probability distribution of the Softmax function of the Q value. In short, the higher the Q value, the easier it is to be selected, and the lower the Q value, the harder it is to be selected.

SoftmaxPolicy()

There are no arguments.

UCB(Upper Confidence Bound)1 UCB1 is a method of selecting an action by considering not only the Q value but also the number of times the action was selected. The idea is to search for actions that are not selected very much because the search is not so advanced and there may be unknown rewards.

UCB1()

There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.

UCB1-Tuned UCB1-Tuned is an improved algorithm for UCB1 that also considers variance. It gives better results than UCB1, but there is no theoretical guarantee.

UCB1_Tuned()

There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.

UCB-V It is an algorithm that is more distributed-conscious than UCB1-Tuned.

UCBv()

There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.

KL-UCB It is an algorithm that finds the theoretical optimum value of the search and reward dilemma. However, the implementation may be a little strange ...

KL_UCB()

There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.

Thompson Sampling (Beta distribution)

Thompson Sampling is an algorithm based on Bayesian inference. This is also the theoretical optimum for the search and reward dilemma.

The beta distribution is a distribution that can be applied when it takes a binary value of 0 or 1. In the implementation, if the reward is greater than 0, it is treated as 1, and if it is 0 or less, it is treated as 0.

ThompsonSamplingBeta()

There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.

Thompson Sampling (normal distribution)

Thompson Sampling is an algorithm based on Bayesian inference. This is also the theoretical optimum for the search and reward dilemma.

The algorithm is applied assuming that the reward follows a normal distribution.

ThompsonSamplingGaussian()

There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.

Only related to Rainbow (DQN)

Overview Mold Example Remarks
train_interval Learning interval int 1 1 or more

By increasing train_interval, you can increase the learning interval.

Related to R2D2 only

Overview Mold Example Remarks
actors Specify Actor class Actor(Original implementation) See below
actor_model_sync_interval Interval to synchronize NN model from Learner int 500

Actor It is a class that expresses Actor by its own implementation. It inherits this and defines the Policy and env.fit that each Actor executes.

This is a definition example.

from src.r2d2 import Actor
from src.policy import EpsilonGreedy

ENV_NAME = "xxx"

class MyActor(Actor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.1)

    def fit(self, index, agent):
        env = gym.make(ENV_NAME)
        agent.fit(env, visualize=False, verbose=0)
        env.close()

getPolicy specifies the action policy used by the actor. Let agetn, which is an argument in fit, execute fit to learn.

Be careful when passing it to R2D2, pass the class itself (don't instantiate it)

from src.r2d2 import R2D2
kwargs = {
    "actors": [MyActor]  #Pass the class itself
(abridgement)
}
manager = R2D2(**kwargs)

If you want to increase the number of Actors, increase the number of elements in the array.

Example of 4 Actors


from src.r2d2 import R2D2
kwargs = {
    "actors": [MyActor, MyActor, MyActor, MyActor]
(abridgement)
}
manager = R2D2(**kwargs)

Other

MovieLogger(Rainbow/R2D2)

Callback that outputs a video. It can be used with both Rainbow and R2D2.

from src.callbacks import MovieLogger

#Add it to the callcacks argument of test.
movie = MovieLogger()
agent.test(env, nb_episodes=1, visualize=False, callbacks=[movie])

#you save.
movie.save(
    start_frame=0,
    end_frame=0,
    gifname="pendulum.gif",
    mp4name="",
    interval=200,
    fps=30
):

・ Output example pendulum.gif

Intermediate layer visualization of NN (Rainbow / R2D2)

Callback for Conv layer, Advance layer, and Value layer visualization introduced in Previous article. It can be used with both Rainbow and R2D2.

from src.callbacks import ConvLayerView

#Specify the agent in the initialization.
conv = ConvLayerView(agent)

#Perform a test.
#Specify ConvLayerView object in callbacks argument
agent.test(env, nb_episodes=1, visualize=False, callbacks=[conv])

#Save the result.
conv.save(
    grad_cam_layers=["conv_1", "conv_2", "conv_3"],
    add_adv_layer=True,
    add_val_layer=True,
    start_frame=0,
    end_frame=200,
    gifname="tmp/pendulum.gif",
    interval=200,
    fps=10,
)

Also, ConvLayerView works only when the input is an image (InputType is GRAY_2ch, GRAY_3ch, COLOR).

・ Output example pendulum1_min.gif

Logger2Stage(Rainbow) It provides the following two functions.

from src.rainbow import Rainbow
from src.callbacks import Logger2Stage

#Create a separate agent and env for testing
kwargs = (abridgement)
test_agent = Rainbow(**kwargs)
test_env = gym.make(ENV_NAME)

#various settings
log = Logger2Stage(
    logger_type=LoggerType.STEP,
    warmup=1000,
    interval1=200,
    interval2=20_000,
    change_count=5,
    savefile="tmp/log.json",
    test_agent=test_agent,
    test_env=test_env,
    test_episodes=10
)

#Add to callbacks when learning
#Logger2Stage outputs the log, so verbose=0
agent.fit(env, nb_steps=1_750_000, visualize=False, verbose=0, callbacks=[log])

#You can get the logs with the getLogs function(You must specify savefile)
history = log.getLogs()

#It's simple, but you can also output a graph(You must specify savefile)
log.drawGraph()

・ Output example

--- start ---
'Ctrl + C' is stop.
Steps 0, Time: 0.00m, TestReward:  21.12 -  92.80 (ave:  51.73, med:  46.99), Reward:   0.00 -   0.00 (ave:   0.00, med:   0.00)
Steps 200, Time: 0.05m, TestReward:  22.06 -  99.94 (ave:  43.85, med:  31.24), Reward: 108.30 - 108.30 (ave: 108.30, med: 108.30)
Steps 1200, Time: 0.28m, TestReward:  40.99 -  73.88 (ave:  52.41, med:  47.69), Reward:  49.05 - 141.53 (ave:  87.85, med:  90.89)
(abridgement)
Steps 17200, Time: 3.95m, TestReward: 167.68 - 199.49 (ave: 184.34, med: 188.30), Reward: 166.29 - 199.66 (ave: 181.79, med: 177.36)
Steps 18200, Time: 4.19m, TestReward: 165.84 - 199.53 (ave: 186.16, med: 188.50), Reward: 188.00 - 199.50 (ave: 190.64, med: 188.41)
Steps 19200, Time: 4.43m, TestReward: 163.63 - 188.93 (ave: 186.15, med: 188.59), Reward: 165.56 - 188.45 (ave: 183.75, med: 188.23)
done, took 4.626 minutes
Steps 0, Time: 4.63m, TestReward: 188.37 - 199.66 (ave: 190.83, med: 188.68), Reward: 188.34 - 188.83 (ave: 188.63, med: 188.67)

rainbow_pendium.png

SaveManager(R2D2) R2D2 uses multiprocessing and the implementation method is quite special. In particular, the save / load of the model was significantly affected, so I prepared it separately.

from src.r2d2 import R2D2
from src.r2d2_callbacks import SaveManager

#Creating R2D2
kwargs = (abridgement)
manager = R2D2(**kwargs)

#Creating a SaveManager
save_manager = SaveManager(
    save_dirpath="tmp",
    is_load=False,
    save_overwrite=True,
    save_memory=True,
    checkpoint=True,
    checkpoint_interval=2000,
    verbose=0
)

#Start learning, add to callbacks argument.
manager.train(
    nb_trains=20_000,
    callbacks=[save_manager],
)

#Call the following to create an Agent for test
# save_dirpath/last/learner.Please specify dat.
agent = manager.createTestAgent(MyActor, "tmp/last/learner.dat")

#Conduct a test.
agent.test(env, nb_episodes=5, visualize=True)

It provides the following two functions.

Also, unlike rainbow, there is only an acquisition interval in time.

from src.r2d2 import R2D2
from src.r2d2_callbacks import Logger2Stage

#Creating R2D2
kwargs = (abridgement)
manager = R2D2(**kwargs)

#Create env for testing
test_env = gym.make(ENV_NAME)

#Create Logger2 Stage
log = Logger2Stage(
    warmup=0,
    interval1=10,
    interval2=60,
    change_count=20,
    savedir="tmp",
    test_actor=MyActor,
    test_env=test_env,
    test_episodes=10,
    verbose=1,
)

#Start learning, add to callbacks argument.
manager.train(
    nb_trains=20_000,
    callbacks=[log],
)

#You can get the logs with getLogs.(If savedir is specified)
history = log.getLogs()

#You can also easily display a graph.(If savedir is specified)
log.drawGraph()

・ Output example

--- start ---
'Ctrl + C' is stop.
Learner Start!
Actor0 Start!
Actor1 Start!
actor1   Train 1, Time: 0.24m, Reward    :  27.80 -  27.80 (ave:  27.80, med:  27.80), nb_steps: 200
learner  Train 1, Time: 0.19m, TestReward:  29.79 -  76.71 (ave:  58.99, med:  57.61)
actor0   Train 575, Time: 0.35m, Reward    :  24.88 - 133.09 (ave:  62.14, med:  50.83), nb_steps: 3400
learner  Train 651, Time: 0.36m, TestReward:  24.98 -  51.67 (ave:  38.86, med:  38.11)
actor1   Train 651, Time: 0.41m, Reward    :  22.15 -  88.59 (ave:  41.14, med:  35.62), nb_steps: 3200
actor0   Train 1249, Time: 0.51m, Reward    :  22.97 -  61.41 (ave:  35.24, med:  31.99), nb_steps: 8000
(abridgement)
learner  Train 16476, Time: 4.53m, TestReward: 165.56 - 199.57 (ave: 180.52, med: 177.73)
actor1   Train 16880, Time: 4.67m, Reward    : 128.88 - 188.45 (ave: 169.13, med: 165.94), nb_steps: 117600
Learning End. Train Count:20001
learner  Train 20001, Time: 5.29m, TestReward: 175.72 - 188.17 (ave: 183.21, med: 187.48)
Actor0 End!
Actor1 End!
actor0   Train 20001, Time: 5.34m, Reward    : 151.92 - 199.61 (ave: 181.68, med: 187.48), nb_steps: 0
actor1   Train 20001, Time: 5.34m, Reward    : 130.39 - 199.26 (ave: 170.83, med: 167.99), nb_steps: 0
done, took 5.350 minutes

r2d2_pendium.png

Setting value sample

DQN Paper (Atari)

from src.rainbow import Rainbow
from src.processor import AtariProcessor
from src.image_model import DQNImageModel
from src.memory import ReplayMemory
from src.policy import AnnealingEpsilonGreedy

nb_steps = 1_750_000

#What AtariProcessor does
#・ Resize image(84,84)
#・ Reward clipping
processor = AtariProcessor(reshape_size=(84, 84), is_clip=True)

kwargs={
    "input_shape": processor.image_shape, 
    "input_type": InputType.GRAY_2ch,
    "nb_actions": env.action_space.n, 
    "optimizer": Adam(lr=0.0001),
    "metrics": [],

    "image_model": DQNImageModel(),
    "input_sequence": 4,         #Number of input frames
    "dense_units_num": 256,       #Number of units in the dense layer
    "enable_dueling_network": False,
    "lstm_type": LstmType.NONE,           #LSTM algorithm to use

    # train/action related
    "memory_warmup_size": 50_000,    #Number of steps for initial memory allocation(Don't learn)
    "target_model_update": 10_000,  #target network update interval
    "action_interval": 4,       #Interval to perform action
    "train_interval": 4,        #Interval to learn
    "batch_size": 32,     # batch_size
    "gamma": 0.99,        #Q-learning discount rate
    "enable_double_dqn": False,
    "enable_rescaling": False,   #Whether to enable rescaling
    "reward_multisteps": 1,      # multistep reward

    #Other
    "processor": processor,
    "action_policy": AnnealingEpsilonGreedy(
        initial_epsilon=1.0,      #Initial ε
        final_epsilon=0.05,        #Ε in final state
        exploration_steps=1_000_000  #Number of steps from initial to final state
    ),
    "memory": ReplayMemory(capacity=1_000_000),

}
agent = Rainbow(**kwargs)

Keras-RL sample (Cartpole)

from src.rainbow import Rainbow
from src.memory import ReplayMemory
from src.policy import SoftmaxPolicy

env = gym.make('CartPole-v0')

kwargs={
    "input_shape": env.observation_space.shape, 
    "input_type": InputType.VALUES,
    "nb_actions": env.action_space.n, 
    "optimizer": Adam(lr=0.0001),
    "metrics": [],

    "image_model": None,
    "input_sequence": 1,         #Number of input frames
    "dense_units_num": 16,       #Number of units in the dense layer
    "enable_dueling_network": False,
    "lstm_type": LstmType.NONE,
    
    # train/action related
    "memory_warmup_size": 10,    #Number of steps for initial memory allocation(Don't learn)
    "target_model_update": 1,  #target network update interval
    "action_interval": 1,       #Interval to perform action
    "train_interval": 1,        #Interval to learn
    "batch_size": 32,     # batch_size
    "gamma": 0.99,        #Q-learning discount rate
    "enable_double_dqn": False,
    "enable_rescaling": False,
    
    #Other
    "processor": processor,
    "action_policy": SoftmaxPolicy(),
    "memory": ReplayMemory(capacity=50000)
}
agent = Rainbow(**kwargs)

Rainbow Paper (Atari)

from src.rainbow import Rainbow
from src.processor import AtariProcessor
from src.image_model import DQNImageModel
from src.memory import PERProportionalMemory
from src.policy import AnnealingEpsilonGreedy

nb_steps = 1_750_000

#What AtariProcessor does
#・ Resize image(84,84)
#・ Reward clipping
processor = AtariProcessor(reshape_size=(84, 84), is_clip=True)

kwargs={
    "input_shape": processor.image_shape, 
    "input_type": InputType.GRAY_2ch,
    "nb_actions": env.action_space.n, 
    "optimizer": Adam(lr=0.0000625, epsilon=0.00015),
    "metrics": [],

    "image_model": DQNImageModel(),
    "input_sequence": 4,         #Number of input frames
    "dense_units_num": 512,       #Number of units in the dense layer
    "enable_dueling_network": True,
    "dueling_network_type": DuelingNetwork.AVERAGE,  #Algorithm used in dueling network
    "lstm_type": LstmType.NONE,
    
    # train/action related
    "memory_warmup_size": 80000,    #Number of steps for initial memory allocation(Don't learn)
    "target_model_update": 32000,  #target network update interval
    "action_interval": 4,       #Interval to perform action
    "train_interval": 4,        #Interval to learn
    "batch_size": 32,     # batch_size
    "gamma": 0.99,        #Q-learning discount rate
    "enable_double_dqn": True,
    "enable_rescaling": False,
    "reward_multisteps": 3,    # multistep reward
    
    #Other
    "processor": processor,
    "action_policy": AnnealingEpsilonGreedy(
        initial_epsilon=1.0,      #Initial ε
        final_epsilon=0.05,        #Ε in final state
        exploration_steps=1_000_000  #Number of steps from initial to final state
    ),
    "memory": PERProportionalMemory(
        capacity=1_000_000,
        alpha=0.5,           #Probability reflection rate of PER
        beta_initial=0.4,    #Initial value of IS reflection rate
        beta_steps=1_000_000,  #Number of steps to increase IS reflection rate
        enable_is=True,     #Whether to enable IS
    )
}
agent = Rainbow(**kwargs)

R2D2 paper (Atari)

from src.r2d2 import R2D2, Actor
from src.processor import AtariProcessor
from src.image_model import DQNImageModel
from src.memory import PERProportionalMemory
from src.policy import EpsilonGreedyActor

ENV_NAME = "xxxxx"

class MyActor(Actor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)

    def fit(self, index, agent):
        env = gym.make(ENV_NAME)
        agent.fit(env, visualize=False, verbose=0)
        env.close()


#What AtariProcessor does
#・ Resize image(84,84)
#・ Reward clipping
processor = AtariProcessor(reshape_size=(84, 84), is_clip=True)

kwargs={
    "input_shape": processor.image_shape, 
    "input_type": InputType.GRAY_2ch,
    "nb_actions": env.action_space.n, 
    "optimizer": Adam(lr=0.0001, epsilon=0.001),
    "metrics": [],

    "image_model": DQNImageModel(),
    "input_sequence": 4,             #Number of input frames
    "dense_units_num": 512,           #Number of units in the Dense layer
    "enable_dueling_network": True,  # dueling_network valid flag
    "dueling_network_type": DuelingNetwork.AVERAGE,   # dueling_network algorithm
    "lstm_type": LstmType.STATEFUL,  #LSTM algorithm
    "lstm_units_num": 512,            #Number of LSTM layer units
    "lstm_ful_input_length": 40,      #Stateful LSTM inputs

    # train/action related
    "remote_memory_warmup_size": 50_000,  #Number of steps for initial memory allocation(Don't learn)
    "target_model_update": 10_000,  #target network update interval
    "action_interval": 4,    #Interval to perform action
    "batch_size": 64,
    "gamma": 0.997,           #Q-learning discount rate
    "enable_double_dqn": True,   #DDQN valid flag
    "enable_rescaling": enable_rescaling,    #Whether to enable rescaling(priotrity)
    "rescaling_epsilon": 0.001,  #rescaling constant
    "priority_exponent": 0.9,   #priority priority
    "burnin_length": 40,          # burn-in period
    "reward_multisteps": 3,  # multistep reward

    #Other
    "processor": processor,
    "actors": [MyActor for _ in range(256)],
    "remote_memory": PERProportionalMemory(
        capacity= 1_000_000,
        alpha=0.6,             #Probability reflection rate of PER
        beta_initial=0.4,      #Initial value of IS reflection rate
        beta_steps=1_000_000,  #Number of steps to increase IS reflection rate
        enable_is=True,        #Whether to enable IS
    ),

    #actor relationship
    "actor_model_sync_interval": 400,  #Interval to synchronize model from learner

}
manager = R2D2(**kwargs)

Afterword

There are too many parameters ... It seems that something called R2D3 has been announced, and we will implement it soon.

Recommended Posts

[Reinforcement learning] R2D2 implementation / explanation revenge hyperparameter explanation (Keras-RL)
[Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)
[Reinforcement learning] I implemented / explained R2D3 (Keras-RL)
Deep reinforcement learning 2 Implementation of reinforcement learning
[Reinforcement learning] Explanation and implementation of Ape-X in Keras (failure)
[Reinforcement learning] Easy high-speed implementation of Ape-X!
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
Future reinforcement learning_2
Future reinforcement learning_1