[PYTHON] [Reinforcement learning] Easy high-speed implementation of Ape-X!

I posted an article (https://zenn.dev/ymd_h/articles/03edcaa47a3b1c) on Zenn for the technical story of internal implementation. If you are interested, please read that.

1.First of all

I wrote Introduction article before, but I have developed and released a library of Replay Buffer for experience replay of reinforcement learning.

Although Experience Replay is widely used in reinforcement learning, many people are spending time reinventing the wheel by copying and rewriting the code published on the Internet etc. I think it's not good and continue to develop and publish. Moreover, it is surprisingly addictive, and it is a headache for researchers and developers who are interested in deep learning, such as inefficient implementation and encountering bugs that follow the same rut, so it can be used quickly. I think it's important to have a good library.

(Of course, there is a wonderful library that includes the whole reinforcement learning such as RLlib, but it is quite difficult for researchers of reinforcement learning algorithms to put it on their own, so it is easy to use. Also, DeepMind's Reverb, which was released the other day, is a direct competitor, but it is supposed to be on a larger scale, and this is currently a single computer. Is expected to be used.)

Since it deviates from the subject, I will not describe it in detail here, but I am developing it with a focus on deep learning framework-independent, high degree of freedom and efficiency. If you are interested, please use it and give us feedback such as starring or issuing an issue in the repository.

  1. Ape-X

Distributed learning Ape-X has been proposed as one of the methods for learning reinforcement learning in a short time. Roughly speaking, it is a method of separating environment exploration and network learning and performing multiple explorations at the same time.

There was also a great article on Qiita with detailed explanations and implementations.

Even with these implementations, I think Replay Buffer and interprocess data communication are written in full scratch. When reusing, I think it's difficult to copy and rewrite all the related parts when it comes to "I'm not a TensorFlow group but a PyTorch group ..." I will.

3. Install cpprb

This feature for Ape-X requires v9.4.2 or later.

3.1 Linux/Windows You can install the binaries directly from PyPI.

pip install cpprb

3.2 macOS

Unfortunately, clang, which is used by default, cannot be compiled, so it is necessary to have gcc prepared by Homebrew or MacPorts and compiled by hand at the time of installation.

Replace / path/to/g ++ with the path of the installed g ++.

CC=/path/to/g++ CXX=/path/to/g++ pip install cpprb

Reference: Installation procedure on the official website

4. Ape-X sample implementation code using cpprb

Below is a sample implementation of Ape-X when using cpprb. This is the code for the skeleton part of Ape-X that does not include the model (network) part of deep learning. I think that you can actually use it by changing the MyModel part of the mock and adding visualization such as TensorBoard or saving the model as needed. (I think that implementation around that is not difficult for those who are researching and developing reinforcement learning on a regular basis.)


from multiprocessing import Process, Event, SimpleQueue
import time

import gym
import numpy as np
from tqdm import tqdm

from cpprb import ReplayBuffer, MPPrioritizedReplayBuffer

class MyModel:
    def __init__(self):
        self._weights = 0

    def get_action(self,obs):
        # Implement action selection
        return 0

    def abs_TD_error(self,sample):
        # Implement absolute TD error
        return np.zeros(sample["obs"].shape[0])

    def weights(self):
        return self._weights

    def weights(self,w):
        self._weights = w

    def train(self,sample):
        # Implement model update

def explorer(global_rb,env_dict,is_training_done,queue):
    local_buffer_size = int(1e+2)
    local_rb = ReplayBuffer(local_buffer_size,env_dict)

    model = MyModel()
    env = gym.make("CartPole-v1")

    obs = env.reset()
    while not is_training_done.is_set():
        if not queue.empty():
            w = queue.get()
            model.weights = w

        action = model.get_action(obs)
        next_obs, reward, done, _ = env.step(action)

        if done:
            obs = env.reset()
            obs = next_obs

        if local_rb.get_stored_size() == local_buffer_size:
            local_sample = local_rb.get_all_transitions()

            absTD = model.abs_TD_error(local_sample)

def learner(global_rb,queues):
    batch_size = 64
    n_warmup = 100
    n_training_step = int(1e+4)
    explorer_update_freq = 100

    model = MyModel()

    while global_rb.get_stored_size() < n_warmup:

    for step in tqdm(range(n_training_step)):
        sample = global_rb.sample(batch_size)

        absTD = model.abs_TD_error(sample)

        if step % explorer_update_freq == 0:
            w = model.weights
            for q in queues:

if __name__ == "__main__":
    buffer_size = int(1e+6)
    env_dict = {"obs": {"shape": 4},
                "act": {},
                "rew": {},
                "next_obs": {"shape": 4},
                "done": {}}
    n_explorer = 4

    global_rb = MPPrioritizedReplayBuffer(buffer_size,env_dict)

    is_training_done = Event()

    qs = [SimpleQueue() for _ in range(n_explorer)]
    ps = [Process(target=explorer,
          for q in qs]

    for p in ps:


    for p in ps:


As you can see, the MPPrioritizedReplayBuffer (Multi-Process supported Prioritized Replay Buffer), which is used as a global buffer, can be accessed from multiple processes without any special awareness. Since the internal data is stored in shared memory, interprocess data sharing is faster than proxies (multiprocessing.managers.SyncManager etc.) and queues (multiprocessing.Queue etc.).

Also, locks to prevent data inconsistencies are executed inside each method, so as long as the user adheres to the basic configuration of multiple explorer + single learner, there is no need to lock manually. .. Moreover, it does not lock the entire buffer, it locks only the minimum critical sections necessary to maintain data integrity, which is much more efficient than locking the entire global buffer poorly. (Especially in situations where the deep learning network is small or the computational load of the environment such as a simulator is light, the difference becomes large, and although it is not a rigorous test, it is a simple global when developing at hand. Compared to locking the entire buffer, the explorer speed is 3-4 times faster and the learner speed is 1.2-2 times faster.)

5. Bonus

A friend of mine is using cpprb to develop a reinforcement learning library tf2rl for TensorFlow 2.x. If you are interested in that too, thank you.

Introductory article of the author himself → tf2rl: TensorFlow2 Reinforcement Learning

Recommended Posts

[Reinforcement learning] Easy high-speed implementation of Ape-X!
Deep reinforcement learning 2 Implementation of reinforcement learning
[Reinforcement learning] Explanation and implementation of Ape-X in Keras (failure)
Reinforcement learning 2 Installation of chainerrl
Try to make a blackjack strategy by reinforcement learning ((1) Implementation of blackjack)
Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)
Qiskit: Implementation of Quantum Circuit Learning (QCL)
Implementation of 3-layer neural network (no learning)
Machine learning algorithm (implementation of multi-class classification)
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
Othello-From the tic-tac-toe of "Implementation Deep Learning" (2)
[Introduction] Reinforcement learning
[Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)
See the behavior of drunkenness with reinforcement learning
[Reinforcement learning] Experience Replay is easy with cpprb!
[Learning memo] Deep Learning from scratch ~ Implementation of Dropout ~
Future reinforcement learning_2
[Reinforcement learning] R2D2 implementation / explanation revenge hyperparameter explanation (Keras-RL)
Implementation of Deep Learning model for image recognition
Deep learning learned by implementation (segmentation) ~ Implementation of SegNet ~
About testing in the implementation of machine learning models
Othello ~ From the tic-tac-toe of "Implementation Deep Learning" (4) [End]
Implementation of Chainer series learning using variable length mini-batch
I investigated the reinforcement learning algorithm of algorithmic trading
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Reinforcement learning 1 Python installation
Reinforcement learning 3 OpenAI installation
Deep learning 1 Practice of deep learning
Implementation of Fibonacci sequence
[Reinforcement learning] Bandit task
Python + Unity Reinforcement Learning (Learning)
Reinforcement learning 1 introductory edition
Rank learning using neural network (Implementation of RankNet by Chainer)
Easy implementation of credit card payment function with PAY.JP [Django]