[PYTHON] I investigated the reinforcement learning algorithm of algorithmic trading

[Nekopuni's blog](http://nekopuni.holy.jp/2014/09/python%E5%BC%B7%E5%8C%96%E5%AD%A6%E7%BF%92%EF%BC % 8B% E7% 82% BA% E6% 9B% BF% E3% 83% 88% E3% 83% AC% E3% 83% BC% E3% 83% 89% E6% 88% A6% E7% 95% A5 Inspired by% E3% 81% 9D% E3% 81% AE2 /), I investigated algorithmic trading using reinforcement learning for both hobbies and practical benefits. I'm not completely confident in what I'm writing (especially the code). The main use is for personal memos, but I would appreciate it if you could use it as a reference for a small survey.

Reference material

Survey outline

The meaning of using reinforcement learning (see Reference 4)

The difference between reinforcement learning and supervised learning is as follows.

In supervised learning, conditions are optimized without a unified trading policy. On the other hand, in reinforcement learning, it is possible to construct algorithms including the environment and policies, so it is considered that the effectiveness of using reinforcement learning is high.

Difference between Value Function RL and Direct RL

■Value Function RL Action value is assigned by state or state action pair. Based on this value, we optimize the policy (the policy of action to be taken when a certain state is obtained). This will maximize long-term expected rewards. Q-learning is categorized here. For details on Q-learning, see Reference 5 .

■Direct RL We will directly adjust the reward function (values) based on the experience (observed values) obtained from the environment. Unlike Q-learning, Q-table is not required, so the amount of temporal / spatial calculation is small. However, maximizing expected rewards will be short-term. Recurrent Reinforcement Learning (RRL) is categorized here. For more information on RRL, see Reference 5 < img src = "http://ir-jp.amazon-adsystem.com/e/ir?t=shimashimao06-22&l=as2&o=9&a=4627826613" width="1" height="1" border="0" alt Please refer to = "" style = "border: none! Important; margin: 0px! Important;" />. (This time, I am writing the code using this RRL.)

RRL Financial Trading Framework

Evaluation value of the agent

Uses Differential Sharpe Ratio (DSR). Used when updating weight.

Precautions for actual operation

Algorithm (see Reference 1)

[Position determination at time t (long or short)]

F_t=sign(\sum_{i=0}^{M}w_{i,t}r_{t-i}+w_{M+1,t}F_{t-1}+v_t) F_t ∈ [-1,1]; (short=-1, long=1) w_t: weight vector v_t: threshold of the neural network $ r_t $: $ p_t --p_ {t-1} $ (return in chronological order)

The form of the formula is the same as a simple one-layer neural network. Actually, the neural network method is applied, and the threshold $ v_t $ is also included in the weight vector for optimization.

[Profit at time point T]

P_t=\sum_{t=0}^{T}R_t R_t:=F_{t-1}r_t-δ|F_t-F_{t-1}| δ: transaction cost

[Optimization evaluation value]

■sharpe ratio \hat{S}(t):=\frac{A_t}{B_t} A_t=A_{t-1}+η(R_t-A_{t-1}), A_0=0 B_t=B_{t-1}+η(R_t^2-B_{t-1}), B_0=0 η: adaptation parameter

■Differential Sharpe Ratio(DSR) An improved version of the sharpe ratio moving average for online learning. DSR is lighter in calculation and converges faster (it seems). Ueno $ \ hat {S} $ is Taylor expanded around η = 0 and the first term is acquired. D_t:=\frac{d\hat{S}}{dη}|_{η=0}

=\frac{B_{t-1}\Delta A_t-\frac{1}{2}A_{t-1}\Delta B_t}{(B_{t-1}-A_{t-1}^2)^\frac{3}{2}} \Delta A_t=R_t-A_{t-1} \Delta B_t=R_t^2-B_{t-1}

Consider $ D_t $ as an immediate performance measure and update the weight to maximize it.

[Update weight]

w_{i,t}=w_{i,t-1}+\rho \Delta w_{i,t} \Delta w_{i,t}=\frac{dD_t}{dw_i}     \approx \frac{dD_t}{dR_t}{ \frac{dR_t}{dF_t}\frac{dF_t}{dw_{i,t}} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{dw_{i,t-1}}}

\frac{dF_t}{dw_{i,t}}\approx\frac{\partial F_t}{\partial w_{i,t}}+\frac{\partial F_t}{\partial F_{t-1}}\frac{dF_{t-1}}{dw_{i,t-1}}

\frac{dD_t}{dR_t}=\frac{B_{t-1}-A_{t-1}R_t}{(B_{t-1}-A_{t-1}^2)^{3/2}} \frac{dR_t}{dF_t}=-\delta \frac{dR_t}{dF_{t-1}}=r_t-\delta

[Loss cut, signal setting threshold, system abnormality judgment]

There is a statement that it should be set as a parameter, but there is no specific description. There seems to be no choice but to set it with an empirical value. ..

Reference code

The sign function is once calculated as tanh and signalized depending on whether $ F_t $ is greater than 0 or not. Functions such as loss cut are not written. It's a rather suspicious code, though I say it myself. Please see for reference.

python


# coding: utf-8

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import tanh, copysign

class RRLAgentForFX:
    TRADING_COST = 0.003
    EPS = 1e-6
        
    def __init__(self,M,rho=0.01,eta=0.1,bias=1.0):
        np.random.seed(555)
        self.M = M # number of lags
        self.weights = np.zeros(self.M+3,dtype=np.float64)
        self.bias = bias # bias term
        self.rho = rho
        self.eta = eta
        self.price_diff = np.zeros(self.M+1) # r_t
        
        self.pre_price = None
        self.pre_signal = 0
        
        self.pre_A = 0.0
        self.pre_B = 0.0
        self.pre_gradient_F = 0.0
        
        # result store
        self.signal_store = []
        self.profit_store = []
        self.dsr_store = []
        self.sr_store = []
        self.cumulative_profit = 0.0
        
    def train_online(self, price):
        self.calculate_price_diff(price)
        signal, self.F_t_value = self.select_signal()
        print "signal",signal
        self.calculate_return(signal)
        self.update_parameters()
        self.pre_price = price
        self.pre_signal = signal
        
        # store result
        self.signal_store.append(signal)
                    
    def calculate_price_diff(self,price):
        r = price - self.pre_price if self.pre_price is not None else 0
        self.price_diff[:self.M] = self.price_diff[1:]
        self.price_diff[self.M] = r
        
    def calculate_return(self,signal):
        R_t = self.pre_signal*self.price_diff[-1]
        R_t -= self.TRADING_COST*abs(signal - self.pre_signal)
        self.return_t = R_t
        
        self.cumulative_profit += R_t
        self.profit_store.append(self.cumulative_profit)
            
    def select_signal(self):
        values_sum = (self.weights[:self.M+1]*self.price_diff).sum()
        values_sum += self.weights[-2]*self.pre_signal
        values_sum += self.bias*self.weights[-1]
        
        F_t_value = tanh(values_sum)
        return copysign(1, F_t_value ), F_t_value
                                            
    def update_parameters(self):
        # update weight
        self.weights += self.rho*self.calculate_gradient_weights()
        print "weight",self.weights

        # update moment R_t
        self.update_R_moment()

    def calculate_gradient_weights(self):
        """ differentiate between D_t and w_t """
        denominator = self.pre_B-self.pre_A**2
        if denominator!=0:
            diff_D_R = self.pre_B-self.pre_A*self.return_t
            diff_D_R /= (denominator)**1.5
        else:
            diff_D_R = 0
        
        gradient_F = self.calculate_gradient_F()
        print "gradient_F",gradient_F

        #diff_R_F = -self.TRADING_COST
        #diff_R_F_{t-1} = self.price_diff[-1] - self.TRADING_COST
        delta_weights = -self.TRADING_COST*gradient_F
        delta_weights += ( self.price_diff[-1] - self.TRADING_COST) \
                                                    *self.pre_gradient_F
        delta_weights *= diff_D_R
        self.pre_gradient_F = gradient_F
        return delta_weights
        
    def calculate_gradient_F(self):
        """ differentiate between F_t and w_t """
        diff_tnah = 1-self.F_t_value**2

        diff_F_w = diff_tnah*( np.r_[ self.price_diff, self.pre_signal, self.bias ] )
        diff_F_F = diff_tnah*self.weights[-2]

        return diff_F_w + diff_F_F*self.pre_gradient_F

    def update_R_moment(self):
        delta_A = self.return_t - self.pre_A
        delta_B = self.return_t**2 - self.pre_B
        A_t = self.pre_A + self.eta*delta_A # A_t. first moment of R_t.
        B_t = self.pre_B + self.eta*delta_B # B_t. second moment of R_t.
        self.sr_store.append(A_t/B_t)
        self.calculate_dsr(delta_A, delta_B)
        
        self.pre_A = A_t
        self.pre_B = B_t

    def calculate_dsr(self,delta_A,delta_B):
        dsr = self.pre_B*delta_A - 0.5*self.pre_A*delta_B
        dsr /= (self.pre_B-self.pre_A**2)**1.5
        self.dsr_store.append(dsr)

if __name__=='__main__':
    M = 8
    fx_agent = RRLAgentForFX(M,rho=0.01,eta=0.01,bias=0.25)
    
    ifname = os.getcwd()+'/input/quote.csv'
    data = pd.read_csv(ifname)
    train_data = data.ix[:3000,'USD']
    
    for price in train_data.values:
        fx_agent.train_online(price)
    

Experiment

I have downloaded and used the csv file of the date and time data of the foreign exchange rate (/ yen) from the Mizuho Bank Historical Data page. .. I used USD / JPY from April 1, 2002, and learned using 3,000 data.

Experimental result

USD / JPY rate

USD.png

Cumulative profit (when purchasing only 1 unit per day)

profit.png

DSR DSR_.png

SR SR.png

comment

The result depends on the values of ρ and η. It's too unstable. .. I would like to update the code as soon as I notice the mistake. If you notice something strange, it would be greatly appreciated if you could comment.

Recommended Posts

I investigated the reinforcement learning algorithm of algorithmic trading
I investigated the mechanism of flask-login!
See the behavior of drunkenness with reinforcement learning
Reinforcement learning 2 Installation of chainerrl
Deep reinforcement learning 2 Implementation of reinforcement learning
I evaluated the strategy of stock system trading with Python.
I implemented the FloodFill algorithm with TRON BATTLE of CodinGame.
Try the Variational-Quantum-Eigensolver (VQE) algorithm with Blueqat
Try the Taxii server (2. Service / Collection settings)
I investigated the reinforcement learning algorithm of algorithmic trading
About the service command
Try to model a multimodal distribution using the EM algorithm
[Machine learning] I tried to summarize the theory of Adaboost
I learned the basics of reinforcement learning and played with Cart Pole (implementing simple Q Learning)
Explore the maze with reinforcement learning
I investigated the X-means method that automatically estimates the number of clusters
I investigated how the scope looks
I tried reinforcement learning using PyBrain
I investigated the behavior of the difference between hard links and symbolic links
I investigated the device tree Overlay
Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
I tried calling the prediction API of the machine learning model from WordPress
I tried using the trained model VGG16 of the deep learning library Keras
I made a GAN with Keras, so I made a video of the learning process.
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried the common story of predicting the Nikkei 225 using deep learning (backtest)
[Reinforcement learning] I implemented / explained R2D3 (Keras-RL)
I checked the options of copyMakeBorder of OpenCV
Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)
Learning notes from the beginning of Python 1
I summarized the folder structure of Flask
I didn't know the basics of Python
Machine learning algorithm (implementation of multi-class classification)
Visualize the effects of deep learning / regularization
The Python project template I think of.
[Reinforcement learning] Easy high-speed implementation of Ape-X!
I read the implementation of golang channel
Learning notes from the beginning of Python 2
Try OpenAI's standard reinforcement learning algorithm PPO
[Reinforcement learning] Search for the best route
Othello-From the tic-tac-toe of "Implementation Deep Learning" (2)
[Fundamental Information Technology Engineer Examination] I wrote the algorithm of Euclidean algorithm in Python.
I tried running an object detection tutorial using the latest deep learning algorithm
I tried to predict the presence or absence of snow by machine learning.
The story of doing deep learning with TPU
I tried cluster analysis of the weather map
The story of low learning costs for Python
Learning neural networks using the genetic algorithm (GA)
I solved the deepest problem of Hiroshi Yuki.
I checked the list of shortcut keys of Jupyter
I tried to touch the API of ebay
I tried to correct the keystone of the image
Try the free version of Progate [Python I]
I checked the session retention period of django
I checked the processing speed of numpy one-dimensionalization
About the development contents of machine learning (Example)
I touched some of the new features of Python 3.8 ①
Visualize the behavior of the sorting algorithm with matplotlib
I read and implemented the Variants of UKR
I want to customize the appearance of zabbix
I tried using the image filter of OpenCV
[Mac] I tried reinforcement learning with OpenAI Baselines
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
[CodeIQ] I wrote the probability distribution of dice (from CodeIQ math course for machine learning [probability distribution])