[Nekopuni's blog](http://nekopuni.holy.jp/2014/09/python%E5%BC%B7%E5%8C%96%E5%AD%A6%E7%BF%92%EF%BC % 8B% E7% 82% BA% E6% 9B% BF% E3% 83% 88% E3% 83% AC% E3% 83% BC% E3% 83% 89% E6% 88% A6% E7% 95% A5 Inspired by% E3% 81% 9D% E3% 81% AE2 /), I investigated algorithmic trading using reinforcement learning for both hobbies and practical benefits. I'm not completely confident in what I'm writing (especially the code). The main use is for personal memos, but I would appreciate it if you could use it as a reference for a small survey.

Reference material

An Automated FX Trading System Using Adaptive Reinforcement Learning [Reference 1] ← Optimization by RRL
Design of an FX trading system using Adaptive Reinforcement Learning [Reference 2] ← Summary of the above paper PowerPoint
Algorithm Trading using Q-Learning and Recurrent Reinforcement Learning [Reference 3] ← Optimum including risk management by portfolio using Q-learning Automation
A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems [Reference 4] ← Includes stock price forecast by NN (Use Q-learning)
Reinforcement Learning <img src = "http" //ir-jp.amazon-adsystem.com/e/ir?t=shimashimao06-22&l=as2&o=9&a=4627826613" width = "1" height = "1" border = "0" alt = "" style = "border: none! important; margin: 0px! important;" /> [Reference 5]
PERFORMANCE FUNCTIONS AND REINFORCEMENT LEARNING FOR TRADING SYSTEMS AND PORTFOLIOS [Reference 6]
[Learning and Neural Network (Electronic Information and Communication Engineering Series)](http://www.amazon.co.jp/gp/product/4627702914/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=247&creative=1211&creativeASIN=4627702914&linkCode=as2&tag=shimashimao06 -22) [Reference 7]

Survey outline

The meaning of using reinforcement learning (see Reference 4)

The difference between reinforcement learning and supervised learning is as follows.

Reinforcement learning aims to maximize expected rewards from the environment by optimizing policies
Supervised learning aims to reduce prediction errors for teacher data

In supervised learning, conditions are optimized without a unified trading policy. On the other hand, in reinforcement learning, it is possible to construct algorithms including the environment and policies, so it is considered that the effectiveness of using reinforcement learning is high.

Difference between Value Function RL and Direct RL

■Value Function RL Action value is assigned by state or state action pair. Based on this value, we optimize the policy (the policy of action to be taken when a certain state is obtained). This will maximize long-term expected rewards. Q-learning is categorized here. For details on Q-learning, see Reference 5 .

■Direct RL We will directly adjust the reward function (values) based on the experience (observed values) obtained from the environment. Unlike Q-learning, Q-table is not required, so the amount of temporal / spatial calculation is small. However, maximizing expected rewards will be short-term. Recurrent Reinforcement Learning (RRL) is categorized here. For more information on RRL, see Reference 5 < img src = "http://ir-jp.amazon-adsystem.com/e/ir?t=shimashimao06-22&l=as2&o=9&a=4627826613" width="1" height="1" border="0" alt Please refer to = "" style = "border: none! Important; margin: 0px! Important;" />. (This time, I am writing the code using this RRL.)

RRL Financial Trading Framework

Agent: RRL-trader
State: The state of the market as defined by technical indicators or past returns
Reward: Reward earned with risks and fees between t and t + 1
Action: trading signal(buy/sell/short/long/neutral/hold)

Evaluation value of the agent

Uses Differential Sharpe Ratio (DSR). Used when updating weight.

Precautions for actual operation

Loss cut
Risk preference
System down processing when abnormal behavior is confirmed

Algorithm (see Reference 1)

[Position determination at time t (long or short)]

F_t=sign(\sum_{i=0}^{M}w_{i,t}r_{t-i}+w_{M+1,t}F_{t-1}+v_t) F_t ∈ [-1,1]; (short=-1, long=1) w_t: weight vector v_t: threshold of the neural network $ r_t $: $ p_t --p_ {t-1} $ (return in chronological order)

The form of the formula is the same as a simple one-layer neural network. Actually, the neural network method is applied, and the threshold $ v_t $ is also included in the weight vector for optimization.

[Profit at time point T]

P_t=\sum_{t=0}^{T}R_t R_t:=F_{t-1}r_t-δ|F_t-F_{t-1}| δ: transaction cost

It is assumed that only one unit is traded in each transaction.

[Optimization evaluation value]

■sharpe ratio \hat{S}(t):=\frac{A_t}{B_t} A_t=A_{t-1}+η(R_t-A_{t-1}), A_0=0 B_t=B_{t-1}+η(R_t^2-B_{t-1}), B_0=0 η: adaptation parameter

■Differential Sharpe Ratio(DSR) An improved version of the sharpe ratio moving average for online learning. DSR is lighter in calculation and converges faster (it seems). Ueno $ \ hat {S} $ is Taylor expanded around η = 0 and the first term is acquired. D_t:=\frac{d\hat{S}}{dη}|_{η=0}

=\frac{B_{t-1}\Delta A_t-\frac{1}{2}A_{t-1}\Delta B_t}{(B_{t-1}-A_{t-1}^2)^\frac{3}{2}} \Delta A_t=R_t-A_{t-1} \Delta B_t=R_t^2-B_{t-1}

Consider $ D_t $ as an immediate performance measure and update the weight to maximize it.

[Update weight]

w_{i,t}=w_{i,t-1}+\rho \Delta w_{i,t} \Delta w_{i,t}=\frac{dD_t}{dw_i} 　　　\approx \frac{dD_t}{dR_t}{ \frac{dR_t}{dF_t}\frac{dF_t}{dw_{i,t}} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{dw_{i,t-1}}}

\frac{dF_t}{dw_{i,t}}\approx\frac{\partial F_t}{\partial w_{i,t}}+\frac{\partial F_t}{\partial F_{t-1}}\frac{dF_{t-1}}{dw_{i,t-1}}

\frac{dD_t}{dR_t}=\frac{B_{t-1}-A_{t-1}R_t}{(B_{t-1}-A_{t-1}^2)^{3/2}} \frac{dR_t}{dF_t}=-\delta \frac{dR_t}{dF_{t-1}}=r_t-\delta

[Loss cut, signal setting threshold, system abnormality judgment]

There is a statement that it should be set as a parameter, but there is no specific description. There seems to be no choice but to set it with an empirical value. ..

Reference code

The sign function is once calculated as tanh and signalized depending on whether $ F_t $ is greater than 0 or not. Functions such as loss cut are not written. It's a rather suspicious code, though I say it myself. Please see for reference.

`python`


# coding: utf-8

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import tanh, copysign

class RRLAgentForFX:
    TRADING_COST = 0.003
    EPS = 1e-6
        
    def __init__(self,M,rho=0.01,eta=0.1,bias=1.0):
        np.random.seed(555)
        self.M = M # number of lags
        self.weights = np.zeros(self.M+3,dtype=np.float64)
        self.bias = bias # bias term
        self.rho = rho
        self.eta = eta
        self.price_diff = np.zeros(self.M+1) # r_t
        
        self.pre_price = None
        self.pre_signal = 0
        
        self.pre_A = 0.0
        self.pre_B = 0.0
        self.pre_gradient_F = 0.0
        
        # result store
        self.signal_store = []
        self.profit_store = []
        self.dsr_store = []
        self.sr_store = []
        self.cumulative_profit = 0.0
        
    def train_online(self, price):
        self.calculate_price_diff(price)
        signal, self.F_t_value = self.select_signal()
        print "signal",signal
        self.calculate_return(signal)
        self.update_parameters()
        self.pre_price = price
        self.pre_signal = signal
        
        # store result
        self.signal_store.append(signal)
                    
    def calculate_price_diff(self,price):
        r = price - self.pre_price if self.pre_price is not None else 0
        self.price_diff[:self.M] = self.price_diff[1:]
        self.price_diff[self.M] = r
        
    def calculate_return(self,signal):
        R_t = self.pre_signal*self.price_diff[-1]
        R_t -= self.TRADING_COST*abs(signal - self.pre_signal)
        self.return_t = R_t
        
        self.cumulative_profit += R_t
        self.profit_store.append(self.cumulative_profit)
            
    def select_signal(self):
        values_sum = (self.weights[:self.M+1]*self.price_diff).sum()
        values_sum += self.weights[-2]*self.pre_signal
        values_sum += self.bias*self.weights[-1]
        
        F_t_value = tanh(values_sum)
        return copysign(1, F_t_value ), F_t_value
                                            
    def update_parameters(self):
        # update weight
        self.weights += self.rho*self.calculate_gradient_weights()
        print "weight",self.weights

        # update moment R_t
        self.update_R_moment()

    def calculate_gradient_weights(self):
        """ differentiate between D_t and w_t """
        denominator = self.pre_B-self.pre_A**2
        if denominator!=0:
            diff_D_R = self.pre_B-self.pre_A*self.return_t
            diff_D_R /= (denominator)**1.5
        else:
            diff_D_R = 0
        
        gradient_F = self.calculate_gradient_F()
        print "gradient_F",gradient_F

        #diff_R_F = -self.TRADING_COST
        #diff_R_F_{t-1} = self.price_diff[-1] - self.TRADING_COST
        delta_weights = -self.TRADING_COST*gradient_F
        delta_weights += ( self.price_diff[-1] - self.TRADING_COST) \
                                                    *self.pre_gradient_F
        delta_weights *= diff_D_R
        self.pre_gradient_F = gradient_F
        return delta_weights
        
    def calculate_gradient_F(self):
        """ differentiate between F_t and w_t """
        diff_tnah = 1-self.F_t_value**2

        diff_F_w = diff_tnah*( np.r_[ self.price_diff, self.pre_signal, self.bias ] )
        diff_F_F = diff_tnah*self.weights[-2]

        return diff_F_w + diff_F_F*self.pre_gradient_F

    def update_R_moment(self):
        delta_A = self.return_t - self.pre_A
        delta_B = self.return_t**2 - self.pre_B
        A_t = self.pre_A + self.eta*delta_A # A_t. first moment of R_t.
        B_t = self.pre_B + self.eta*delta_B # B_t. second moment of R_t.
        self.sr_store.append(A_t/B_t)
        self.calculate_dsr(delta_A, delta_B)
        
        self.pre_A = A_t
        self.pre_B = B_t

    def calculate_dsr(self,delta_A,delta_B):
        dsr = self.pre_B*delta_A - 0.5*self.pre_A*delta_B
        dsr /= (self.pre_B-self.pre_A**2)**1.5
        self.dsr_store.append(dsr)

if __name__=='__main__':
    M = 8
    fx_agent = RRLAgentForFX(M,rho=0.01,eta=0.01,bias=0.25)
    
    ifname = os.getcwd()+'/input/quote.csv'
    data = pd.read_csv(ifname)
    train_data = data.ix[:3000,'USD']
    
    for price in train_data.values:
        fx_agent.train_online(price)

Experiment

I have downloaded and used the csv file of the date and time data of the foreign exchange rate (/ yen) from the Mizuho Bank Historical Data page. .. I used USD / JPY from April 1, 2002, and learned using 3,000 data.

Experimental result

USD / JPY rate

Cumulative profit (when purchasing only 1 unit per day)

DSR

comment

The result depends on the values of ρ and η. It's too unstable. .. I would like to update the code as soon as I notice the mistake. If you notice something strange, it would be greatly appreciated if you could comment.

[PYTHON] I investigated the reinforcement learning algorithm of algorithmic trading