Develop AI exchange automatic trading algorithm and earn Gappori ~ ~ ~

It's a dream that everyone has. (Maybe just me)

** This time, we built a classification model that predicts whether the dollar-yen rate will rise, fall, or remain the same in Random Forest. ** **

Make a policy first

Selection of regression model or classification model → Let's make it a classification model

First, why did you choose a classification model (that is, a ternary classification of whether the rate goes up, down, or remains the same) instead of the regression model (that is, the model that estimates the rate with Don Pisha)? There are several reasons.

-The evaluation of the classification model is more intuitive than the regression model (this time, the confusion matrix was calculated). ・ There are only three actions that humans actually perform in fx (buy, sell, do nothing), isn't it good for classification? ・ Random forest I just wanted to implement it

That's why I chose classification prediction.

(If you think about it later, the transaction volume is also important in transit trading. Since the specific rate is important when determining the transaction volume, I wonder if a regression model that can predict the rate pinpointly was better. I'm wondering. I'll try again next time.)

What to do with features → Let's use technical indicators

What should I do with the features when setting up a rate prediction model for fx? Then, if you think about what humans actually use to buy and sell dollars and yen, the features can come to mind.

The analysis that humans perform when trading with fx is roughly divided into technical analysis and fundamental analysis. Technical analysis is to predict price movements by looking at so-called charts. On the other hand, fundamental analysis is to predict price movements from news and world affairs.

This time, we will use technical analysis, which is easy to incorporate as a feature of machine learning. The reason why it is easy to import is that technical analysis deals only with numbers and is easy to program.

On the other hand, if you want to incorporate fundamentals (such as text mining), it is difficult and fundamentals are often used for buying and selling from a long-term perspective, so we will not use them.

data set

Now that the policy has been decided, let's search for the dataset. This time, we used daily data for the past 20 years at investing.com. Approximately 250 days (fx trading is only on weekdays) x 20 years to 5000 data can be collected.

Implementation! !! !!

1. Read the csv file containing the rate information

import pandas as pd
df = pd.read_csv("USD_JPY.csv")

2. Creation of features (explanatory variables)

As mentioned above, this time we will use the numerical values of technical analysis as the features. Specifically, I used five of SMA5, SMA20, RSI14, MACD, and Bollinger-bands (2σ). Please google for the detailed features of each technical indicator. ** However, when selecting technical indicators to be used as features, we want to avoid multicollinearity as much as possible, even if it is machine learning rather than statistical analysis, so technical that is not correlated with each other as much as possible. Please select an index. ** **

For the calculation of technical indicators, use a library called talib, which is extremely convenient. It will calculate the technical index in one shot.

Here, the technical values are used as they are for RSI and MACD, but the values for SMA and Bollinger Bands are not so. (For example, even if the SMA value is 105, it is not suitable as a feature because it is not known whether the value of 105 is high or low as it is.) Therefore, we will divide the values of SMA and Bollinger Bands by the value of close to convert them into relative and comparable values before using them as features.

I think there are many other ways to do this, so try the ones you like best!


import talib as ta
import numpy as np

#Use the closing rate for all subsequent calculations
close = np.array(df["closing price"])

#Create an empty dataframe to put the features
df_feature = pd.DataFrame(index=range(len(df)),columns=["SMA5/current", "SMA20/current","RSI","MACD","BBANDS+2σ","BBANDS-2σ"])

#Below, the technical index (feature amount used in this learning) is calculated using talib and df_feature feature

#The simple moving average uses the ratio of the simple moving average to the closing price of the day as a feature.
df_feature["SMA5/current"]= ta.SMA(close, timeperiod=5) / close
df_feature["SMA20/current"]= ta.SMA(close, timeperiod=20) / close

#RSI
df_feature["RSI"] = ta.RSI(close, timeperiod=14)

#MACD
df_feature["MACD"], _ , _= ta.MACD(close, fastperiod=12, slowperiod=26, signalperiod=9)

#Bollinger band
upper, middle, lower = ta.BBANDS(close, timeperiod=20, nbdevup=3, nbdevdn=3)
df_feature["BBANDS+2σ"] = upper / close
df_feature["BBANDS-2σ"] = lower / close

3. Creation of teacher data (objective variable)

As mentioned above, the teacher data of this model is the three values of [up, down, value is (almost) unchanged]. Therefore, create teacher data using the above ratio in the data downloaded from investing.com.

The specific functions used to create it are as follows.

def classify(x):
#Compared to the previous day-0.2%Group 0 if
    if x <= -0.2:
        return 0
#Compared to the previous day is 0.2%<x<0.2%Then group 1
    elif -0.2 < x < 0.2:
        return 1
#Compared to the previous day is 0.2%Group 2 above
    elif 0.2 <= x:
        return 2

Why did you divide the previous day by -0.2% and 0.2%?

・ 100 (yen / dollar) x 0.002 = 0.2 (yen / dollar) = 20pips, and I thought that this value was appropriate as a value to judge whether the rate moved. ** ・ By dividing the previous day into three groups with -0.2% and 0.2%, the data will be almost for the time being. (Figure below) ** スクリーンショット 2020-07-29 22.20.33.png

From the left, it is the number of data in group 0, group 1, and group 2. It is almost evenly divided. Equal division of teacher data classes is very important for using Random Forest. (Of course you can do it even if the classes are not evenly divided, but you need to weight it. For details, this article is easy to understand. .)

Pay attention to the above and create teacher data.

df["The day before ratio_float"] = df["The day before ratio%"].apply(lambda x: float(x.replace("%", "")))

#The day before ratio%How to classify. Divide the samples of each class as equal as possible
def classify(x):
    if x <= -0.2:
        return 0
    elif -0.2 < x < 0.2:
        return 1
    elif 0.2 <= x:
        return 2
    

df["The day before ratio_classified"] = df["The day before ratio_float"].apply(lambda x: classify(x))

#Shift the data you want to be a teacher by one day (I think you can understand it if you think about the meaning)
df_y = df["The day before ratio_classified"].shift()

4. Feature and teacher data completed!

Do a little processing. For example, if you use SMA5 to calculate features, the value will be NaN for the first 4 days. (Because it takes at least 5 days' worth of data to calculate the 5-day average) As you can see, NaN is included at the beginning of the feature data, so we will remove it.

df_xy = pd.concat([df_feature, df_y], axis=1)
df_xy = df_xy.dropna(how="any")

This completes the preprocessing. This time, we will use a random forest, so normalization / standardization is not necessary.

5. Model learning!

All you have to do is learn. It might be interesting to experiment with various random forest parameters ... Random forest hyperparameters are easy to understand in this article.

Also, hyperparameters were optimized using optuna. When using optuna, be aware that the objective function is set to return what you want to minimize.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import optuna

X_train, X_test, Y_train, Y_test = train_test_split(df_xy[["SMA5/current", "SMA20/current","RSI","MACD","BBANDS+2σ","BBANDS-2σ"]],df_xy["The day before ratio_classified"], train_size=0.8)

def objective(trial):
    min_samples_split = trial.suggest_int("min_samples_split", 2,16)
    max_leaf_nodes = int(trial.suggest_discrete_uniform("max_leaf_nodes", 4,64,4))
    criterion = trial.suggest_categorical("criterion", ["gini", "entropy"])
    n_estimators = int(trial.suggest_discrete_uniform("n_estimators", 50,500,50))
    max_depth = trial.suggest_int("max_depth", 3,10)
    clf = RandomForestClassifier(random_state=1, n_estimators = n_estimators, max_leaf_nodes = max_leaf_nodes, max_depth=max_depth, max_features=None,criterion=criterion,min_samples_split=min_samples_split)
    clf.fit(X_train, Y_train)
    return 1 - accuracy_score(Y_test, clf.predict(X_test))

study = optuna.create_study()
study.optimize(objective, n_trials=100)

print(1-study.best_value)
print(study.best_params)

The accuracy when hyperparameters are optimized is

0.6335025380710659 Since it is a three-value classification, I think it is quite a good result. It's about twice as good as when you select it randomly.

If the forecast of price movement is correct with a probability of 60%, the expected value is likely to be positive even if you consider the spread!

Consider a multiclass classification confusion matrix! スクリーンショット 2020-07-30 13.32.40.png

In order to make a profit with Fx, it is desirable that the ratio of ① and ⑨ is high. Looking at the confusion matrix, 1 + 9 = 48.9%, which accounts for nearly half. Also, what you want to avoid most is the patterns ③ and ⑦ (the pattern that you predicted to go up but actually went down, and the pattern that you predicted to go down but actually went up). These two are quite low numbers of ③ + ⑦ = 7.3%.

** From the above considerations, we can see that the model learned this time can be profitable. ** **

Also, the hyperparameters when accuracy is maximized are

{'min_samples_split': 8,
 'max_leaf_nodes': 40.0,
 'criterion': 'entropy',
 'n_estimators': 310.0,
 'max_depth': 7}

was. The relationship between hyperparameters and accuracy is as follows. スクリーンショット 2020-07-30 11.45.56.png

(* Note that subjective_value = 1 --accuracy! (Optuna specification))

The lighter color in the above figure is the higher accuracy. Certainly, it can be read that max_depth is around 7, max_leaf_nodes is around 30-40, and n_estimators is around 300.

Postscript: Further experiments

So far, we have classified three values, but let's classify them into two values, rate increase or decrease.

Only change 3. classify function for teacher data creation.

def classify(x):
    if x <= 0:
        return 0
    else:
        return 1

If you build a model in the same way and optimize hyperparameters with optuna ...

accuracy=0.7766497461928934

** I'm also very rich with this (white eyes)! ** **

Investment decisions are at your own risk.

In this way, it may be interesting not only to change the data division method, but also to play with the technical indicators used!

[PYTHON] Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning