[PYTHON] Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning

Develop AI exchange automatic trading algorithm and earn Gappori ~ ~ ~

It's a dream that everyone has. (Maybe just me)

** This time, we built a classification model that predicts whether the dollar-yen rate will rise, fall, or remain the same in Random Forest. ** **

Make a policy first

Selection of regression model or classification model → Let's make it a classification model

First, why did you choose a classification model (that is, a ternary classification of whether the rate goes up, down, or remains the same) instead of the regression model (that is, the model that estimates the rate with Don Pisha)? There are several reasons.

-The evaluation of the classification model is more intuitive than the regression model (this time, the confusion matrix was calculated). ・ There are only three actions that humans actually perform in fx (buy, sell, do nothing), isn't it good for classification? ・ Random forest I just wanted to implement it

That's why I chose classification prediction.

(If you think about it later, the transaction volume is also important in transit trading. Since the specific rate is important when determining the transaction volume, I wonder if a regression model that can predict the rate pinpointly was better. I'm wondering. I'll try again next time.)

What to do with features → Let's use technical indicators

What should I do with the features when setting up a rate prediction model for fx? Then, if you think about what humans actually use to buy and sell dollars and yen, the features can come to mind.

The analysis that humans perform when trading with fx is roughly divided into technical analysis and fundamental analysis. Technical analysis is to predict price movements by looking at so-called charts. On the other hand, fundamental analysis is to predict price movements from news and world affairs.

This time, we will use technical analysis, which is easy to incorporate as a feature of machine learning. The reason why it is easy to import is that technical analysis deals only with numbers and is easy to program.

On the other hand, if you want to incorporate fundamentals (such as text mining), it is difficult and fundamentals are often used for buying and selling from a long-term perspective, so we will not use them.

data set

Now that the policy has been decided, let's search for the dataset. This time, we used daily data for the past 20 years at investing.com. Approximately 250 days (fx trading is only on weekdays) x 20 years to 5000 data can be collected.

Implementation! !! !!

1. Read the csv file containing the rate information

import pandas as pd
df = pd.read_csv("USD_JPY.csv")

2. Creation of features (explanatory variables)

As mentioned above, this time we will use the numerical values of technical analysis as the features. Specifically, I used five of SMA5, SMA20, RSI14, MACD, and Bollinger-bands (2σ). Please google for the detailed features of each technical indicator. ** However, when selecting technical indicators to be used as features, we want to avoid multicollinearity as much as possible, even if it is machine learning rather than statistical analysis, so technical that is not correlated with each other as much as possible. Please select an index. ** **

For the calculation of technical indicators, use a library called talib, which is extremely convenient. It will calculate the technical index in one shot.

Here, the technical values are used as they are for RSI and MACD, but the values for SMA and Bollinger Bands are not so. (For example, even if the SMA value is 105, it is not suitable as a feature because it is not known whether the value of 105 is high or low as it is.) Therefore, we will divide the values of SMA and Bollinger Bands by the value of close to convert them into relative and comparable values before using them as features.

I think there are many other ways to do this, so try the ones you like best!


import talib as ta
import numpy as np

#Use the closing rate for all subsequent calculations
close = np.array(df["closing price"])

#Create an empty dataframe to put the features
df_feature = pd.DataFrame(index=range(len(df)),columns=["SMA5/current", "SMA20/current","RSI","MACD","BBANDS+2σ","BBANDS-2σ"])

#Below, the technical index (feature amount used in this learning) is calculated using talib and df_feature feature

#The simple moving average uses the ratio of the simple moving average to the closing price of the day as a feature.
df_feature["SMA5/current"]= ta.SMA(close, timeperiod=5) / close
df_feature["SMA20/current"]= ta.SMA(close, timeperiod=20) / close

#RSI
df_feature["RSI"] = ta.RSI(close, timeperiod=14)

#MACD
df_feature["MACD"], _ , _= ta.MACD(close, fastperiod=12, slowperiod=26, signalperiod=9)

#Bollinger band
upper, middle, lower = ta.BBANDS(close, timeperiod=20, nbdevup=3, nbdevdn=3)
df_feature["BBANDS+2σ"] = upper / close
df_feature["BBANDS-2σ"] = lower / close

3. Creation of teacher data (objective variable)

As mentioned above, the teacher data of this model is the three values of [up, down, value is (almost) unchanged]. Therefore, create teacher data using the above ratio in the data downloaded from investing.com.

The specific functions used to create it are as follows.

def classify(x):
#Compared to the previous day-0.2%Group 0 if
    if x <= -0.2:
        return 0
#Compared to the previous day is 0.2%<x<0.2%Then group 1
    elif -0.2 < x < 0.2:
        return 1
#Compared to the previous day is 0.2%Group 2 above
    elif 0.2 <= x:
        return 2

Why did you divide the previous day by -0.2% and 0.2%?

・ 100 (yen / dollar) x 0.002 = 0.2 (yen / dollar) = 20pips, and I thought that this value was appropriate as a value to judge whether the rate moved. ** ・ By dividing the previous day into three groups with -0.2% and 0.2%, the data will be almost for the time being. (Figure below) ** スクリーンショット 2020-07-29 22.20.33.png

From the left, it is the number of data in group 0, group 1, and group 2. It is almost evenly divided. Equal division of teacher data classes is very important for using Random Forest. (Of course you can do it even if the classes are not evenly divided, but you need to weight it. For details, this article is easy to understand. .)

Pay attention to the above and create teacher data.

df["The day before ratio_float"] = df["The day before ratio%"].apply(lambda x: float(x.replace("%", "")))

#The day before ratio%How to classify. Divide the samples of each class as equal as possible
def classify(x):
    if x <= -0.2:
        return 0
    elif -0.2 < x < 0.2:
        return 1
    elif 0.2 <= x:
        return 2
    

df["The day before ratio_classified"] = df["The day before ratio_float"].apply(lambda x: classify(x))

#Shift the data you want to be a teacher by one day (I think you can understand it if you think about the meaning)
df_y = df["The day before ratio_classified"].shift()

4. Feature and teacher data completed!

Do a little processing. For example, if you use SMA5 to calculate features, the value will be NaN for the first 4 days. (Because it takes at least 5 days' worth of data to calculate the 5-day average) As you can see, NaN is included at the beginning of the feature data, so we will remove it.

df_xy = pd.concat([df_feature, df_y], axis=1)
df_xy = df_xy.dropna(how="any")

This completes the preprocessing. This time, we will use a random forest, so normalization / standardization is not necessary.

5. Model learning!

All you have to do is learn. It might be interesting to experiment with various random forest parameters ... Random forest hyperparameters are easy to understand in this article.

Also, hyperparameters were optimized using optuna. When using optuna, be aware that the objective function is set to return what you want to minimize.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import optuna

X_train, X_test, Y_train, Y_test = train_test_split(df_xy[["SMA5/current", "SMA20/current","RSI","MACD","BBANDS+2σ","BBANDS-2σ"]],df_xy["The day before ratio_classified"], train_size=0.8)

def objective(trial):
    min_samples_split = trial.suggest_int("min_samples_split", 2,16)
    max_leaf_nodes = int(trial.suggest_discrete_uniform("max_leaf_nodes", 4,64,4))
    criterion = trial.suggest_categorical("criterion", ["gini", "entropy"])
    n_estimators = int(trial.suggest_discrete_uniform("n_estimators", 50,500,50))
    max_depth = trial.suggest_int("max_depth", 3,10)
    clf = RandomForestClassifier(random_state=1, n_estimators = n_estimators, max_leaf_nodes = max_leaf_nodes, max_depth=max_depth, max_features=None,criterion=criterion,min_samples_split=min_samples_split)
    clf.fit(X_train, Y_train)
    return 1 - accuracy_score(Y_test, clf.predict(X_test))

study = optuna.create_study()
study.optimize(objective, n_trials=100)

print(1-study.best_value)
print(study.best_params)

The accuracy when hyperparameters are optimized is

0.6335025380710659 Since it is a three-value classification, I think it is quite a good result. It's about twice as good as when you select it randomly.

If the forecast of price movement is correct with a probability of 60%, the expected value is likely to be positive even if you consider the spread!

Consider a multiclass classification confusion matrix! スクリーンショット 2020-07-30 13.32.40.png

In order to make a profit with Fx, it is desirable that the ratio of ① and ⑨ is high. Looking at the confusion matrix, 1 + 9 = 48.9%, which accounts for nearly half. Also, what you want to avoid most is the patterns ③ and ⑦ (the pattern that you predicted to go up but actually went down, and the pattern that you predicted to go down but actually went up). These two are quite low numbers of ③ + ⑦ = 7.3%.

** From the above considerations, we can see that the model learned this time can be profitable. ** **

Also, the hyperparameters when accuracy is maximized are

{'min_samples_split': 8,
 'max_leaf_nodes': 40.0,
 'criterion': 'entropy',
 'n_estimators': 310.0,
 'max_depth': 7}

was. The relationship between hyperparameters and accuracy is as follows. スクリーンショット 2020-07-30 11.45.56.png

(* Note that subjective_value = 1 --accuracy! (Optuna specification))

The lighter color in the above figure is the higher accuracy. Certainly, it can be read that max_depth is around 7, max_leaf_nodes is around 30-40, and n_estimators is around 300.

Postscript: Further experiments

So far, we have classified three values, but let's classify them into two values, rate increase or decrease.

Only change 3. classify function for teacher data creation.

def classify(x):
    if x <= 0:
        return 0
    else:
        return 1

If you build a model in the same way and optimize hyperparameters with optuna ...

accuracy=0.7766497461928934

** I'm also very rich with this (white eyes)! ** **

In this way, it may be interesting not only to change the data division method, but also to play with the technical indicators used!

Recommended Posts

Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
A concrete method of predicting horse racing by machine learning and simulating the recovery rate
An example of a mechanism that returns a prediction by HTTP from the result of machine learning
About testing in the implementation of machine learning models
Predict the presence or absence of infidelity by machine learning
Try to evaluate the performance of machine learning / classification model
I made an API with Docker that returns the predicted value of the machine learning model
Inversely analyze a machine learning model
A story stuck with the installation of the machine learning library JAX
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
Try to make a blackjack strategy by reinforcement learning ((1) Implementation of blackjack)
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
Evaluate the accuracy of the learning model by cross-validation from scikit learn
I tried calling the prediction API of the machine learning model from WordPress
Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)
Machine learning algorithm (implementation of multi-class classification)
Implement a model with state and behavior (3) --Example of implementation by decorator
Python learning memo for machine learning by Chainer until the end of Chapter 2
Judgment of igneous rock by machine learning ②
Judge the authenticity of posted articles by machine learning (Google Prediction API).
Othello-From the tic-tac-toe of "Implementation Deep Learning" (2)
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
A story about achieving a horse racing recovery rate of over 100% through machine learning
Build a python environment to learn the theory and implementation of deep learning
I tried to predict the presence or absence of snow by machine learning.
Creating a position estimation model for the Werewolf Intelligence Tournament using machine learning
[Machine learning] Create a machine learning model by performing transfer learning with your own data set
A model that identifies the guitar with fast.ai
Machine learning memo of a fledgling engineer Part 1
A set of integers that satisfies ax + by = 1.
Classification of guitar images by machine learning Part 1
List of links that machine learning beginners are learning
About the development contents of machine learning (Example)
Classify machine learning related information by topic model
Improvement of performance metrix by two-step learning model
Analysis of shared space usage by machine learning
Machine learning memo of a fledgling engineer Part 2
Reasonable price estimation of Mercari by machine learning
Classification of guitar images by machine learning Part 2
Get a glimpse of machine learning in Python
Implementation of Deep Learning model for image recognition
Deep learning learned by implementation (segmentation) ~ Implementation of SegNet ~
A story about data analysis by machine learning
I tried "Lobe" which can easily train the machine learning model published by Microsoft.
Simple code that gives a score of 0.81339 in Kaggle's Titanic: Machine Learning from Disaster
[Python] Implementation of clustering using a mixed Gaussian model
A story that reduces the effort of operation / maintenance
[Python] A program that counts the number of valleys
I made a twitter app that identifies and saves the image of a specific character on the twitter timeline by pytorch transfer learning
Count the number of parameters in the deep learning model
About data preprocessing of systems that use machine learning
Impressions of taking the Udacity Machine Learning Engineer Nano-degree
Make a BOT that shortens the URL of Discord
A python implementation of the Bayesian linear regression class
# Function that returns the character code of a string
Predict the gender of Twitter users with machine learning
Othello ~ From the tic-tac-toe of "Implementation Deep Learning" (4) [End]
The procedure from generating and saving a learning model by machine learning, making it an API server, and communicating with JSON from a browser
Generate that shape of the bottom of a PET bottle
Summary of the basic flow of machine learning with Python
Record of the first machine learning challenge with Keras