[PYTHON] [Machine learning] FX prediction using decision trees

Machine learning and FX

Hello. This is my first time posting a decent article on Qiita. Recently I started studying machine learning. Machine learning, needless to say, is used in many places. Spam email filtering, product recommendations, etc ... There are endless examples. Even so, I was interested in predicting stock prices and FX using machine learning, so today I would like to predict FX using a decision tree, which is one of machine learning. If you can predict stock prices and FX with good accuracy, you can make money without doing anything, so it's a very dreamy story. However, in reality, it is not so sweet that it can be easily predicted, so the main motivation is to apply the machine learning that I recently studied to something rather than to win or make a profit. So, first of all, if you are reading this article __ "I don't care about machine learning, please predict FX with AI and tell me if the dollar yen will rise or fall tomorrow!" __ There is no useful information for those who say. For those who are interested in machine learning and Forex forecasting, it may be a little fun. That's about it.

What is FX?

In the case of stocks, we all know that if you buy a stock and the stock price goes up, you will make a profit, and if it goes down, you will lose. Some people may not know about Forex, so I will explain it for the time being. For example, let's say you are trading at 100 yen to the dollar. Let's say you place a $ 1 "buy" order in this state. Tomorrow this dollar

--Rise to 110 yen → Profit of 10 yen ――It goes down to 90 yen → If you do, you will lose 10 yen

That's right. This kind of foreign exchange trading is called FX. In the case of Forex, you can apply leverage, so if you multiply the leverage by 10 times, you can move 10 times more money. In this case, the profit is 10 times and the loss is 10 times, so be careful. In the case of dollar-yen transactions, it is called dollar-yen. If the value of the dollar rises like 100 yen per dollar-> 110 yen, the dollar will strengthen (weak yen), and conversely if the value of the dollar falls like 100 yen per dollar-> 90 yen, the dollar will weaken (strong yen). It becomes. I will not explain FX any more because there are so many books and sites that explain FX in an easy-to-understand manner, but the point is that you can make a profit if you can predict whether it will go up or down like stocks __ about it.

Referenced site

I referred to here. A detailed explanation of the decision tree is also included here, so I will not explain it in detail in this article. In a nutshell, the decision tree does not require the work of scaling features called standardization, and it is easy to interpret the process in which the result was obtained (it has semantic interpretability). there is. Here's a quick summary of what's happening on the linked page.

--Predict whether tomorrow's dollar-yen will rise or fall using daily dollar-yen data for 500 days as of 2018 --Use a decision tree --No other classifiers are used. We do not perform Grid Search or cross validation --The features used were the dollar-yen "opening price," "closing price," "high price," and "low price." --Training data (learning data) and test data are divided at 8: 2. --As a result of applying to the test data, the precision rate was about 50%. (I can't predict it very well)

It's like that. The prediction accuracy is not so high because the main purpose is to introduce how to apply machine learning (decision tree) to Forex rather than seriously predicting.

What you want to do on this page

--Apply the same method as the page introduced above in your environment --Use the latest daily data up to the end of 2019 ――I want to find out how many days' worth of data can be used to improve accuracy (500 days? 200 days?) ――I want to see if increasing the features will improve the accuracy. --I want to find the optimum parameters of the decision tree using grid search. -Perform cross-validation properly

It is a place like that. Regarding increasing features, we have learned hundreds of days' worth of data on the above page, but we are using "open price", "close price", "high price", and "low price" as the features when actually predicting. There are only four of them. In other words, when deciding whether tomorrow's dollar / yen pair will rise or fall, it is decided only by the candlestick of the "day". However, when actually making a judgment, traders often use moving averages (average values for the past n days) and various technical indicators called Bollinger Bands and MACD. So, this time, in addition to the above four features, __ "Average and variance of closing prices on 5, 25, 50, and 75 days" , __ "Opening, closing, and high prices up to the last 3 days" , Low price " etc. I would like to add new. These values are related to the technical indicators mentioned earlier. Not sure what you're talking about? When predicting tomorrow's dollar-yen, it is better to refer not only to today's price movements, but also to the average values of the past few days to weeks and how much they have varied, right? That is. To put it in a very confusing analogy "Today's rice was curry rice made by my mother. From past trends, the next day's rice after curry rice is likely to be hamburger steak, so tomorrow will be hamburger steak!" Is the method we are doing above. What i want to do "Today's rice was curry rice. Yesterday was meat, but last week there was a lot of Chinese food. I guess from today's menu and trends in the last few weeks ... tomorrow is hamburger steak!" I feel like. It's hard to understand.

Try to implement with jupyter

Preparation

Just like the page above, use Python with jupyter to scribble. The page is almost the same as the page introduced above, but I will do it in order. First, import various required libraries.

import pandas as pd
import numpy as np 
#Data visualization library
import matplotlib.pyplot as plt
#Machine learning library
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
#import graphviz
import graphviz
#For grid search and cross validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

The bottom two are necessary for grid search and cross-validation. Then load the data. This time, we have prepared csv data for 2 years from 2017 to 2019. I am trading with software called MT4, and I brought the data provided there in csv format.

#Read CSV file. 2017-2 years of 2019
df = pd.read_csv('usd_jpy_api_2017_2019.csv')
#Check the last 5 lines
df.tail()

The first 5 lines look like this. It shows the time, closing price, opening price, high price, low price, and trading volume of each transaction. スクリーンショット 2020-01-04 15.28.06.png

Since the work up to this point is the page introduced above, I will skip it roughly, but I will give the correct answer label with 0, 1 depending on whether the closing price of the last day will rise.

#Next day closing price-Calculate the difference at the closing price on the day
#shift(-1)Move close up by one
df['close+1'] = df.close.shift(-1)
df['diff'] = df['close+1'] - df['close']
#Close on the last day+1 becomes NaN, so cut it
df = df[:-1]

Let's check the ratio of rising and falling data.

#Check the data ratio of rising and falling
m = len(df['close'])
#df['diff']>0 returns true or false for all lines. df[(df['diff'] > 0)]With dff>Output all columns by narrowing down to 0
print(len(df[(df['diff'] > 0)]) / m * 100)
print(len(df[(df['diff'] < 0)]) / m * 100)

52.16284987277354
47.837150127226465

There are many days when it has risen slightly. Then remove the unwanted columns and name the label target.

--The word "target" corresponds to the correct class label required for learning. --1 is assigned if the closing price of the next day goes up, and 0 is assigned if it goes down.

df.rename(columns={"diff" : "target"}, inplace=True)
#Delete unnecessary columns
del df['close+1']
del df['time']
#Sorting columns
df = df[['target', 'volume', 'open', 'high', 'low', 'close']]
#Output the first 5 lines
df.head()

Addition of features

From here, we will calculate new features and add them.

#Moving average calculation, 5 days, 25 days, 50 days, 75 days
#Also calculate std. (=Has the same information as Bollinger Bands)
#Secure data for 75 days
for i in range(1, 75):
    df['close-'+str(i)] = df.close.shift(+i)
#Calculate the moving average value and std,If there is even one Nan in the skipna setting, it will return Nan.
nclose = 5    
df['MA5'] = df.iloc[:, np.arange(nclose, nclose+5)].mean(axis='columns', skipna=False)
df['MA25'] = df.iloc[:, np.arange(nclose, nclose+25)].mean(axis='columns', skipna=False)
df['MA50'] = df.iloc[:, np.arange(nclose, nclose+50)].mean(axis='columns', skipna=False)
df['MA75'] = df.iloc[:, np.arange(nclose, nclose+75)].mean(axis='columns', skipna=False)

df['STD5'] = df.iloc[:, np.arange(nclose, nclose+5)].std(axis='columns', skipna=False)
df['STD25'] = df.iloc[:, np.arange(nclose, nclose+25)].std(axis='columns', skipna=False)
df['STD50'] = df.iloc[:, np.arange(nclose, nclose+50)].std(axis='columns', skipna=False)
df['STD75'] = df.iloc[:, np.arange(nclose, nclose+75)].std(axis='columns', skipna=False)
#Delete extra columns after calculation
for i in range(1, 75):
    del df['close-'+str(i)]
#Changes from the previous day of each average line (you can see whether the moving average line is upward or downward)
#shift(-1)Move close up by one
df['diff_MA5'] = df['MA5'] - df.MA5.shift(1) 
df['diff_MA25'] = df['MA25'] - df.MA25.shift(1) 
df['diff_MA50'] = df['MA50'] - df.MA50.shift(1) 
df['diff_MA75'] = df['MA50'] - df.MA50.shift(1) 
#Open up to 3 days ago, close, high,I want to add low to the features
for i in range(1, 4):
    df['close-'+str(i)] = df.close.shift(+i)
    df['open-'+str(i)] = df.open.shift(+i)
    df['high-'+str(i)] = df.high.shift(+i)
    df['low-'+str(i)] = df.low.shift(+i)
#Delete line containing NaN
df = df.dropna()
#Decide how many days to use
nday = 500
df = df[-nday:]
#df.head()
df

The right side is cut off, but it looks like this.

--MA is the moving average. For example, MA5 is the average closing price for the past 5 days from that date. --close-n represents the closing price n days ago (same for open, high, low) --STD is the standard deviation ――I wanted to know if the moving average is pointing up or down, so I added the change from the previous day with the name diff_.

For the time being, I chose data for 500 days. It was okay to specify it when the data was first read, but when calculating the 75-day average, data that goes back 75 days is required, so the data for 500 days after the calculation is completed I am using. Now you have 500 rows and 30 columns of data.

Decision tree learning

Now that we are ready, let's divide it into train and test for evaluation.

n = df.shape[0]
p = df.shape[1]
print(n,p)
#Divided into training data and test data. Do not shuffle
train_start = 0
train_end = int(np.floor(0.8*n))
test_start = train_end + 1
test_end = n
data_train = np.arange(train_start, train_end)
data_train = df.iloc[np.arange(train_start, train_end), :]
data_test = df.iloc[np.arange(test_start, test_end), :]
#Check the size of training data and test data
print(data_train.shape)
print(data_test.shape)

This time it was split at 8: 2.


(400, 30)
(99, 30)

Next, the part of the correct answer label is separated and learning is performed with a decision tree. The hyperparameter max_depth that indicates the depth of the tree is set to 5 for the time being, but the appropriate value will be determined by grid search after this.

#Separate target
X_train = data_train.iloc[:, 1:]
y_train = data_train.iloc[:, 0]
X_test = data_test.iloc[:, 1:]
y_test = data_test.iloc[:, 0]
#Training of deciding technique model
clf_2 = DecisionTreeClassifier(max_depth=5)

The decision tree has finally come out. Perform cross-validation and grid search with k = 10.


#max with grid search_Determine the optimum parameters for depth
#k=Also perform 10 k-validation cross-validation
params = {'max_depth': [2, 5, 10, 20]}

grid = GridSearchCV(estimator=clf_2,
                    param_grid=params,
                    cv=10,
                    scoring='roc_auc')
grid.fit(X_train, y_train)
for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
          % (grid.cv_results_['mean_test_score'][r],
             grid.cv_results_['std_test_score'][r] / 2.0,
             grid.cv_results_['params'][r]))
print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

The output looks like this. When the depth is 10, the correct answer rate is the highest, which is 69%.

0.630 +/- 0.05 {'max_depth': 2}
0.679 +/- 0.06 {'max_depth': 5}
0.690 +/- 0.06 {'max_depth': 10}
0.665 +/- 0.05 {'max_depth': 20}
Best parameters: {'max_depth': 10}
Accuracy: 0.69

Evaluation with test data

The correct answer rate shown above is just the correct answer rate in the learning data, so let's try to predict the test data.

#Learn using the parameters that were best for grid search
clf_2 = grid.best_estimator_
clf_2 = clf_2.fit(X_train, y_train)
clf_2

You can see that the parameters are set like this.

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Let's visualize it because it's a big deal. Since max_depth = 10 it will be a mess. .. If you cut out only the top part, it looks like this. In this case, it can be seen that the high price three days ago is divided according to whether it is 82.068 or more or less, and then the threshold is determined by the low price of the day and divided. The fact that it is bought as gini means that the division is performed so that the value of gini purity becomes small. The value = part indicates how many rises and falls.

Let's check the correct answer rate of test.

pred_test_2 = clf_2.predict(X_test)
#Test data accuracy rate
accuracy_score(y_test, pred_test_2)

0.555555555555

Hmm. .. It feels like .. for all the hard work, but it's like this. Let's also see which features are important (feature impotance).

#Display high-priority features
importances = clf_2.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            df.columns[1+indices[f]],
                            importances[indices[f]]))

I tried to calculate the moving average and so on, but after all, the price movement on the day or the day before seems to be important.

 1) low                            0.407248
 2) close                          0.184738
 3) low-1                          0.078743
 4) high-3                         0.069653
 5) high                           0.043982
 6) diff_MA5                       0.039119
 7) close-3                        0.035420
 8) STD50                          0.035032
 9) diff_MA25                      0.029473
10) MA75                           0.028125
11) MA50                           0.009830
12) open-3                         0.009540
13) STD25                          0.009159
14) low-3                          0.007632
15) high-1                         0.007632
16) volume                         0.004674
Since it is 0 below, it is omitted

Play around with various values

Let's summarize the conditions for this verification once again

--__ Aggregation period : 500 days until December 31, 2019 -- Classifier : Decision tree -- Parameter _: max_depth = 10 -- Cross-validation : 10 divisions -- Training and test split __: 8: 2 --__Result __: Correct answer rate 0.55 Here, the parameters of the decision tree were decided by grid search, so I tried it because it seems that the aggregation period and the ratio of division are tampered with.

--_ The result of changing the division ratio to 9: __ → Correct answer rate: __ 61% __ (max_depth = 20) --__ Result of totaling period of 200 days __ → Correct answer rate: __ 56% __ (max_depth = 10) have become. The correct answer rate is about 50-60%. The grid search is redone every time the conditions are changed, but about max_depth = 10-20 was the best score.

What I learned this time

――If you do your best to predict from the daily data of the dollar yen with a decision tree, you will get an accuracy of about 50-60%.

What I want to do in the future

--Try with a classifier other than the decision tree (logistic regression, kNN, etc.) --Use the ensemble method (combine several classifiers) --Perform appropriate dimension reduction (this time it was not necessary due to the nature of the decision tree) --Thinking about features (maybe there are better features ... ??)

It's longer than I expected. .. end