[PYTHON] I tried the common story of using Deep Learning to predict the Nikkei 225

Overview

This is the Nikkei Stock Average forecast that various people are doing. This time I tried it with Random Forest, MLP and CNN. Although I am a disc braker, I am not responsible for any loss that may occur if the product is actually bought or sold using this method.

Preface

Theoretical story about stocks

Basically, stocks or general financial assets are random walks, and even if the information before one point is known, it is basically impossible to predict the value at the next time. If you can do that, everyone is done, and there is no such good story.

On the other hand, there is an anomaly in the world of stocks, for example, there are small-cap stock effect and value stock effect, which can not be explained theoretically ・ It is confirmed that there is a kind of fluctuation that deviates from theory It has been. (The small-cap effect and value-cap effect were confirmed as anomalies that could not be captured by CAPM, which was the asset price theory until then, and a model called the Fama-French model was created.)

It is difficult to make a profit by using an anomaly because if it becomes widely known, it will be woven into the market.

In deep learning and machine learning, we aim to find such anomalies (in my personal understanding). For that purpose, it is calculated as rough by using data or methods that have not been used until now.

Well, if you use a lot of data, you can't do anything, right? Is it a place with such expectations?

Talk about actual stock trading

A general stock forecast is a return, that is, a rate of change from the previous day. In other words, "use the data up to that day to predict the ups and downs of the next day."

However, the actual problem is not so easy,

There are problems such as.

You should be aware that these points are often more important than expected in actual buying and selling.

After that, when forecasting using overseas stock price indexes, it is necessary to fully consider the effect of "time difference". If you don't take this into account, you'll end up using future data to make predictions.

Model building

So far, I will make an actual model with the introduction.

Data and forecast targets

The data uses daily closing price data of 225 Nikkei Stock Average constituents (most recent constituent stocks). The forecast target is whether the closing price of the Nikkei Stock Average on the next day will rise or fall compared to the previous day. In other words, the teacher data is whether the closing price return compared to the day is positive or negative. The training data is from 2000/01/11 to 2007/12/30, and the test data is from then until the latest.

As shown in the introduction, it is reaffirmed that using the closing price to predict the closing price return cannot be used in a real trade. Well, it's like seeing if the results differ depending on the method.

Composition of features by model

Random forest

In Random Forest, features are extended horizontally. This time, we will build a matrix that has the closing price return on the previous day of each issue in the column direction and the time point in the row direction.

Multilayer Perceptron (MLP)

It uses the same features as Random Forest.

Convolutional Neural Network (CNN)

In the convolutional neural network, it is necessary to generate an image format, that is, a 2D feature map, so we set the channel to 1 and used a 4D tensor. clm_dim and row_dim are the number of columns and rows of the 2D image at a certain point in time, and are the maximum value and the number of industries by industry, respectively. We will embed the returns of stocks in each industry.

    clm_dim = max(industry_count["count"])
    row_dim = len(industry_count)
    l_sample = len(x_train)
    t_sample = len(x_test)
    print row_dim,clm_dim
    x_train_mat = np.zeros((l_sample,1,row_dim,clm_dim),dtype=np.float32)
    x_test_mat = np.zeros((t_sample,1,row_dim,clm_dim),dtype=np.float32)

    for ind in industry_count["ind"]:
        """Turn by industry"""
        ind_code_list = ind_data[ind_data["ind"]==ind]["code"]
        len_3 = [i for i,ii in enumerate(industry_count["ind"]) if ii == ind] #line number

        len_1 = 0 #Column index
        for idx,row in x_train.iterrows():
            len_4 = 0
            for cc in ind_code_list:
                # x_train_mat[len_1,0,len_3,len_4] = 1. if row[str(cc)] > 0 else -1.
                x_train_mat[len_1,0,len_3,len_4] = row[str(cc)]
                len_4 += 1
            len_1 += 1

        len_1 = 0 #Column index
        for idx,row in x_test.iterrows():
            len_4 = 0
            for cc in ind_code_list:
                x_test_mat[len_1,0,len_3,len_4] = row[str(cc)]
                len_4 += 1
            len_1 += 1

Model parameters

Random forest

The number of decision trees is 200.

Multilayer Perceptron (MLP)

It is a 3-layer multi-layer perceptron with 1000 hidden layer nodes. epoch is 100.

Convolutional Neural Network (CNN)

Convolution → average pooling once, then one hidden layer, and the number of nodes is 1000. The filter size is 2x3 asymmetric filter, the pooling size is 1x2, and the output channel is 30.

result

It was measured by sklearn's classification report and AUC, respectively.

Random forest

RF_code.png

Multilayer Perceptron (MLP)

MLP_code.png

Convolutional Neural Network (CNN)

CNN_code.png

The result is that CNN has slightly better performance than others.

Looking at the percentage of correct answers in CNN results, Abenomics in 2013 was the highest. That's right because it was a period when there was a clear trend.

正答率.png

Summary

Recommended Posts

I tried the common story of using Deep Learning to predict the Nikkei 225
I tried the common story of predicting the Nikkei 225 using deep learning (backtest)
I tried to predict the price of ETF
I tried using the trained model VGG16 of the deep learning library Keras
I tried to extract and illustrate the stage of the story using COTOHA
I tried to compress the image using machine learning
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to predict the presence or absence of snow by machine learning.
I tried deep learning using Theano
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
I tried to predict the victory or defeat of the Premier League using the Qore SDK
I tried to predict Covid-19 using Darts
I tried to get the index of the list using the enumerate function
I tried to visualize the common condition of VTuber channel viewers
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to predict the infection of new pneumonia using the SIR model: ☓ Wuhan edition ○ Hubei edition
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
The story of doing deep learning with TPU
[Pokemon Sword Shield] I tried to visualize the judgment basis of deep learning using the three family classification as an example.
zoom I tried to quantify the degree of excitement of the story at the meeting
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried deep learning
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried using the image filter of OpenCV
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)
I tried to vectorize the lyrics of Hinatazaka46!
[Python] Deep Learning: I tried to implement deep learning (DBN, SDA) without using a library.
I tried running an object detection tutorial using the latest deep learning algorithm
I tried to predict the change in snowfall for 2 years by machine learning
The story of making soracom_exporter (I tried to monitor SORACOM Air with Prometheus)
python beginners tried to predict the number of criminals
I tried to summarize the basic form of GPLVM
I tried to predict the J-League match (data analysis)
I tried to approximate the sin function using chainer
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
[Deep Learning from scratch] I tried to explain Dropout
The story of using circleci to build manylinux wheels
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to predict horse racing by doing everything from data collection to deep learning
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to find the entropy of the image with python
[First COTOHA API] I tried to summarize the old story
I tried to get the location information of Odakyu Bus
I tried to find the average of the sequence with TensorFlow
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried to simulate ad optimization using the bandit algorithm.
I tried face recognition of the laughter problem using Keras.
I tried hosting a TensorFlow deep learning model using TensorFlow Serving
[Python] I tried to visualize the follow relationship of Twitter
I tried to implement ListNet of rank learning with Chainer