I'm a graduate student studying Pytorch recently.

I started studying neural networks dreaming of stock price forecasts and horse racing forecasts.

Finally, it has taken shape, so I will write it as a memorandum. I hope it will be helpful for beginners like me.

References, quoted articles and books

Overview of LSTM network This is an article about the network structure and mechanism of LSTM.

I want to get started with LSTM with pytorch ... don't you want? This is mainly about how to implement LSTM.

FX Forecast: Time Series Data Forecast with PyTorch LSTM It was very helpful because I wrote in detail how to collect data, molding, backtesting, and actual learning.

Deep learning stock price forecast model _1 with 67% accuracy This is not FX, I was predicting the stock price, and it matched the forecast of the next day I wanted, so I referred to it. This is implemented with keras, but I will implement this with pytorch.

[Everyone's python 4th edition](https://www.amazon.co.jp/%E3%81%BF%E3%82%93%E3%81%AA%E3%81%AEPython-%E7%AC% AC4% E7% 89% 88-% E6% 9F% B4% E7% 94% B0-% E6% B7% B3 / dp / 479738946X / ref = sr_1_1? __mk_ja_JP =% E3% 82% AB% E3% 82% BF % E3% 82% AB% E3% 83% 8A & keywords =% E3% 81% BF% E3% 82% 93% E3% 81% AA% E3% 81% AEpython & qid = 1580280858 & sr = 8-1) This is the book I used to find out the basic usage of python.

Data collection

The data was collected using HYPER SBI, which can be downloaded from SBI SECURITIES. Deep learning stock price forecast model_1 with 67% accuracy was learning the stock prices of 423 listed companies for 20 years. So, first of all, I learned whether to go up the next day only with the Nikkei average. The data is for 20 years.

Overview

We collected data for 20 years and trained it separately for train and test. The accuracy for ~~ test was 62.5%. However, there are problems such as loss does not converge and the value of f1 calculated from the mixed matrix becomes 0 during training. ~~

Eventually it settled at Accuracy 60.5%.

Definition of constants

`main.py`


import glob
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import torch.nn.functional as F
import torch.optim as optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(1)

future_num = 1 #How many days ahead
feature_num = 7 #'Open price', 'High price','Low price','closing price','5-day average','25-day average','75-day average'7 items
batch_size = 128

time_steps = 30 #lstm timesteps
moving_average_num = 30 #Number of days to take a moving average
n_epocs = 50

lstm_hidden_dim = 16
target_dim = 1

Data reading

For information on how to collect data, see SBI Official Website: Frequently Asked Questions Q & A. Here, I selected the index and downloaded the Nikkei 225 data for 20 years.

`main.py`


path = "./data/nikkei_heikin.csv"

model_name = "./models/nikkei.mdl"

#data load
flist = glob.glob(path)
for file in flist:
    df = pd.read_csv(file, header=0, encoding='cp932')
    dt = pd.read_csv(file, header=0, encoding='cp932')


#Train data,Index to divide into test
val_idx_from = 3500
test_idx_from = 4000

future_price = df.iloc[future_num:]['closing price'].values
curr_price = df.iloc[:-future_num]['closing price'].values

#future_Treat the price compared with num days later as the correct label
y_data_tmp = future_price / curr_price
#Prepare a list for correct labels
y_data = np.zeros_like(y_data_tmp)

#Future to predict_Correct answer if num days later is more than the previous day

for i in range(len(y_data_tmp)):
    if y_data_tmp[i] > 1.0:
        y_data[i] = 1

#Moving when normalizing the price_average_Since there is blank data for num, remove that part.
y_data = y_data[moving_average_num:]

#Price normalization
#Get column name
cols = ['Open price', 'High price','Low price','closing price','5-day average','25-day average','75-day average']
#I pulled it out because the volume data was damaged.

for col in cols:
    dt[col] = df[col].rolling(window=25, min_periods=25).mean()
    df[col] = df[col] / dt[col] - 1
    

X_data = df.iloc[moving_average_num:-future_num][cols].values

#Data split, converted to Torch Tensor
#Training data
X_train = torch.tensor(X_data[:val_idx_from], dtype=torch.float, device=device)
y_train = torch.tensor(y_data[:val_idx_from], dtype=torch.float, device=device)
#Evaluation data
X_val   = torch.tensor(X_data[val_idx_from:test_idx_from], dtype=torch.float, device=device)
y_val   = y_data[val_idx_from:test_idx_from]
#Test data
X_test  = torch.tensor(X_data[test_idx_from:], dtype=torch.float, device=device)
y_test  = y_data[test_idx_from:]

The original number of data was about 4500. The number of training data is 3500, the validation is 500, and the rest is test data.

Model definition

`main.py`


class LSTMClassifier(nn.Module):
    def __init__(self, lstm_input_dim, lstm_hidden_dim, target_dim):
        super(LSTMClassifier, self).__init__()
        self.input_dim = lstm_input_dim
        self.hidden_dim = lstm_hidden_dim
        self.lstm = nn.LSTM(input_size=lstm_input_dim, 
                            hidden_size=lstm_hidden_dim,
                            num_layers=1, #default
                            #dropout=0.2,
                            batch_first=True
                            )
        self.dense = nn.Linear(lstm_hidden_dim, target_dim)

    def forward(self, X_input):
        _, lstm_out = self.lstm(X_input)
        #Use only the final output of the LSTM.
        linear_out = self.dense(lstm_out[0].view(X_input.size(0), -1))
        return torch.sigmoid(linear_out)

The definition of the model. I've pulled LSTMs from Pytorch's library.

`main.py`


def prepare_data(batch_idx, time_steps, X_data, feature_num, device):
    feats = torch.zeros((len(batch_idx), time_steps, feature_num), dtype=torch.float, device=device)
    for b_i, b_idx in enumerate(batch_idx):
        #Store the past 30 days as time step data.
        b_slc = slice(b_idx + 1 - time_steps ,b_idx + 1)
        feats[b_i, :, :] = X_data[b_slc, :]

    return feats

The function prepare_data is responsible for collecting the data to be input to the LSTM for 30 days each.

Learning and evaluation

`main.py`


#Learning
model = LSTMClassifier(feature_num, lstm_hidden_dim, target_dim).to(device)
loss_function = nn.BCELoss()
optimizer= optim.Adam(model.parameters(), lr=1e-4)


train_size = X_train.size(0)
best_acc_score = 0

for epoch in range(n_epocs):
    #Randomly replace the index of train data. First time_Do not use steps.
    perm_idx = np.random.permutation(np.arange(time_steps, train_size))
    for t_i in range(0, len(perm_idx), batch_size):
        batch_idx = perm_idx[t_i:(t_i + batch_size)]
        #Preparation of time series data for LSTM input
        feats = prepare_data(batch_idx, time_steps, X_train, feature_num, device)
        y_target = y_train[batch_idx]
        model.zero_grad()
        train_scores = model(feats) # batch size x time steps x feature_num
        loss = loss_function(train_scores, y_target.view(-1, 1))
        loss.backward()
        optimizer.step()

    #Evaluation of validation data
    print('EPOCH: ', str(epoch), ' loss :', loss.item())
    with torch.no_grad():
        feats_val = prepare_data(np.arange(time_steps, X_val.size(0)), time_steps, X_val, feature_num, device)
        val_scores = model(feats_val)
        tmp_scores = val_scores.view(-1).to('cpu').numpy()
        bi_scores = np.round(tmp_scores)
        acc_score = accuracy_score(y_val[time_steps:], bi_scores)
        roc_score = roc_auc_score(y_val[time_steps:], tmp_scores)
        f1_scores = f1_score(y_val[time_steps:], bi_scores)
        print('Val ACC Score :', acc_score, ' ROC AUC Score :', roc_score, 'f1 Score :', f1_scores)

    #Save model if validation is good
    if acc_score > best_acc_score:
        best_acc_score = acc_score
        torch.save(model.state_dict(),model_name)
        print('best score updated, Pytorch model was saved!!', )

#Predict with the best model.
model.load_state_dict(torch.load(model_name))

with torch.no_grad():
    feats_test = prepare_data(np.arange(time_steps, X_test.size(0)), time_steps, X_test, feature_num, device)
    val_scores = model(feats_test)
    tmp_scores = val_scores.view(-1).to('cpu').numpy()   
    bi_scores = np.round(tmp_scores)
    acc_score = accuracy_score(y_test[time_steps:], bi_scores)
    roc_score = roc_auc_score(y_test[time_steps:], tmp_scores)
    f1_scores = f1_score(y_test[time_steps:], bi_scores)
    print('Test ACC Score :', acc_score, ' ROC AUC Score :', roc_score, 'f1 Score :', f1_scores)

In learning, Accuracy, ROC Score, and f1 Score are output. Weights are saved when the highest Accuracy is updated.

result

EPOCH:  0  loss : 0.7389694452285767
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.5448111497752646 f1 Score : 0.653295128939828
best score updated, Pytorch model was saved!!
EPOCH:  1  loss : 0.6844338178634644
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.5550601710888793 f1 Score : 0.653295128939828
EPOCH:  2  loss : 0.7206816673278809
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.5678012179208352 f1 Score : 0.653295128939828
EPOCH:  3  loss : 0.7066923975944519
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.5815934464259822 f1 Score : 0.653295128939828
EPOCH:  4  loss : 0.7148252129554749
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.6025717703349283 f1 Score : 0.653295128939828
EPOCH:  5  loss : 0.6946689486503601
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.6224264172828766 f1 Score : 0.653295128939828
EPOCH:  6  loss : 0.7018400430679321
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.639100333478324 f1 Score : 0.653295128939828
EPOCH:  7  loss : 0.7006129026412964
.
.
.
.
EPOCH:  43  loss : 0.7038401961326599
Val ACC Score : 0.5148936170212766  ROC AUC Score : 0.6018921270117442 f1 Score : 0.0
EPOCH:  44  loss : 0.6951379179954529
Val ACC Score : 0.5148936170212766  ROC AUC Score : 0.6018921270117443 f1 Score : 0.0
EPOCH:  45  loss : 0.6788191795349121
Val ACC Score : 0.5148936170212766  ROC AUC Score : 0.6018921270117443 f1 Score : 0.0
EPOCH:  46  loss : 0.6547065377235413
Val ACC Score : 0.5148936170212766  ROC AUC Score : 0.6018558793678411 f1 Score : 0.0
EPOCH:  47  loss : 0.6936472654342651
Val ACC Score : 0.5148936170212766  ROC AUC Score : 0.6016746411483254 f1 Score : 0.0
EPOCH:  48  loss : 0.719009280204773
Val ACC Score : 0.5148936170212766  ROC AUC Score : 0.6016202696824707 f1 Score : 0.0
EPOCH:  49  loss : 0.6854437589645386
Val ACC Score : 0.5148936170212766  ROC AUC Score : 0.6014934029288096 f1 Score : 0.0
Test ACC Score : 0.6252100840336134  ROC AUC Score : 0.6860275646683414 f1 Score : 0.6915629322268327

Postscript

If you change the number of epochs from 50 to 500, the loss will be a decent number, so I will post it.

EPOCH:  0  loss : 0.7389694452285767
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.5448111497752646 f1 Score : 0.653295128939828
best score updated, Pytorch model was saved!!
EPOCH:  1  loss : 0.6844338178634644
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.5550601710888793 f1 Score : 0.653295128939828
EPOCH:  2  loss : 0.7206816673278809
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.5678012179208352 f1 Score : 0.653295128939828
EPOCH:  3  loss : 0.7066923975944519
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.5815934464259822 f1 Score : 0.653295128939828
EPOCH:  4  loss : 0.7148252129554749
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.6025717703349283 f1 Score : 0.653295128939828
EPOCH:  5  loss : 0.6946689486503601
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.6224264172828766 f1 Score : 0.653295128939828
EPOCH:  6  loss : 0.7018400430679321
Val ACC Score : 0.4851063829787234  ROC AUC Score : 0.639100333478324 f1 Score : 0.653295128939828
EPOCH:  7  loss : 0.7006129026412964
.
.
.
.
EPOCH:  493  loss : 0.694491982460022
Val ACC Score : 0.6042553191489362  ROC AUC Score : 0.638157894736842 f1 Score : 0.5079365079365079
EPOCH:  494  loss : 0.6935185194015503
Val ACC Score : 0.6063829787234043  ROC AUC Score : 0.638157894736842 f1 Score : 0.5144356955380578
EPOCH:  495  loss : 0.6262539029121399
Val ACC Score : 0.5957446808510638  ROC AUC Score : 0.6370704654197477 f1 Score : 0.48087431693989063
EPOCH:  496  loss : 0.5570085644721985
Val ACC Score : 0.6085106382978723  ROC AUC Score : 0.6387559808612441 f1 Score : 0.5422885572139303
EPOCH:  497  loss : 0.6102970838546753
Val ACC Score : 0.6  ROC AUC Score : 0.6374691895026823 f1 Score : 0.4973262032085562
EPOCH:  498  loss : 0.6443783640861511
Val ACC Score : 0.6042553191489362  ROC AUC Score : 0.6395534290271132 f1 Score : 0.5633802816901409
EPOCH:  499  loss : 0.663628876209259
Val ACC Score : 0.6085106382978723  ROC AUC Score : 0.6380310279831811 f1 Score : 0.518324607329843
Test ACC Score : 0.6050420168067226  ROC AUC Score : 0.644737139882771 f1 Score : 0.660894660894661

Consideration of results

~~ If you look at the learning process, there are many times when Accuracy is not updated. And f1 Score is 0. It seems that there are many problems such as Loss not converging. However, looking at the results of the test data, we have recorded Accuracy 62.5%. ~~

By increasing the number of epochs, the ROC Score and f1 Score are decent. However, Accuracy itself has become 60.5%.

I would like to improve the accuracy by increasing the number of data and features, and implementing LSTM by myself.

Next time, we will predict stocks that will increase by 3% as was done in Deep Learning Stock Price Forecast Model_1 with 67% Accuracy.

[PYTHON] Prediction of Nikkei 225 with Pytorch

References, quoted articles and books

Data collection

Overview

Definition of constants

main.py

Data reading

main.py

Model definition

main.py

main.py

Learning and evaluation

main.py

result

Postscript

Consideration of results

`main.py`

`main.py`

`main.py`

`main.py`

`main.py`