Aidemy　2020/11/10

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This time, I will post the practice of time series analysis. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ Time series analysis using LSTM (sales forecast)

Time series analysis using LSTM

About LSTM

(Learning about LSTM in "Topic Extraction 1 of Japanese Text" and "Negative / Positive Analysis 3") -LSTM is a type of RNN, and __ can store the data entered at the beginning. That is, it is an RNN model capable of long-term memory.

Sales forecast procedure

・ This time, we will use LSTM to predict sales data. The procedure is the same as for other data forecasts. __ ① Data collection __ __② Data processing __ __③ Model creation __ __④ Forecast / Evaluation __

① Data collection

Data read

・ This time, we will collect data by reading __csv data of champagne sales forecast. -In LSTM data prediction, only the value of __time series data is used __, so only this is acquired. Specifically, __ "pd.read_csv ()" __ should be executed, but __ "usecols = [column]" __ specifies the column to be extracted __ and __ "skipfooter =" In "Number of lines" __, specify the number of lines __ that is not read from the end of __. Also, when using skipfooter, you need to specify __ "engine ='python'" __. -After reading the file, use __ ".values" __ to retrieve only __ data excluding index and column __. -Finally, convert to __float type with __ ".astype ('float32')" __ to make it suitable for LSTM analysis.

-Code![Screenshot 2020-11-07 16.50.00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/0b331f86-594b-321d- 55cd-62127694301c.png)

・ Result (only part)![Screenshot 2020-11-07 16.50.36.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/71ca1207 -7f78-ad5a-3d25-42ce612eb76a.png)

② Data processing

Creating a dataset

-Here, __ format the data into a form that can be used for analysis __. ・ First of all, the data is divided into training data and test data, but __time series data also has information such as periodicity and continuity __, so if the data is divided randomly, these Do not divide it randomly, as the information will be lost and it will be corrupted as data. ・ This time, the original data is divided into __3, and the first 2/3 is for training and the remaining 1/3 is for testing. -As a code, it is necessary to first set __the length to divide __. This time, it is divided by 2/3 of the whole, 0.67 when expressed as a fraction, so this is expressed as __ "int (len (dataset) * 0.67)" __. -Based on this length, divide the data into "train" and "test". If the length to divide is __ "train_size" __, train can be obtained by __ [0: train_size,:] __, and test can be obtained by __ [train_size: len (dataset),:] __. it can.

-Code![Screenshot 2020-11-07 17.14.30.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ddc6b5eb-9caa-db2f- a8b5-03564d2d3293.png)

Data scaling

-(Review) __ Scaling __ is a __ process that equalizes the degree of influence of __ values by equalizing data with different units and ranges according to a certain standard. -There are __ "normalization" __ and __ "standardization" __ for scaling. Normalization is __ "minimum 0 maximum 1" __, and standardization is __ "mean 0 standard deviation 1" __. -If scaling is performed including test data, the model will fit the data too much and the prediction accuracy will be low, so perform scaling based only on the training data __. -When scaling by normalization, use __ "MinMaxScaler (feature_range = (0,1))" __. Define the scaling parameters based on the training data with __ "fit (train)" __, and then scale this with __ "transform (train)" and "transform (test)" __ respectively.

・ Code![Screenshot 2020-11-07 20.19.18.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/84b9fd35-956c-5a5d- 59c9-cc0ef1cf5bbd.png)

・ Result (only part)![Screenshot 2020-11-07 20.19.00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/2d9134e0 -0479-69d0-2e6a-9d954233d3eb.png)

Creation of input data and correct label

・ For this prediction, __ "prediction of the next data" __ is performed using __ "multiple data before the reference point" __, so the input data is __ "the reference point". The data from that point to n points before is __, and the correct answer label is __ "data at the point following the reference point" __. -When creating data, the process of acquiring "the reference point and the data from that point to n points before" is repeated __, so create a __function that defines this iterative process in advance __ Then, pass the training data and test data created in the previous section to it, and create __input data __ and __correct label __. ・ The code is as follows. The meaning will be described later.

スクリーンショット 2020-11-07 20.34.05.png

-For this __ "create_dataset" __ function, data dataset and __ "look_back" __ that specifies how far before the reference point (n) is to be taken are passed as arguments. First, prepare __ "data_X" __ "data_X" __ to store the input data __ "data_Y" __ to store the correct answer label. -For __ "for i in range (look_back, len (dataset))" __, this represents the position of __ "the point the day after the reference point" __. In other words, the reference point is __ "i-1" . In data_X, __ "dataset [i-look_back: i, 0]" __ "data of i-look_back or more and less than i (up to the reference point)" __ is stored, and in data_Y __ "dataset" [i, 0] " stores " data of the day after the reference point ". All you have to do is return these as a __np.array array __.

-This time, __3 previous data is set as 1 set __, that is, __ "look_back = 3" __. You can create __ "train_X" "train_Y" __ by passing training data and look_back to the function create_dataset as in the above code, and __ "test_X" and "test_Y" __ by passing test data in the same way. Can be created.

Formatting input data

-Although the input data and the correct answer label were created in the previous section, the __ input data is not yet in a format that can be analyzed by LSTM, so we will format it __. -Specifically, the input data is converted to 3D of __ "number of rows x number of variables x number of columns" __. The number of rows represents the total number of __ data __, the number of variables represents the number of elements contained in the ___1 set __, and the number of columns represents the number of types of data to be handled __.

-Dimension conversion is performed with __ "reshape ()" __. If you pass it as an argument, it will be __ (numpy.shape [0], numpy.shape [1], 1) __.

・ Code![Screenshot 2020-11-07 21.39.44.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/e4d55aa7-e26e-23fe- 9604-46ac2b47dc79.png)

・ Result (part of train_X)![Screenshot 2020-11-07 21.40.14.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ ff4ed61b-bf29-550b-9fa9-478a8142b4be.png)

③ Model creation

Creating and training LSTM models

・ This time, we will create with __Sequential model __. Therefore, the LSTM layer can be added with __ "model.add (LSTM ())" __. Specify the number of __LSTM blocks __ and create an input layer with length look_back and dimension 1 with __input_shape = (look_back, 1) __. -The output layer is the data of the next day, that is, one class is sufficient, so __ "Dense (1)" . -Next, perform compile. __ Set the loss function loss and optimizer. Since this time is regression model, we use __mean squared error "mean_squared_error" __ for the loss function (the function that measures the error between the correct label and the output). There are various optimization algorithms (methods that change the weights so that the gradient of the loss function decreases), but __ I can't tell until I actually try it __.

-Learning is done with __ "model.fit ()" __. Pass epochs (number of learnings) and batch_size (number of data divisions) as arguments. ・ After that, change the number of layers and the number of epochs to __tune __. If the LSTM layer is added after the LSTM layer, write __ "return_sequences = True" __ in the argument of the addition source.

·code スクリーンショット 2020-11-07 23.04.50.png

・ Result (only part)![Screenshot 2020-11-07 23.05.16.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/dd32ab50 -c35b-c3ac-2ec6-1c4c29ade57f.png)

④ Forecast / evaluation

-Finally, predict and evaluate the model. -Use __ "model.predict ()" __ for prediction. You can pass train_X and test_X to this. -Next, we will evaluate the prediction result, but in order to evaluate it correctly __, it is necessary to restore the scaled data __. I scaled with __ "transform ()" __, but this time I use __ "inverse_transform ()" __ which does the opposite.

-At the time of scaling, the scaling parameters were defined based on the training data with __ "fit (train)" __, but the same definition is used for this inverse_transform. If this definition is "scaler_train", "train_predict" that stores the predicted value of training data and "train_Y" that stores the correct answer label are returned to the original as follows.

スクリーンショット 2020-11-07 23.18.04.png

-Do this for test data as well. Once this is done, the next step is to evaluate the __data __. -For the accuracy index of time series data, use the __ "mean square error" __ that came out earlier. Once again, this index shows how far the predicted value is from the correct answer, and the closer it is to __0, the higher the accuracy __.

-As a code, write __ "math_sqrt (mean_squared_error (correct label, predicted value))" __. -In this case, the correct answer label is stored in the 0th column __ of __train_Y, and the predicted value is stored in the 0th column __ (each row) of __train_predict, so each code looks like the following. It should be specified in.

・ Code![Screenshot 2020-11-07 23.30.36.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/31dd419b-64f1-2762- 78f4-7d4d6492b454.png)

・ Result![Screenshot 2020-11-07 23.30.51.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/7c9dbf22-8174-9066- bb2b-a3327a6cf54d.png)

Visualization of forecast results

・ Since I want to see what is visualized this time, I will only post a specific code.

・ Code![Screenshot 2020-11-07 23.33.49.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/649132c4-c813-5a12- 712e-f70aca646b64.png)

・ Result![Screenshot 2020-11-07 23.38.07.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/2cb856dc-b466-6366- cc1f-09db83c4395d.png)

Summary

-For time series analysis, it is advisable to create a model using "LSTM", which enables long-term memory. -Even if the data is acquired, it cannot be used for analysis as it is, so format it. After scaling, you can pass it to the model by separating it into input data and correct labels. For the input data, make the dimension three-dimensional. -Model is Sequential. After adding layers, compiling, and training, the model is complete. Then evaluate the model. It is necessary to restore the scaled data at the time of evaluation.

This time is over. Thank you for reading until the end.

[PYTHON] Time series analysis practice sales forecast