[PYTHON] Multivariate LSTM and data preprocessing in TensorFlow 2.x

Introduction

I haven't found many commentary articles on implementing multivariate LSTMs for time series data in TensorFlow 2.x, so I will write them. This article is recommended for the following people.

The code and data used are published on the GitHub repository. The code is in Jupyter Notebook format, and it can be run immediately on Colab, so please refer to it (please also star if you like).

Tasks to be tackled this time

This time, I would like to perform ** "Prediction of average cloud cover" ** in Tokyo from meteorological data such as temperature.

data

This time, as data used for forecasting, we downloaded ** "Average temperature, average relative humidity, total sunshine duration, average wind speed, average cloud cover" for the past 10 years in Tokyo from Japan Meteorological Agency HP Did. This data is also located in the GitHub repository (https://github.com/ishikawa08/tf_multi_LSTM/tree/main/data).

The code to read the data is below.

#Library import
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import time
#Data read
df = pd.read_csv('data/tokyo_weather.csv')
df['date'] = pd.to_datetime(df['date'])
# df.set_index('date', inplace=True)  #When index is date
df

Looking at the actual data, it looks like the following. Of course, it's all numerical data. Next, we will process this data. ss 2020-12-28 20.40.38.png

Data preprocessing

When training an LSTM with multivariate data, the data must be in the format ** [number of samples, number of lookups, number of variables] ** (** super important! **). Here, the number of lookups means ** "how many times of the past data should be regarded as one data" **. For example, in the figure below, the number of lookups is ** "3" ** (because the data for 3 weeks is regarded as one data).

ss 2020-12-30 2.16.23.png

The code that actually creates the dataset looks like this:

#Lookback number
look_back = 25
#The number of data
sample_size = len(df) - look_back
#Period used for forecasting
past_size = int(sample_size*0.8)
future_size = sample_size - past_size +1
#Functions that create datasets
def make_dataset(raw_data, look_back=25):
    _X = []
    _y = []

    for i in range(len(raw_data) - look_back):
        _X.append(raw_data[i : i + look_back])
        _y.append(raw_data[i + look_back])
    _X = np.array(_X).reshape(len(_X), look_back, 1)
    _y = np.array(_y).reshape(len(_y), 1)

    return _X, _y

Create a dataset using the created function. It also normalizes the data. We did not use wind speed data this time.

from sklearn import preprocessing

columns = list(df.columns)
del columns[0]

#Normalize to a minimum of 0 and a maximum of 1
Xs = []
for i in range(len(columns)):
    Xs.append(preprocessing.minmax_scale(df[columns[i]]))
Xs = np.array(Xs)

#Create each numerical data
X_tmpr, y_tmpr = make_dataset(Xs[0], look_back=look_back)
X_humid, y_humid = make_dataset(Xs[1], look_back=look_back)
X_dlh, y_dlh = make_dataset(Xs[2], look_back=look_back)
X_prec, y_prec = make_dataset(Xs[3], look_back=look_back)
X_cloud, y_cloud = make_dataset(Xs[4], look_back=look_back)

#Combine each data to support multivariate LSTMs
X_con = np.concatenate([X_tmpr, X_humid, X_dlh, X_prec, X_cloud], axis=2)
X = X_con
y = y_cloud

#Divide the data into past (used for training) and future (used for future prediction)
X_past = X[:past_size]
X_future = X[past_size-1:]
y_past = y[:past_size]
y_future = y[past_size-1:]

#Define training data
X_train = X_past
y_train = y_past

Multivariate LSTM model

Now, let's actually learn and predict with LSTM using the created data.

Creating a model

Make an LSTM model. Defining it with a function is also convenient when loading a trained model. If you are used to it, you can use the class.

import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD, Adam

#Functions that create LSTM models
def create_LSTM_model():
    input = Input(shape=(np.array(X_train).shape[1], np.array(X_train).shape[2]))
    x = LSTM(64, return_sequences=True)(input)
    x = BatchNormalization()(x)
    x = LSTM(64)(x)
    output = Dense(1, activation='relu')(x)
    model = Model(input, output)
    return model
model = create_LSTM_model()
model.summary()
model.compile(optimizer=Adam(learning_rate=0.0001), loss='mean_squared_error')

Model learning

Now let's actually train the model.

history = model.fit(X_train, y_train, epochs=200, batch_size=64, verbose=1)

Confirmation of forecast results

Let the model predict when the training is over.

predictions = model.predict(X_past)
future_predictions = model.predict(X_future)

Let's display the prediction result.

plt.figure(figsize=(18, 9))
plt.plot(df['date'][look_back:], y, color="b", label="true_cloud_cover")
plt.plot(df['date'][look_back:look_back + past_size], predictions, color="r", linestyle="dashed", label="prediction")
plt.plot(df['date'][-future_size:], future_predictions, color="g", linestyle="dashed", label="future_predisction")
plt.legend()
plt.show()

Below is a diagram of the forecast results. It seems that the future prediction part (green line) that is not used for training has ended with reasonable accuracy. Well, let's forgive this time in such a place. ss 2020-12-30 2.39.13.png

in conclusion

We implemented and explained multivariate LSTM for time series data in TensorFlow 2.x. There are some parts that are not separated from the training data and the test data, but I hope it will be helpful.

Recommended Posts

Multivariate LSTM and data preprocessing in TensorFlow 2.x
Clipping and normalization in TensorFlow
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Hashing data in R and Python
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
Learn data distributed with TensorFlow Y = 2X
Python: Preprocessing in machine learning: Data acquisition
Easily graph data in shell and Python
Separation of design and data in matplotlib
Python: Preprocessing in machine learning: Data conversion
Partially read parameters in old TensorFlow 1.x
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Preprocessing in machine learning 1 Data analysis process
OS X GPU is now supported in Tensorflow
Python variables and data types learned in chemoinformatics
Receive and display HTML form data in Python
[Python] Swapping rows and columns in Numpy data
Data Science Workloads and RTVS in Visual Studio 2017
Data cleansing 3 Use of OpenCV and preprocessing of image data
TensorFlow: Run data learned in Python on Android
Easy 3 minutes TensorBoard in Google Colab (using TensorFlow 2.x)
Investigate the relationship between TensorFlow and Keras in transition
Full-width and half-width processing of CSV data in Python
Plot and understand the multivariate normal distribution in Python
Performance verification of data preprocessing in natural language processing
Select the required variables in TensorFlow and save / restore
Approximately 200 latitude and longitude data for hospitals in Tokyo
Conv in x direction and deconv in y direction with chainer
What you can and cannot do with Tensorflow 2.x
Overview of natural language processing and its data preprocessing