In deep learning, I tried to see how much the periodic function can be reproduced in order to study the recurrent neural network used for analysis of time series data.
Here, we have defined the following three types of periodic functions: f`` g
h
.
import numpy as np
def f(t, freq=25):
return np.sin(2. * np.pi * t / freq)
def g(t, freq=25, amp=10, threshold = 10):
return 1/(1 + np.exp(10 * np.sin(2 * np.pi * t / freq) + 10))
def h(t, freqs=[11, 23, 31, 41, 53, 61, 71, 83, 97]):
value = np.zeros_like(t)
for freq in freqs:
value += f(t, freq)
return value
The function f
is the trigonometric function $ sin $. Here, the period is set to $ 25 $.
%matplotlib inline
import matplotlib.pyplot as plt
total_time_length = 1000
times = np.linspace(0, total_time_length, total_time_length + 1)
plt.figure(figsize=(15, 6))
plt.plot(f(times))
plt.xticks(np.linspace(0, 1000, 11))
plt.grid()
The function g
transforms the trigonometric function $ sin $ with a sigmoid function to make it weird. Again, I set the cycle to $ 25 $.
%matplotlib inline
import matplotlib.pyplot as plt
total_time_length = 1000
times = np.linspace(0, total_time_length, total_time_length + 1)
plt.figure(figsize=(15, 6))
plt.plot(g(times))
plt.xticks(np.linspace(0, 1000, 11))
plt.grid()
The function h
is the sum of trigonometric functions $ sin $ with some prime cycles.
%matplotlib inline
import matplotlib.pyplot as plt
total_time_length = 1000
times = np.linspace(0, total_time_length, total_time_length + 1)
plt.figure(figsize=(15, 6))
plt.plot(h(times))
plt.xticks(np.linspace(0, 1000, 11))
plt.grid()
Can these functions be reproduced in a recursive neural network?
Before using a recurrent neural network, let's convert the above three functions into frequencies by Fourier transform and take the reciprocal of them to check the "period".
plt.figure(figsize=(6,6))
sp = np.fft.fft(f(times))
freq = np.fft.fftfreq(times.shape[-1])
plt.subplot(311)
plt.plot(1/freq, abs(sp.real) + abs(sp.imag), label="f")
plt.plot(1/freq, abs(sp.real))
plt.plot(1/freq, abs(sp.imag), alpha=0.5)
plt.legend()
plt.xlim([0, 150])
plt.xticks(np.linspace(0, 150, 16))
plt.grid()
sp = np.fft.fft(g(times))
freq = np.fft.fftfreq(times.shape[-1])
plt.subplot(312)
plt.plot(1/freq, abs(sp.real) + abs(sp.imag), label="g")
plt.plot(1/freq, abs(sp.real))
plt.plot(1/freq, abs(sp.imag), alpha=0.5)
plt.legend()
plt.xlim([0, 150])
plt.xticks(np.linspace(0, 150, 16))
plt.grid()
sp = np.fft.fft(h(times))
freq = np.fft.fftfreq(times.shape[-1])
plt.subplot(313)
plt.plot(1/freq, abs(sp.real) + abs(sp.imag), label="h")
plt.plot(1/freq, abs(sp.real))
plt.plot(1/freq, abs(sp.imag), alpha=0.5)
plt.legend()
plt.xlim([0, 150])
plt.xticks(np.linspace(0, 150, 16))
plt.grid()
The function f
consists only of sine waves with a period of $ 25 $. You can see that the function h
consists of a combination of periods of the specified prime numbers. On the other hand, you can see that the function g
contains various components other than the specified period of $ 25 $.
Now let's create a time series dataset to train a recurrent neural network. In a recursive neural network, inputting the output $ X_ {t, t + w} $ from time $ t $ to $ t + w $ predicts the output $ X_ {t + w + 1} $ at the next time. Aim to create a model to do.
import numpy as np
from sklearn.model_selection import train_test_split
func = f #You can change the function by rewriting here
#func = g
#func = h
total_time_length = 10000 #All time widths to handle
pred_length = 1000 #Predicted time width
learning_time_length = 100 #Time width used for learning
time_series_T = np.linspace(0, total_time_length, total_time_length + 1) #Engrave the predicted time T (corresponding to the horizontal axis of the graph)
time_series_X = func(time_series_T) #Function output X (corresponds to the vertical axis of the graph)
X_learn = [] #Time t ~ t+learning_time_Stores X up to length
Y_learn = [] #Time t+learning_time_length+Store 1 X
for i in range(total_time_length - learning_time_length):
X_learn.append(time_series_X[i:i+learning_time_length].reshape(1, learning_time_length).T)
Y_learn.append([time_series_X[i+learning_time_length]])
#Divided into training data and verification data
#Shuffle for time series data=Must be False
X_train, X_val, Y_train, Y_val = \
train_test_split(X_learn, Y_learn, test_size=0.2, shuffle=False)
# scikit-Convert to data type for learn
X_train2sklearn = [list(x.reshape(1, len(x))[0]) for x in X_train]
Y_train2sklearn = [y[0] for y in Y_train]
MLP (Multi-Layer Perceptron)
For comparison, we use the simplest model of deep learning, the Multi-Layer Perceptron (MLP). I used the multi-layer perceptron because scikit-learn is easy and fast.
%%time
from sklearn.neural_network import MLPRegressor
regressor = MLPRegressor(hidden_layer_sizes=(100, 100, 100),
early_stopping=True, max_iter=10000)
regressor.fit(X_train2sklearn, Y_train2sklearn)
CPU times: user 2.84 s, sys: 1.41 s, total: 4.25 s
Wall time: 2.22 s
The learning curve was drawn as follows.
plt.plot(regressor.loss_curve_)
%matplotlib inline
import matplotlib.pyplot as plt
plt.subplot(211)
plt.plot(regressor.loss_curve_, label='train_loss')
plt.legend()
plt.grid()
plt.subplot(212)
plt.plot(regressor.loss_curve_, label='train_loss')
plt.yscale('log')
plt.legend()
plt.grid()
Give the trained model the first few hours (only the length of pred_length
) as input and let it predict the output at the next time. Add the output predicted value to the input to predict the output at the next time. Repeat it endlessly.
%%time
pred_length = 1000
X_pred_length = np.linspace(0, pred_length , pred_length + 1)
Y_observed = func(X_pred_length)
Y_pred = Y_observed[:learning_time_length+1]
for i in range(pred_length):
X_ = [Y_pred[i:i+learning_time_length]]
Y_ = regressor.predict(X_)
Y_pred = np.append(Y_pred, Y_)
CPU times: user 383 ms, sys: 279 ms, total: 662 ms
Wall time: 351 ms
The curve of the predicted value obtained in this way and the actual curve are illustrated and compared.
plt.figure(figsize=(36, 6))
times = np.linspace(0, Y_pred.shape[0] - 1, Y_pred.shape[0])
plt.plot(func(times), label="time series")
plt.plot(Y_pred, alpha=0.5, label="predicted")
plt.xticks(np.linspace(0, 1000, 11))
plt.xlim([0, 1000])
plt.grid()
plt.legend()
If the figure is small and difficult to see, you can enlarge it by clicking on it. Up to the time $ 100 $, the training data is used as it is, so it matches, but after the time $ 100 $, you can see that the actual value (value of the function f
) and the predicted value are gradually shifting. ..
Let's compare the results of the Fourier transform.
plt.figure(figsize=(6,4))
sp = np.fft.fft(func(times))
freq = np.fft.fftfreq(times.shape[-1])
plt.subplot(211)
plt.plot(1/freq, abs(sp.real) + abs(sp.imag), label="observed")
plt.plot(1/freq, abs(sp.real))
plt.plot(1/freq, abs(sp.imag), alpha=0.5)
plt.legend()
plt.xlim([0, 150])
plt.xticks(np.linspace(0, 150, 16))
plt.grid()
sp = np.fft.fft(Y_pred)
freq = np.fft.fftfreq(times.shape[-1])
plt.subplot(212)
plt.plot(1/freq, abs(sp.real) + abs(sp.imag), label="predicted")
plt.plot(1/freq, abs(sp.real))
plt.plot(1/freq, abs(sp.imag), alpha=0.5)
plt.legend()
plt.xlim([0, 150])
plt.xticks(np.linspace(0, 150, 16))
plt.grid()
The predicted period is close to the actual period, but appears to be slightly off.
Now let's build a recurrent neural network. Here, PyTorch, a library for deep learning, is used.
First, set up the device. If you can use the GPU, use the GPU, and if not, use the CPU as follows.
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Prepare a function for EarlyStopping. Early termination is the process of terminating learning when it is determined that learning will not proceed any further.
In deep learning, the prediction error is called loss. In order to prevent overfitting (also called overfitting), learning is stopped when it is judged that the loss
of the verification data that was not used for training does not decrease any more, not the loss
of the training data. .. As a judgment, we introduce the concept of patience
. If patience = 20
, compare the minimum value of loss
of $ 20 $ recently with the minimum value of loss
before that, and if the former is larger, judge that" it will not be any better "and learn. Censor.
def EarlyStopping(log, patience=20):
if len(log) <= patience:
return False
min1 = log[:len(log)-patience].min()
min2 = log[len(log)-patience:].min()
if min1 <= min2:
return True
else:
return False
RNN(Recurrent Neural Network)
It's a bit confusing, but RNNs (Recurrent Neural Networks) have "RNNs in a broad sense" and "RNNs in a narrow sense". RNNs in the narrow sense can be implemented in PyTorch as follows:
import torch
class RNN(torch.nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.l1 = torch.nn.RNN(1, hidden_dim,
nonlinearity='tanh',
batch_first=True)
self.l2 = torch.nn.Linear(hidden_dim, 1)
torch.nn.init.xavier_normal_(self.l1.weight_ih_l0)
torch.nn.init.orthogonal_(self.l1.weight_hh_l0)
def forward(self, x):
h, _ = self.l1(x)
y = self.l2(h[:, -1])
return y
LSTM (Long Short-Term Memory)
LSTM (Long Short-Term Memory) is either long or short! !! It's a name that makes you want to dig in, but it's a kind of "RNN in a broad sense". It is said that long-term memory is superior to "RNN in a narrow sense". You can implement it in PyTorch as follows:
import torch
class LSTM(torch.nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.l1 = torch.nn.LSTM(1, hidden_dim, batch_first=True)
self.l2 = torch.nn.Linear(hidden_dim, 1)
torch.nn.init.xavier_normal_(self.l1.weight_ih_l0)
torch.nn.init.orthogonal_(self.l1.weight_hh_l0)
def forward(self, x):
h, _ = self.l1(x)
y = self.l2(h[:, -1])
return y
GRU (Gated Recurrent Unit)
The GRU is a Gated Recurrent Unit, not the GRUnoye Razvedyvatelnoye Upravleniye. It is said that the calculation time is short while having the same or better performance as LSTM.
import torch
class GRU(torch.nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.l1 = torch.nn.GRU(1, hidden_dim, batch_first=True)
self.l2 = torch.nn.Linear(hidden_dim, 1)
torch.nn.init.xavier_normal_(self.l1.weight_ih_l0)
torch.nn.init.orthogonal_(self.l1.weight_hh_l0)
def forward(self, x):
h, _ = self.l1(x)
y = self.l2(h[:, -1])
return y
I learned as follows. The following code is for RNN, but you can change it to LSTM or GRU by rewriting it in one place.
%%time
from sklearn.utils import shuffle
model = RNN(50).to(device) #You can change the network by rewriting here
#model = LSTM(50).to(device)
#model = GRU(50).to(device)
criterion = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=0.001,
betas=(0.9, 0.999), amsgrad=True)
epochs = 1000
batch_size = 100
n_batches_train = len(X_train) // batch_size - 1
n_batches_test = len(X_val) // batch_size - 1
hist = {'train_loss':[], 'val_loss':[]}
for epoch in range(epochs):
train_loss = 0.
val_loss = 0.
X_, Y_ = shuffle(X_train, Y_train)
for batch in range(n_batches_train):
start = batch * batch_size
end = start + batch_size
X = torch.Tensor(X_[start:end])
Y = torch.Tensor(Y_[start:end])
model.train()
Y_pred = model(X)
loss = criterion(Y, Y_pred)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
for batch in range(n_batches_test):
start = batch * batch_size
end = start + batch_size
X = torch.Tensor(X_val[start:end])
Y = torch.Tensor(Y_val[start:end])
model.eval()
Y_pred = model(X)
loss = criterion(Y, Y_pred)
val_loss += loss.item()
train_loss /= n_batches_train
val_loss /= n_batches_test
hist['train_loss'].append(train_loss)
hist['val_loss'].append(val_loss)
print("Epoch:", epoch + 1, "Train loss:", train_loss, "Val loss:", val_loss)
if EarlyStopping(np.array(hist['val_loss'])):
print("Early stopping at epoch", epoch + 1)
break
Epoch: 1 Train loss: 0.024917872337913975 Val loss: 6.828008190495893e-05
Epoch: 2 Train loss: 3.570798083110742e-05 Val loss: 3.464157634880394e-05
Epoch: 3 Train loss: 2.720728588638639e-05 Val loss: 1.954806430148892e-05
...
Epoch: 580 Train loss: 5.4337909078255436e-08 Val loss: 4.0113718569045886e-08
Epoch: 581 Train loss: 6.47745306281422e-08 Val loss: 5.6099906942108646e-08
Epoch: 582 Train loss: 5.797503896896836e-08 Val loss: 1.620698952820021e-07
Early stopping at epoch 582
CPU times: user 26min 9s, sys: 21.6 s, total: 26min 31s
Wall time: 26min 39s
Learning is over. Outputs the learning curve.
%matplotlib inline
import matplotlib.pyplot as plt
plt.subplot(211)
plt.plot(hist['train_loss'], label='train_loss')
plt.plot(hist['val_loss'], label='val_loss')
plt.legend()
plt.grid()
plt.subplot(212)
plt.plot(hist['train_loss'], label='train_loss')
plt.plot(hist['val_loss'], label='val_loss')
plt.yscale('log')
plt.legend()
plt.grid()
Give the trained model the first few hours (only the length of pred_length
) as input and let it predict the output at the next time. Add the output predicted value to the input to predict the output at the next time. Repeat it endlessly.
%%time
total_time_length = 10000
pred_length = 1000
learning_time_length = 100
X_pred_length = np.linspace(0, pred_length , pred_length + 1)
Y_observed = func(X_pred_length)
Y_pred = Y_observed[:learning_time_length+1]
for i in range(pred_length):
X_ = Y_pred[i:i+learning_time_length+1].reshape(1, learning_time_length + 1, 1)
Y_ = model(torch.Tensor(X_)).detach().numpy()
Y_pred = np.append(Y_pred, Y_)
CPU times: user 2.54 s, sys: 5.97 ms, total: 2.55 s
Wall time: 2.55 s
The curve of the predicted value obtained in this way and the actual curve are illustrated and compared.
plt.figure(figsize=(36, 6))
times = np.linspace(0, Y_pred.shape[0] - 1, Y_pred.shape[0])
plt.plot(func(times), label="time series")
plt.plot(Y_pred, alpha=0.5, label="predicted")
plt.xticks(np.linspace(0, 1000, 11))
plt.xlim([0, 1000])
plt.grid()
plt.legend()
If the figure is small and difficult to see, you can enlarge it by clicking on it. Up to time $ 100 $, the training data is used as it is, so it is natural that they match, but after time $ 100 $, you can see that they match perfectly.
Let's compare the results of the Fourier transform.
plt.figure(figsize=(6,4))
sp = np.fft.fft(func(times))
freq = np.fft.fftfreq(times.shape[-1])
plt.subplot(211)
plt.plot(1/freq, abs(sp.real) + abs(sp.imag), label="observed")
plt.plot(1/freq, abs(sp.real))
plt.plot(1/freq, abs(sp.imag), alpha=0.5)
plt.legend()
plt.xlim([0, 150])
plt.xticks(np.linspace(0, 150, 16))
plt.grid()
sp = np.fft.fft(Y_pred)
freq = np.fft.fftfreq(times.shape[-1])
plt.subplot(212)
plt.plot(1/freq, abs(sp.real) + abs(sp.imag), label="predicted")
plt.plot(1/freq, abs(sp.real))
plt.plot(1/freq, abs(sp.imag), alpha=0.5)
plt.legend()
plt.xlim([0, 150])
plt.xticks(np.linspace(0, 150, 16))
plt.grid()
It seems that the cycles are exactly the same.
As mentioned above, learning an LSTM simply replaces model = RNN (50) .to (device)
in the code above with model = LSTM (50) .to (device)
.
Epoch: 1 Train loss: 0.24947839844315192 Val loss: 0.0037629783619195223
Epoch: 2 Train loss: 0.0010665786028720248 Val loss: 0.0004544752591755241
Epoch: 3 Train loss: 0.000281030429528656 Val loss: 0.00014765093510504812
...
Epoch: 397 Train loss: 1.9865108783006072e-08 Val loss: 1.99065262052045e-08
Epoch: 398 Train loss: 1.840841412067617e-08 Val loss: 1.814414751777349e-08
Epoch: 399 Train loss: 1.7767042196444784e-08 Val loss: 1.9604467382805524e-08
Early stopping at epoch 399
CPU times: user 48min 40s, sys: 51.2 s, total: 49min 31s
Wall time: 49min 41s
CPU times: user 7.67 s, sys: 14 ms, total: 7.68 s
Wall time: 7.69 s
It seems that this was also a perfect prediction.
As mentioned above, learning a GRU simply replaces model = RNN (50) .to (device)
in the code above with model = GRU (50) .to (device)
.
Epoch: 1 Train loss: 0.2067998453276232 Val loss: 0.0007729934877716005
Epoch: 2 Train loss: 0.0005770771786979495 Val loss: 0.00023205751494970173
Epoch: 3 Train loss: 0.00018625847849015816 Val loss: 0.00014329736586660147
...
Epoch: 315 Train loss: 5.816128262764026e-09 Val loss: 5.750611098420677e-09
Epoch: 316 Train loss: 5.757192062114896e-09 Val loss: 5.7092033323158375e-09
Epoch: 317 Train loss: 5.780735246610847e-09 Val loss: 5.6715170337895415e-09
Early stopping at epoch 317
CPU times: user 34min 51s, sys: 42.1 s, total: 35min 33s
Wall time: 35min 40s
CPU times: user 8.81 s, sys: 7.04 ms, total: 8.81 s
Wall time: 8.82 s
It seems that this was also a perfect prediction.
Now that I know how to move it, let's finally experiment with changing the shape and period of the function. The result is as follows.
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Even if the cycle becomes large, there is no big change in the number of epochs spent until the early end. The shape of the curve did not collapse significantly, but there was a shift in the cycle. The height (amplitude) of the output was preserved when the cycle was short, but it was found that it tended to become shorter as the cycle became longer.
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
The number of epochs spent to end early tended to decrease as the cycle increased. The prediction curve made good predictions in the short period (25) or medium period (50), but lost its shape significantly in the long period (100). In the long-period prediction, a sharp peak was observed at a strange place.
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
The number of epochs spent before early termination may (albeit unclear) decrease with longer cycles. Good predictions were made in all of the short period (25), medium period (50), and long period (100).
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
I've heard that GRUs are faster than LSTMs, but the number of epochs spent before early termination was rather long (reaching 1000 epochs). Good predictions were made in all of the short period (25), medium period (50), and long period (100).
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
The number of epochs spent until the early end is almost unchanged even if the cycle changes. The approximate shape is preserved, but there is a tendency for the cycle to shift when the feet are squishy and when the peak height is low. Looking at the Fourier transform diagram, it seems that many peaks (periods) are picked up (although they are out of alignment).
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
The one with a short cycle (25) is a good prediction. The one with the middle cycle (50) picks up the main peaks, but they are out of alignment. In the long period (100), RNN seems to have given up the prediction early. Of the many cycles, it seems that only relatively short cycles are picked up.
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Surprisingly, it didn't work in every cycle. The short cycle worked well with RNNs, but not with LSTMs, picking up an unlikely peak. The result of the middle cycle is the best, but the cycle was still off. I can hear the voice saying that I have given up on the prediction for the long period.
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
Learning curve
Prediction curve
Fourier transform
This is also surprising. LSTMs were worse than RNNs, but GRUs were worse. The short cycle is a little better, but the middle cycle feels like I've given up.
At first, I was thinking of mixing more diverse cycles, but after seeing the above results, I thought it would be better to mix a little less.
Learning curve
Prediction curve
Fourier transform
Picking up the cycle seems to be relatively successful. However, since the amplitude is not picked up well, the result is a prediction with a large error.
Learning curve
Prediction curve
Fourier transform
I usually pick up the cycle, but something is out of sync. The amplitude has increased for some reason.
Learning curve
Prediction curve
Fourier transform
I've picked up some cycles, but I've picked up quite a few cycles that aren't there. In that sense, is RNN a little better?
Learning curve
Prediction curve
Fourier transform
It feels even worse than the LSTM.
This is the summary. As far as I've dealt with some periodic functions
I haven't studied enough yet, so there may be some strange parts, but I would appreciate it if you could point out any points you noticed ... (ヽ ´ω`)