[PYTHON] Forecasting time series data with Simplex Projection

Introduction

About EDM posted in Previous article, it is a story of implementing Simplex Projection in Python in order to challenge the rumination of learning contents and temporal extrapolation. ..

Regarding pyEDM implemented in the laboratory of the proposer Sugihara et al., We will implement it because it does not support prediction outside the input DataFrame. (Reference).

For this implementation, we refer to Blog by Professor Ushio of Kyoto University.

Simplex Projection

0. Data preparation

From the pyEDM sample dataset [Tent Map](https://en.wikipedia.org/wiki/%E3%83%86%E3%83%B3%E3%83%88%E5%86%99%E5% 83% 8F) data is used.

!pip install pyEDM

import pyEDM
tentmap = pyEDM.sampleData['TentMap']

print(tentmap.head())
  Time	TentMap
0    1	-0.09920
1    2	-0.60130
2    3	 0.79980
3    4	-0.79441
4    5	 0.79800

download.png

  1. Time-delay Embedding

Its own lag for the variable $ y_t , such as \ { {y_ {t}, y_ {t-\ tau}, ..., y_ {t-(E-1) \ tau}} $ } Use to reconstruct the original dynamics. In fact, you can use other related variables in addition to the target variable. This time, for the sake of simplicity, let's set one variable and $ E = 2, \ tau = 1 $. Actually, it is a variable that greatly depends on the reconstruction accuracy of the dynamics, so it is necessary to estimate the data on hand in advance.

In the case of pyEDM, ʻEmbed () is prepared ([doc]([ʻEmbed ()](https://sugiharalab.github.io/EDM_Documentation/edm_functions/#embed))).

Implemented

def time_delay_embedding(objective_df, target_var, E=2, tau=1):
    return_df = pd.DataFrame(objective_df[target_var].values,
                             columns=[target_var+"(t-0)"],
                             index=objective_df.index)    
    
    for i in range(tau, E*tau-1, tau):
        colname = target_var + "(t-" + str(i) + ")"
        return_df[colname] = objective_df[[target_var]].shift(i)
    
    return return_df.dropna()

emb_df = time_delay_embedding(tentmap, "TentMap", E=2, tau=1)
print(emb_df.head())
   TentMap(t-0)	 TentMap(t-1)
1      -0.60130      -0.09920
2       0.79980	     -0.60130
3      -0.79441	      0.79980
4       0.79800	     -0.79441
5      -0.81954	      0.79800

2. Search for Nearest Neighbor

In this implementation, the latest value of the data on hand is targeted and predicted after $ 1 $ time.

target_pt = emb_df.iloc[-1,:].values #The latest value that serves as the basis for forecasting

diff = emb_df.values - target_pt #Difference between all points of data on hand from the reference point
l2_distance = np.linalg.norm(diff, ord=2, axis=1) #Calculation of L2 norm from difference
l2_distance = pd.Series(l2_distance, index=emb_df.index, name="l2") #I want to sort so pandas.Convert to Series

nearest_sort = l2_distance.iloc[:-1].sort_values(ascending=True) #end(=The latest value that is the standard)Sort in ascending order except

print(nearest_sort.head(3))
index    distance
  124    0.003371
  177    0.018171
  163    0.018347
Name: l2

3. Movement of Nearest Neighbor

Assuming that the point that is the basis of prediction is $ y_ {t} $ and the extracted neighborhood points are \ {$ y_ {t_1}, y_ {t_2}, y_ {t_3} $ } Use \ {$ y_ {t_1 + 1}, y_ {t_2 + 1}, y_ {t_3 + 1} $ } to calculate the point $ y_ {t + 1} $ you want to predict. Since the index of the neighborhood point obtained in the previous section is \ {124, 177, 163 }, You will use the points that correspond to \ {125, 178, 164 }.

There is no limit to the number of neighborhood points that can be referenced, but $ E + 1 $ is used for the embedded dimension $ E $.

knn = int(emb_df.shape[1] + 1)
nn_index = np.array(nearest_sort.iloc[:knn,:].index.tolist())
nn_future_index = nn_index + pt

print(nn_index, "-->", nn_future_index)

nn_future_points = lib_df.loc[nn_future_index].values
print(nn_future_points)
[124 177 163] --> [125 178 164]

[[0.16743 0.91591]
 [0.15932 0.91998]
 [0.1335  0.93295]]

4. Calculation of prediction points

\ {$ Y_ {t_1 + 1}, y_ {t_2 +, depending on the distance to the reference points $ y_ {t} $ and \ {$ y_ {t_1}, y_ {t_2}, y_ {t_3} $ } Calculate the weights \ {$ w_ {t_1 + 1}, w_ {t_2 + 1}, w_ {t_3 + 1} $ } for 1}, y_ {t_3 + 1} $ }.

nn_distances = nearest_sort.loc[nn_index].values #From the reference point yt1, yt2,Distance to yt3
nn_weights = np.exp(-nn_distances/nn_distances[0]) #Calculate weight
total_weight = np.sum(nn_weights)

print("nn_distances")
print("nn_weights")
print("total_weight")
[0.00337083 0.01817143 0.01834696]
[0.36787944 0.00455838 0.00432709]
0.376764916711149

Calculate the prediction point $ y_ {t + 1} $ using the weights. The formula is a simple weighted average, $ y_{t+1} = \frac{w_{t_1+1} \times y_{t_1+1} + w_{t_2+1} \times y_{t_2+1} + w_{t_3+1} \times y_{t_3+1}}{w_{total}}$

forecast_point = list()
for yt in nn_future_points.T:
    forecast_point.append(np.sum((yt * nn_weights) / total_weight))

print(forecast_point)
[0.16694219792961462, 0.9161549438807427]

By the way, the correct answer in this case is ground_truth: [0.16928 0.91498] Therefore, a value that is fairly close is required.

Finally, I plotted the state of the calculation.

Overall view TentMap_SP.png

Expansion of reference points $ y_ {t} $ and \ {$ y_ {t_1}, y_ {t_2}, y_ {t_3} $ } TentMap_SP_targetnn.png

Expansion of predicted and true values TentMap_SP_forecastnn.png

The entire function that combines 1. to 4. 2.

def forecast_simplex_projection(input_df=None,
                                    forecast_tp=1, knn=None
                                   ):
        if input_df is None:
            raise Exception("Invalid argument: is None: lib_df<pandas.DataFrame> is None")

        if knn is None:
            knn = int(input_df.shape[1] + 1) #Embedded dimension+1
    
        #Get variable name from input DataFrame
        input_cols = input_df.columns.tolist()

        #DeepCopy data for recursive processing in the subsequent stage
        lib_df = input_df.copy()
    
        # forecast_Recursively execute one-step prediction up to the step specified by tp
        forecast_points = list()
        for pt in range(1, forecast_tp+1):
            #Distance calculation from the reference point to each point in the library
            target_pt = lib_df.iloc[-1,:].values
            lib_pt = lib_df.values

            diff = lib_pt - target_pt

            l2_distance = np.linalg.norm(diff, ord=2, axis=1)
            l2_distance = pd.Series(l2_distance, index=lib_df.index, name="l2")

            #Sort distances in ascending order
            nearest_sort = l2_distance.iloc[:-1].sort_values(ascending=True)

            #Index of neighborhood points with respect to reference point and t+Index of 1
            nn_index = np.array(nearest_sort.iloc[:knn].index.tolist())
            nn_future_index = nn_index + 1
    
            #Calculate weight according to the distance of neighboring points
            nn_distances = nearest_sort.loc[nn_index].values
            nn_distances = nn_distances + 1.0e-6
        
            nn_weights = np.exp(-nn_distances/nn_distances[0])
            total_weight = np.sum(nn_weights)
        
            #Calculate predicted value for each embedded dimension
            forecast_value = list()
            for yt in nn_future_points.values.T:
                forecast_value.append(np.sum((yt * nn_weights) / total_weight))
        
            #Add prediction results to library
            forecast_points.append(forecast_value)
            series_forecast = pd.Series(forecast_value, index=input_df.columns)
            lib_df = lib_df.append(series_forecast, ignore_index=True)
        
        forecast_points = np.array(forecast_points)
    
        return forecast_points

At the end

This time, I implemented Simplex Projection, which is the simplest of EDM, in Python. Other methods include S-Map and Multiview Embedding. It is also possible to use different lags with multiple variables for dynamic reconstruction, which affects prediction accuracy. From the next time onward, I will challenge such contents.

Recommended Posts

Forecasting time series data with Simplex Projection
Predict time series data with neural network
View details of time series data with Remotte
[Python] Plot time series data
Easy time series prediction with Prophet
Python: Time Series Analysis: Preprocessing Time Series Data
About time series data and overfitting
Differentiation of time series data (discrete)
Movement statistics for time series forecasting
Time series analysis 3 Preprocessing of time series data
LSTM (1) for time series forecasting (for beginners)
How to extract features of time series data with PySpark Basics
Time series data anomaly detection for beginners
How to handle time series data (implementation)
Reading OpenFOAM time series data and sets data
Power of forecasting methods in time series data analysis Semi-optimization (SARIMA) [Memo]
Plot CSV of time series data with unixtime value in Python (matplotlib)
Convenient time series aggregation with TimeGrouper in pandas
Format and display time series data with different scales and units with Python or Matplotlib
Get time series data from k-db.com in Python
Kaggle Kernel Method Summary [Table Time Series Data]
Time Series Decomposition
Acquisition of time series data (daily) of stock prices
Smoothing of time series and waveform data 3 methods (smoothing)
How to read time series data in PyTorch
Reading, summarizing, visualizing, and exporting time series data to an Excel file with Python
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
"Getting stock price time series data from k-db.com with Python" Program environment creation memo
Visualize Prophet's time series forecasts more clearly with Plotly
Features that can be extracted from time series data
Anomaly detection of time series data by LSTM (Keras)
I tried to implement time series prediction with GBDT
Time series data prediction by AutoML (automatic machine learning)
[Time series with plotly] Dynamic visualization with plotly [python, stock price]
Data analysis with python 2
Python: Time Series Analysis
Visualize data with Streamlit
Reading data with TensorFlow
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python
Python time series question
RNN_LSTM1 Time series analysis
Time series analysis 1 Basics
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
Data Augmentation with openCV
Normarize data with Scipy
Data analysis with Python
Display TOPIX time series
Time series plot / Matplotlib
LOAD DATA with PyMysql
I made a package to filter time series with python
How to generate exponential pulse time series data in python
Reformat the timeline of the pandas time series plot with matplotlib
Library tsfresh that automatically extracts features from time series data
Graph time series data in Python using pandas and matplotlib
A story about clustering time series data of foreign exchange