[PYTHON] Model Complexity and Robustness

Verify model complexity and deterioration of model accuracy when light noise is added to the data

For me, who tend to think that the data used for model construction was a sample of possible past events (population), it is also possible that the only sample that appeared in the past was a slightly blurred event. Things you want to care about. If you say "think about the sample blurring" = "add noise to the visible sample", you can use it as a way to check the robustness (dependence on the data) of your model. I thought it might be, and created it

Roughly speaking This is not divided into test data and training data! !! I made a model called, and in reality, the data can be slightly different. How much the accuracy of the model deteriorates at that time The motive was that I wanted to confirm the phenomenon

If the model is simple and the noise is simple, I think that the deterioration of prediction accuracy can be calculated using mathematical formulas, but by making it based on Simulation, it may be possible to handle various models, noise that is neither iid nor normal distribution. I don't know

Import etc.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
import pandas_datareader.data as web
import yfinance as yf
from numpy.random import *

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

import seaborn as sns
import warnings
import sys

warnings.filterwarnings('ignore')

plt.style.use('seaborn-darkgrid')
plt.rcParams['axes.xmargin'] = 0.01
plt.rcParams['axes.ymargin'] = 0.01

Get'USDJPY'from yfinance and create weekly returns

ReadDF = yf.download('JPY=X', start="1995-01-01", end="2019-10-30")
ReadDF.index = pd.to_datetime(ReadDF.index)
IndexValueReadDF_rsmpl = ReadDF.resample('W').last()['Adj Close']
ReadDF = IndexValueReadDF_rsmpl / IndexValueReadDF_rsmpl.shift(1) - 1

Explanatory variable (X), explained variable (y) Data creation

ret_df = pd.DataFrame()
ret_df[mkt] = ReadDF

test = pd.DataFrame(ret_df[mkt])
for i in range(1, 5):
    test['i_' + str(i)] = test['USDJPY'].shift(1 * i)

test = test.dropna(axis=0)
X = test.ix[:, 1:]
y = test[mkt]

Model creation (actual sample & noise added)

m_ = 3  # 0.A hook to have 3 times more volatility
output_degree = {}

for k in range(1, 7):
    polynomial_features = PolynomialFeatures(degree=k, include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

    pipeline.fit(X, y)

    k_sample = pd.DataFrame()

    for l in range(0, 300):

        #Normal random number generation
        eps_0 = pd.DataFrame(randn(X.shape[0], X.shape[1]))
        eps_1 = eps_0.apply(lambda x: x * list(X.std()), axis=1)

        eps_1.columns = X.columns
        eps_1.index = X.index

        #Add to the original data
        X_r = X + m_/10 * eps_1

        signal = pd.DataFrame()
        signal[mkt] = np.sign(pd.DataFrame(pipeline.predict(X_r)))[0]

        signal.index = y.index

        k_sample['s_' + str(l)] = (pd.DataFrame(signal[mkt]) * pd.DataFrame(y)).ix[:, 0]

    signal_IS = pd.DataFrame()
    # signal_IS[mkt] = np.sign(pd.DataFrame(pipeline.predict(X.ix[:, 0][:, np.newaxis])))[0]

    signal_IS[mkt] = np.sign(pd.DataFrame(pipeline.predict(X)))[0]

    signal_IS.index = y.index

    k_sample['IS'] = (pd.DataFrame(signal_IS[mkt])*pd.DataFrame(y)).ix[:, 0]
    k_sample[mkt] = pd.DataFrame(y).ix[:, 0]

    output_degree['degree_' + str(k)] = k_sample

Output histogram (every degree N)


Performance_sim = pd.DataFrame()

fig = plt.figure(figsize=(15, 7), dpi=80)
for k in range(1, 7):
    ax = fig.add_subplot(2, 3, k)

    for_stats = output_degree['degree_' + str(k)]
    Performance = pd.DataFrame(for_stats.mean()*50 / (for_stats.std()*np.sqrt(50))).T

    Performance_tmp = Performance.ix[:, 1:].T

    ax.hist(Performance.drop(['IS', mkt], axis=1), bins=30, color="dodgerblue", alpha=0.8)
    ax.axvline(x=float(Performance.drop(['IS', mkt], axis=1).mean(axis=1)), color="b")
    ax.axvline(x=float(Performance['IS']), color="tomato")
    ax.axvline(x=float(Performance[mkt]), color="gray")
    ax.set_ylim([0, 40])
    ax.set_xlim([-0.3, 2.5])
    ax.set_title('degree-N polynomial: ' + str(m_))
fig.show()

result

Model_Complexity.png

It is a natural result if it is simple data, but it seems to be useful for verification when I made a model

Addendum 1

In the above, the complexity of the model is set to degree = k of PolynomialFeatures, but the degree = 3, the model is RandomForestRegressor, and the complexity of the model is max_depth = k.

    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("rf_regression", RandomForestRegressor(max_depth=k))])

result

Model_Complexity_RF.png

Addendum 2

Once again, in the example where the complexity of the model is set to degree = k of PolynomialFeatures, an explanatory variable is added, and in the sense that the feeling of wanting to make a model is systematically controlled, the explanatory variable is used in the regression part using Lasso. Make a selection

    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", LassoCV(cv=5))])

Model_Complexity_LassoCV.png

Recommended Posts

Model Complexity and Robustness
Learning model creation, learning and reasoning
Memorandum of saving and loading model
Regression model and its visualization using scikit-learn
Loading and testing Chainer's trained imagenet model
Implement a model with state and behavior