I'll forget too much about model validation, so here's a minimum of things to remember. In this article, I will touch on the main validation methods such as hold-out method and cross-validation, and the handling of time series data. References: [Kaggle Book](https://www.amazon.co.jp/Kaggle%E3%81%A7%E5%8B%9D%E3%81%A4%E3%83%87%E3%83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E3% 81% AE% E6% 8A% 80% E8% A1% 93-% E9% 96% 80% E8% 84% 87 -% E5% A4% A7% E8% BC% 94 / dp / 4297108437 / ref = sr_1_1_sspa? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords = Kaggle % E3% 81% A7% E5% 8B% 9D% E3% 81% A4% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E3 % 81% AE% E6% 8A% 80% E8% A1% 93 & qid = 1583042364 & sr = 8-1-spons & psc = 1 & spLa = ZW5jcnlwdGVkUXVhbGlmaWVyPUExTkhDNkExUTRFMzQ2JmVuY3J5cHRlZElkPUEwMzc3MjMyMzFUQ0g5SERIQ1BDSiZlbmNyeXB0ZWRBZElkPUFMT0hMWThLWFhJNDUmd2lkZ2V0TmFtZT1zcF9hdGYmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl)

1. Preparation

1-1. Environment

The code in the article has been confirmed to work on Windows-10, Python 3.7.3.

import platform

print(platform.platform())
print(platform.python_version())

1-2. Data set

Read regression and binary classification datasets from sklearn.datasets.

from sklearn import datasets
import numpy as np
import pandas as pd

#Regression dataset
boston = datasets.load_boston()
boston_X = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_y = pd.Series(boston.target)

#Binary classification dataset
cancer = datasets.load_breast_cancer()
cancer_X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
cancer_y = pd.Series(cancer.target)

2. hold-out method

This is the simplest and most straightforward method. Save some for validation and train the model with the rest. I often hear "Let's divide it into 7: 3", but when you think about it, it depends on the amount of data, so there is no fixed ratio. Validation data cannot be used for learning, so if the amount of data is small, cross-validation, etc., which will be described later, should be considered, and conversely, if it is huge, the hold-out method should be used. Basically shuffle and split the data. However, it does not shuffle for time series data. This is because there is a risk of learning (leaking) the future information if it is shuffled while trying to predict the future from the past information. I had an error with it once. Below is the code that divides the data into 3: 1 and evaluates it with the coefficient of determination. By the way, the evaluation indexes such as the coefficient of determination are summarized in here before.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

tr_x, va_x, tr_y, va_y = train_test_split(boston_X, boston_y, test_size=0.25, random_state=2020, shuffle=True)
slr = LinearRegression()
slr.fit(tr_x, tr_y)
va_pred = slr.predict(va_x)
score = r2_score(va_y, va_pred)
print(score)

0.732147337324218

3. Cross validation

A method of repeating the previous hold-out method multiple times. The image is a rocket pencil. After evaluating the model in one block, the next block is added to the training data, evaluated in another block, and so on. Since validation data cannot be used for learning in the hold-out method, cross-validation is often selected when the amount of data is small. As the number of blocks = the number of folds increases, the training data increases, but the calculation time also increases. This also depends on the amount of data, but about 4 or 5 is common. When evaluating the accuracy (generalization performance) of the model, either look at the average score of each fold, or calculate the score again with the predicted values of all folds. Below is the cross-validation code. It's not as simple as the hold-out method, so forget about it.

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold

i = 0
scores = []
kf = KFold(n_splits=4, shuffle=True, random_state=2020)
for tr_idx, va_idx in kf.split(boston_X):
    i += 1
    tr_x, va_x = boston_X.iloc[tr_idx], boston_X.iloc[va_idx]
    tr_y, va_y = boston_y.iloc[tr_idx], boston_y.iloc[va_idx]
    slr = LinearRegression()
    slr.fit(tr_x, tr_y)
    va_pred = slr.predict(va_x)
    score = mean_absolute_error(va_y, va_pred)
    print('fold{}: {:.2f}'.format(i, score))
    scores.append(score)

print(np.mean(scores))

fold1: 3.34 fold2: 3.39 fold3: 3.89 fold4: 3.02 3.4098095699116184

Validation is now complete, but there are as many models as there are folds. It needs to be put together somehow.

--Average the model of each fold. --Again, learn the model with the entire data.

It doesn't matter which one you use, but in practice, you should save and operate the model created in the latter. 4. stratified k-fold This is the method used for classification tasks. For example, in the task of classifying negative or positive, if the number of positives is extremely small, it may happen that there is no positive in the validation data after randomly dividing the data. This gives rise to the motivation to perform stratified sampling so that the proportion of classes contained in each fold is equal. On the contrary, if the ratio is in equilibrium, there is not much to worry about.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold

# hold-Also possible with the out method
# tr_x, va_x, tr_y, va_y = train_test_split(cancer_X, cancer_y, test_size=0.25, random_state=2020, shuffle=True, stratify=cancer_y)

i = 0
scores = []
kf = StratifiedKFold(n_splits=4, shuffle=True, random_state=2020)
for tr_idx, va_idx in kf.split(cancer_X, cancer_y):
    i += 1
    tr_x, va_x = cancer_X.iloc[tr_idx], cancer_X.iloc[va_idx]
    tr_y, va_y = cancer_y.iloc[tr_idx], cancer_y.iloc[va_idx]
    lr = LogisticRegression(solver='liblinear')
    lr.fit(tr_x, tr_y)
    va_pred = lr.predict_proba(va_x)[:, 1]
    score = log_loss(va_y, va_pred)
    print('fold{}: {:.2f}'.format(i, score))
    scores.append(score)

print(np.mean(scores))

fold1: 0.11 fold2: 0.17 fold3: 0.09 fold4: 0.07 0.11030074372001544

5. Other validation

Other than that, I haven't used it, but there is something like this, so I'll just make a note of the method. 5-1. group k-fold A method of dividing data by variables that represent groups. For example, in the case of a task such as learning the purchase history of each customer and scoring new customers, we do not want the same customer to coexist in the learning data and validation data. This is because it is considered that the answer is partially mixed with the training data (leakage). Therefore, when you want to divide the data using the customer ID, perform group k-fold. You can use a class called GroupKFold. However, there is no shuffle and random number seed function. 5-2. leave-one-out (LOO) I have never used this either. There is extremely little data, and I want to increase the number of N as much as possible → It seems to be a radical method of increasing the number of folds by the number of records. All you have to do is specify the number of records in n_splits with KFold, but there is also a dedicated class called LeaveOneOut.

6. Validation of time series data

Perhaps the most important thing in this article is that the techniques we've covered so far shouldn't be used directly for time series data. Since old / new is information by itself, you must be aware of the time series when learning and evaluating.

6-1. Data set

Since sklearn.datasets does not have the right time series data, we will use SIGNATE's [Practice question] Demand forecast for lunch data here.

I didn't know if I could use it according to the rules, so when I inquired, I received a reply that "analysis results and source code can be published only for non-commercial purposes." friendly. Perform the following processing according to Tutorial.

--Exclude older (pre-May) data with different trends --Create the number of days, fun menu flag, and curry flag as features

import matplotlib.pyplot as plt
%matplotlib inline

#Data read
train = pd.read_csv('./train.csv')

#Exclude old data with different trends
train.index = pd.to_datetime(train['datetime'])
train = train['2014-05-01':].copy()

#plot
train['y'].plot(figsize=(15, 3))
plt.show()

#Feature creation
train = train.reset_index(drop=True)
train['days'] = train.index
train['fun'] = train['remarks'].apply(lambda x: 1 if x == 'Fun menu' else 0)
train['curry'] = train['name'].apply(lambda x: 1 if x.find('curry') >= 0 else 0)
train_X = train[['days', 'fun', 'curry']].copy()
train_y = train['y'].copy()

Basically, the number of sales is declining (negative correlation with the number of days), but it spikes during popular menus (fun menu, curry). How far can we predict with a simple linear regression?

from sklearn.metrics import mean_squared_error

slr = LinearRegression()
slr.fit(train_X, train_y)
train['pred'] = slr.predict(train_X)
rmse = np.sqrt(mean_squared_error(train['y'], train['pred']))

print(rmse)
train.plot(y=['y', 'pred'], figsize=(15, 3))
plt.show()

10.548692191381326

It's quite off, but I've got a rough idea. With the desire to see the relationship between the residuals and other features, I return to the topic of validation.

6-2. Cross-validation of time series data

The simplest is probably the non-shuffled hold-out method. False should be specified in the argument shuffle of the train_test_split function. By learning with old data and evaluating with new data, time series data can be validated without problems. However, it is still a waste not to use data for learning that best reflects the latest, or recent trends. Therefore, if the generalization performance can be confirmed, it will often be re-learned with all the data. Even so, there will still be complaints about whether the accuracy will be improved for other periods, or simply the amount of data is insufficient, that is, the desire to use the data more efficiently. Therefore, a method called TimeSeries Split appears. The idea itself is simple, in short, it's a method of cross-validating in chronological order. However, even with this, dissatisfaction remains, such as the fact that the latest data cannot be used, and the length of the training data differs for each fold. But it's better to use it.

from sklearn.model_selection import TimeSeriesSplit

i = 0
scores = []
tss = TimeSeriesSplit(n_splits=4)
for tr_idx, va_idx in tss.split(train_X):
    i += 1
    tr_x, va_x = train_X.iloc[tr_idx], train_X.iloc[va_idx]
    tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]
    slr = LinearRegression()
    slr.fit(tr_x, tr_y)
    va_pred = slr.predict(va_x)
    score = np.sqrt(mean_squared_error(va_y, va_pred))
    print('fold{}: {:.2f}'.format(i, score))
    scores.append(score)

print(np.mean(scores))

fold1: 20.29 fold2: 9.21 fold3: 15.05 fold4: 9.68 13.557202833084698

It is a result that is difficult to judge. It is interesting that the accuracy does not improve as the training data increases. There's still room for experimentation, but it's beyond the scope of this article and ends here.

Thus, validation of time series data is difficult to draw clear conclusions. In particular, when the trend is changing most recently, it is correct that the validation data cannot be explained by the training data, but rather it can be said that the model should be trained by the validation data. In the control system, it is possible to monitor the accuracy with RMSE or something, return to the existing control when the threshold is exceeded, relearn when the required number of data is accumulated, and restart the control. Alternatively, one might think of keeping the model updated with online learning. What about the demand forecast system? Since it is not always necessary to predict a certain period of time like a competition, an ARIMA model that considers short-term autocorrelation may be effective. I would like to ask the demand forecast team next time.

[PYTHON] Memorandum about validation