Google translated http://scikit-learn.org/0.18/modules/cross_validation.html [scikit-learn 0.18 User Guide 3. Model Selection and Evaluation](http://qiita.com/nazoking@github/items/267f2371757516f8c168#3-%E3%83%A2%E3%83%87%E3%83] From% AB% E3% 81% AE% E9% 81% B8% E6% 8A% 9E% E3% 81% A8% E8% A9% 95% E4% BE% A1)

3.1. Cross-validation: assessing estimator performance

Learning the parameters of the predictor function and testing it with the same data is a methodological mistake. The labels of known samples are returned perfectly, but no useful predictions can be made for unknown data. This situation is called ** overfit **. To avoid that, it is common practice to perform (supervised) machine learning experiments that keep some of the available data as ** test sets ** X_test, y_test. Please note that the word "experiment" does not mean only academic use. Machine learning usually begins experimentally, even in commercial settings. scikit-learn uses the train_test_split helper function for training and test sets. You can quickly calculate the random division of. Load the iris dataset to train a linear SVM:

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> from sklearn import datasets
>>> from sklearn import svm

>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))

To test (evaluate) the classifier, you can sample the training set while retaining 40% of the data.

>>>
>>> X_train, X_test, y_train, y_test = train_test_split(
...     iris.data, iris.target, test_size=0.4, random_state=0)

>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))

>>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)                           
0.96...

When adjusting various settings ("hyperparameters") such as the value of C that needs to be set manually on the SVM, overfitting to the test set to fine-tune the parameters to optimize the estimation. I will. In this way, knowledge of the test set "leaks" into the model, making predictive performance less versatile with evaluation metrics. To solve this problem, yet another part of the dataset can be a so-called "validation set". You can train on the training set, then evaluate on the validation set, and make a final evaluation on the test set when the experiment appears to be successful. However, if dividing the available data into three sets significantly reduces the number of samples available for training the model, the results can depend on a particular random selection of (training, validation) pairs. There is sex. The solution to this problem is a procedure called Cross-Validation, Cross-Validation (https://en.wikipedia.org/wiki/Cross-validation_(statistics)) (CV for short). The test set still needs to be put on hold for final evaluation, but the validation set is no longer needed when running the CV. In a basic approach called k-fold CV, the training set is divided into k small sets (other approaches are described below, but generally follow the same principles). For each of the k divisions

--The model is trained using k-1 as training data. --The resulting model is validated with the rest of the data (ie, used as a test set to calculate performance measures such as accuracy).

The performance metric reported by k-fold cross-validation is the average of the values calculated in the loop. Although this approach is computationally expensive, it does not waste too much data (as it does when fixing an arbitrary test set) and has great advantages such as back inference with a very small number of samples.

3.1.1. Calculation of cross-validated metrics

The easiest way to use cross-validation is with the estimator and dataset cross_val_score To call a helper function. The following example shows how to estimate the accuracy of a linear SVM on an iris dataset by splitting the data, fitting the model, and calculating the score five times in a row (with different splits each time).

>>> from sklearn.model_selection import cross_val_score
>>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)
>>> scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])

Therefore, the average score and 95% confidence interval for the scores are

>>>
>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)

By default, the score calculated for each CV iteration is the estimated score method. You can change this using scoring parameters:

>>>
>>> from sklearn import metrics
>>> scores = cross_val_score(
...     clf, iris.data, iris.target, cv=5, scoring='f1_macro')
>>> scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])

For more information, see Scoring Parameters: Defining Model Evaluation Rules (http://qiita.com/nazoking@github/items/958426da6448d74279c7). For the Iris dataset, the accuracy and F1 score are about the same because the samples are balanced across the target class. If the cv argument is an integer, cross_val_score defaults to [KFold](http: //scikit-learn.org/0.18/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) or StratifiedKFold Use .model_selection.StratifiedKFold.html # sklearn.model_selection.StratifiedKFold). The latter is used when the estimator is derived from ClassifierMixin. Other cross-validation strategies can also be used by passing an iterator for cross-validation.

>>> from sklearn.model_selection import ShuffleSplit
>>> n_samples = iris.data.shape[0]
>>> cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
>>> cross_val_score(clf, iris.data, iris.target, cv=cv)
...                                                     
array([ 0.97...,  0.97...,  1.        ])

** Data conversion by pending data **

Just as it is important to test expectations for data reserved from training, preprocessing (standardization, feature selection, etc.) and similar data transformation data_transforms.html # data-transforms) also need to be learned from the training set and applied to the pending data for prediction:

>>>
>>> from sklearn import preprocessing
>>> X_train, X_test, y_train, y_test = train_test_split(
...     iris.data, iris.target, test_size=0.4, random_state=0)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train_transformed = scaler.transform(X_train)
>>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
>>> X_test_transformed = scaler.transform(X_test)
>>> clf.score(X_test_transformed, y_test)  
0.9333...

Pipeline makes it easy to create synthetic estimators for cross-validation. available.

>>>
>>> from sklearn.pipeline import make_pipeline
>>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
>>> cross_val_score(clf, iris.data, iris.target, cv=cv)
...                                                 
array([ 0.97...,  0.93...,  0.95...])

See Pipeline and Feature Union: Combining estimators (http://scikit-learn.org/0.18/modules/pipeline.html#combining-estimators).

3.1.1.1. Obtaining forecasts by cross-validation

The function cross_val_predict is cross_val_score. It has the same interface as /0.18/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score), but returns the predicted result itself. You can only use cross-validation strategies that assign all elements to a test set only once (otherwise you will get an exception). The classifier can also be evaluated using these predictions.

>>>
>>> from sklearn.model_selection import cross_val_predict
>>> predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
>>> metrics.accuracy_score(iris.target, predicted) 
0.966...

The result of this calculation is [cross_val_score](http://scikit-learn.org/0.18/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score] because the elements are grouped in different ways. Please note that the results obtained using) may differ slightly. The available cross-validation iterators are listed in the next section.

--Example -[Receiver Operating Characteristics (ROC) by Cross-Validation](http://scikit-learn.org/0.18/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval -py) -[Delete recursive features by cross-validation](http://scikit-learn.org/0.18/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with -cross-validation-py), -[Parameter estimation using cross-validation grid search](http://scikit-learn.org/0.18/auto_examples/model_selection/grid_search_digits.html#sphx-glr-auto-examples-model-selection-grid-search- digits-py), -[Sample Pipeline for Text Feature Extraction and Evaluation](http://scikit-learn.org/0.18/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid- search-text-feature-extraction-py), -Cross-validated prediction plot, -[Non-nested cross-validation and non-nested cross-validation](http://scikit-learn.org/0.18/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot -nested-cross-validation-iris-py).

3.1.2. Cross-validation iterator

The following sections show utilities that generate indexes for dataset partitioning according to various cross-validation strategies.

3.1.3. Cross-validation iterator for independent and identically distributed data

Assuming that some data are Independent Identically Distributed (iid), it is assumed that all samples originate from the same generation process and that the generation process does not have the memory of previously generated samples. .. In such cases, the following cross-validators can be used.

Caution

i.i.d. data is a common assumption in machine learning theory and is rarely true in practice. It is safer to use the time series recognition cross-validation method <time_series_cv> if you know that the sample was generated using a time-dependent process. It is safer to use groupwise cross-validation <group_cv> if you know that the generation process has a group structure (different subjects, experiments, measuring devices).

3.1.3.1. K-fold cross-validation

KFold puts all the samples in the same $ k $ pieces called folds. Divide into groups of sizes (if possible) (if $ k = n $, this is the same as the Leave One Out strategy). The predictive function is trained using the $ k-1 $ fold, and the remaining fold is used for testing. Example of 2-fold cross-validation on a dataset with 4 samples:

>>> import numpy as np
>>> from sklearn.model_selection import KFold

>>> X = ["a", "b", "c", "d"]
>>> kf = KFold(n_splits=2)
>>> for train, test in kf.split(X):
...     print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]

Each fold consists of two arrays. The first one is for the training set and the second is for the test set. Therefore, you can use the numpy index to create training / test sets.

>>> X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
>>> y = np.array([0, 1, 0, 1])
>>> X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]

3.1.3.2. Leave One Out（LOO）

LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by all but one sample, and the test set is one sample excluded. Therefore, for n samples, there are n different training sets and n different test sets. This cross-validation procedure removes only one sample from the training set, so no data is wasted.

>>> from sklearn.model_selection import LeaveOneOut

>>> X = [1, 2, 3, 4]
>>> loo = LeaveOneOut()
>>> for train, test in loo.split(X):
...     print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]

Potential users of LOO for model selection should consider some known warnings. Compared to $ k $ split cross-validation, $ n $ samples build $ n $ models instead of $ k $ models. In addition, each is trained on $ n-1 $ samples instead of $ (k-1) n / k $. In both methods, LOO is computationally more expensive than k-fold cross-validation, assuming $ k $ is not too large and $ k <n $. From an accuracy standpoint, LOOs often result in high variance as an estimator test error. Intuitively, $ n-1 $ of $ n $ samples are used to build each model, so the models built from fold are virtually identical to each other and from the entire training set. It is the same as the built model. However, if the learning curve is steep with respect to the training size in question, cross-validation with 5 or 10 divisions can overestimate the generalization error. As a general rule, most authors and empirical evidence suggest that 5- or 10-fold cross-validation is preferred over LOO.

--Reference: - http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html; --T. Hastie, R. Tibshirani, J. Friedman, Elements of Statistical Learning, Springer 2009 --L. Breiman, P. Spector Submodel Selection and Evaluation in Regression: X Random Case, International Statistical Review 1992; --R. Kohavi, Mutual validation and bootstrap for accuracy estimation and model selection, Intl .. Jnt. Conf. AI --R. Bharat Rao, G. Fung, R. Rosales, Experimental Assessment of Cross-Validation Risks, SIAM 2008; --G. James, D. Witten, T. Hastie, R Tibshirani, Introduction to Statistics Learning, Springer 2013

3.1.3.3. Leave P Out（LPO）

LeavePOut is very similar to LeaveOneOut. This is to create all possible training / test sets by removing $ p $ samples from the complete set. For n samples, this will generate a $ {n \ choose p} $ training-test pair. Unlike LeaveOneOut and KFold, test sets are duplicated at $ p> 1 $. Leave-2-Out example for a dataset with 4 samples:

>>> from sklearn.model_selection import LeavePOut

>>> X = np.ones(4)
>>> lpo = LeavePOut(p=2)
>>> for train, test in lpo.split(X):
...     print("%s %s" % (train, test))
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]

3.1.3.4. Random Permutation Mutual Verification a.k.a. Shuffle & Split

ShuffleSplit The iterator has a user-defined number of independent training / test dataset splits. Generate. Shuffle the sample first and then split it into training and test set pairs. You can control the reproducibility of the results by explicitly seeding the random_state pseudo-random number generator. The usage example is shown below.

>>> from sklearn.model_selection import ShuffleSplit
>>> X = np.arange(5)
>>> ss = ShuffleSplit(n_splits=3, test_size=0.25,
...     random_state=0)
>>> for train_index, test_index in ss.split(X):
...     print("%s %s" % (train_index, test_index))
...
[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]

For this reason, Shuffle Split is a great alternative to KFold's cross-validation, giving you more control over the number of samples and iterations for each training / test.

3.1.4. Mutual validation iterator with hierarchy based on class label

Some classification problems can show a large imbalance in the distribution of target classes: for example, there can be several times more negative samples than positive samples. In such cases, StratifiedKFold and [StratifiedShuffleSplit](http://scikit- Relative class frequencies are nearly preserved at multiples of each training and validation using stratified sampling implemented in learn.org/0.18/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit). It is recommended to do so.

3.1.4.1. Layered K partition

StratifiedKFold is a variation of the k partition that returns a stratified fold. Each set contains approximately the same percentage of the complete set and the sample for each target class. Example of stratified 3-fold cross-validation for a dataset of 10 samples from two slightly unbalanced classes:

>>> from sklearn.model_selection import StratifiedKFold

>>> X = np.ones(10)
>>> y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
>>> skf = StratifiedKFold(n_splits=3)
>>> for train, test in skf.split(X, y):
...     print("%s %s" % (train, test))
[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]

3.1.5. Mutual validation iterator for grouped data

The i.i.d. assumption is violated if the basic generation process produces a group of dependent samples. Such grouping of data is domain specific. One example is when there is medical data collected from multiple patients and multiple samples are taken from each patient. Such data is likely to be individual group dependent. In this example, the patient ID for each sample is the group identifier. In this case, I would like to know if a model trained in a particular set of groups can be applied to other groups as well. To measure this, the validation sample should come from a group that is never used for training. You can do this using the following cross-validation splitter. The sample grouping identifier is specified on the groups parameter.

3.1.5.1. Group k partition

GroupKFold is a variation of the k partition that prevents the same group from appearing in both the test set and the training set. is. For example, if the data are obtained from different subjects with multiple samples per subject and the model is flexible enough to learn from highly individual characteristics, it cannot be generalized to new subjects. GroupKFold makes it possible to detect such overcrowding. Suppose you have three groups, each associated with a number from 1 to 3.

>>> from sklearn.model_selection import GroupKFold

>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
>>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

>>> gkf = GroupKFold(n_splits=3)
>>> for train, test in gkf.split(X, y, groups=groups):
...     print("%s %s" % (train, test))
[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

Each subject is in a different test phase and the same subject is not in both the test and training. Note that they will not be exactly the same size due to data imbalances.

3.1.5.2. Leave One Group Out

LeaveOneGroupOut (http://scikit-learn.org/0.18/modules/generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn.model_selection.LeaveOneGroupOut) holds samples according to an array of group numbers provided by a third party. It is a cross-validation method. This group information can be used to encode any domain-specific predefined cross-validation partition. Therefore, each training set consists of all samples except those associated with a particular group. For example, for multiple experiments, LeaveOneGroupOut can be used to create cross-validation based on different experiments. Create a training set using samples from all but one experiment.

>>> from sklearn.model_selection import LeaveOneGroupOut

>>> X = [1, 5, 10, 50, 60, 70, 80]
>>> y = [0, 1, 1, 2, 2, 2, 2]
>>> groups = [1, 1, 2, 2, 3, 3, 3]
>>> logo = LeaveOneGroupOut()
>>> for train, test in logo.split(X, y, groups=groups):
...     print("%s %s" % (train, test))
[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]

Another common application is to use time information. For example, a group could be the year of collecting samples, allowing mutual validation against time-based divisions.

3.1.5.3. Leave P Groups Out

LeavePGroupsOut is LeaveOneGroupOut /modules/generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn.model_selection.LeaveOneGroupOut), but removes the sample associated with the $ P $ group in each training / test set. Leave-2-Group Out example:

>>> from sklearn.model_selection import LeavePGroupsOut

>>> X = np.arange(6)
>>> y = [1, 1, 1, 2, 2, 2]
>>> groups = [1, 1, 2, 2, 3, 3]
>>> lpgo = LeavePGroupsOut(n_groups=2)
>>> for train, test in lpgo.split(X, y, groups=groups):
...     print("%s %s" % (train, test))
[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]

3.1.5.4. Group Shuffle Split

GroupShuffleSplit The iterator is ShuffleSplit /0.18/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit) and [LeavePGroupsOut](http://scikit-learn.org/0.18/modules/generated/sklearn.model_selection.LeavePGroupsOut.html# Acts as a combination of sklearn.model_selection.LeavePGroupsOut), producing a random sequence of partitions that holds a subset of the groups for each split. The usage example is shown below.

>>> from sklearn.model_selection import GroupShuffleSplit

>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "a"]
>>> groups = [1, 1, 2, 2, 3, 3, 4, 4]
>>> gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)
>>> for train, test in gss.split(X, y, groups=groups):
...     print("%s %s" % (train, test))
...
[0 1 2 3] [4 5 6 7]
[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]

This class is useful if you need the behavior of LeavePGroupsOut, but it has a large number of groups and it is very expensive to generate all possible partitions holding $ P $ groups. In such scenarios, GroupShuffleSplit provides a random sample (replacement) of the training / test splits generated by LeavePGroupsOut.

3.1.6. Predefined Fold-Splits / Validation-Sets

For some datasets, training and validation partitioning of predefined data or some cross-validation partitioning already exists. You can use these folds with PredefinedSplit. For example, it is used when searching for hyperparameters. For example, if you use a validation set, set test_fold to 0 for all samples that are part of the validation set and -1 for all other samples.

3.1.7. Mutual verification of time series data

Time series data are characterized by correlations between observations that are close in time (autocorrelation). However, classical cross-validation techniques such as KFold and ShuffleSplit assume that the samples are independently and uniformly distributed, resulting in an irrational correlation between training and test instances (generalization error estimation). bad). Therefore, it is very important to evaluate a model of time series data for "future" observations, such as those used to train the model. To achieve this, there is one solution provided by TimeSeriesSplit (http://scikit-learn.org/0.18/modules/generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit).

3.1.7.1. Time series division

TimeSeriesSplit is a variant of k-fold that returns the first k folds as a column set and the $ k + 1 $ th folds as a test set. Note that unlike standard cross-validation methods, successive training sets are a superset of what comes before them. It also adds all the surplus data to the first training partition that is always used to train the model. This class can be used to cross-validate time series data samples observed at regular time intervals. Example of 3-split time series mutual validation for 6-sample dataset:

>>> from sklearn.model_selection import TimeSeriesSplit

>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(n_splits=3)
>>> print(tscv)  
TimeSeriesSplit(n_splits=3)
>>> for train, test in tscv.split(X):
...     print("%s %s" % (train, test))
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]

3.1.8. Shuffle precautions

If the order of the data is not arbitrary (for example, if samples with the same class label are contiguous), it may be essential to shuffle it first to get meaningful cross-validation results. However, the opposite may be true if the samples are not independently and uniformly distributed. For example, if the samples correspond to news articles and are sorted by publication time, shuffling the data can lead to overfitted models and inflated validation scores. That's because it was tested on artificially similar (close in time) samples. Some mutual validation iterators, such as KFold, have a data index before splitting. There are built-in options for shuffling.

--This consumes less memory than shuffling data directly. --By default, shuffle does not occur. Cross_val_score with cv = some_integer, grid search, (stratified) T) Cross-validation of K division, etc. Note that train_test_split still returns a random split. --The default for the random_state parameter is None. This means that the shuffle will be different each time KFold (..., shuffle = True) is used. However, GridSearchCV uses the same shuffle for each set of parameters validated in a single call to the fit method. --Use a fixed value for random_state to make the result repeatable (on the same platform).

3.1.9. Cross-validation and model selection

The cross-validation iterator can also be used to directly select the best hyperparameters for the model using grid search. This is the topic in the next section. Tuning the hyperparameters of the estimator.

[scikit-learn 0.18 User Guide 3. Model Selection and Evaluation](http://qiita.com/nazoking@github/items/267f2371757516f8c168#3-%E3%83%A2%E3%83%87%E3%83] From% AB% E3% 81% AE% E9% 81% B8% E6% 8A% 9E% E3% 81% A8% E8% A9% 95% E4% BE% A1)

[PYTHON] [Translation] scikit-learn 0.18 User Guide 3.1. Cross-validation: Evaluate the performance of the estimator