Google translated http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html scikit-learn 0.18 Tutorial Table of Contents Statistical Learning Tutorial Table of Contents for Scientific Data Processing Previous tutorial page

Model selection: Selection of estimator and its parameters

Score, cross-validated score

As we have seen, all estimators expose a score method that can determine the fit (or prediction) quality of new data. Larger is better.

>>> from sklearn import datasets, svm
>>> digits = datasets.load_digits()
>>> X_digits = digits.data
>>> y_digits = digits.target
>>> svc = svm.SVC(C=1, kernel='linear')
>>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.97999999999999998

To get a better measure of predictive accuracy (which can be used as a proxy for model goodness of fit), we can continuously split the data in the folds we use for training and testing.

>>> import numpy as np
>>> X_folds = np.array_split(X_digits, 3)
>>> y_folds = np.array_split(y_digits, 3)
>>> scores = list()
>>> for k in range(3):
...     # We use 'list' to copy, in order to 'pop' later on
...     X_train = list(X_folds)
...     X_test  = X_train.pop(k)
...     X_train = np.concatenate(X_train)
...     y_train = list(y_folds)
...     y_test  = y_train.pop(k)
...     y_train = np.concatenate(y_train)
...     scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
>>> print(scores)
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

This is called cross-validation of KFold.

Cross-validation generator

scikit-learn has a collection of classes that you can use to generate a list of training / test indexes for common cross-validation strategies. They accept split input datasets and expose a split method that generates a train / test set index for each iteration of the selected cross-validation strategy. This example shows an example of using the split method.

>>> from sklearn.model_selection import KFold, cross_val_score
>>> X = ["a", "a", "b", "c", "c", "c"]
>>> k_fold = KFold(n_splits=3)
>>> for train_indices, test_indices in k_fold.split(X):
...      print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]

Cross-validation is easy to perform.

>>> kfold = KFold(n_splits=3)
>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
...          for train, test in k_fold.split(X_digits)]
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

The cross-validation score can be calculated directly using the cross_val_score helper. Given a cross-validation object and an estimate for the input dataset, cross_val_score iteratively divides the data into a training set and a test set, trains the estimate using the training set, and is based on the test set for each iteration of the cross-validation aggregation. To calculate the score. By default, the estimator's score method is used to calculate individual scores. For more information on the available scoring methods, see Metric Modules (http://scikit-learn.org/stable/modules/metrics.html#metrics).

>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
array([ 0.93489149,  0.95659432,  0.93989983])

n_jobs = -1 means that the calculation will be dispatched on all CPUs of the computer. Alternatively, you can provide the scoring argument to specify a different scoring method.

>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold,
...                 scoring='precision_macro')
array([ 0.93969761,  0.95911415,  0.94041254])

Mutual verification generator

KFold (n_splits, shuffle, random_state) --Split K, train with K-1 and test with the rest.
StratifiedKFold (n_iter, test_size, train_size, random_state) --Same as K-Fold, but preserves the class distribution within the fold.
GroupKFold (n_splits, shuffle, random_state) --Ensure that the same group is not in both the test set and the training set.
ShuffleSplit (n_iter, test_size, train_size, random_state) --Generate training / test exponents based on random permutations.
StratifiedShuffleSplit --Same as shuffle split, but keeps the distribution of classes within each iteration.
GroupShuffleSplit --Ensure that the same group is not in both the test set and the training set.
LeaveOneGroupOut () --Take a group array to group the observations.
LeavePGroupsOut (p) --Leave the P group as it is.
LeaveOneOut () -Leave one observation.
LeavePOut(p) --Leave the observed value of P as it is.
PredefinedSplit --Generate training / tests based on predefined divisions.

Exercise

Plot the cross-validation score of the SVC estimate using the linear kernel as a function of the parameter C in the digit dataset (using a logarithmic grid of points from 1 to 10).

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm

digits = datasets.load_digits()
X = digits.data
y = digits.target

svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)

[Answer here: Cross-validation with Digits Dataset Exercise](http://scikit-learn.org/stable/auto_examples/exercises/plot_cv_digits.html#sphx-glr-auto-examples-exercises-plot-cv-digits- py)

Grid search and cross-validated estimator

Grid search

scikit-learn provides an object where the given data calculates the score during the fit of the estimator on the parameter grid and selects the parameters that maximize the cross-validation score. This object makes estimates during construction and exposes the estimator API.

>>> from sklearn.model_selection import GridSearchCV, cross_val_score
>>> Cs = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
...                    n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])        
GridSearchCV(cv=None,...
>>> clf.best_score_                                  
0.925...
>>> clf.best_estimator_.C                            
0.0077...

>>> #Test set prediction performance is not as good as training set
>>> clf.score(X_digits[1000:], y_digits[1000:])      
0.943...

By default, GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) uses 3x mutual validation. However, instead of regression analysis, if it is detected that a classification has been passed, use stratified triple.

Nested mutual validation

>>> cross_val_score(clf, X_digits, y_digits)
...                                               
array([ 0.938...,  0.963...,  0.944...])

Two cross-validation loops run in parallel, one with the GridSearchCV estimator to set the gamma and the other with the cross_val_score to measure the predictive performance of the evaluator. The score obtained is an unbiased estimate of the predicted score for the new data.

warning

Objects cannot be nested in parallel computing (n_jobs is other than 1).

Cross-validated estimator

Mutual validation for setting parameters can be done efficiently for each algorithm. This is because, for certain estimators, scikit-learn automatically sets the parameters by cross-validation "Cross-validation: Estimator performance assessment. /cross_validation.html#cross-validation) ”Publish the estimator.

>>> from sklearn import linear_model, datasets
>>> lasso = linear_model.LassoCV()
>>> diabetes = datasets.load_diabetes()
>>> X_diabetes = diabetes.data
>>> y_diabetes = diabetes.target
>>> lasso.fit(X_diabetes, y_diabetes)
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
>>> # The estimator chose automatically its lambda:
>>> lasso.alpha_ 
0.01229...

These estimators are called as if they had "CV" added to their name.

Exercise

In the diabetes dataset, find the optimal regularization parameter ʻalpha`.

** Bonus: ** How credible can you trust the alpha choice?

from sklearn import datasets
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

diabetes = datasets.load_diabetes()

[Answer here: Mutual authentication of diabetic dataset exercise](http://scikit-learn.org/stable/auto_examples/exercises/plot_cv_diabetes.html#sphx-glr-auto-examples-exercises-plot-cv-diabetes -py)

[PYTHON] [Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Model selection: Estimator and its parameter selection

Model selection: Selection of estimator and its parameters

Score, cross-validated score

Cross-validation generator

Mutual verification generator

Exercise

Grid search and cross-validated estimator

Grid search

Nested mutual validation

warning

Cross-validated estimator

Exercise