Google translated http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html scikit-learn 0.18 Tutorial Table of Contents Statistical Learning Tutorial Table of Contents for Scientific Data Processing Previous tutorial page
As we have seen, all estimators expose a score
method that can determine the fit (or prediction) quality of new data. Larger is better.
>>> from sklearn import datasets, svm
>>> digits = datasets.load_digits()
>>> X_digits = digits.data
>>> y_digits = digits.target
>>> svc = svm.SVC(C=1, kernel='linear')
>>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.97999999999999998
To get a better measure of predictive accuracy (which can be used as a proxy for model goodness of fit), we can continuously split the data in the folds we use for training and testing.
>>> import numpy as np
>>> X_folds = np.array_split(X_digits, 3)
>>> y_folds = np.array_split(y_digits, 3)
>>> scores = list()
>>> for k in range(3):
... # We use 'list' to copy, in order to 'pop' later on
... X_train = list(X_folds)
... X_test = X_train.pop(k)
... X_train = np.concatenate(X_train)
... y_train = list(y_folds)
... y_test = y_train.pop(k)
... y_train = np.concatenate(y_train)
... scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
>>> print(scores)
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
This is called cross-validation of KFold.
scikit-learn has a collection of classes that you can use to generate a list of training / test indexes for common cross-validation strategies.
They accept split input datasets and expose a split
method that generates a train / test set index for each iteration of the selected cross-validation strategy.
This example shows an example of using the split
method.
>>> from sklearn.model_selection import KFold, cross_val_score
>>> X = ["a", "a", "b", "c", "c", "c"]
>>> k_fold = KFold(n_splits=3)
>>> for train_indices, test_indices in k_fold.split(X):
... print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]
Cross-validation is easy to perform.
>>> kfold = KFold(n_splits=3)
>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
... for train, test in k_fold.split(X_digits)]
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
The cross-validation score can be calculated directly using the cross_val_score helper. Given a cross-validation object and an estimate for the input dataset, cross_val_score iteratively divides the data into a training set and a test set, trains the estimate using the training set, and is based on the test set for each iteration of the cross-validation aggregation. To calculate the score.
By default, the estimator's score
method is used to calculate individual scores.
For more information on the available scoring methods, see Metric Modules (http://scikit-learn.org/stable/modules/metrics.html#metrics).
>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
array([ 0.93489149, 0.95659432, 0.93989983])
n_jobs = -1
means that the calculation will be dispatched on all CPUs of the computer.
Alternatively, you can provide the scoring
argument to specify a different scoring method.
>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold,
... scoring='precision_macro')
array([ 0.93969761, 0.95911415, 0.94041254])
Plot the cross-validation score of the SVC estimate using the linear kernel as a function of the parameter C
in the digit dataset (using a logarithmic grid of points from 1 to 10).
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm
digits = datasets.load_digits()
X = digits.data
y = digits.target
svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)
[Answer here: Cross-validation with Digits Dataset Exercise](http://scikit-learn.org/stable/auto_examples/exercises/plot_cv_digits.html#sphx-glr-auto-examples-exercises-plot-cv-digits- py)
scikit-learn provides an object where the given data calculates the score during the fit of the estimator on the parameter grid and selects the parameters that maximize the cross-validation score. This object makes estimates during construction and exposes the estimator API.
>>> from sklearn.model_selection import GridSearchCV, cross_val_score
>>> Cs = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
... n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])
GridSearchCV(cv=None,...
>>> clf.best_score_
0.925...
>>> clf.best_estimator_.C
0.0077...
>>> #Test set prediction performance is not as good as training set
>>> clf.score(X_digits[1000:], y_digits[1000:])
0.943...
By default, GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) uses 3x mutual validation. However, instead of regression analysis, if it is detected that a classification has been passed, use stratified triple.
>>> cross_val_score(clf, X_digits, y_digits)
...
array([ 0.938..., 0.963..., 0.944...])
Two cross-validation loops run in parallel, one with the GridSearchCV estimator to set the gamma and the other with the cross_val_score to measure the predictive performance of the evaluator. The score obtained is an unbiased estimate of the predicted score for the new data.
Objects cannot be nested in parallel computing (n_jobs
is other than 1).
Mutual validation for setting parameters can be done efficiently for each algorithm. This is because, for certain estimators, scikit-learn automatically sets the parameters by cross-validation "Cross-validation: Estimator performance assessment. /cross_validation.html#cross-validation) ”Publish the estimator.
>>> from sklearn import linear_model, datasets
>>> lasso = linear_model.LassoCV()
>>> diabetes = datasets.load_diabetes()
>>> X_diabetes = diabetes.data
>>> y_diabetes = diabetes.target
>>> lasso.fit(X_diabetes, y_diabetes)
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
verbose=False)
>>> # The estimator chose automatically its lambda:
>>> lasso.alpha_
0.01229...
These estimators are called as if they had "CV" added to their name.
In the diabetes dataset, find the optimal regularization parameter ʻalpha`.
** Bonus: ** How credible can you trust the alpha choice?
from sklearn import datasets
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
diabetes = datasets.load_diabetes()
[Answer here: Mutual authentication of diabetic dataset exercise](http://scikit-learn.org/stable/auto_examples/exercises/plot_cv_diabetes.html#sphx-glr-auto-examples-exercises-plot-cv-diabetes -py)
© 2010 --2016, scikit-learn developers (BSD license)
Recommended Posts