[PYTHON] Cross Validation with scikit-learn

I tried scikit-learnCross Validation and Grid Search.

Cross Validation

More details can be found in Wikipedia. Cross Validation is one of the methods to verify the validity of a model. Generally, development data is divided into training data and verification data. However, if this is done as it is, the training data will be reduced, and the generalization performance may be lowered depending on how the training data is selected. This is the hold-out test of what is written on Wikipedia. Generally this is not Cross Validation.

This is the K-validation cross-validation written here. In the K-partition cross-validation test, the data for development is divided into K pieces, K-1 pieces are used for training, and the remaining one is used for verification to calculate the validity of the model. As a result, the training data that can be used increases, and at the same time, the generalization performance can be improved by changing the training data.

I wrote how to do it concretely with scikit-learn. The data used for the training was from Kaggle's Data Science London.

SVM

First of all, the code when classifying with a support vector machine

# -*- coding: utf-8 -*-

import os
import sys
from sklearn import svm
import numpy as np
import csv

if __name__ == "__main__":
    train_feature_file = np.genfromtxt(open("../data/train.csv", "rb"), delimiter=",", dtype=float)
    train_label_file = np.genfromtxt(open("../data/trainLabels.csv", "rb"), delimiter=",", dtype=float)

    train_features = []
    train_labels = []
    for train_feature, train_label in zip(train_feature_file, train_label_file):
        train_features.append(train_feature)
        train_labels.append(train_label)

    train_features = np.array(train_features)
    train_labels = np.array(train_labels)

    clf = svm.SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, degree=3,gamma=0.001, kernel="rbf", max_iter=-1, probability=False,random_state=None, shrinking=True, tol=0.001, verbose=False)

    clf.fit(train_features, train_labels)

    test_feature_file = np.genfromtxt(open("../data/test.csv", "rb"), delimiter=",", dtype=float)

    test_features = []
    print "Id,Solution"
    i = 1
    for test_feature in test_feature_file:
        print str(i) + "," + str(int(clf.predict(test_feature)[0]))
        i += 1

Let's validate this model with Cross Validation.

def get_score(clf, train_features, train_labels):
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(train_features, train_labels, test_size=0.4, random_state=0)

    clf.fit(X_train, y_train)
    print clf.score(X_test, y_test) 

cross_validation.train_test_split is a function that divides development data so that a certain percentage becomes validation data. In this case, since test_size = 0.4 is specified, 40% of the data will be used for verification. fit is done with 60% training data, and the score is verified with the remaining 40% data and the correct answer rate is given. This is the validity of this model in this test data. Of course, the higher this is, the better Whether or not the generalization performance is high cannot be read from here. Therefore, it is possible to perform K verifications by performing K division. By averaging these scores, the validity of the model including generalization performance can be expressed.

def get_accuracy(clf, train_features, train_labels):
    scores = cross_validation.cross_val_score(clf, train_features, train_labels, cv=10)
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

You can get all the scores for these validations with cross_validation_cross_val_score. cv can specify the number of divisions of K division. This time, the data for development will be divided into 10 pieces and verified 10 times. scores will return a list of 10 scores. The average of this is given as Accuracy. With this, the validity of the model including generalization performance can be obtained, but it is necessary to tune the model parameters manually. It is very troublesome to adjust by hand and calculate Accuracy, so an algorithm called Grid Search can automate this tuning to some extent.

Grid Search

Grid Search is a method to search for the optimal set of parameters empirically by specifying the range of parameters. To do it in Python, write as follows.

def grid_search(train_features, train_labels):
    param_grid = [
        {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
        {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
    ]
    
    clf = GridSearchCV(svm.SVC(C=1), param_grid, n_jobs=-1)
    clf.fit(train_features, train_labels)
    print clf.best_estimator_

This range can be specified by specifying it in param_grid. You can specify the number of processes that perform calculations in parallel in n_jobs. If -1 is specified, the number of cores is selected by default. Perform Grid Search on the given training data. It will take some time, but you can choose the model parameters that give the highest score for this training data. This training data can be used for actual test data.

Recommended Posts

Cross Validation with scikit-learn
Isomap with Scikit-learn
DBSCAN with scikit-learn
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
PCA with Scikit-learn
kmeans ++ with scikit-learn
Multi-class SVM with scikit-learn
Clustering with scikit-learn + DBSCAN
Learn with chemoinformatics scikit-learn
DBSCAN (clustering) with scikit-learn
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Neural network with Python (scikit-learn)
Parallel processing with Parallel of scikit-learn
[Python] Linear regression with scikit-learn
Robust linear regression with scikit-learn
Grid search of hyperparameters with Scikit-learn
Creating a decision tree with scikit-learn
Image segmentation with scikit-image and scikit-learn
Identify outliers with RandomForestClassifier in scikit-learn
Laplacian eigenmaps with Scikit-learn (personal notes)
Non-negative Matrix Factorization (NMF) with scikit-learn
Try machine learning with scikit-learn SVM
Scikit-learn DecisionTreeClassifier with datetime type values
The most basic clustering analysis with scikit-learn
Let's tune the model hyperparameters with scikit-learn!
[Scikit-learn] I played with the ROC curve
Try SVM with scikit-learn on Jupyter Notebook
Multi-label classification by random forest with scikit-learn
[Python] Use string data with scikit-learn SVM
[Voice analysis] Find Cross Similarity with Librosa
Clustering representative schools in summer 2016 with scikit-learn
Implement a minimal self-made estimator with scikit-learn
Cross Validation improves machine learning model accuracy
Fill in missing values with Scikit-learn impute
Visualize scikit-learn decision trees with Plotly's Treemap