[PYTHON] Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization

Overview

Let's consider the example ** "Let's automatically identify the handwritten zip code written on the postcard" **.

This is an article for beginners. Basically, it is a collection of scikit-learn tutorials and Documents, but it also includes other contents. We will use digits for the dataset and SVM (SVC to be exact) for the machine learning method.

digits: Handwritten numeric character image dataset
SVC: A type of support vector machine

Dataset: digits

digits is a dataset that is a set of numeric labels and numeric image data. You will learn this label and image pair later. Since the data is prepared in advance by scikit-learn, anyone can easily try it.

Read data

You can read the dataset digits with datasets.load_digits ().

from sklearn import datasets
from matplotlib import pyplot as plt
# from sklearn import datasets

digits = datasets.load_digits()

View the contents of the data

Each image is a handwritten character image from 0 to 9. These images are programmatically represented as a two-dimensional array with values between 0 and 255.

#Image array data
print(digits.data)

[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]

It is difficult to understand even if you look at the image data as an array, so I would like to display it as an image.

Before displaying the image, first check the label data. As shown below, the labels 0 to 9 are correctly assigned in advance.

#label
print(digits.target)

[0 1 2 ..., 8 9 8]

Looking at the above result, for example, the 0, 1, 2nd image from the beginning is labeled 0, 1, 2 and the second image from the back is labeled 9. .. You can use matplotlib to display these images.

#Image display
# number 0
plt.subplot(141), plt.imshow(digits.images[0], cmap = 'gray')
plt.title('number 0'), plt.xticks([]), plt.yticks([])

# number 1
plt.subplot(142), plt.imshow(digits.images[1], cmap = 'gray')
plt.title('numbert 1'), plt.xticks([]), plt.yticks([])

# number 2
plt.subplot(143), plt.imshow(digits.images[2], cmap = 'gray')
plt.title('numbert 2'), plt.xticks([]), plt.yticks([])

# number 9
plt.subplot(144), plt.imshow(digits.images[-2], cmap = 'gray')
plt.title('numbert 9'), plt.xticks([]), plt.yticks([])

plt.show()

In this way, you can see that each image seems to have the correct label.

Image classification by SVM

What is SVM

** SVM (Support Vector Machine) ** is one of the supervised learning methods with excellent recognition performance. Basically, the two-class classification is based on maximizing the margin. Of course, it can also be applied to multi-class classification (by performing two-class classification multiple times).

With a strict SVM, if the data to be classified overlaps (that is, if not all the data can be completely separated), it is not possible to obtain a proper classification boundary. On the other hand, an error-tolerant SVM is called a ** soft margin SVM **. By giving a penalty C to misclassification, it is possible to draw a classification boundary that minimizes misclassification even for data that cannot be completely separated.

It is important to note that the larger the penalty C, the more severe the error, and at the same time, the more likely it is to cause ** overfitting **.

(Note) Overfitting means that the training model fits into specific random features (unrelated to the features that you originally want to train) in the training data. When overfitting occurs, the performance of the training data improves, but the results of other data are worse. (Reference: Overfitting-Wikipedia)

SVM with scikit-learn

In fact, scikit-learn has slightly different types of SVMs such as SVC, NuSVC, and LinearSVC. NuSVC and SVC are very similar techniques, but they have slightly different parameter sets and are mathematically represented by different formulations. LinearSVC is an SVM that uses a linear kernel, and no other kernel can be specified.

This time, we will use SVC and apply a soft margin. All you have to do is ** (1) create a classifier and (2) apply it to your data **.

(1) Creation of classifier

Create a learning model other than the last 10

from sklearn import svm

# SVM
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-10], digits.target[:-10])

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Although rarely specified here, you can see that the SVC has quite a few parameters (C, cache_size, class_weight, coef0, ...) as shown above. Don't worry too much at first, the default settings are fine.

(2) Image classification by classifier

Estimate the last 10 test data from the training model

The label is actually estimated from the image using the created classifier. Let's try with the last 10 data that we haven't used to create the training model.

clf.predict(digits.data[-10:])

array([5, 4, 8, 8, 4, 9, 0, 8, 9, 8])

Looking at the actual data,

print(digits.target[-10:])

[5 4 8 8 4 9 0 8 9 8]

Is in agreement with the estimation result.

This confirms that a roughly correct estimate is possible from the trained model. Let's try different parameters.

Accuracy evaluation of classifier

Evaluation index of the classifier

There are several evaluation indexes for the classification accuracy of the classifier, but basically it can be measured by the following indexes.

** Accuracy **
Percentage of forecasts that are correctly classified
** Precision **
Percentage of data predicted to be positive that is actually positive
** Recall **
Percentage of those that are actually positive that are predicted to be positive
** F value (F-measure) **
Harmonic mean of precision and recall

Normally, the accuracy of the classifier is often evaluated by the F value. However, in practice, it is often the case that the emphasis on precision or recall is different.

Conformance and recall

For example, consider a factory parts inspection. It is not a big problem if you mistakenly classify a part that is not broken anywhere as "broken (error)". However, if a broken part is mistakenly classified as "unbroken (correct)", it may cause complaints and recalls, and even life-threatening depending on the product. In such cases, the precision rate is more important than the recall rate. For example, in "compliance rate 99% + recall rate 70%" and "compliance rate 80% + recall rate 99%", the latter has a higher F value, but the former is overwhelmingly more practical. It is possible. On the other hand, when searching a database, recall rate is often more important than precision rate. Even if you get a lot of wrong search results, it's much better than a lot of data that you can't find by searching.

Parameter optimization

Until now, the parameters were somehow set to appropriate values. However, this often does not provide the required classification accuracy, and in practice parameter optimization is essential. So what parameters should be set and how should they be set to improve the classification accuracy of the classifier? You can tune the parameters one by one by hand, but this is very difficult. It seems that there may be some knowledge that this value is customarily good depending on the data set and method, but it cannot be used for unknown data sets. Therefore, a method called ** grid search ** is often used. Simply put, the model is actually trained while changing the parameters in the search range, and the parameter with the best result accuracy is searched for. In addition, ** Cross-validation ** is used to confirm that the learning model with the obtained parameters is not overfitting. The k-validation method first divides the data into k pieces. It is a method of learning with k-1 of them and evaluating with the remaining one, repeating k times (while changing the training data and test data), and evaluating the learning model with the average value. By doing this, you can evaluate the ** generalization performance ** of the learning model.

(Note) Good generalization performance is simply the ability of the learning model to properly identify unknown data. Recall that if you are overfitting, you will be able to identify training data with high accuracy, but you will be less accurate with unknown data.

With scikit-learn, you can easily perform grid search and cross-validation using GridSearchCV (). For example, you can specify the following parameters:

scoring
Evaluation value in parameter optimization. Specify'precision'and'recall' this time
cv
Number of cross-validation divisions. Often around 10. If the amount of calculation becomes too large, or if the number of data is too small, specify a small number of divisions.

Preparation

Before performing parameter optimization, convert the format of the read data.

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

#Loading Digits dataset
digits = datasets.load_digits()
print(len(digits.images))
print(digits.images.shape)

1797
(1797, 8, 8)

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))  # reshape(cols, rows)Convert to cols rows and rows column(One of the arguments-Automatic calculation if 1)
y = digits.target
print(X.shape)
print(y)

(1797, 64)
[0 1 2 ..., 8 9 8]

Grid search and cross-validation method

The code below may seem daunting, but what you're actually doing is simple.

kernel = "rbf", gamma = 0.001 or 0.0001, C = 1 or 10 or 100 or 1000
kernel = "linear", C = 1 or 10 or 100 or 1000

You're just trying all the combinations of cases above to find the parameter (best \ _params \ _) that maximizes each precision and recall. (Note that gamma is a parameter when the kernel is rbf, so it is irrelevant when the kernel is linear)

After that, the result of grid search is displayed in detail, and the detailed report of the result is displayed by classification_report ().

#Divide the dataset into training data and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

#Set the parameters you want to optimize with cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    #Grid search and cross-validation method
    clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5,
                       scoring='%s_weighted' % score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))
    print()
    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'gamma': 0.001, 'kernel': 'rbf', 'C': 10}

Grid scores on development set:

0.987 (+/-0.018) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1}
0.959 (+/-0.030) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1}
0.988 (+/-0.018) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 10}
0.982 (+/-0.027) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 10}
0.988 (+/-0.018) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 100}
0.982 (+/-0.026) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 100}
0.988 (+/-0.018) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1000}
0.982 (+/-0.026) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1000}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 1}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 10}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 100}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 1000}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92

avg / total       0.99      0.99      0.99       899


# Tuning hyper-parameters for recall

Best parameters set found on development set:

{'gamma': 0.001, 'kernel': 'rbf', 'C': 10}

Grid scores on development set:

0.986 (+/-0.021) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1}
0.958 (+/-0.029) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1}
0.987 (+/-0.021) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 10}
0.981 (+/-0.029) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 10}
0.987 (+/-0.021) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 100}
0.981 (+/-0.027) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 100}
0.987 (+/-0.021) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1000}
0.981 (+/-0.027) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1000}
0.973 (+/-0.015) for {'kernel': 'linear', 'C': 1}
0.973 (+/-0.015) for {'kernel': 'linear', 'C': 10}
0.973 (+/-0.015) for {'kernel': 'linear', 'C': 100}
0.973 (+/-0.015) for {'kernel': 'linear', 'C': 1000}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92

avg / total       0.99      0.99      0.99       899

Now, from the result of print (clf.best \ _params \ _),'gamma': 0.001,'kernel':'rbf','C': 10 are the best from both viewpoints of precision / recall. I understand this. Now you have optimized the parameters.

If necessary, try optimizing with a different kernel or when compared to learning methods other than SVM.

reference

[1]An introduction to machine learning with scikit-learn — scikit-learn 0.18.1 documentation http://scikit-learn.org/stable/tutorial/basic/tutorial.html#introduction [2]Parameter estimation using grid search with cross-validation — scikit-learn 0.18.1 documentation http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html#example-model-selection-grid-search-digits-py [3]1.4. Support Vector Machines — scikit-learn 0.18.1 documentation http://scikit-learn.org/stable/modules/svm.html [4] F-number-Machine learning "Toki no Mori Wiki" http://ibisforest.org/index.php?F%E5%80%A4 [5] Master SVM! 8 checkpoints-Qiita http://qiita.com/pika_shi/items/5e59bcf69e85fdd9edb2 [6] Parameter optimization by grid search from Scikit learn http://qiita.com/SE96UoC5AfUt7uY/items/c81f7cea72a44a7bfd3a [7] Introduction to Bayesian Optimization for Machine Learning | Tech Book Zone Manatee https://book.mynavi.jp/manatee/detail/id=59393 [8]3.3. Model evaluation: quantifying the quality of predictions — scikit-learn 0.18.1 documentation http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter [9] Overfitting-Wikipedia https://ja.wikipedia.org/wiki/%E9%81%8E%E5%89%B0%E9%81%A9%E5%90%88