As a guideline for which algorithm should be used in machine learning, scikit learn algorithm cheat sheet was introduced. Even if you look only at the classification in it, many algorithms are prepared. In this article, we will compare them so that you can intuitively understand how each classifier determines the discrimination boundary.

What i did

--Basically, Classifier comparison in Example of Scikit learn. auto_examples / classification / plot_classifier_comparison.html) is reproduced --Compare the classifiers in Scikit learn using the generated dataset --Visually confirm the identification boundary of each classifier

Caution

The actual data set does not always look like this example, so it is for reference only. Especially when dealing with high-dimensional data, it can be identified even with a relatively simple classifier such as a naive Bayes or a linear support vector machine. It is more generalizable and often useful than using a complex classifier.

Type of classifier to use

Experiment with 7 different classifiers.

--k-nearest neighbor method --Support vector machine (linear) --Support vector machine (Gaussian kernel) --Decision tree --Random forest --Ada Boost --Naive Bayes --Linear discriminant analysis --Secondary discriminant analysis

Sample data

Use the data generation function for the classification problem of Sckit learn. The following three types of data sets with different properties are used.

Two semi-circular two-dimensional data (linearly unidentifiable)
Two-dimensional data on two concentric circles (linearly indistinguishable)
Linearly identifiable data set

Two semicircular two-dimensional data (linearly indistinguishable)

Generated by make_moon. The color of the dots on the graph indicates the label.

X, y = make_moons(noise = 0.05, random_state=0)

Two-dimensional data on two concentric circles (linearly unidentifiable)

Generated by make_circle.

X, y = make_circles(noise = 0.02, random_state=0)

Linearly identifiable dataset

Generated with make_classification. For make_classification, this commentary article is easy to understand. Thank you very much.

-Sample data generation using scikit-learn

In the following example, 100 input data sets X having two-dimensional features and 100 label data sets y with two attributes are generated.

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=5, n_clusters_per_class=1, n_samples=100, n_classes=2)

result

Here is the output result. The leftmost column shows the original data, and the second and subsequent columns from the left show the discrimination boundaries with different classifiers. When using a semicircular data set in the upper row, a concentric circle data set in the middle row, and a linearly separable data set in the lower row. The numbers displayed at the bottom right of each graph are evaluation values that indicate the accuracy of the model. You can understand the characteristics of how to draw the identification boundary for each classifier.

(Reference) Code

Except for adding comments in Japanese, it is basically the source Classifier comparison.

# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
h = .02  # step size in the mesh

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
         "Random Forest", "AdaBoost", "Naive Bayes", "Linear Discriminant Analysis",
         "Quadratic Discriminant Analysis"]
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    AdaBoostClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis()]


#Generate data for a random two-class identification problem
#Generate 100 samples each of 2D input data X and 2 classes of label data y
X, y = make_classification(n_features=2, n_samples=100, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1, n_classes=2)
rng = np.random.RandomState(0) #Random number generator class, numbers are seeds, so anything is fine
X += 2 * rng.uniform(size=X.shape) #Process the original data
linearly_separable = (X, y) #Linearly identifiable dataset

datasets = [make_moons(noise=0.25, random_state=0),
            make_circles(noise=0.2, factor=0.6, random_state=1),
            linearly_separable
            ]

figure = plt.figure(figsize=(27, 9))
i = 1
#Loop with 3 datasets
for ds in datasets:
    #Data set partitioning for data preprocessing, training and testing
    X, y = ds
    X = StandardScaler().fit_transform(X) #Normalize data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    #Plot dataset only
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright) #For training
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6) #Bright colors for testing
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    #Loop with classifier
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        #Calculate estimates at each point in the grid to plot the decision boundaries.
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) #Distance from the decision boundary
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1] #probability

        #Color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8) #Grid data
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright) #Training data points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6) #Test data points

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                size=15, horizontalalignment='right')
        i += 1

figure.subplots_adjust(left=.02, right=.98)
plt.show()

[PYTHON] Various Scikit learn classifiers