As a guideline for which algorithm should be used in machine learning, scikit learn algorithm cheat sheet was introduced. Even if you look only at the classification in it, many algorithms are prepared. In this article, we will compare them so that you can intuitively understand how each classifier determines the discrimination boundary.
--Basically, Classifier comparison in Example of Scikit learn. auto_examples / classification / plot_classifier_comparison.html) is reproduced --Compare the classifiers in Scikit learn using the generated dataset --Visually confirm the identification boundary of each classifier
The actual data set does not always look like this example, so it is for reference only. Especially when dealing with high-dimensional data, it can be identified even with a relatively simple classifier such as a naive Bayes or a linear support vector machine. It is more generalizable and often useful than using a complex classifier.
Experiment with 7 different classifiers.
--k-nearest neighbor method --Support vector machine (linear) --Support vector machine (Gaussian kernel) --Decision tree --Random forest --Ada Boost --Naive Bayes --Linear discriminant analysis --Secondary discriminant analysis
Use the data generation function for the classification problem of Sckit learn. The following three types of data sets with different properties are used.
Generated by make_moon
. The color of the dots on the graph indicates the label.
X, y = make_moons(noise = 0.05, random_state=0)
Generated by make_circle
.
X, y = make_circles(noise = 0.02, random_state=0)
Generated with make_classification
. For make_classification, this commentary article is easy to understand. Thank you very much.
-Sample data generation using scikit-learn
In the following example, 100 input data sets X having two-dimensional features and 100 label data sets y with two attributes are generated.
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
random_state=5, n_clusters_per_class=1, n_samples=100, n_classes=2)
Here is the output result. The leftmost column shows the original data, and the second and subsequent columns from the left show the discrimination boundaries with different classifiers. When using a semicircular data set in the upper row, a concentric circle data set in the middle row, and a linearly separable data set in the lower row. The numbers displayed at the bottom right of each graph are evaluation values that indicate the accuracy of the model. You can understand the characteristics of how to draw the identification boundary for each classifier.
Except for adding comments in Japanese, it is basically the source Classifier comparison.
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
h = .02 # step size in the mesh
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
"Random Forest", "AdaBoost", "Naive Bayes", "Linear Discriminant Analysis",
"Quadratic Discriminant Analysis"]
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
AdaBoostClassifier(),
GaussianNB(),
LinearDiscriminantAnalysis(),
QuadraticDiscriminantAnalysis()]
#Generate data for a random two-class identification problem
#Generate 100 samples each of 2D input data X and 2 classes of label data y
X, y = make_classification(n_features=2, n_samples=100, n_redundant=0, n_informative=2,
random_state=1, n_clusters_per_class=1, n_classes=2)
rng = np.random.RandomState(0) #Random number generator class, numbers are seeds, so anything is fine
X += 2 * rng.uniform(size=X.shape) #Process the original data
linearly_separable = (X, y) #Linearly identifiable dataset
datasets = [make_moons(noise=0.25, random_state=0),
make_circles(noise=0.2, factor=0.6, random_state=1),
linearly_separable
]
figure = plt.figure(figsize=(27, 9))
i = 1
#Loop with 3 datasets
for ds in datasets:
#Data set partitioning for data preprocessing, training and testing
X, y = ds
X = StandardScaler().fit_transform(X) #Normalize data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
#Plot dataset only
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright) #For training
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6) #Bright colors for testing
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1
#Loop with classifier
for name, clf in zip(names, classifiers):
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
#Calculate estimates at each point in the grid to plot the decision boundaries.
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) #Distance from the decision boundary
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1] #probability
#Color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=cm, alpha=.8) #Grid data
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright) #Training data points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6) #Test data points
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
size=15, horizontalalignment='right')
i += 1
figure.subplots_adjust(left=.02, right=.98)
plt.show()