This time, we will summarize the implementation of support vector machines (classification).
Proceed with the next 7 steps.
First, import the required modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Module to read the dataset
from sklearn.datasets import load_iris
#Module for standardization (distributed normalization)
from sklearn.preprocessing import StandardScaler
#Module that separates training data and test data
from sklearn.model_selection import train_test_split
#Module that runs the support vector machine
from sklearn.svm import SVC
#Module to evaluate classification
from sklearn.metrics import classification_report
This time, we will use the iris dataset for binary classification.
First get the data, standardize it, and then split it.
#Loading iris dataset
iris = load_iris()
#Divide into objective variable and explanatory variable
X, y = iris.data[:100, [0, 2]], iris.target[:100]
#Standardization (distributed normalization)
std = StandardScaler()
X = std.fit_transform(X)
#Divide into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In order to perform binary classification, the data set is specified up to the 100th line (Setosa / Versicolor only). We've also narrowed down the explanatory variables to two to make it easier to plot. (Sepal Length / Petal Lengh only)
In standardization, for example, when there are 2-digit and 4-digit features (explanatory variables), the influence of the latter becomes large.
The scale is aligned by setting the average to 0 and the variance to 1 for all features.
Let's plot the data before classifying by SVM.
#Creating drawing objects and subplots
fig, ax = plt.subplots()
#Setosa plot
ax.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
marker = 'o', label = 'Setosa')
#Versicolor plot
ax.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
marker = 'x', label = 'Versicolor')
#Axis label settings
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Petal Length')
#Legend settings
ax.legend(loc = 'best')
plt.show()
Plot with features corresponding to Setosa (y_train == 0) (0: Sepal Lengh on the horizontal axis, 1: Petal Length on the vertical axis) Plot with features corresponding to Versicolor (y_train == 1) (0: Sepal Lengh on the horizontal axis, 1: Petal Length on the vertical axis)
Output result
First, create an SVM execution function (instance) and apply it to the training data.
#Create an instance
svc = SVC(kernel = 'linear', C = 1e6)
#Create a model from training data
svc.fit(X_train, y_train)
This time, linear separation is already possible (separate by one straight line), so the argument is set to kernel ='linear'.
C is a hyperparameter that you adjust yourself while looking at the output values and plots.
Since the model of the support vector machine was created from the training data Plot and check how the classification is done.
The first half is exactly the same as the scatter plot code above.
#Creating drawing objects and subplots
fig, ax = plt.subplots()
#Setosa plot
ax.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
marker = 'o', label = 'Setosa')
#Versicolor plot
ax.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
marker = 'x', label = 'Versicolor')
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Petal Length')
ax.legend(loc = 'upper left')
#From here on down, you can paste it as it is and use it every time even if it is other data. (Fine adjustment of numerical value is necessary)
#Specify the plot range of the decision boundary (straight line)
xmin = -2.0
xmax = 2.5
ymin = -1.5
ymax = 1.8
#Plot decision boundaries and margins
xx, yy = np.meshgrid(np.linspace(xmin, xmax, 100), np.linspace(ymin, ymax, 100))
xy = np.vstack([xx.ravel(), yy.ravel()]).T
p = svc.decision_function(xy).reshape(100, 100)
ax.contour(xx, yy, p, colors = 'k', levels = [-1, 0, 1], alpha = 1,
linestyles = ['--', '-', '--'])
#Plot support vector
ax.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1],
s = 250, facecolors = 'none', edgecolors = 'black')
plt.show()
alpha: Straight line density s: Size of support vector (○)
Output result
Now that the model is complete, let's predict the classification.
#Predict classification results
y_pred = svc.predict(X_test)
#Output predicted value and correct answer value
print(y_pred)
print(y_test)
Output result
#Compare the predicted value with the correct answer value
y_pred: [0 1 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1]
y_test: [0 1 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1]
0:Setosa 1:Versicolor
In this case, you can see that they all match (correct answer).
Since this time it will be a classification (binary classification), we will evaluate it based on the precision, recall, and F value using a confusion matrix.
#Outputs precision rate, recall rate, and F value
print(classification_report(y_test, y_pred))
Output result
From the above, we were able to evaluate the classification in Setosa and Versicolor.
In SVM, we will create and evaluate the model based on the steps 1 to 7 above.
This time, for beginners, I have summarized only the implementation (code). Looking at the timing in the future, I would like to write an article about theory (mathematical formula).
Thank you for reading.
References: A new textbook for data analysis using Python (Python 3 engineer certification data analysis test main teaching material)
Recommended Posts