Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.
Last time, I wrote about the basics of support vector machines. Last time, we dealt with SVM, which is called hard margin and can separate positive and negative cases properly, but this time
I would like to mention about.
Consider an example where the red and blue circles cannot be subtly separated as shown in the figure below, unlike the previous time.
The hard margin SVM expression is a set of parameters
Introduce the slack variable $ \ xi $ and the parameter $ C $ to relax the condition. The slack variable is a variable of how much error is allowed between the support vector and the boundary line, and $ C (> 0) $ indicates the strictness of the constraint condition. When these are introduced, the problem to be solved described above changes as follows.
\frac{1}{2}|w|^2+C\sum_{i=1}^{N}\xi_i \\
t_n(\boldsymbol{w}^Tx_n+w_0) \geq 1-\xi_n \\
\xi_n \geq 0
Regarding the relationship between $ C $ and $ \ xi $, if $ C $ is large, it cannot be minimized unless $ \ xi $ is small, and if $ C $ is small, it can be minimized even if $ \ xi $ is large to some extent. It means that. If $ C $ is infinite, this is the same as a hard margin SVM, as $ \ xi $ can only tolerate zero (= not allow data within the margin).
Unlike the hard margin, the constraint has been increased to two, so the Lagrange multiplier is also set to $ \ lambda $ and $ \ mu $.
L(w,w_0,\lambda, \mu)=\frac{1}{2}|w|^2+C\sum_{i=1}^n\xi_i-\sum_{i=1}^{N}\lambda_i \{ t_i(\boldsymbol{w}^Tx_i+w_0)+\xi_i-1\}-\sum_{i=1}^n\mu_i\xi_i
If this is partially differentiated with respect to $ w $, $ w_0 $, and $ \ xi $, and each is set to zero,
w=\sum_{i=1}^n\lambda_it_ix_i \\
\sum_{i=1}^n\lambda_it_i=0 \\
\lambda_i=C-\mu_i
Can be obtained, and when assigned to the Lagrange function,
L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mx_n^Tx_m
And this is exactly the same formula as for the hard margin. However, the constraint is
\sum_{i=1}^n\lambda_it_n=0 \\
0 \leq \lambda_i \leq C
It will be. As with the hard margin, it is possible to find the parameters using SMO. (Omitted this time)
Consider the following example, which seems to be inseparable by a straight line.
If it has such a shape, move the point to a higher dimensional space and then separate the planes, such as 2D → 3D. The method of converting to a higher order is called the ** kernel method **, and the function for converting is called the ** kernel function **.
A data string that projects $ \ boldsymbol {x} = (x_0, x_1, \ cdots, x_ {n-1}) $ is $ \ boldsymbol {\ phi} = \ {\ phi_0 (\ boldsymbol {x) }), \ phi_1 (\ boldsymbol {x}), \ cdots, \ phi_ {m-1} (\ boldsymbol {x}) \} $. This $ \ phi (x) $ is called a ** basis set **. The basis function was equivalent to
By applying the basis function, the $ x_n ^ Tx_m $ part of the Lagrange function changes to $ \ phi (x) _n ^ T \ phi (x) _m $.
L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_m\phi(x)_n^T\phi(x)_m
This $ \ phi (x) _n ^ T \ phi (x) _m $ is an inner product calculation, and if there are many data points, the amount of calculation will be enormous, so we will devise a little.
In fact, $ \ phi (x) _n ^ T \ phi (x) _m $ can be replaced with $ k (x_n, k_m) $. $ k (x_n, k_m) $ is called a ** kernel function **. By replacing it in this way, you can omit the troublesome inner product calculation. This is called a ** kernel trick **. For details, refer to "Kernel Trick".
In particular, the kernel function using the Gaussian basis function mentioned above is called ** RBF kernel (Radial basis function kernel) **.
Finally the Lagrange function
L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mk(x_n,x_m) \\
\text{subject.to }\sum_{i=1}^n\lambda_it_n=0,0 \leq \lambda_i \leq C
It will be. Actually, solve this formula, find $ \ lambda $, then $ \ boldsymbol {w} $ or $ w_0 $.
Last time, I used a simple sklearn.svm.LinearSVC to classify, but the more general [sklearn] Try using .svm.SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
Looking at the explanation of the API, it is as follows.
class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)
If you understand the contents so far, you will gradually be able to understand this explanation. The kernel
parameter is the parameter that determines the basis function, which is linear
for linear and rbf
for Gaussian kernel. The important ones here are C
and gamma
.
C
is a parameter that determines the strength of the constraint, and the larger it is, the more stringent the constraint becomes. gamma
is a parameter that determines the spread of Gaussian basis functions and is the reciprocal, so the smaller it is, the smoother it becomes.
Use the data shown first as the data to be classified. Actually, this data uses an API called sklearn.datasets.make_moons. You can specify the number of samples and the strength of noise. The decision boundaries are also illustrated. Since the decision boundary is not linear, it is drawn as a contour line. Specifically, we use a function called contourf in matplotlib.
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
%matplotlib inline
X, y = make_moons(n_samples=200,
shuffle = True,
noise = 0.1,
random_state = 2020,)
a0, b0 = X[y==0,0], X[y==0,1]
a1, b1 = X[y==1,0], X[y==1,1]
model = svm.SVC(C=1.0, kernel='rbf', gamma=1)
model.fit(X, y)
x1_min,x1_max = X[:,0].min() - 0.1,X[:,0].max() + 0.1
x2_min,x2_max = X[:,1].min() - 0.1,X[:,1].max() + 0.1
xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,0.02),
np.arange(x2_min,x2_max,0.02))
Z = model.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.figure(figsize=(8, 7))
plt.contourf(xx1,xx2,Z,alpha = 0.4)
plt.xlim(xx1.min(),xx1.max())
plt.ylim(xx2.min(),xx2.max())
plt.scatter(a0, b0, marker='o', s=25, label="y = 0")
plt.scatter(a1, b1, marker='o', s=25, label="y = 1")
plt.legend()
plt.xlabel("x1")
plt.ylabel("x2")
plt.show()
It seems that they can be separated. With the API, you can also get support vectors, but even if you compare it with the actual number of data, you can approximate it with a small amount of data, which contributes to memory saving and faster calculation.
In the above, I decided C
and gamma
appropriately, but what if I change this? Let's actually draw it.
list_C = [0.1, 1, 20]
list_gamma = [0.05, 0.5, 20]
x1_min,x1_max = X[:,0].min() - 0.1,X[:,0].max() + 0.1
x2_min,x2_max = X[:,1].min() - 0.1,X[:,1].max() + 0.1
xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,0.02),
np.arange(x2_min,x2_max,0.02))
plt.figure(figsize=(11, 11))
plt.xlim(xx1.min(),xx1.max())
plt.ylim(xx2.min(),xx2.max())
plt.xlabel("x1")
plt.ylabel("x2")
for i in range(len(list_C)):
for j in range(len(list_gamma)):
model = svm.SVC(C=list_C[i], kernel='rbf', gamma=list_gamma[j])
model.fit(X, y)
Z = model.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
ax = plt.subplot(len(list_C), len(list_gamma), i*len(list_C)+j+1)
ax.set_title("C={}, gamma={}".format(list_C[i], list_gamma[j]))
ax.contourf(xx1,xx2,Z,alpha = 0.4)
ax.scatter(a0, b0, marker='o', s=25, label="y = 0")
ax.scatter(a1, b1, marker='o', s=25, label="y = 1")
plt.show()
The result is as follows.
As explained above, the larger the C
, the better the separation, and the larger the gamma
, the more complex the curve. However, it seems that it is overfitting at the bottom right, and it seems that parameter tuning is required.
In order to tune the parameters, it is necessary to divide the sample into training data and verification data and search for parameters that have a high degree of agreement when predicted by the verification data. This is called ** Cross Validation **, but I'll try to summarize it at another time.
The support vector machine has been extended from hard margin to soft margin to handle non-linear separation. Looking at it in this way, I felt that I could reject even quite complicated classifications. It is understandable that it was popular before neural networks.
Recommended Posts