Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

Last time, I wrote about the basics of support vector machines. Last time, we dealt with SVM, which is called hard margin and can separate positive and negative cases properly, but this time

Soft margin (classification with noise)
Kernel method and kernel trick (linear inseparable problem)

I would like to mention about.

Soft margin SVM

Consider an example where the red and blue circles cannot be subtly separated as shown in the figure below, unlike the previous time.

Review before that

The hard margin SVM expression is a set of parameterswWhen$ \frac{1}{2}|w|^2To$ t_n(\boldsymbol{w}^Tx_n+w_0) \geq 1$$という制約条件化で最小化するという問題でした。ソフトマージンはこの制約条件To緩めた問題に変えます。

Relaxing constraints

Introduce the slack variable $ \ xi $ and the parameter $ C $ to relax the condition. The slack variable is a variable of how much error is allowed between the support vector and the boundary line, and $ C (> 0) $ indicates the strictness of the constraint condition. When these are introduced, the problem to be solved described above changes as follows.

\frac{1}{2}|w|^2+C\sum_{i=1}^{N}\xi_i \\
t_n(\boldsymbol{w}^Tx_n+w_0) \geq 1-\xi_n \\
\xi_n \geq 0

Regarding the relationship between $ C $ and $ \ xi $, if $ C $ is large, it cannot be minimized unless $ \ xi $ is small, and if $ C $ is small, it can be minimized even if $ \ xi $ is large to some extent. It means that. If $ C $ is infinite, this is the same as a hard margin SVM, as $ \ xi $ can only tolerate zero (= not allow data within the margin).

Lagrange undetermined multiplier solution

Unlike the hard margin, the constraint has been increased to two, so the Lagrange multiplier is also set to $ \ lambda $ and $ \ mu $.

L(w,w_0,\lambda, \mu)=\frac{1}{2}|w|^2+C\sum_{i=1}^n\xi_i-\sum_{i=1}^{N}\lambda_i \{ t_i(\boldsymbol{w}^Tx_i+w_0)+\xi_i-1\}-\sum_{i=1}^n\mu_i\xi_i

If this is partially differentiated with respect to $ w $, $ w_0 $, and $ \ xi $, and each is set to zero,

w=\sum_{i=1}^n\lambda_it_ix_i \\
\sum_{i=1}^n\lambda_it_i=0 \\
\lambda_i=C-\mu_i

Can be obtained, and when assigned to the Lagrange function,

L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mx_n^Tx_m

And this is exactly the same formula as for the hard margin. However, the constraint is

\sum_{i=1}^n\lambda_it_n=0 \\
0 \leq \lambda_i \leq C

It will be. As with the hard margin, it is possible to find the parameters using SMO. (Omitted this time)

Kernel method and kernel tricks

Consider the following example, which seems to be inseparable by a straight line.

svm_advance_2.png.png

If it has such a shape, move the point to a higher dimensional space and then separate the planes, such as 2D → 3D. The method of converting to a higher order is called the ** kernel method **, and the function for converting is called the ** kernel function **.

Basis set

A data string that projects $ \ boldsymbol {x} = (x_0, x_1, \ cdots, x_ {n-1}) $ is $ \ boldsymbol {\ phi} = \ {\ phi_0 (\ boldsymbol {x) }), \ phi_1 (\ boldsymbol {x}), \ cdots, \ phi_ {m-1} (\ boldsymbol {x}) \} $. This $ \ phi (x) $ is called a ** basis set **. The basis function was equivalent to $ \ phi (x) = x $ because the previous SVM was able to handle linear separation. Other commonly used basis functions are the polynomial $ \ phi (x) = x ^ n $ and the Gaussian basis $ \ phi (x) = \ exp \ left \\ {-\ frac {(x-). There is \ mu) ^ 2} {2 \ sigma ^ 2} \ right \\} $.

By applying the basis function, the $ x_n ^ Tx_m $ part of the Lagrange function changes to $ \ phi (x) _n ^ T \ phi (x) _m $.

L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_m\phi(x)_n^T\phi(x)_m

This $ \ phi (x) _n ^ T \ phi (x) _m $ is an inner product calculation, and if there are many data points, the amount of calculation will be enormous, so we will devise a little.

Kernel functions and kernel tricks

In fact, $ \ phi (x) _n ^ T \ phi (x) _m $ can be replaced with $ k (x_n, k_m) $. $ k (x_n, k_m) $ is called a ** kernel function **. By replacing it in this way, you can omit the troublesome inner product calculation. This is called a ** kernel trick **. For details, refer to "Kernel Trick".

In particular, the kernel function using the Gaussian basis function mentioned above is called ** RBF kernel (Radial basis function kernel) **.

Finally the Lagrange function

L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mk(x_n,x_m) \\
\text{subject.to }\sum_{i=1}^n\lambda_it_n=0,0 \leq \lambda_i \leq C

It will be. Actually, solve this formula, find $ \ lambda $, then $ \ boldsymbol {w} $ or $ w_0 $.

Try it with python

Last time, I used a simple sklearn.svm.LinearSVC to classify, but the more general [sklearn] Try using .svm.SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

View API documentation

Looking at the explanation of the API, it is as follows.

class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)

If you understand the contents so far, you will gradually be able to understand this explanation. The kernel parameter is the parameter that determines the basis function, which is linear for linear and rbf for Gaussian kernel. The important ones here are C and gamma. C is a parameter that determines the strength of the constraint, and the larger it is, the more stringent the constraint becomes. gamma is a parameter that determines the spread of Gaussian basis functions and is the reciprocal, so the smaller it is, the smoother it becomes.

Try to implement

Use the data shown first as the data to be classified. Actually, this data uses an API called sklearn.datasets.make_moons. You can specify the number of samples and the strength of noise. The decision boundaries are also illustrated. Since the decision boundary is not linear, it is drawn as a contour line. Specifically, we use a function called contourf in matplotlib.

import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

%matplotlib inline

X, y = make_moons(n_samples=200,
                  shuffle = True,
                  noise = 0.1,
                  random_state = 2020,)
 
a0, b0 = X[y==0,0], X[y==0,1]
a1, b1 = X[y==1,0], X[y==1,1]

model = svm.SVC(C=1.0, kernel='rbf', gamma=1)
model.fit(X, y)

x1_min,x1_max = X[:,0].min() - 0.1,X[:,0].max() + 0.1
x2_min,x2_max = X[:,1].min() - 0.1,X[:,1].max() + 0.1

xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,0.02),
                        np.arange(x2_min,x2_max,0.02))

Z = model.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)

plt.figure(figsize=(8, 7))
 
plt.contourf(xx1,xx2,Z,alpha = 0.4)
plt.xlim(xx1.min(),xx1.max())
plt.ylim(xx2.min(),xx2.max())

plt.scatter(a0, b0, marker='o', s=25, label="y = 0")
plt.scatter(a1, b1, marker='o', s=25, label="y = 1")
plt.legend()
plt.xlabel("x1")
plt.ylabel("x2")
plt.show()

It seems that they can be separated. With the API, you can also get support vectors, but even if you compare it with the actual number of data, you can approximate it with a small amount of data, which contributes to memory saving and faster calculation.

Hyperparameter adjustment

In the above, I decided C and gamma appropriately, but what if I change this? Let's actually draw it.

list_C = [0.1, 1, 20]
list_gamma = [0.05, 0.5, 20]

x1_min,x1_max = X[:,0].min() - 0.1,X[:,0].max() + 0.1
x2_min,x2_max = X[:,1].min() - 0.1,X[:,1].max() + 0.1

xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,0.02),
                        np.arange(x2_min,x2_max,0.02))

plt.figure(figsize=(11, 11))
plt.xlim(xx1.min(),xx1.max())
plt.ylim(xx2.min(),xx2.max())
plt.xlabel("x1")
plt.ylabel("x2")

for i in range(len(list_C)):
  for j in range(len(list_gamma)):
    model = svm.SVC(C=list_C[i], kernel='rbf', gamma=list_gamma[j])
    model.fit(X, y)

    Z = model.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)


    ax = plt.subplot(len(list_C), len(list_gamma), i*len(list_C)+j+1)
    ax.set_title("C={}, gamma={}".format(list_C[i], list_gamma[j]))
    ax.contourf(xx1,xx2,Z,alpha = 0.4)

    ax.scatter(a0, b0, marker='o', s=25, label="y = 0")
    ax.scatter(a1, b1, marker='o', s=25, label="y = 1")

plt.show()

The result is as follows.

As explained above, the larger the C, the better the separation, and the larger the gamma, the more complex the curve. However, it seems that it is overfitting at the bottom right, and it seems that parameter tuning is required.

In order to tune the parameters, it is necessary to divide the sample into training data and verification data and search for parameters that have a high degree of agreement when predicted by the verification data. This is called ** Cross Validation **, but I'll try to summarize it at another time.

Summary

The support vector machine has been extended from hard margin to soft margin to handle non-linear separation. Looking at it in this way, I felt that I could reject even quite complicated classifications. It is understandable that it was popular before neural networks.

[PYTHON] Machine learning algorithm (support vector machine application)