[PYTHON] [Machine learning] Understanding SVM from both scikit-learn and mathematics

1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level You can see that it is clearly weak in the explanation of "I don't know the background, but I got this result."

In this article, the two objectives are "Because the theory is good, try using scikit-learn first" in 2-3, and "Understand the background from mathematics" in 4 and later.

2. What is SVM (Support Vector Machine)?

SVM is a model that can be used for classification and regression as supervised learning. And because there is a device to obtain high discrimination performance for unlearned data, it demonstrates excellent recognition performance. Source: [Wikipedia] (https://ja.wikipedia.org/wiki/%E3%82%B5%E3%83%9D%E3%83%BC%E3%83%88%E3%83%99%E3%82%AF%E3%82%BF%E3%83%BC%E3%83%9E%E3%82%B7%E3%83%B3)

Roughly speaking, ** it tends to be a highly accurate model when new data is obtained **.

◆ Specific example

Suppose you are the president of an event planning company. Suppose you are planning a tour to see "rare cats" in response to the recent cat boom (a fictional setting).

キャプチャ1.PNGキャプチャ2.PNG

Since there are too many candidates for the tour location, you have collected data on rare cats (= A) and so-called ordinary cats (= B). Based on that data, we will create a model that can determine whether it is a rare cat by inputting data on "body size" and "beard length" in the future, and focus on the place where it was determined that there is a rare cat. I will make a plan.

The distribution of the data is as follows.

◆ What is SVM?

Now, what kind of boundary is likely to be drawn between blue and orange in the distribution shown above? As shown below, there can be a red border and a green border in the data at hand. キャプチャ4.PNG

Now that I have one new data, I tried to plot it additionally. (Data in orange frame) キャプチャ5.PNG

In this case, the red border is correctly identified, but the green border is a rare cat (it is originally a normal cat), so it is a misidentification.

In order to prevent such misjudgment and find the correct classification standard, SVM uses the concept of ** "maximize margin" **. Margin is the distance between the upper border, such as red or green, and the actual data. The idea is that if this margin is large, ** "misjudgment due to slight changes in data" can be made as small as possible **.

キャプチャ6.PNG

The data near the boundary is, so to speak, data that makes it difficult to distinguish between "rare cats" and "ordinary cats." It would be a problem if there is a lot of subtle data, so the idea is that if you decide the boundary so that the distance between the boundary and the data is as far as possible, the risk of misjudgment can be minimized.

◆ About the penalty

However, there aren't many boundaries that can classify everything 100% perfectly. In the real world, data such as outliers sometimes come in, as shown below.

キャプチャ7.PNG

If you try to draw a boundary that accurately classifies this new orange point, you can imagine that it will probably be a boundary that does not match the actual situation. (So-called overfitting)

In order to make a judgment that suits the actual situation, SVM allows ** "some misjudgment" **.

It will appear in the next scikit-learn section, but how much misidentification is allowed? In fact, we have to decide ourselves to build the model, which we call a "penalty".

◆ To summarize ...

SVM can be said to be a model that realizes the following two ** "good feeling" **.

・ In order to prevent misjudgment as much as possible, try to draw a boundary that maximizes the distance between the boundary and the data, that is, the margin. ・ However, some misjudgment is allowed in order to draw a boundary that matches the actual situation.

3. SVM with scikit-learn

(1) Import of required libraries

Import the following required to perform SVM.

from sklearn.svm import SVC

#Below is a library for illustrations, pandas and numpy
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

(2) Data preparation

Set the length and whiskers data and the unusual, normal classification (True for rare cats, False for normal cats) as data as shown below.

data = pd.DataFrame({
        "rare":[True,True,True,True,True,False,False,False,False,False,False,False,False],
        "scale":[20, 25, 30, 24, 28, 35, 40, 38, 55, 50, 60,32,25],
        "hige":[10, 20, 40, 18, 30, 10, 20, 30, 25, 28, 30,18,25],
    })

(3) Try to illustrate (important)

I will illustrate the body length / beard length and the rare / normal classification. In order to grasp the characteristics, do not use scikit-learn suddenly, but try to illustrate any data.

y = data["rare"].values
x1, x2 = data["scale"].values, data["hige"].values 

#Plot the data
plt.grid(which='major',color='black',linestyle=':')
plt.grid(which='minor',color='black',linestyle=':')
plt.plot(x1[y], x2[y], 'o', color='C0', label='rare')#Blue dot: y is True(=Rare)
plt.plot(x1[~y], x2[~y], '^', color='C1', label='normal')#Orange dot: y is False(=Ordinary)
plt.xlabel("scale")
plt.ylabel("hige")
plt.legend(loc='best')
plt.show()
キャプチャ8.PNG

Somehow, the boundary seems to be closed.

(4) Model construction

(I) Data shaping

First of all, we will arrange the shape of the data to build the model.

y = data["rare"].values#It's the same as the one shown above, so you can omit it.
X = data[["scale", "hige"]].values

Since this is not an article on python grammar, I will omit the details, but I will arrange x and y into a form for SVM with scikit-learn.

(Ii) Model construction

It's finally the model building code.

C = 10
clf = SVC(C=C,kernel="linear")
clf.fit(X, y) 

That's it for a simple model. We will create an svm model in a variable called clf! The image is that the clf is fitted (= learned) with the prepared X and y in the next line.

◆ About arguments

The main arguments to consider when building an SVM model are $ C $ and kernel. ** <About $ C $> ** I will try it for the time being, so I will omit the details, but if you reduce the value of $ C $, it will be a model that allows misidentification.

** ** The types of kerenel are ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, and ‘precomputed’. [Official reference for details] (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

Here, we will introduce ‘linear’ and ‘rbf’. Use linear to draw the boundary linearly (plane), and use rbf (nonlinear kernel function) to draw the boundary non-linearly. The result will change depending on which one you choose.

(5) Illustrate the constructed model

Now let's illustrate this boundary in the scatter plot above.

fig,ax = plt.subplots(figsize=(6,4))
#Show data points
ax.scatter(X[:,0], X[:,1], c=y)
                                                                                                                                                                   
#Arrange 100 values in the x coordinate direction
x = np.linspace(np.min(X[:,0]), np.max(X[:,0]), 10)
#Arrange 100 values in the y coordinate direction
y = np.linspace(np.min(X[:,1]), np.max(X[:,1]), 10)
#x,With the x-coordinate of 10000 points combined with y,Array of y coordinates
x_g, y_g = np.meshgrid(x, y)
#np,c_Connect the two coordinates with,Pass to SVM
z_g = clf.predict(np.c_[x_g.ravel(), y_g.ravel()])
#z_g is an array column, but for display in the graph(100, 100)Return to the shape of
z_g = z_g.reshape(x_g.shape)

#Border coloring
ax.contourf(x_g,y_g,z_g,cmap=plt.cm.coolwarm, alpha=0.8);

#Display at the end
plt.show()
キャプチャ9.PNG

As a result of building the model, the boundaries were closed as shown above. When new data comes in after that, if it is plotted in the blue area, it will be classified as a normal cat, and if it is plotted in the red area, it will be classified as a rare cat.

By the way, if the kernel introduced in (4) ◆ Arguments is rbf, the boundary will be as follows. キャプチャ10.PNG

It's a completely different boundary! In this case, I feel that linear draws the boundaries of the data more appropriately, so let's use linear for the kernel.

(6) In the real world ...

It doesn't make sense to finish making a model. In the real world, it is important to use this predictive model to distinguish between rare and normal when acquiring new cat data.

You got two other types of information and wrote down the data. Store it in a variable called z as shown below.

z = pd.DataFrame({
        "scale":[28, 45],
        "hige":[25, 20],
    })
z2 = z[["scale", "hige"]].values

Comparing this data with the illustration with the linear boundary, it seems that the first animal is probably classified as red (rare = True) and the second animal is classified as blue (normal = False). Now let's make a prediction.

y_est = clf.predict(z2)

By doing this, y_est will display the result as ([True, False]), so you can see that it is classified according to the border.

4. Understanding SVM from mathematics

By the way, up to 3, I tried to implement the flow of building an SVM model using scikit-learn → illustration → predicting the rare and normal of two other cats. Here, I would like to clarify how the SVM model of this flow is calculated mathematically.

(1) About maximizing the margin

I will delve into the margin maximization described in "2. What is SVM (Support Vector Machine)". I explained that the part where the distance between the point and the boundary of each data is the largest is the optimum boundary line, but what kind of state does that mean?

キャプチャ6.PNG

◆ Three-dimensional visualization

The scatter plot that has been shown so far can be rewritten in three dimensions as shown below.

キャプチャ11.PNG

If you think of the green plane that passes through the red border above as the border, you can imagine that changing the ** "slope" ** of this plane will change the margin (= distance between the data and the border). Is it?

For example, if the slope of this plane is steep, the margin will be small as shown below.

キャプチャ12.PNG

On the contrary, if the slope of the plane is made gentle, the margin becomes large as shown below.

キャプチャ13.PNG

In other words, ** "the data can be classified neatly" and "the slope of the plane passing through the decision boundary is as gentle as possible" is the optimum boundary condition **.

◆ Margin formula

Then, what does ** "the slope of the plane passing through the decision boundary becomes as gentle as possible" **? I will continue to illustrate it.

キャプチャ14.PNG

I tried to show the view of the boundary surface from the side. This formula is expressed as $ w_1x_1 + w_2x_2 $.

As mentioned earlier, the maximum margin means that "the slope (= slope) of the plane passing through the decision boundary becomes as gentle as possible". The gentlest slope (= slope) means that even if you move $ x_1 $ or $ x_2 $ a little, the effect on $ w_1x_1 + w_2x_2 $ is small (= the slope is gentle, so set the value of $ x $ a little. Even if you move it, the value of the whole expression does not change much), that is, "** $ w_1, w_2 $ are small **".

If this is made into a formula, it will be as follows, but since understanding the norm is necessary and complicated to understand the meaning of this formula, at this point, "$ w_1 $ and $ w_2 $ of the boundary line formula are as small as possible. It is calculated so that it becomes. "

||w||_2^2← If this is minimized, the margin will be maximized

(2) Penalty

The basic idea ends with (1), but as mentioned in "◆ Penalty" in "2. What is SVM (Support Vector Machine)", some misunderstandings are made so that classification can be performed according to the actual situation. Allow another. How much misjudgment is allowed? The degree of is called a penalty. The penalty formula is expressed as follows, and $ ξ $ is called the hinge loss function. C(\sum_{i=1}^n ξi)

$ C $ has the same meaning as the argument described in (ii) Model construction, but the larger this $ C $ is, the more misjudgment is not allowed (= too large it makes overfitting easier). If you want to understand this formula in depth, you need to understand it in depth, so I will leave it to this point this time. (It may be built separately later, but I would like to summarize it here as well)

(3) To summarize ...

From (1) and (2), SVM is calculated to make the following objective functions as small as possible. Intuitively, ** I try to make the slope of the boundary surface as small as possible "to maximize the margin", but how much misjudgment is allowed in order to classify according to the actual situation? The penalty term of is added, and the formula of the boundary surface is set so that the overall balance feels good. ** **

||w||_2^2 +
C(\sum_{i=1}^n ξi)

5. Summary

What did you think. Since SVM requires a mathematical understanding of the background more than simple regression and logistic regression, I have not been able to describe it so deeply, but I hope that the understanding so far will help to deepen the understanding than before. ..

Recommended Posts

[Machine learning] Understanding SVM from both scikit-learn and mathematics
[Machine learning] Understanding decision trees from both scikit-learn and mathematics
[Machine learning] Understanding logistic regression from both scikit-learn and mathematics
[Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics
[Machine learning] Understanding uncorrelatedness from mathematics
Try machine learning with scikit-learn SVM
Overview of machine learning techniques learned from scikit-learn
Easy machine learning with scikit-learn and flask ✕ Web app
Practical machine learning with Scikit-Learn and TensorFlow-TensorFlow gave up-
[Machine learning] Understanding random forest
Machine learning and mathematical optimization
Build a machine learning scikit-learn environment with VirtualBox and Ubuntu
[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.
Significance of machine learning and mini-batch learning
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
Machine learning ① SVM (Support Vector Machine) Summary
Classification and regression in machine learning
Organize machine learning and deep learning platforms
[Reading Notes] Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Chapter 1
[Machine learning] OOB (Out-Of-Bag) and its ratio
Stock price forecast using machine learning (scikit-learn)
[Machine learning] LDA topic classification using scikit-learn
Use machine learning APIs A3RT from Python
Personal notes and links about machine learning ① (Machine learning)
Machine learning algorithm classification and implementation summary
Python and machine learning environment construction (macOS)
"OpenCV-Python Tutorials" and "Practical Machine Learning System"
Machine learning
[Python] Easy introduction to machine learning with python (SVM)
Study machine learning and computer science. Resource list
Machine learning starting from 0 for theoretical physics students # 1
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Notes on machine learning (updated from time to time)
Machine learning algorithms (from two-class classification to multi-class classification)
Machine learning starting from scratch (machine learning learned with Kaggle)
Machine learning Training data division and learning / prediction / verification
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Machine learning starting from 0 for theoretical physics students # 2
[Python] Sort apples and pears from pixel values using a support vector machine (SVM)
Predicting offensive and defensive attributes from the Yu-Gi-Oh! Card name --Yu-Gi-Oh! Data Science 3. Machine Learning