[PYTHON] Try SVM with scikit-learn on Jupyter Notebook

A linear SVM (Support Vector Machine) is a machine learning model that linearly separates and classifies feature spaces. If it cannot be separated linearly, it can be separated non-linearly by SVN using the kernel method.

Until now, I didn't really understand the kernel method, but the following article was very easy to understand.

About the kernel method in machine learning-Memomemo

After that, I am trying it in the environment of Jupyter Notebook prepared according to the following article. Easy installation and startup of Jupyter Notebook using Docker (also supports nbextensions and Scala) --Qiita

In this environment, you can access port 8888 with a browser and use Jupyter Notebook. You can open a new note by following New> Python 3 on the upper right button.

Also, a CSV file created at random https://github.com/suzuki-navi/sample-data/blob/master/sample-data-1.csv I am using.

Data preparation

Read the data from the CSV file and make it a DataFrame object.

import pandas as pd
from sklearn import model_selection
df = pd.read_csv("sample-data-1.csv", names=["id", "target", "data1", "data2", "data3"])

image.png

df is a Pandas DataFrame object.

reference Try basic operations for Pandas DataFrame on Jupyter Notebook --Qiita

There are feature variables data1, data2, and data3 in this CSV data, but let's check the state of the data with a scatter plot.

%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(df["data1"], df["data2"], c = df["target"])

image.png

plt.scatter(df["data1"], df["data3"], c = df["target"])

image.png

plt.scatter(df["data2"], df["data3"], c = df["target"])

image.png

reference Display histogram / scatter plot on Jupyter Notebook --Qiita

Looking at the scatter plot, it seems that it can be classified into two, data2 and data3, so I will try it.

feature = df[["data2", "data3"]]
target = df["target"]

feature is a Pandas DataFrame object and target is a Pandas Series object.

There are 300 records, which are divided into training data and validation data, respectively, for the feature variable and the objective variable. It just splits the record in two, but you can easily split it with model_selection.train_test_split. This will split it randomly.

feature_train, feature_test, target_train, target_test = model_selection.train_test_split(feature, target, test_size=0.2)

test_size = 0.2 is a specification that 20% of all data is used as verification data.

Feature variables (df [[" data2 "," data3 "]], feature_train, feature_test) are Pandas DataFrame objects, objective variables (df ["target "], target_train,target_test) Is a Series object.

Learning

Learn based on the created training data (feature_train, target_train).

from sklearn import svm
model = svm.SVC(kernel="linear")
model.fit(feature_train, target_train)

SVC (kernel =" linear ") is a model of a linearly separable SVM classifier. Let's learn with fit.

reference sklearn.svm.SVC — scikit-learn 0.21.3 documentation

Evaluation

Create an inference result (pred_train) from the feature variable (feature_train) of the training data with the trained model, compare it with the objective variable (target_train), and evaluate the accuracy rate. You can easily evaluate it with a function called metrics.accuracy_score.

from sklearn import metrics
pred_train = model.predict(feature_train)
metrics.accuracy_score(target_train, pred_train)

Due to the randomness of the logic, the result may be different each time, but it says 0.95.

Evaluate with training data to see if it is overfitted or generalized.

pred_test = model.predict(feature_test)
metrics.accuracy_score(target_test, pred_test)

It was displayed as 0.9333333333333333. I'm not sure if it's okay.

Apart from scikit-learn, you can use plotting.plot_decision_regions included in the package mlxtend to visualize how it is classified in a scatter plot. You need to pass an array of NumPy to plot_decision_regions instead of a Pandas object, so convert it with the methodto_numpy ().

from mlxtend import plotting
plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)

image.png

Good vibes.

reference plot_decision_regions - Mlxtend.plotting - mlxtend pandas.DataFrame.to_numpy — pandas 0.25.3 documentation

Try using the kernel method

I would like to try nonlinear separation. Let's use the RBF kernel.

All you have to do is change svm.SVC (kernel =" linear ") to svm.SVC (kernel =" rbf ", gamma =" scale "). gamma =" scale " is a hyperparameter for RBF kernel, and if you specify " scale ", it will be calculated automatically from the number of training data and the variance of feature variables.

The code below will create, train, infer, and even evaluate the model.

model = svm.SVC(kernel="rbf", gamma="scale")
model.fit(feature_train, target_train)
pred_train = model.predict(feature_train)
metrics.accuracy_score(target_train, pred_train)

It was displayed as 0.95.

Evaluate with training data to see generalization performance.

pred_test = model.predict(feature_test)
metrics.accuracy_score(target_test, pred_test)

It was displayed as 0.95. It's a little better than the linear separation I mentioned earlier.

plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)

image.png

As it is non-linear, it is certainly separated by a curve.

This sample was easy to linearly separate, so it may not have been enough to make it non-linear.

Try the kernel method with other data

Since data2 and data3 can be linearly separated, try the RBF kernel with other data combinations.

First of all, data1 and data2. Make only the figure that shows the separation with the following code.

feature = df[["data1", "data2"]]
target = df["target"]
feature_train, feature_test, target_train, target_test = model_selection.train_test_split(feature, target, test_size=0.2)
model = svm.SVC(kernel="rbf", gamma="scale")
model.fit(feature_train, target_train)
plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)

image.png

Let's see the correct answer rate.

pred_train = model.predict(feature_train)
metrics.accuracy_score(target_train, pred_train)

It was 0.7583333333333333.

pred_test = model.predict(feature_test)
metrics.accuracy_score(target_test, pred_test)

It was 0.7833333333333333.

By the way, even if I tried linearly (kernel =" linear ") with the same data, it was 0.71 to 0.74. Looking at the figure, it seems that the kernel method is working hard, but isn't there a big difference in numerical values? Shouldn't we expect too much just because we can make non-linearity?

I tried it with data1 and data3, but it was similar, so I omitted it ...

that's all.

Sequel Try clustering with mixed Gauss model on Jupyter Notebook --Qiita

Recommended Posts

Try SVM with scikit-learn on Jupyter Notebook
Formatting with autopep8 on Jupyter notebook
Try machine learning with scikit-learn SVM
Try running Jupyter Notebook on Mac
Try clustering with a mixed Gaussian model on a Jupyter Notebook
Multi-class SVM with scikit-learn
Enable Jupyter Notebook with conda on remote server
Try using conda virtual environment with Jupyter Notebook
Try Apache Spark on Jupyter Notebook (on local Docker
Using Graphviz with Jupyter Notebook
Use pip with Jupyter Notebook
Monitor the training model with TensorBord on Jupyter Notebook
Try using Jupyter Notebook dynamically
Try basic operations for Pandas DataFrame on Jupyter Notebook
High charts on Jupyter notebook
View PDF on Jupyter Notebook
EC2 provisioning with Vagrant + Jupyter (IPython Notebook) on Docker
Use Cython with Jupyter Notebook
Play with Jupyter Notebook (IPython Notebook)
Try running Python with Try Jupyter
Run Jupyter Notebook on windows
Write charts in real time with Matplotlib on Jupyter notebook
Allow external connections with jupyter notebook
Visualize decision trees with jupyter notebook
Run azure ML on jupyter notebook
Use markdown with jupyter notebook (with shortcut)
Try running Jupyter with VS Code
Add more kernels with Jupyter Notebook
Convenient analysis with Pandas + Jupyter notebook
Try starting Jupyter Notebook ~ Esper training
Use nb extensions with Anaconda's Jupyter notebook
I want to blog with Jupyter Notebook
Use Jupyter Lab and Jupyter Notebook with EC2
Start jupyter notebook on GPU server (remote server)
[Python] Use string data with scikit-learn SVM
Try server-side encryption on S3 with boto3
Clone the github repository on jupyter notebook
How to use jupyter notebook with ABCI
GPU check of PC on jupyter notebook
Display histogram / scatter plot on Jupyter Notebook
Linking python and JavaScript with jupyter notebook
Build jupyter notebook on remote server (CentOS)
Use vim keybindings on Docker-launched Jupyter Notebook
[Jupyter Notebook memo] Display kanji with matplotlib
Run Jupyter notebook on a remote server
Rich cell output with Jupyter Notebook (IPython)
Use Jupyter Notebook with Visual Studio Code on Windows 10 + Python + Poetry + pyenv-win
Settings when reading S3 files with pandas from Jupyter Notebook on AWS
Install matplotlib and display graph on Jupyter Notebook
Isomap with Scikit-learn
Try a state-space model (Jupyter Notebook + IR kernel)
Jupyter Notebook memo
[Jupyter Notebook / Lab] 3 ways to debug on Jupyter [Pdb]
Introducing Jupyter Notebook
When Html cannot be output with Jupyter Notebook
Analytical environment construction with Docker (jupyter notebook + PostgreSQL)
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Try Tensorflow with a GPU instance on AWS
Play with custom image notebook on Kubeflow v0.71
Using MLflow with Databricks ① --Experimental tracking on notebook -
Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)