[Python] Python / Scikit-learn's first SVM

Introduction

When I touch Python, I often hear that there are a lot of machine learning libraries. I knew it existed, but I had never actually moved my hand. However, it seems easy to read this article! I thought that I tried machine learning, especially SVM (Support Vector Machine), so I will post it.

Here, we will do everything from weather ** data acquisition to very simple data processing, learning, and visualization **.

I mainly referred to the following two articles. [Python] Easy introduction to machine learning with python (SVM) [Python for beginners in machine learning] Easy implementation of SVM with scikit-learn

environment

I'm running on Anaconda on Windows 10.

name version
Python 3.7.3
Scikit-learn 0.23.1
Pandas 1.0.5
Numpy 1.18.5
matplotlib 3.2.2
mlxtend 0.17.3

Each can be installed with pip as shown below.

$ pip install scikit-learn
$ pip isntall pandas
$ pip isntall numpy
$ pip install matplotlib
$ pip install mlxtend

Target audience

It is on the same level as the two articles listed above. --Can handle Python, Numpy and Pandas ――I know about the existence of machine learning --I want to know the flow of SVM implementation

What is machine learning and SVM?

I won't go into details here. Please refer to the reference articles.

What is machine learning classification?

Classification In the classification task, a finite number of predetermined classes are defined, and each class is assigned a class name called a class label (or simply a label) such as "cat" or "dog". The purpose of the classification task is to guess which of the given inputs x belongs to. [Machine Learning Classification-Wikipedia](https://ja.wikipedia.org/wiki/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92#%E5%88% 86% E9% A1% 9E)

Here, machine learning uses data such as temperature, precipitation, and cloud cover to guess the weather (label).

In addition, SVM is as follows.

The support vector machine (SVM) is one of the pattern recognition models that uses supervised learning. Applicable to classification and regression. SVM - Wikipedia

flow

Here, the explanation follows the following flow.

  1. Data acquisition
  2. Data processing
  3. Learning
  4. Visualization

1. Data acquisition

Since we decided to handle weather data this time, let's download it from this page of the Japan Meteorological Agency.

Select the location, item (temperature, precipitation, etc.) and period to download. Feel free to download items such as temperature and precipitation. I don't think it's a problem because you can make a selection when using it for learning. You should be able to download a csv file named data.csv.

When I tried it myself for a while, for example, from the data for the year 2019, it was 12 months from October-November 2001, October-November 2002, October-November 2003, and so on. The data seems to be better classified. (I think you can understand that it is better to study with the data of the same period)

2. Data processing

2-1 Deletion of unnecessary data

From here, we will process the data using pandas etc.

The downloaded file is on the first line Download time: 2020/11/16 18:18:28 Since there is data called header = 2 to avoid it and read it, and because it includes Japanese, it is set toencoding = "SHIFT-JIS".

import numpy as np
import pandas as pd

#Read csv file(data.csv is your own directory/Please match with the file name)
df = pd.read_csv("data.csv", header = 2,encoding="SHIFT-JIS")

I think the df at this point is as follows. image.png

Since there is the same column name, column names such as ". 1 " exist. You can see that this is because there are columns for quality number, homogeneous number, etc. in the 0th row. I don't need it this time, so let's delete it. It's a little forcible, but I did the following. Delete the rows that have missing values.

#Drop line 0
df = df.drop(df.index[[0]])

# ".1", ".2", ".3"Drops the column at the end of the column name
df = df.drop(df.loc[:, df.columns.str.endswith(".1")], axis = 1)
df = df.drop(df.loc[:, df.columns.str.endswith(".2")], axis = 1)
df = df.drop(df.loc[:, df.columns.str.endswith(".3")], axis = 1)

#Delete rows with missing values
df = df.dropna(how='all')

I think it's clean now! image.png

2-2 Label arrangement

Now let's look at the unique number of labels (weather overview). The data I downloaded was a whopping 64. There are too many.

print(len(df["Weather overview(Noon: 06:00 to 18:00)"].unique().tolist()))

It's a part, but it looks like this.

['Partially cloudy', 'Temporary rain after cloudy weather', 'Cloudy', '晴後Cloudy', 'Cloudy後雨', '晴後薄Cloudy', '雨一時Cloudy',
 'Fine', 'Cloudy and sometimes rain','曇一時Fine', 'rain', '曇後Fine', 'rain時々曇', '曇一時rain', '快Fine',
 'Sunny after rain', 'Temporarily cloudy after fine weather', 'Light cloud','Temporary sunny after rain', 'Cloudy after rain', 'Temporary clear after cloudy',
'Sunny and cloudy', 'With rain and thunder', '晴後一時With rain and thunder', 'heavy rain','Cloudy and sunny after a temporary rain',
'Cloudy temporary fog', 'Light cloudy temporary clear', 'Cloudy and sometimes sunny', 'Cloudy after rain', 'Sunny Temporary cloudy', 'Cloudy temporary rain, accompanied by lightning']

** This time, ** I took the following actions to simplify the classification. ① Get the first letter. 2 Replace with a number

df["Weather overview(Noon: 06:00 to 18:00)"] = df["Weather overview(Noon: 06:00 to 18:00)"].str[:1]
df["Weather No"] = df["Weather overview(Noon: 06:00 to 18:00)"].str.replace("Cloudy","0").replace("Fine", "1").replace("Big", "3").replace("rain", "3").replace("Thin", "0").replace("Pleasant", "1").replace("fog","0")

A list of replaced numbers and weather. This time it's very simplified. If you don't like it, of course, change it yourself.

weather(1st character) Numerical value Original notation
Cloudy 0 Cloudy(Somehow temporarily etc.)
Thin 0 Light cloud(Somehow temporarily etc.)
fog 0 fog(Somehow temporarily etc.)
Pleasant 1 Sunny(Somehow temporarily etc.)
Fine 1 Fine(Somehow temporarily etc.)
rain 2 rain(Somehow temporarily etc.)
Big 2 heavy rain(Somehow temporarily etc.)

3. Learning

It's finally learning. This time, we use a function called train_test_split to divide the data into training data and test data in order to verify whether we can learn well and predict both training data and unknown data.

The learning itself is very easy

model.fit(x_train, y_train)

Only this line.

from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#Storage of explanatory variables
x = df.loc[1:, ["Total precipitation(mm)","Average cloud cover(10 minutes ratio)"]]

#Storage of objective variable
y = df.loc[1:,"Weather No"].astype("int64")

#Divided into training data and test data.
# test_size=0.3 :Test data is 30%, Training data: 70%
# random_state=None: Generate different data each time
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=None )

#Select SVM
model = svm.SVC()

#Learning
model.fit(x_train, y_train)

#Accuracy for training data
pred_train = model.predict(x_train)
accuracy_train = accuracy_score(y_train, pred_train)
print('Correct answer rate for training data:%.2f' % accuracy_train)

#Accuracy to test data
pred_test = model.predict(x_test)
accuracy_test = accuracy_score(y_test, pred_test)
print('Correct answer rate for test data:%.2f' % accuracy_test)

If you get the following results, you are successful!

Correct answer rate for training data: 0.81
Correct answer rate for test data: 0.82

Since the classification is done by model.predict (), the classification result will be returned even below the extreme theory.

model.predict([[1,1]])

4. Visualization

Finally, visualization. Visualize decision boundaries with plot_decision_regions. plot_decision_regions is also easy to use and will create a graph for you if you pass in the data and model. However, the x passed here is two-dimensional. Each applies to the x and y axes of the resulting graph.

#Visualization of decision boundaries
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions

x_combined = x_test.values
y_combined = y_test.values

fig = plt.figure(figsize=(13,8))
plot_decision_regions(x_combined, y_combined, clf=model,  res=0.02)
plt.show()

In my case, the following figure came out. Go ...? image.png

Whole code

import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions

#Read csv file(data.csv is your own directory/Please match with the file name)
df = pd.read_csv("data.csv", header = 2,encoding="SHIFT-JIS")

#Drop line 0
df = df.drop(df.index[[0]])

# ".1", ".2", ".3"Drops the column at the end of the column name
df = df.drop(df.loc[:, df.columns.str.endswith(".1")], axis = 1)
df = df.drop(df.loc[:, df.columns.str.endswith(".2")], axis = 1)
df = df.drop(df.loc[:, df.columns.str.endswith(".3")], axis = 1)

#Delete rows with missing values
df = df.dropna(how='all')

#Label processing
df["Weather overview(Noon: 06:00 to 18:00)"] = df["Weather overview(Noon: 06:00 to 18:00)"].str[:1]
df["Weather No"] = df["Weather overview(Noon: 06:00 to 18:00)"].str.replace("Cloudy","0").replace("Fine", "1").replace("Big", "3").replace("rain", "3").replace("Thin", "0").replace("Pleasant", "1").replace("fog","0")

#Storage of explanatory variables
x = df.loc[1:, ["Total precipitation(mm)","Average cloud cover(10 minutes ratio)"]]

#Storage of objective variable
y = df.loc[1:,"Weather No"].astype("int64")

#Divided into training data and test data.
# test_size=0.3 :Test data is 30%, Training data: 70%
# random_state=None: Generate different data each time
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=None )

#Select SVM
model = svm.SVC()

#Learning
model.fit(x_train, y_train)

#Accuracy for training data
pred_train = model.predict(x_train)
accuracy_train = accuracy_score(y_train, pred_train)
print('Correct answer rate for training data:%.2f' % accuracy_train)

#Accuracy to test data
pred_test = model.predict(x_test)
accuracy_test = accuracy_score(y_test, pred_test)
print('Correct answer rate for test data:%.2f' % accuracy_test)

#Visualization of decision boundaries
x_combined = x_test.values
y_combined = y_test.values

fig = plt.figure(figsize=(13,8))
plot_decision_regions(x_combined, y_combined, clf=model,  res=0.02)
plt.show()

in conclusion

I first encountered machine learning, but it was surprisingly easy! (Although there were various miscellaneous parts,) I was also inspired by this article, so please try it.

Recommended Posts

[Python] Python / Scikit-learn's first SVM
First time python
First time python
First Python ~ Coding 2 ~
First python [O'REILLY]
First Python 3rd Edition
PyQ ~ Python First Steps ~
First Python image processing
First Python miscellaneous notes
[Python] Chapter 01-01 About Python (First Python)
SVM implementation in python
First Fabric (Python deployment tool)
First neuron simulation with NEURON + Python
Python
First Python 3 ~ The beginning of repetition ~
Web scraping with Python First step
[GUI with Python] PyQt5-The first step-
Prepare your first Python development environment
Python (from first time to execution)
C / C ++ programmer challenges Python (first step)
See python for the first time
The first step in Python Matplotlib
Generate a first class collection in Python
Continuously play the first Python Sukusta MV
Implemented in Python PRML Chapter 7 Nonlinear SVM
[Python] First data analysis / machine learning (Kaggle)
MongoDB for the first time in Python
"First Elasticsearch" starting with a python client
Python environment preparation (venv first time use)
Python standard library: First half (Python learning memo ⑧)