[PYTHON] You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7

Click here until yesterday

This time is a continuation of the story about machine learning.

About classification model

I will explain what you can do with machine learning for the first time, but what you can do with machine learning There are basically three.

・ Regression ・ Classification ・ Clustering

Roughly speaking, it becomes prediction, but the part of what to predict changes.

・ Regression: Predict numerical values ・ Classification: Predict categories ・ Clustering: Make it feel good

The classification model goes to predict the category value.

The data used this time is the data of ʻiris(Ayame) attached toscikit-learn`.

sepal length (cm)	Sepal length
sepal width (cm)	Sepal width
petal length (cm)	Petal length
petal width (cm)	Petal width

There are three varieties of irises: setosa, versicolor, and virginica.

Data visualization

Let's read the data and look at the data.

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

#Reading iris data
iris = load_iris()
#Convert to data frame
df = pd.DataFrame(np.concatenate((iris.data, iris.target.reshape(-1,1)), axis=1),columns=iris.feature_names + ['type'])
df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5	3.6	1.4	0.2

It is 3 types of Ayame (not rigid) data, and has 150 rows of numerical data for 4 columns. The last column type is the iris variety.

Since there are three types of data used this time, the numbers are 0, 1, and 2.

Let's visualize what kind of features it appears.

import matplotlib.pyplot as plt
%matplotlib inline

#Store data for visualization in variables
x1 = df['petal length (cm)']
x2 = df['petal width (cm)']
y  = df['type']
x  = df[['petal length (cm)','petal width (cm)']]

#Draw data
for i in range(3):
    plt.scatter(x1[y==i],x2[y==i],marker="x",label=i)
plt.legend()
plt.grid()
plt.show()

It looks like the three types of irises are fairly well separated and put together.

Here, let's try normalization and standardization, which are one of the methods for processing machine learning data.

Normalization (min-max normalization)

Normalization is a method of converting the minimum value to the scale of 0 and the maximum value to the scale of 1. However, it is said that the method using the minimum and maximum values is strongly affected by the maximum and minimum values.

Normalized value = (value-minimum value) / (maximum value-minimum value)

Standardization (z-score normalization)

Standardization changes the range of numeric data so that it has an average of 0 and a variance of 1. This is said to be more resistant to outliers than normalization.

Standardized value = (value-mean) / standard deviation

Standardization can reduce the scale of data and speed up learning.

When the scale between each explanatory variable is significantly different (height, weight, etc.) Larger variables affect learning and can be remedied with standardization.

Normalization method

Normalization and standardization can be done with the following code.

Use MinMaxScaler and StandardScaler as libraries.

** Data split **

First, split the data for training and testing. This time we will split at 7: 3.

#Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=3)

** Normalize **

#Perform normalization
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()

#Normalize training data
x_train_norm = mms.fit_transform(x_train)

#Normalize test data based on training data
x_test_norm = mms.transform(x_test)

print('Maximum training data: ' , x_train_norm.max())
print('Minimum value of training data: ' , x_train_norm.min())
#The test version is based on training data, so it may be a little different.
print('Maximum value of test data: ' , x_test_norm.max())
print('Minimum value of test data: ' , x_test_norm.min())

Maximum training data: 1.0 Minimum training data: 0.0 Maximum value of test data: 1.0357142857142858 Minimum test data: -0.017857142857142877

** Standardize **

#Standardize
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

x_train_std = ss.fit_transform(x_train)
x_test_std  = ss.transform(x_test)

print('Mean of training data: ' , x_train_std.mean())
print('Standard deviation of training data: ' , x_train_std.std())
#The test version is based on training data, so it may be a little different.
print('Mean of test data: ' , x_test_std.mean())
print('Standard deviation of test data: ' , x_test_std.std())

Average training data: 3.425831047414769e-16 Standard deviation of training data: 1.0000000000000002 Mean of test data: -0.1110109621182351 Standard deviation of test data: 1.040586519459273

When standardized, the average is close to 0.

Let's see the difference between the result of normalization and standardization and the original data.

#Visualize the original data after normalization and standardization
x_norm = mms.transform(x)
x_std  = ss.transform(x)
plt.axis('equal')
plt.scatter(x.iloc[:,0],x.iloc[:,1], marker="x", label="org" ,c='blue')
plt.scatter(x_std[:,0] ,x_std[:,1] , marker="x", label="std" ,c='red')
plt.scatter(x_norm[:,0],x_norm[:,1], marker="x", label="norm",c='green')
plt.legend()
plt.grid()
plt.show()

Blue is the original value, red is the standardized data, and green is the normalized data. You can see that the possible distributions have changed considerably.

Creating a predictive model

Let's see the difference between the machine learning results of the regularized, standardized data and the original data. The following outputs the result of classification using logistic regression. The higher the score value, the better the accuracy.

#Load a logistic regression model
from sklearn.linear_model import LogisticRegression
lr      = LogisticRegression(solver='liblinear', multi_class='auto')
lr_norm = LogisticRegression(solver='liblinear', multi_class='auto')
lr_std  = LogisticRegression(solver='liblinear', multi_class='auto')

lr.fit(x_train,y_train)
print('Original data score:',lr.score(x_train,y_train))

lr_norm.fit(x_train_norm,y_train)
print('Score of normalized data:',lr_norm.score(x_train_norm,y_train))

lr_std.fit(x_train_std,y_train)
print('Standardized data score:',lr_std.score(x_train_std,y_train))

Original data score: 0.8666666666666667 Score of normalized data: 0.8095238095238095 Standardized data score: 0.9523809523809523

The accuracy of standardized data is much higher.

The discriminant model finally outputs the predicted value of the objective variable. In this case, output one of the values 0, 1, 2 as the category value, and check whether the result matches the test data. I am measuring the accuracy.

The method of accuracy verification is the same as the method used in the previous day's lecture Machine learning 5.

Classification model type

Here are some other classification models.

** Decision tree **

Reference: wikipedia

Speaking of classification, this is a method that is often used. In the process of learning, it branches according to the value of the variable, and finally it is a model that can be visualized in the shape of a tree.

** Random Forest **

It is a ensemble learning algorithm that uses a decision tree as a weak learner. Originally it is a decision tree, but multiple decision trees are randomly created and arranged in parallel to make a majority vote of the results.

** Gradient Boosting Decision Tree **

It is abbreviated as GBDT (Gradient Boosting Decision Tree). This is also learned based on the decision tree, and the learning is repeated so as to reduce the error from the result. It is a technique to draw the final conclusion by doing so. The more you learn, the better the accuracy.

The current mainstream method is called boostering.

Both are included in the scikit-learn library, so you can experiment with them.

Summary

Today I explained how the classification model works. There are many other classification models.

First of all, let's start with what is classification and suppress how to model and verify.

18 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5	3.6	1.4	0.2