[PYTHON] You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7

Click here until yesterday

You will become an engineer in 100 days --Day 76 --Programming --About machine learning

You will become an engineer in 100 days-Day 70-Programming-About scraping

You will become an engineer in 100 days --Day 66 --Programming --About natural language processing

You will become an engineer in 100 days --Day 63 --Programming --Probability 1

You will become an engineer in 100 days-Day 59-Programming-Algorithms

You will become an engineer in 100 days --- Day 53 --Git --About Git

You will become an engineer in 100 days --Day 42 --Cloud --About cloud services

You will become an engineer in 100 days --Day 36 --Database --About the database

You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1

You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1

You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1

You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1

This time is a continuation of the story about machine learning.

About classification model

I will explain what you can do with machine learning for the first time, but what you can do with machine learning There are basically three.

・ Regression ・ Classification ・ Clustering

Roughly speaking, it becomes prediction, but the part of what to predict changes.

・ Regression: Predict numerical values ・ Classification: Predict categories ・ Clustering: Make it feel good

The classification model goes to predict the category value.

The data used this time is the data of ʻiris(Ayame) attached toscikit-learn`.

sepal length (cm) Sepal length
sepal width (cm) Sepal width
petal length (cm) Petal length
petal width (cm) Petal width

There are three varieties of irises: setosa, versicolor, and virginica.

Data visualization

Let's read the data and look at the data.

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

#Reading iris data
iris = load_iris()
#Convert to data frame
df = pd.DataFrame(np.concatenate((iris.data, iris.target.reshape(-1,1)), axis=1),columns=iris.feature_names + ['type'])
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) type
0 5.1 3.5 1.4 0.2 0
1 4.9 3 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5 3.6 1.4 0.2 0

It is 3 types of Ayame (not rigid) data, and has 150 rows of numerical data for 4 columns. The last column type is the iris variety.

Since there are three types of data used this time, the numbers are 0, 1, and 2.

Let's visualize what kind of features it appears.

import matplotlib.pyplot as plt
%matplotlib inline

#Store data for visualization in variables
x1 = df['petal length (cm)']
x2 = df['petal width (cm)']
y  = df['type']
x  = df[['petal length (cm)','petal width (cm)']]

#Draw data
for i in range(3):
    plt.scatter(x1[y==i],x2[y==i],marker="x",label=i)
plt.legend()
plt.grid()
plt.show()

image.png

It looks like the three types of irises are fairly well separated and put together.

Here, let's try normalization and standardization, which are one of the methods for processing machine learning data.

Normalization (min-max normalization)

Normalization is a method of converting the minimum value to the scale of 0 and the maximum value to the scale of 1. However, it is said that the method using the minimum and maximum values is strongly affected by the maximum and minimum values.

Normalized value = (value-minimum value) / (maximum value-minimum value)

Standardization (z-score normalization)

Standardization changes the range of numeric data so that it has an average of 0 and a variance of 1. This is said to be more resistant to outliers than normalization.

Standardized value = (value-mean) / standard deviation

Standardization can reduce the scale of data and speed up learning.

When the scale between each explanatory variable is significantly different (height, weight, etc.) Larger variables affect learning and can be remedied with standardization.

Normalization method

Normalization and standardization can be done with the following code.

Use MinMaxScaler and StandardScaler as libraries.

** Data split **

First, split the data for training and testing. This time we will split at 7: 3.

#Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=3)

** Normalize **

#Perform normalization
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()

#Normalize training data
x_train_norm = mms.fit_transform(x_train)

#Normalize test data based on training data
x_test_norm = mms.transform(x_test)

print('Maximum training data: ' , x_train_norm.max())
print('Minimum value of training data: ' , x_train_norm.min())
#The test version is based on training data, so it may be a little different.
print('Maximum value of test data: ' , x_test_norm.max())
print('Minimum value of test data: ' , x_test_norm.min())

Maximum training data: 1.0 Minimum training data: 0.0 Maximum value of test data: 1.0357142857142858 Minimum test data: -0.017857142857142877

** Standardize **

#Standardize
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

x_train_std = ss.fit_transform(x_train)
x_test_std  = ss.transform(x_test)

print('Mean of training data: ' , x_train_std.mean())
print('Standard deviation of training data: ' , x_train_std.std())
#The test version is based on training data, so it may be a little different.
print('Mean of test data: ' , x_test_std.mean())
print('Standard deviation of test data: ' , x_test_std.std())

Average training data: 3.425831047414769e-16 Standard deviation of training data: 1.0000000000000002 Mean of test data: -0.1110109621182351 Standard deviation of test data: 1.040586519459273

When standardized, the average is close to 0.

Let's see the difference between the result of normalization and standardization and the original data.

#Visualize the original data after normalization and standardization
x_norm = mms.transform(x)
x_std  = ss.transform(x)
plt.axis('equal')
plt.scatter(x.iloc[:,0],x.iloc[:,1], marker="x", label="org" ,c='blue')
plt.scatter(x_std[:,0] ,x_std[:,1] , marker="x", label="std" ,c='red')
plt.scatter(x_norm[:,0],x_norm[:,1], marker="x", label="norm",c='green')
plt.legend()
plt.grid()
plt.show()

image.png

Blue is the original value, red is the standardized data, and green is the normalized data. You can see that the possible distributions have changed considerably.

Creating a predictive model

Let's see the difference between the machine learning results of the regularized, standardized data and the original data. The following outputs the result of classification using logistic regression. The higher the score value, the better the accuracy.

#Load a logistic regression model
from sklearn.linear_model import LogisticRegression
lr      = LogisticRegression(solver='liblinear', multi_class='auto')
lr_norm = LogisticRegression(solver='liblinear', multi_class='auto')
lr_std  = LogisticRegression(solver='liblinear', multi_class='auto')

lr.fit(x_train,y_train)
print('Original data score:',lr.score(x_train,y_train))

lr_norm.fit(x_train_norm,y_train)
print('Score of normalized data:',lr_norm.score(x_train_norm,y_train))

lr_std.fit(x_train_std,y_train)
print('Standardized data score:',lr_std.score(x_train_std,y_train))

Original data score: 0.8666666666666667 Score of normalized data: 0.8095238095238095 Standardized data score: 0.9523809523809523

The accuracy of standardized data is much higher.

The discriminant model finally outputs the predicted value of the objective variable. In this case, output one of the values 0, 1, 2 as the category value, and check whether the result matches the test data. I am measuring the accuracy.

The method of accuracy verification is the same as the method used in the previous day's lecture Machine learning 5.

Classification model type

Here are some other classification models.

** Decision tree **

Reference: wikipedia

Speaking of classification, this is a method that is often used. In the process of learning, it branches according to the value of the variable, and finally it is a model that can be visualized in the shape of a tree.

** Random Forest **

It is a ensemble learning algorithm that uses a decision tree as a weak learner. Originally it is a decision tree, but multiple decision trees are randomly created and arranged in parallel to make a majority vote of the results.

** Gradient Boosting Decision Tree **

It is abbreviated as GBDT (Gradient Boosting Decision Tree). This is also learned based on the decision tree, and the learning is repeated so as to reduce the error from the result. It is a technique to draw the final conclusion by doing so. The more you learn, the better the accuracy.

The current mainstream method is called boostering.

Both are included in the scikit-learn library, so you can experiment with them.

Summary

Today I explained how the classification model works. There are many other classification models.

First of all, let's start with what is classification and suppress how to model and verify.

18 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter: https://twitter.com/otupython

Recommended Posts

You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6
You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2
You will be an engineer in 100 days ――Day 61 ――Programming ――About exploration
You will be an engineer in 100 days ――Day 74 ――Programming ――About scraping 5
You will be an engineer in 100 days ――Day 73 ――Programming ――About scraping 4
You will be an engineer in 100 days ――Day 75 ――Programming ――About scraping 6
You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF
You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping
You will be an engineer in 100 days --Day 65 --Programming --Probability 3
You will be an engineer in 100 days --Day 86 --Database --About Hadoop
You will be an engineer in 100 days ――Day 60 ――Programming ――About data structure and sorting algorithm
You will be an engineer in 100 days --Day 34 --Python --Python Exercise 3
You will be an engineer in 100 days --Day 31 --Python --Python Exercise 2
You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis
You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
You will be an engineer in 100 days ――Day 24 ―― Python ―― Basics of Python language 1
You will be an engineer in 100 days ――Day 30 ―― Python ―― Basics of Python language 6
You will be an engineer in 100 days ――Day 25 ―― Python ―― Basics of Python language 2
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
Become an AI engineer soon! Comprehensive learning of Python / AI / machine learning / deep learning / statistical analysis in a few days!
You have to be careful about the commands you use every day in the production environment.
Build an interactive environment for machine learning in Python
About testing in the implementation of machine learning models
About machine learning overfitting
Programming learning record day 2
Until an engineer who was once frustrated about machine learning manages to use machine learning at work
[Machine learning] Let's summarize random forest in an easy-to-understand manner
Machine learning in Delemas (practice)
An introduction to machine learning
About machine learning mixed matrices
Python Machine Learning Programming> Keywords
Used in machine learning EDA
How about Anaconda for building a machine learning environment in Python?
Learn machine learning anytime, anywhere in an on-demand Jupyter Notebook environment