Click here until yesterday
You will become an engineer in 100 days --Day 76 --Programming --About machine learning
You will become an engineer in 100 days-Day 70-Programming-About scraping
You will become an engineer in 100 days --Day 66 --Programming --About natural language processing
You will become an engineer in 100 days --Day 63 --Programming --Probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
This time is a continuation of the story about machine learning.
I will explain what you can do with machine learning for the first time, but what you can do with machine learning There are basically three.
・ Regression ・ Classification ・ Clustering
Roughly speaking, it becomes prediction
, but the part of what to predict
changes.
・ Regression: Predict numerical values ・ Classification: Predict categories ・ Clustering: Make it feel good
The classification model
goes to predict the category value.
The data used this time is the data of ʻiris(Ayame) attached to
scikit-learn`.
sepal length (cm) | Sepal length |
---|---|
sepal width (cm) | Sepal width |
petal length (cm) | Petal length |
petal width (cm) | Petal width |
There are three varieties of irises: setosa
, versicolor
, and virginica
.
Let's read the data and look at the data.
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
#Reading iris data
iris = load_iris()
#Convert to data frame
df = pd.DataFrame(np.concatenate((iris.data, iris.target.reshape(-1,1)), axis=1),columns=iris.feature_names + ['type'])
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | type | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5 | 3.6 | 1.4 | 0.2 | 0 |
It is 3 types of Ayame (not rigid) data, and has 150 rows of numerical data for 4 columns.
The last column type
is the iris variety.
Since there are three types of data used this time, the numbers are 0
, 1
, and 2
.
Let's visualize what kind of features it appears.
import matplotlib.pyplot as plt
%matplotlib inline
#Store data for visualization in variables
x1 = df['petal length (cm)']
x2 = df['petal width (cm)']
y = df['type']
x = df[['petal length (cm)','petal width (cm)']]
#Draw data
for i in range(3):
plt.scatter(x1[y==i],x2[y==i],marker="x",label=i)
plt.legend()
plt.grid()
plt.show()
It looks like the three types of irises are fairly well separated and put together.
Here, let's try normalization and standardization, which are one of the methods for processing machine learning data.
Normalization is a method of converting the minimum value to the scale of 0 and the maximum value to the scale of 1. However, it is said that the method using the minimum and maximum values is strongly affected by the maximum and minimum values.
Normalized value = (value-minimum value) / (maximum value-minimum value)
Standardization changes the range of numeric data so that it has an average of 0 and a variance of 1. This is said to be more resistant to outliers than normalization.
Standardized value = (value-mean) / standard deviation
Standardization can reduce the scale of data and speed up learning.
When the scale between each explanatory variable is significantly different (height, weight, etc.) Larger variables affect learning and can be remedied with standardization.
Normalization and standardization can be done with the following code.
Use MinMaxScaler
and StandardScaler
as libraries.
** Data split **
First, split the data for training and testing. This time we will split at 7: 3.
#Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=3)
** Normalize **
#Perform normalization
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
#Normalize training data
x_train_norm = mms.fit_transform(x_train)
#Normalize test data based on training data
x_test_norm = mms.transform(x_test)
print('Maximum training data: ' , x_train_norm.max())
print('Minimum value of training data: ' , x_train_norm.min())
#The test version is based on training data, so it may be a little different.
print('Maximum value of test data: ' , x_test_norm.max())
print('Minimum value of test data: ' , x_test_norm.min())
Maximum training data: 1.0 Minimum training data: 0.0 Maximum value of test data: 1.0357142857142858 Minimum test data: -0.017857142857142877
** Standardize **
#Standardize
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x_train_std = ss.fit_transform(x_train)
x_test_std = ss.transform(x_test)
print('Mean of training data: ' , x_train_std.mean())
print('Standard deviation of training data: ' , x_train_std.std())
#The test version is based on training data, so it may be a little different.
print('Mean of test data: ' , x_test_std.mean())
print('Standard deviation of test data: ' , x_test_std.std())
Average training data: 3.425831047414769e-16 Standard deviation of training data: 1.0000000000000002 Mean of test data: -0.1110109621182351 Standard deviation of test data: 1.040586519459273
When standardized, the average is close to 0.
Let's see the difference between the result of normalization and standardization and the original data.
#Visualize the original data after normalization and standardization
x_norm = mms.transform(x)
x_std = ss.transform(x)
plt.axis('equal')
plt.scatter(x.iloc[:,0],x.iloc[:,1], marker="x", label="org" ,c='blue')
plt.scatter(x_std[:,0] ,x_std[:,1] , marker="x", label="std" ,c='red')
plt.scatter(x_norm[:,0],x_norm[:,1], marker="x", label="norm",c='green')
plt.legend()
plt.grid()
plt.show()
Blue is the original value, red is the standardized data, and green is the normalized data. You can see that the possible distributions have changed considerably.
Let's see the difference between the machine learning results of the regularized, standardized data and the original data.
The following outputs the result of classification using logistic regression
.
The higher the score value, the better the accuracy.
#Load a logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear', multi_class='auto')
lr_norm = LogisticRegression(solver='liblinear', multi_class='auto')
lr_std = LogisticRegression(solver='liblinear', multi_class='auto')
lr.fit(x_train,y_train)
print('Original data score:',lr.score(x_train,y_train))
lr_norm.fit(x_train_norm,y_train)
print('Score of normalized data:',lr_norm.score(x_train_norm,y_train))
lr_std.fit(x_train_std,y_train)
print('Standardized data score:',lr_std.score(x_train_std,y_train))
Original data score: 0.8666666666666667 Score of normalized data: 0.8095238095238095 Standardized data score: 0.9523809523809523
The accuracy of standardized data is much higher.
The discriminant model finally outputs the predicted value of the objective variable.
In this case, output one of the values 0
, 1
, 2
as the category value, and check whether the result matches the test data.
I am measuring the accuracy.
The method of accuracy verification is the same as the method used in the previous day's lecture Machine learning 5
.
Here are some other classification models.
** Decision tree **
Reference: wikipedia
Speaking of classification, this is a method that is often used. In the process of learning, it branches according to the value of the variable, and finally it is a model that can be visualized in the shape of a tree.
** Random Forest **
It is a ensemble learning
algorithm that uses a decision tree as a weak learner.
Originally it is a decision tree, but multiple decision trees are randomly created and arranged in parallel to make a majority vote of the results.
** Gradient Boosting Decision Tree **
It is abbreviated as GBDT (Gradient Boosting Decision Tree). This is also learned based on the decision tree, and the learning is repeated so as to reduce the error from the result. It is a technique to draw the final conclusion by doing so. The more you learn, the better the accuracy.
The current mainstream method is called boostering
.
Both are included in the scikit-learn
library, so you can experiment with them.
Today I explained how the classification model works. There are many other classification models.
First of all, let's start with what is classification and suppress how to model and verify.
18 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts