[PYTHON] You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5

Click here until yesterday

This time is a continuation of the story about machine learning.

About the data processing flow of machine learning

The flow of work when incorporating machine learning is as follows.

Determine the purpose
Data acquisition
Data understanding / selection / processing
Data mart (data set) creation
Model creation
Accuracy verification
System implementation

Of these, 2-3 parts are called data preprocessing.

This time, I would like to create a data mart out of this preprocessing.

Pretreatment until yesterday

Language is python Libraries for machine learning are Pandas and Numpy The library for visualization uses seaborn, matplotlib.

** Loading library **

#Loading the library
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

I created a data mart from here.

Click here for yesterday's lecture

From here, we will start creating a model using the data mart.

About model creation

From here, machine learning will be performed using the created data.

First of all, load the library for machine learning.

** Loading machine learning library **

scikit-learn: Library for machine learning

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import recall_score,precision_score,f1_score,accuracy_score
from sklearn.metrics import confusion_matrix

Roughly explaining the library, scikit-learn itself is a huge library. Load only what you need. I will talk about each item later.

Model making flow

There are several steps to making a model.

The first task is to divide it into training data and test data. Imagine taking an entrance exam.

If you answer the test in advance, you will get a score close to the perfect score. Therefore, even in machine learning, it does not make much sense to verify with the data that has been learned in advance.

The data to be trained in advance is trained so that the data for verification is not included, and after the learning model is completed, verification is performed again with the remaining test data.

After verification, if you are not satisfied with the result, try changing the method, adjusting the parameters, and re-processing from the data selection to create a model.

** Data split **

Here, simple division is performed by the hold-out method. Split into test data using train_test_split.

X = Data other than the correct label (multiple columns) Y = Correct label data (1 column)

x_train, x_test, y_train, y_test = train_test_split (X, Y, test_size = test size)

Enter the test size with a decimal point and enter the size of the test data. Example: 0.2

The X and Y here specify the columns of the dataframe created yesterday.

X = data_df.drop(['Survived'],axis=1)
Y = data_df['Survived']
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2)
print(len(x_train))
print(len(y_train))
print(len(x_test))
print(len(x_test))

712 712 179 179

This will create four data: data for the explanatory variable for training, data for the objective variable for training, data for the explanatory variable for test, and data for the objective variable for test.

** Model call **

Next is the model call. Decide what method to use and call it from the library.

here Call the logistic regression model. Logistic regression is often used to discriminate between two values. Logistic Regression becomes a logistic regression model.

clf = LogisticRegression()

** Model learning **

After calling the model, the next step is learning. Learning is performed using data of explanatory variables for training and data of objective variables for training. Basically, learning can be written in one line.

clf.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)

This is the end of learning.

** Make a prediction **

Next, verify with test data. In the verification, the label of the test is predicted (determined) using the divided test data.

y_predict = clf.predict(x_test)

The predicted result is now stored in y_predict. The contents are an array that predicts the correct answer.

y_predict[0:5]

array([0, 0, 1, 0, 0])

The prediction outputs the value of 0 or 1. Since the correct label is also the data of the value of 0 or 1, the accuracy can be verified by comparing it with this.

** Verify the accuracy of the model **

Next is the accuracy verification of the model. Calculate how accurate the model is. Since the library for accuracy verification is loaded from the library, I will introduce it

confusion_matrix: Output 2x2 errata accuracy_score: Calculate the correct answer rate precision_score: Calculate the precision rate recall_score: Calculate the recall rate f1_score: Calculate the harmonic mean of precision x recall

It has become.

When I try to verify the accuracy immediately

print(pd.DataFrame(confusion_matrix(y_predict, y_test), index=['predict 0', 'predict 1'], columns=['real 0', 'real 1']))
print()
print('accuracy  : ',accuracy_score(y_test, y_predict))
print('precision : ',precision_score(y_test, y_predict))
print('recall    : ',recall_score(y_test, y_predict))
print('f1_score  : ',f1_score(y_test, y_predict))

real 0 real 1 predict 0 100 25 predict 1 15 39

accuracy : 0.776536312849162 precision : 0.7222222222222222 recall : 0.609375 f1_score : 0.6610169491525424

The result is like this. When creating a model with this data, the accuracy of accuracy is about 77.6%.

How to read the numerical value at the time of verification

First of all, the prediction result is divided into 0 and 1. Since the correct label data is also divided into 0 and 1 in advance, there is a 2x2 pattern as an errata.

The relationship between forecast and actual measurement is as follows.

	Actual measurement 0	Actual measurement 1
Prediction 0	100	25
Prediction 1	15	39

From here, we will calculate the accuracy rate. The correct answer rate is (Actual measurement 0 prediction 0 value + actual measurement 1 prediction 1 value) / total value of all It will be.

(100 + 39)/(100 + 25 + 15 + 39) = 139/179 = 0.7765

Next is the precision. The precision rate is Actual measurement 1 Prediction 1 value / Prediction 1 value It will be.

39/(15 + 39) = 39/54 = 0.7222

Next is the recall rate. The recall rate is Actual measurement 1 Prediction 1 value / Actual measurement 1 value It will be.

39/(25 + 39) = 39/64 = 0.609

Finally, f1_score. f1_score is It is the harmonic mean of the precision and recall.

2 * Compliance rate * Recall rate / (Compliance rate + recall rate)

2 * 0.7222 * 0.609 / (0.7222 + 0.609) = 0.8802 / 1.3316 = 0.6610

Which number is emphasized in the verification depends on the purpose of machine learning. This time, we are making a binary judgment, but if the correct answer labels are evenly divided in the same amount, the correct answer rate will do. If one of the data is small or large and biased, it may not be desirable to use the correct answer rate.

For example, if you have data with a ratio of 99: 1, you can get a correct answer rate of 99% if you match all the labels with the larger one. Think about making a diagnosis of your illness.

If you make a model that everyone is not sick, the correct answer rate is 99%, but the remaining 1% will be overlooked, and sick people will be in trouble. In this case, it's okay to make a slight mistake, so if you don't have a model that can detect sick people, you will miss the minority.

The precision is the percentage of the data predicted to be positive that is actually positive, and the recall is the percentage of the actual positive data that can be predicted to be positive. I will.

If you see that the purpose of machine learning is to rely on the minority properly, you may choose a model that emphasizes recall and can detect it properly.

However, if you do so, you may get caught too much. In that case, f1_score, which takes into account both the precision and recall, is used as an index.

** Other models **

Let's look at other models as well. We will only introduce the model here and will not go into details. First is the decision tree.


#Decision tree
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
y_predict = clf.predict(x_test)
print(pd.DataFrame(confusion_matrix(y_predict, y_test), index=['predict 0', 'predict 1'], columns=['real 0', 'real 1']))
print()
print('accuracy  : ',accuracy_score(y_test, y_predict))
print('precision : ',precision_score(y_test, y_predict))
print('recall    : ',recall_score(y_test, y_predict))
print('f1_score  : ',f1_score(y_test, y_predict))

real 0 real 1 predict 0 95 26 predict 1 20 38

accuracy : 0.7430167597765364 precision : 0.6551724137931034 recall : 0.59375 f1_score : 0.6229508196721311

Next is Random Forest


# RandomForest
clf = RandomForestClassifier()
clf.fit(x_train, y_train)
y_predict = clf.predict(x_test)
print(pd.DataFrame(confusion_matrix(y_predict, y_test), index=['predict 0', 'predict 1'], columns=['real 0', 'real 1']))
print()
print('accuracy  : ',accuracy_score(y_test, y_predict))
print('precision : ',precision_score(y_test, y_predict))
print('recall    : ',recall_score(y_test, y_predict))
print('f1_score  : ',f1_score(y_test, y_predict))

real 0 real 1 predict 0 98 26 predict 1 17 38

accuracy : 0.7597765363128491 precision : 0.6909090909090909 recall : 0.59375 f1_score : 0.6386554621848739

** Calculation of contribution rate **

In the machine learning model, there are things that can give the contribution rate as to which data contributed.

clf.featureimportances


for i,v in zip(x_train.columns,clf.feature_importances_):
    print(i,'\t',v)

SibSp 0.08181730881501241 Parch 0.053030544663722166 Fare 0.40243782816341556 Sex2 0.28228147632317596 Pe_0.0 0.03352832009152742 Pe_1.0 0.014542002215684312 Pe_2.0 0.02212292439144309 Pe_3.0 0.022599544658725688 Pe_4.0 0.013099652111940165 Pe_5.0 0.013494114387414768 Pe_6.0 0.005599733163595443 Pe_7.0 0.002340597855733169 Pe_8.0 0.0030199997376331917 Em_C 0.012248329962351154 Em_N 0.0010747396045908525 Em_Q 0.010808812977944686 Em_S 0.025954070876089957

Summary

Yesterday I processed the data to create a data mart for machine learning. Today, we will create a prediction model and verify the accuracy using that data. Creating the model itself is fairly easy, and the amount of code you write is much less.

Therefore, when creating a model, it is important to decide which model to select. Since the total amount of code to write is small, it is a task to create various models, verify them, and select the one with the highest accuracy.

If the accuracy is low, try other models, and if it is still low, go back to preprocessing the data and start over. Repeat this cycle until you have a model that you are comfortable with, and start over from the worst data acquisition.

First, let's wake up to the whole flow.

20 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython