Click here until yesterday
You will become an engineer in 100 days --Day 76 --Programming --About machine learning
You will become an engineer in 100 days-Day 70-Programming-About scraping
You will become an engineer in 100 days --Day 66 --Programming --About natural language processing
You will become an engineer in 100 days --Day 63 --Programming --Probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
This time is a continuation of the story about machine learning.
The flow of work when incorporating machine learning is as follows.
Of these, 2-3 parts are called data preprocessing.
This time, I would like to create a data mart out of this preprocessing.
Language is python
Libraries for machine learning are Pandas
and Numpy
The library for visualization uses seaborn
, matplotlib
.
** Loading library **
#Loading the library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
I created a data mart from here.
Click here for yesterday's lecture
From here, we will start creating a model using the data mart.
From here, machine learning will be performed using the created data.
First of all, load the library for machine learning.
** Loading machine learning library **
scikit-learn: Library for machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score,precision_score,f1_score,accuracy_score
from sklearn.metrics import confusion_matrix
Roughly explaining the library, scikit-learn
itself is a huge library.
Load only what you need. I will talk about each item later.
There are several steps to making a model.
The first task is to divide it into training data
and test data
.
Imagine taking an entrance exam.
If you answer the test in advance, you will get a score close to the perfect score. Therefore, even in machine learning, it does not make much sense to verify with the data that has been learned in advance.
The data to be trained in advance is trained so that the data for verification is not included, and after the learning model is completed, verification is performed again with the remaining test data.
After verification, if you are not satisfied with the result, try changing the method, adjusting the parameters, and re-processing from the data selection to create a model.
** Data split **
Here, simple division is performed by the hold-out method
.
Split into test data using train_test_split
.
X = Data other than the correct label (multiple columns) Y = Correct label data (1 column)
x_train, x_test, y_train, y_test = train_test_split (X, Y, test_size = test size)
The X
and Y
here specify the columns of the dataframe created yesterday.
X = data_df.drop(['Survived'],axis=1)
Y = data_df['Survived']
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2)
print(len(x_train))
print(len(y_train))
print(len(x_test))
print(len(x_test))
712 712 179 179
This will create four data: data for the explanatory variable for training
, data for the objective variable for training
, data for the explanatory variable for test
, and data for the objective variable for test
.
** Model call **
Next is the model call. Decide what method to use and call it from the library.
here
Call the logistic regression model
.
Logistic regression
is often used to discriminate between two values.
Logistic Regression
becomes a logistic regression
model.
clf = LogisticRegression()
** Model learning **
After calling the model, the next step is learning.
Learning is performed using data of explanatory variables for training
and data of objective variables for training
.
Basically, learning can be written in one line.
clf.fit(x_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)
This is the end of learning.
** Make a prediction **
Next, verify with test data
.
In the verification, the label
of the test is predicted (determined) using the divided test data
.
y_predict = clf.predict(x_test)
The predicted result is now stored in y_predict
.
The contents are an array that predicts the correct answer.
y_predict[0:5]
array([0, 0, 1, 0, 0])
The prediction outputs the value of 0
or 1
.
Since the correct label
is also the data of the value of 0
or 1
, the accuracy can be verified by comparing it with this.
** Verify the accuracy of the model **
Next is the accuracy verification of the model. Calculate how accurate the model is. Since the library for accuracy verification is loaded from the library, I will introduce it
confusion_matrix: Output 2x2 errata accuracy_score: Calculate the correct answer rate precision_score: Calculate the precision rate recall_score: Calculate the recall rate f1_score: Calculate the harmonic mean of precision x recall
It has become.
When I try to verify the accuracy immediately
print(pd.DataFrame(confusion_matrix(y_predict, y_test), index=['predict 0', 'predict 1'], columns=['real 0', 'real 1']))
print()
print('accuracy : ',accuracy_score(y_test, y_predict))
print('precision : ',precision_score(y_test, y_predict))
print('recall : ',recall_score(y_test, y_predict))
print('f1_score : ',f1_score(y_test, y_predict))
real 0 real 1 predict 0 100 25 predict 1 15 39
accuracy : 0.776536312849162 precision : 0.7222222222222222 recall : 0.609375 f1_score : 0.6610169491525424
The result is like this.
When creating a model with this data, the accuracy of accuracy
is about 77.6%.
How to read the numerical value at the time of verification
First of all, the prediction result is divided into 0
and 1
.
Since the correct label data is also divided into 0
and 1
in advance, there is a 2x2 pattern as an errata.
The relationship between forecast and actual measurement is as follows.
Actual measurement 0 | Actual measurement 1 | |
---|---|---|
Prediction 0 | 100 | 25 |
Prediction 1 | 15 | 39 |
From here, we will calculate the accuracy rate. The correct answer rate is
(Actual measurement 0 prediction 0 value + actual measurement 1 prediction 1 value) / total value of all
It will be.
(100 + 39)/(100 + 25 + 15 + 39) = 139/179 = 0.7765
Next is the precision. The precision rate is
Actual measurement 1 Prediction 1 value / Prediction 1 value
It will be.
39/(15 + 39) = 39/54 = 0.7222
Next is the recall rate. The recall rate is
Actual measurement 1 Prediction 1 value / Actual measurement 1 value
It will be.
39/(25 + 39) = 39/64 = 0.609
Finally, f1_score. f1_score is It is the harmonic mean of the precision and recall.
2 * Compliance rate * Recall rate / (Compliance rate + recall rate)
2 * 0.7222 * 0.609 / (0.7222 + 0.609) = 0.8802 / 1.3316 = 0.6610
Which number is emphasized in the verification depends on the purpose of machine learning. This time, we are making a binary judgment, but if the correct answer labels are evenly divided in the same amount, the correct answer rate will do. If one of the data is small or large and biased, it may not be desirable to use the correct answer rate.
For example, if you have data with a ratio of 99: 1, you can get a correct answer rate of 99% if you match all the labels with the larger one. Think about making a diagnosis of your illness.
If you make a model that everyone is not sick, the correct answer rate is 99%, but the remaining 1% will be overlooked, and sick people will be in trouble. In this case, it's okay to make a slight mistake, so if you don't have a model that can detect sick people, you will miss the minority.
The precision is the percentage of the data predicted to be positive that is actually positive, and the recall is the percentage of the actual positive data that can be predicted to be positive. I will.
If you see that the purpose of machine learning is to rely on the minority properly, you may choose a model that emphasizes recall and can detect it properly.
However, if you do so, you may get caught too much. In that case, f1_score, which takes into account both the precision and recall, is used as an index.
** Other models **
Let's look at other models as well. We will only introduce the model here and will not go into details. First is the decision tree.
#Decision tree
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
y_predict = clf.predict(x_test)
print(pd.DataFrame(confusion_matrix(y_predict, y_test), index=['predict 0', 'predict 1'], columns=['real 0', 'real 1']))
print()
print('accuracy : ',accuracy_score(y_test, y_predict))
print('precision : ',precision_score(y_test, y_predict))
print('recall : ',recall_score(y_test, y_predict))
print('f1_score : ',f1_score(y_test, y_predict))
real 0 real 1 predict 0 95 26 predict 1 20 38
accuracy : 0.7430167597765364 precision : 0.6551724137931034 recall : 0.59375 f1_score : 0.6229508196721311
Next is Random Forest
# RandomForest
clf = RandomForestClassifier()
clf.fit(x_train, y_train)
y_predict = clf.predict(x_test)
print(pd.DataFrame(confusion_matrix(y_predict, y_test), index=['predict 0', 'predict 1'], columns=['real 0', 'real 1']))
print()
print('accuracy : ',accuracy_score(y_test, y_predict))
print('precision : ',precision_score(y_test, y_predict))
print('recall : ',recall_score(y_test, y_predict))
print('f1_score : ',f1_score(y_test, y_predict))
real 0 real 1 predict 0 98 26 predict 1 17 38
accuracy : 0.7597765363128491 precision : 0.6909090909090909 recall : 0.59375 f1_score : 0.6386554621848739
** Calculation of contribution rate **
In the machine learning model, there are things that can give the contribution rate as to which data contributed.
clf.featureimportances
for i,v in zip(x_train.columns,clf.feature_importances_):
print(i,'\t',v)
SibSp 0.08181730881501241 Parch 0.053030544663722166 Fare 0.40243782816341556 Sex2 0.28228147632317596 Pe_0.0 0.03352832009152742 Pe_1.0 0.014542002215684312 Pe_2.0 0.02212292439144309 Pe_3.0 0.022599544658725688 Pe_4.0 0.013099652111940165 Pe_5.0 0.013494114387414768 Pe_6.0 0.005599733163595443 Pe_7.0 0.002340597855733169 Pe_8.0 0.0030199997376331917 Em_C 0.012248329962351154 Em_N 0.0010747396045908525 Em_Q 0.010808812977944686 Em_S 0.025954070876089957
Yesterday I processed the data to create a data mart for machine learning. Today, we will create a prediction model and verify the accuracy using that data. Creating the model itself is fairly easy, and the amount of code you write is much less.
Therefore, when creating a model, it is important to decide which model to select. Since the total amount of code to write is small, it is a task to create various models, verify them, and select the one with the highest accuracy.
If the accuracy is low, try other models, and if it is still low, go back to preprocessing the data and start over. Repeat this cycle until you have a model that you are comfortable with, and start over from the worst data acquisition.
First, let's wake up to the whole flow.
20 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts