[PYTHON] You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6

Click here until yesterday

You will become an engineer in 100 days --Day 76 --Programming --About machine learning

You will become an engineer in 100 days-Day 70-Programming-About scraping

You will become an engineer in 100 days --Day 66 --Programming --About natural language processing

You become an engineer in 100 days --Day 63 --Programming --About probability 1

You will become an engineer in 100 days-Day 59-Programming-Algorithms

You will become an engineer in 100 days --- Day 53 --Git --About Git

You will become an engineer in 100 days --Day 42 --Cloud --About cloud services

You will become an engineer in 100 days --Day 36 --Database --About the database

You will be an engineer in 100 days --Day 24 --Python --Basics of Python language 1

You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1

You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1

You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1

This time is a continuation of the story about machine learning.

About regression model

I will explain what you can do with machine learning for the first time, but what you can do with machine learning There are basically three.

・ Regression ・ Classification ・ Clustering

Roughly speaking, it becomes prediction, but the part of what to predict changes.

・ Regression: Predict numerical values ・ Classification: Predict categories ・ Clustering: Make it feel good

The regression model goes to predict the numbers.

The data used this time is the Boston house price data attached to scikit-learn.

column Description
CRIM Crime rate per capita by town
ZN The ratio of residential land is 25,Parcels over 000 square feet
INDUS Percentage of non-retail acres per town
CHAS Charlie's river dummy variable (1 if at river boundary, 0 otherwise)
NOX Nitric oxide concentration (1 in 10 million)
RM Average number of rooms per dwelling unit
AGE Age ratio of owned and occupied units built before 1940
DIS Weighted distances to five Boston employment centers
RAD Indicator of accessibility to radial highways
TAX 10,Full property tax rate per $ 000
PTRATIO Student teacher ratio
B Percentage of blacks in town
LSTAT Low rate per capita
MEDV Median Owner-Resident Homes at $ 1000

The MEDV is the objective variable that you want to predict, and the others are the explanatory variables.

Data visualization

First, let's see what kind of data it is.

from sklearn.datasets import load_boston

#Data reading
boston = load_boston()
#Creating a data frame
boston_df = pd.DataFrame(data=boston.data,columns=boston.feature_names)
boston_df['MEDV'] = boston.target

#Data overview
print(boston_df.shape)
boston_df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
1 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
2 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2

It contains numerical data.

Let's visualize to see the relationship between each column.

sns.pairplot(data=boston_df[list(boston_df.columns[0:6])+['MEDV']])
plt.show()

image.png

sns.pairplot(data=boston_df[list(boston_df.columns[6:13])+['MEDV']])
plt.show()

image.png

Let's also look at the correlation of each column.

boston_df.corr()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
CRIM 1 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734 -0.37967 0.625505 0.582764 0.289946 -0.385064 0.455621 -0.388305
ZN -0.200469 1 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 0.664408 -0.311948 -0.314563 -0.391679 0.17552 -0.412995 0.360445
INDUS 0.406583 -0.533828 1 0.062938 0.763651 -0.391676 0.644779 -0.708027 0.595129 0.72076 0.383248 -0.356977 0.6038 -0.483725
CHAS -0.055892 -0.042697 0.062938 1 0.091203 0.091251 0.086518 -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 0.17526
NOX 0.420972 -0.516604 0.763651 0.091203 1 -0.302188 0.73147 -0.76923 0.611441 0.668023 0.188933 -0.380051 0.590879 -0.427321
RM -0.219247 0.311991 -0.391676 0.091251 -0.302188 1 -0.240265 0.205246 -0.209847 -0.292048 -0.355501 0.128069 -0.613808 0.69536
AGE 0.352734 -0.569537 0.644779 0.086518 0.73147 -0.240265 1 -0.747881 0.456022 0.506456 0.261515 -0.273534 0.602339 -0.376955
DIS -0.37967 0.664408 -0.708027 -0.099176 -0.76923 0.205246 -0.747881 1 -0.494588 -0.534432 -0.232471 0.291512 -0.496996 0.249929
RAD 0.625505 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022 -0.494588 1 0.910228 0.464741 -0.444413 0.488676 -0.381626
TAX 0.582764 -0.314563 0.72076 -0.035587 0.668023 -0.292048 0.506456 -0.534432 0.910228 1 0.460853 -0.441808 0.543993 -0.468536
PTRATIO 0.289946 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515 -0.232471 0.464741 0.460853 1 -0.177383 0.374044 -0.507787
B -0.385064 0.17552 -0.356977 0.048788 -0.380051 0.128069 -0.273534 0.291512 -0.444413 -0.441808 -0.177383 1 -0.366087 0.333461
LSTAT 0.455621 -0.412995 0.6038 -0.053929 0.590879 -0.613808 0.602339 -0.496996 0.488676 0.543993 0.374044 -0.366087 1 -0.737663
MEDV -0.388305 0.360445 -0.483725 0.17526 -0.427321 0.69536 -0.376955 0.249929 -0.381626 -0.468536 -0.507787 0.333461 -0.737663 1

With the exception of some columns, the correlation between each column does not seem to be that high.

The regression model is that you want to rely on the value of one objective variable using these columns.

Creating a predictive model

** Data split **

First, split the data for training and testing. This time we will split at 6: 4.

from sklearn.model_selection import train_test_split

#6 for training and test data:Divided by 4
X = boston_df.drop('MEDV',axis=1)
Y = boston_df['MEDV']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)

** Modeling ** Next, create a forecast model.

Here, we will use the linear regression model to build a regression model with the objective variable MEDV and the explanatory variables CRIM ~ LSTAT.

Roughly speaking, Y = x1 * a + x2 * b + ... ϵ It is an image of making an expression like that.

ʻA and bare calledregression coefficients`, and each variable is It shows how much it contributes to the prediction of the objective variable.

ϵ is called residual, and represents the degree of deviation between each data and the expression. In the linear regression model, the sum of the residual squares of each data Find each coefficient by minimizing it.

The library used is called linear_model.

from sklearn import linear_model

#Learning with linear regression
model = linear_model.LinearRegression()
model.fit(x_train, y_train)

Modeling is done immediately by calling the library and doing fit.

** Accuracy verification **

In the accuracy verification of the regression model, we will look at how much the prediction and the actual measurement are different.

As a commonly used index Mean squared error (MSE) and Root mean squared error (RMSE) There is a R-squared value $ R ^ 2 $.

The `MSE is the average value of the sum of squares of the error, and if both the training data and the test data are small, it is judged that the performance of the model is good.

`RMSE is the square root of the mean squared error.

The R-squared value $ R ^ 2 $ takes 1 when MSE is 0, and the better the model performance, the closer to 1.

スクリーンショット 2020-06-09 19.19.12.png
from sklearn.metrics import mean_squared_error

y_pred = model.predict(x_test)

print('MSE : {0} '.format(mean_squared_error(y_test, y_pred)))
print('RMSE : {0} '.format(np.sqrt(mean_squared_error(y_test, y_pred))))
print('R^2 : {0}'.format(model.score(x_test, y_test)))

MSE : 25.79036215070245 RMSE : 5.078421226198399 R^2 : 0.6882607142538019

By the way, looking at the accuracy, the value of RMSE is off by about 5.0. On average, there is an error of this deviation from the house price.

** Residual plot **

By the way, how much did the prediction model deviate? Let's visualize the residual.

#Plot the residuals
plt.scatter(y_pred, y_pred - y_test, c = 'red', marker = 's')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')

# y =Draw a straight line to 0
plt.hlines(y = 0, xmin = -10, xmax = 50, lw = 2, color = 'blue')
plt.xlim([10, 50])
plt.show()

image.png

By combining the test data and the prediction data, you can see how much the deviation is. Those that are out of alignment are quite out of alignment.

With this kind of feeling, create a model so that there is less deviation, select data, preprocess, adjust model parameter values, etc. We aim to improve accuracy with less error.

Summary

Today I explained how the regression model works. There are many other regression models.

First of all, let's start by saying what regression is, and suppress how to model and verify.

19 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter: https://twitter.com/otupython

Recommended Posts

You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6
You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2
You will be an engineer in 100 days ――Day 61 ――Programming ――About exploration
You will be an engineer in 100 days ――Day 74 ――Programming ――About scraping 5
You will be an engineer in 100 days ――Day 73 ――Programming ――About scraping 4
You will be an engineer in 100 days ――Day 75 ――Programming ――About scraping 6
You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF
You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping
You will be an engineer in 100 days --Day 63 --Programming --Probability 1
You will be an engineer in 100 days --Day 65 --Programming --Probability 3
You will be an engineer in 100 days --Day 64 --Programming --Probability 2
You will be an engineer in 100 days ――Day 60 ――Programming ――About data structure and sorting algorithm
You will be an engineer in 100 days --Day 27 --Python --Python Exercise 1
You will be an engineer in 100 days --Day 34 --Python --Python Exercise 3
You will be an engineer in 100 days --Day 31 --Python --Python Exercise 2
You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis
You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
You will be an engineer in 100 days ――Day 24 ―― Python ―― Basics of Python language 1
You will be an engineer in 100 days ――Day 30 ―― Python ―― Basics of Python language 6
You will be an engineer in 100 days ――Day 25 ―― Python ―― Basics of Python language 2
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
Become an AI engineer soon! Comprehensive learning of Python / AI / machine learning / deep learning / statistical analysis in a few days!
You have to be careful about the commands you use every day in the production environment.
Build an interactive environment for machine learning in Python
About testing in the implementation of machine learning models
About machine learning overfitting
Programming learning record day 2
Until an engineer who was once frustrated about machine learning manages to use machine learning at work
[Machine learning] Let's summarize random forest in an easy-to-understand manner
Machine learning in Delemas (practice)
An introduction to machine learning
About machine learning mixed matrices
Python Machine Learning Programming> Keywords
Used in machine learning EDA
How about Anaconda for building a machine learning environment in Python?
Learn machine learning anytime, anywhere in an on-demand Jupyter Notebook environment