2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)

Here you will learn a linear ** multiple regression analysis ** that deals with three or more variables. Let's analyze the house prices in Boston, a big city in the northeastern part of Massachusetts in the United States, using various explanatory variables.

** ⑴ Import the library **

#Library required for numerical calculation
import numpy as np
import pandas as pd
#Package to draw a graph
import matplotlib.pyplot as plt
#Machine learning library scikit-linear model of learn
from sklearn import linear_model

** ⑵ Read data **

Read one of the datasets "Boston house prices dataset" that comes with scikit-learn and store it as the variable boston.

from sklearn.datasets import load_boston
boston = load_boston()

** ⑶ Check the contents of the data and process it for analysis **

print(boston)

First, view the contents to get an overview of the data. 002_0201_001.PNG

print(boston.DESCR)

Check out the description of the Boston dataset. 002_0201_002.PNG The dataset has 13 items as explanatory variables and 1 item as objective variables.

Variable name Original word Definition
CRIM crime rate Crime rate per capita by town
ZN zone 25,Percentage of residential areas over 000 square feet
INDUS industry Percentage of non-retail (manufacturing) area per town
CHAS Charles River Charles River dummy variable (1 if the area borders the river, 0 otherwise)
NOX nitric oxides Nitric oxide concentration (1 in 10 million)
RM rooms Average number of rooms per unit
AGE ages Percentage of homes built before 1940
DIS distances Weighted distances to five Boston employment centers
RAD radial highways Accessibility index to highways
TAX tax rate 10,Property tax rate per $ 000
PTRATIO pupil-teacher ratio Student / teacher ratio by town
B blacks Percentage of blacks by town
LSTAT lower status Percentage of lower classes in the population
MEDV median value Median homeownership in the $ 1000 range

** ⑷ Convert data to Pandas data frame **

boston_df = pd.DataFrame(boston.data)
print(boston_df)

It says [506 rows x 13 columns], and the shape of the data is 506 rows x 13 columns, that is, the number of samples is 506 for 13 variables. 002_0201_003.PNG

** ⑸ Specify the column name and add the objective variable **

#Specify column name
boston_df.columns = boston.feature_names
print(boston_df)

Specify boston's feature_names as boston_df's columns. 002_0201_004.PNG

#Add objective variable
boston_df['PRICE'] = pd.DataFrame(boston.target)
print(boston_df)

Convert the boston data target to a Pandas data frame and store it in boston_df with the column name PRICE. The objective variable "PRICE" is added to the rightmost column. 002_0201_005.PNG

** basic scikit-learn grammar </ font> **

From now on, we will perform multiple regression analysis using scikit-learn, but here is the procedure. ① Create the model source (instance) Model variable name = LinearRegression () (2) Create a model based on the explanatory variable X and the objective variable Y Model variable name.fit (X, Y) ③ Calculate the regression coefficient using the created model Model variable name.coef_ ④ Calculate the intercept using the created model Model variable name.intercept_ ⑤ Calculate the coefficient of determination to obtain the accuracy of the model Model variable name.score (X, Y)

** ⑹ Create explanatory variables and objective variables **

#Delete only the objective variable and store it in variable X
X = boston_df.drop("PRICE", axis=1)
#Extract only the objective variable and store it in variable Y
Y = boston_df["PRICE"]
print(X)
print(Y)

Delete the column of the data frame with the argument (" column name ", axis = 1) in the drop function. Furthermore, the column of the data frame is specified by data frame name [" column name "], extracted, and stored in the variable Y. 002_0201_006.PNG

** ⑺ Create an instance ① **

model = linear_model.LinearRegression()

** ⑻ Create a model ② **

model.fit(X,Y)

** ⑼ Calculate regression coefficient ③ **

model.coef_

002_0201_007.PNG coef_ means coefficient. Since the coefficient is a little difficult to understand, convert it to a data frame to make it easier to see.

#Store the coefficient value in the variable coefficient
coefficient = model.coef_
#Convert to data frame and specify column name and index name
df_coefficient = pd.DataFrame(coefficient,
                              columns=["coefficient"],
                              index=["Crime rate", "Residential land rate", "Manufacturing ratio", "Charles river", "Nitric oxide concentration", 
                                     "Average number of rooms", "Home ownership rate", "Employment center", "Highway", "Property tax rate", 
                                     "Student / teacher ratio", "Black ratio", "Underclass ratio"])
df_coefficient

002_0201_008.PNG

** ⑽ Calculate the intercept ④ **

model.intercept_

002_0201_009.PNG

** ⑾ Calculate the coefficient of determination and check the accuracy ⑤ **

model.score(X, Y)

002_0201_010.PNG

** Cross-validation </ font> **

There is a concept called cross-validation as a method of verifying the accuracy of the model, that is, the validity of the analysis itself. In most cases, the sample is randomly divided into two groups, one group is used to create a model, and the remaining one group is used to test the model. The former is called training data and the latter is called test data. Now, scikit-learn provides a method to split training data and test data. sklearn.model_selection.train_test_split There is no fixed rule on the ratio of training and test, but if you do not specify the argument of train_test_split, a quarter, that is, 25% of all samples will be divided as test data.

** ➊ Split the sample **

#sklearn train_test_Import split method
from sklearn.model_selection import train_test_split
#Variable X,Divide Y for training and testing respectively
X_train, X_test, Y_train, Y_test = train_test_split(X,Y)

Let's check the contents of each of the variables X for training and testing.

print("Training data for variable x:", X_train, sep="\n")
print("Test data for variable x:", X_test, sep="\n")

002_0201_011.PNG

** ➋ Create an instance ① **

The variable name multi_lreg is an abbreviation for multiple linear regression analysis.

multi_lreg = linear_model.LinearRegression()

** ➌ Create a model with training data ② **

multi_lreg.fit(X_train, Y_train)

** ➍ Calculate the coefficient of determination in training data ⑤ **

multi_lreg.fit(X_train, Y_train)

002_0201_012.PNG

** ➎ Calculate the coefficient of determination in test data ⑤ **

multi_lreg.score(X_test,Y_test)

002_0201_013.PNG A pair of training and testing, which can be replaced with "known data" and "unknown data". If you create a model with the data you already have and apply that model to the newly obtained data, how well will it work? The data to be analyzed is always a part of the whole, either past or present perfect. However, we are not analyzing anything just to confirm the current situation, but from there we should have the aim of "reading ahead" and "fortune-telling the future." In that sense, if the coefficient of determination is low, the accuracy of the model itself is suspected, but the higher the coefficient, the better. Rather, being too expensive is a problem. A high coefficient of determination, that is, a small residual (error), means that the model fits the data analyzed. If it fits too much, the coefficient of determination may drop when applied to the newly obtained data. This is the so-called "overfitting" problem.


Next, in order to understand how the calculation of multiple regression analysis works, we will perform multiple regression analysis without using the convenient scikit-learn.

Recommended Posts