[PYTHON] Linear regression

What is linear regression?

A method of creating a linear function (prediction model) that predicts the objective variable (predicted data) using explanatory variables (data used for prediction). Learning to create a prediction model is required to make predictions. = Supervised learning

Assuming that the horizontal axis of the figure below is the explanatory variable, the vertical axis is the objective variable, and the red dot is the obtained data, the blue line is the regression equation (prediction model obtained from learning). Once this regression equation is created, if an unknown explanatory variable comes in, I feel that I can predict what the objective variable will be at that time.

The method of creating the regression equation uses the least squares method (a method of minimizing the square (square error) of the difference between the actual objective variable value and the predicted value).

In the above figure, if a perpendicular line of the x-axis is drawn from the red point to the blue line, the length becomes an error. An image that squares this in consideration of the effects of plus and minus and minimizes the total value.

Usage data preparation

Use Boston home price data.

`{get_boston_house_price_dataset.py}`


import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

from sklearn.datasets import load_boston
boston = load_boston() #Obtaining Boston Home Price Data

#Convert to data frame
boston_df = DataFrame(boston.data)
boston_df.columns = boston.feature_names
boston_df['Price'] = boston.target

Data overview

`{describe_boston_house_price_dataset.py}`


boston_df[['RM', 'Price']].describe()

RM is the number of rooms. Price is the price. In order to predict the price (objective variable), let's use the number of rooms as an explanatory variable.

Let's take a quick look at the relationship between the number of rooms and the price.

`{scatter_plot_boston_house_price_dataset.py}`


#RM: Number of rooms on the horizontal axis, Price:Try to draw a scatter plot with the price on the vertical axis
plt.scatter(boston_df['RM'], boston_df['Price'])
plt.xlabel('Number of rooms')
plt.ylabel('Price($1,000)')
plt.title('Scatter Plot')

The scatter plot is as above. There seems to be a positive correlation (the price increases as the number of rooms increases).

Therefore, let's calculate the correlation coefficient.

`{calculate_correration.py}`


np.corrcoef (boston_df['RM'], boston_df['Price'])[0,1]
> 0.69535994707153925

Correlation coefficient 0.695 stronger. So using RM as an explanatory variable doesn't seem to be a bad idea.

Try it (simple regression)

1. 1. Look at the regression equation at the same time as the scatter plot with seaborn's lmplot

`{make_lmplot.py}`


sns.lmplot('RM', 'Price', data = boston_df)

Very easy. seaborn lmplot option: Just pass a DataFrame to data and specify the DataFrame column names for the X and Y axes to create a line.

2. Try it with numpy

`{calcurate_single_regression.py}`


#Transform data to feed numpy [* 1]
X = np.vstack(boston_df.RM) 
X = np.array([[value, 1] for value in X])
#Prepare the objective variable
Y = boston_df.Price

#Linear regression(np.linalg.lstsq)To run a:Tilt, b:Obtain the intercept. [* 2]
a, b = np.linalg.lstsq(X, Y)[0] 

#Draw a scatter plot and add a regression equation to it from the results obtained.
plt.plot(boston_df.RM, boston_df.Price, 'o')
x = boston_df.RM
#The regression equation is y= a * x +Since it can be expressed by b, x is on the x-axis and a is on the y-axis.* x +Specify b.
plt.plot(x, a * x + b, 'r')
#Labeling the x and y axes
plt.xlabel('Number of Room')
plt.ylabel('Price($1,000)')

Diagram of the result of linear regression in numpy. Regression equation with red line.

Try it (multiple regression)

Use scikit-learn

Simple regression uses one explanatory variable, whereas multiple regression uses multiple explanatory variables. There are 13 variables in boston_df, including RM: number of rooms, so let's create a model that predicts the price using these variables.

When creating a predictive model, we take a method of dividing it into training and test. The reason for the division is that "if you look at the accuracy of the training data alone, you cannot judge whether it can be used to predict unknown data."

`{do_multiple_regression.py}`


import sklearn
from sklearn.linear_model import LinearRegression #For linear regression

#Confirmation of usage data(There are 14 data from CRIM to Price. Price is the objective variable, so try using the other 13 as explanatory variables.)
boston_df.info()
> <class 'pandas.core.frame.DataFrame'>
> Int64Index: 506 entries, 0 to 505
> Data columns (total 14 columns):
> CRIM       506 non-null float64
> ZN         506 non-null float64
> INDUS      506 non-null float64
> CHAS       506 non-null float64
> NOX        506 non-null float64
> RM         506 non-null float64
> AGE        506 non-null float64
> DIS        506 non-null float64
> RAD        506 non-null float64
> TAX        506 non-null float64
> PTRATIO    506 non-null float64
> B          506 non-null float64
> LSTAT      506 non-null float64
> Price      506 non-null float64
> dtypes: float64(14)

#Create DataFrame for explanatory variables other than Price
X_multi = boston_df.drop('Price', 1) 
X_multi.shape
> (506, 13)

#Separate the data for train and test.
# sklearn.cross_validation.train_test_split(X, Y)If you specify X as the explanatory variable and Y as the objective variable, it will be separated quickly.
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X_multi, boston_df.Price) 

#Check the number of divided data
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
> (379, 13) (127, 13) (379,) (127,)
#You can see that it was divided into 379 for train and 127 for test.

#Instance generation
lreg = LinearRegression()
#Multiple regression model creation implementation
lreg.fit(X_train, Y_train)
#Prediction implementation for training data and test data
pred_train = lreg.predict(X_train)
pred_test = lreg.predict(X_test) 

#Check the mean square error between the training data and the test data[average(Actual price-Predicted value]^2)]
np.mean((Y_train - pred_train) ** 2)
> 20.5592370941859
np.mean((Y_test - pred_test) ** 2) 
> 28.169312238554202

#Check the residual of the actual correct answer and the prediction on the graph.
#The horizontal axis shows the result of train, and the vertical axis shows the error.
train = plt.scatter(pred_train, (pred_train - Y_train), c= 'b', alpha=0.5)  
#The horizontal axis shows the result of test, and the vertical axis shows the error.
test = plt.scatter(pred_test, (pred_test - Y_test), c ='r', alpha=0.5)

#Arrangement of graph
#Draw a line with zero error.
plt.hlines(y = 0, xmin = -1.0, xmax = 50)
#Usage Guide
plt.legend((train, test), ('Training', 'Test'), loc = 'lower left') #Usage Guide
#title
plt.title('Residual Plots')

Error plot results

The black horizontal line represents error = 0, and if there is a point above it, the error is positive, and if it is below it, the error is negative. If this error is evenly distributed up and down, it can be said that it is suitable for solving this prediction by linear regression. However, if this error is a biased result, it is better to review the application of linear regression itself.

Supplement

[* 1] About data formatting for use with npmpy

The simple regression straight line equation can be expressed as y = ax + b, but when converted to vector notation, It can be rewritten as y = Ap.

A and p are, respectively, and the inner product is y. (y = a * x + b * 1) スクリーンショット 2016-05-03 13.54.20.png

It is shaped into this shape.

`{.py}`


# boston_df.With RM, it is a one-dimensional array.
X = boston_df.RM
#Check the shape of the X matrix(One-dimensional array with only 506 rows)
X.shape
> (506,)
X
>0     6.575
>1     6.421
>2     7.185
>...
>504    6.794
>505    6.030
>Name: RM, Length: 506, dtype: float64

In order to use this with numpy, you have to change it to a two-dimensional array.

`{.py}`


#So np to convert to a two-dimensional array.Use vstack.
X = np.vstack(boston_df.RM)
#Check the shape of the X matrix(Line 506,1-column two-dimensional array)
X.shape
>(506, 1)
X
>array([[ 6.575],
>       [ 6.421],
>       [ 7.185],
>・ ・ ・
>       [ 6.794],
>       [ 6.03 ]])

`{.py}`


#Further this A= [x, 1]Conversion to. Set the value of X to the first dimension and the fixed value of 1 to the second dimension.
X = np.array([[value, 1] for value in X])
#Check the shape of the X matrix(Line 506,Two-column two-dimensional array)
X.shape
> (506, 2)
X
>array([[array([ 6.575]), 1],
>       [array([ 6.421]), 1],
>       [array([ 7.185]), 1],
>       ..., 
>       [array([ 6.976]), 1],
>       [array([ 6.794]), 1],
>       [array([ 6.03]), 1]], dtype=object)

Now the explanatory variable X is in the form [x 1], which can be handled by numpy.

[* 2] About np.linalg.lstsq

linalg is an abbreviation for linear Algebra and lstsq is an abbreviation for Least Squares (least squares).

The return value is 0th line: slope and intercept, 1st line: residual, 2nd line: Rank of explanatory variable, 3rd line: singular value of explanatory variable (Singlar value). Since we want to find the slope and intercept this time, specify [0].