[PYTHON] [scikit-learn, matplotlib] Multiple regression analysis and 3D drawing

Multiple regression analysis and drawing were performed using the data Boston House Prices dataset (data set for Boston house prices) attached to scikit-learn. (Reference)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from mpl_toolkits.mplot3d import Axes3D

#Load Boston HOuse Prices data
boston = load_boston()
boston_df = pd.DataFrame(boston.data)
boston_df.columns = boston.feature_names
boston['Price'] = boston.target #Add objective variable to data frame

The dataset contains 13 attributes (columns) such as the number of rooms (RM) and lower percentage? (LSTAT) in each house. First, use the corr () method of the DataFrame class to calculate the correlation coefficient between each attribute and the objective variable (Price).

>> boston_df.corr()
             CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
CRIM     1.000000 -0.199458  0.404471 -0.055295  0.417521 -0.219940  0.350784   
ZN      -0.199458  1.000000 -0.533828 -0.042697 -0.516604  0.311991 -0.569537   
INDUS    0.404471 -0.533828  1.000000  0.062938  0.763651 -0.391676  0.644779   
CHAS    -0.055295 -0.042697  0.062938  1.000000  0.091203  0.091251  0.086518   
NOX      0.417521 -0.516604  0.763651  0.091203  1.000000 -0.302188  0.731470   
RM      -0.219940  0.311991 -0.391676  0.091251 -0.302188  1.000000 -0.240265   
AGE      0.350784 -0.569537  0.644779  0.086518  0.731470 -0.240265  1.000000   
DIS     -0.377904  0.664408 -0.708027 -0.099176 -0.769230  0.205246 -0.747881   
RAD      0.622029 -0.311948  0.595129 -0.007368  0.611441 -0.209847  0.456022   
TAX      0.579564 -0.314563  0.720760 -0.035587  0.668023 -0.292048  0.506456   
PTRATIO  0.288250 -0.391679  0.383248 -0.121515  0.188933 -0.355501  0.261515   
B       -0.377365  0.175520 -0.356977  0.048788 -0.380051  0.128069 -0.273534   
LSTAT    0.452220 -0.412995  0.603800 -0.053929  0.590879 -0.613808  0.602339   
Price   -0.385832  0.360445 -0.483725  0.175260 -0.427321  0.695360 -0.376955   

              DIS       RAD       TAX   PTRATIO         B     LSTAT     Price  
CRIM    -0.377904  0.622029  0.579564  0.288250 -0.377365  0.452220 -0.385832  
ZN       0.664408 -0.311948 -0.314563 -0.391679  0.175520 -0.412995  0.360445  
INDUS   -0.708027  0.595129  0.720760  0.383248 -0.356977  0.603800 -0.483725  
CHAS    -0.099176 -0.007368 -0.035587 -0.121515  0.048788 -0.053929  0.175260  
NOX     -0.769230  0.611441  0.668023  0.188933 -0.380051  0.590879 -0.427321  
RM       0.205246 -0.209847 -0.292048 -0.355501  0.128069 -0.613808  0.695360  
AGE     -0.747881  0.456022  0.506456  0.261515 -0.273534  0.602339 -0.376955  
DIS      1.000000 -0.494588 -0.534432 -0.232471  0.291512 -0.496996  0.249929  
RAD     -0.494588  1.000000  0.910228  0.464741 -0.444413  0.488676 -0.381626  
TAX     -0.534432  0.910228  1.000000  0.460853 -0.441808  0.543993 -0.468536  
PTRATIO -0.232471  0.464741  0.460853  1.000000 -0.177383  0.374044 -0.507787  
B        0.291512 -0.444413 -0.441808 -0.177383  1.000000 -0.366087  0.333461  
LSTAT   -0.496996  0.488676  0.543993  0.374044 -0.366087  1.000000 -0.737663  
Price    0.249929 -0.381626 -0.468536 -0.507787  0.333461 -0.737663  1.000000  

Regression analysis is performed on the Price value using the two attributes RM and LSTAT, which have large absolute values of the correlation coefficient (= large correlation), as explanatory variables.

#Data frame of explanatory variables required (using explanatory variables and RM and LSTAT)
df = pd.DataFrame()
df['RM'] = boston_df['RM']
df['LSTAT'] = boston_df['LSTAT']

X_multi = df
Y_target = boston.target

#Model generation and fitting
lreg = LinearRegression()
lreg.fit(X_multi, Y_target)
a1, a2 = lreg.coef_ #coefficient
b = lreg.intercept_ #Intercept

Use matplotlib's plot_surface () and scatter3D for drawing.

#3D drawing (drawing of measured values)
x, y, z = np.array(df['RM']), np.array(df['LSTAT']), np.array(Y_target)
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter3D(np.ravel(x), np.ravel(y), np.ravel(z), c = 'red')

#3D drawing (drawing of regression plane)
X, Y = np.meshgrid(np.arange(0, 10, 1), np.arange(0, 40, 1))
Z = a1 * X + a2 * Y + b
ax.plot_surface(X, Y, Z, alpha = 0.5) #Specify transparency with alpha
ax.set_xlabel("RM")
ax.set_ylabel("LSTAT")
ax.set_zlabel("Price")

plt.show()

figure_1.png

Recommended Posts

[scikit-learn, matplotlib] Multiple regression analysis and 3D drawing
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
Multiple regression analysis with Keras
Machine learning algorithm (multiple regression analysis)
[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics
I tried multiple regression analysis with polynomial regression
Regression model and its visualization using scikit-learn
2. Multivariate analysis spelled out in Python 6-1. Ridge regression / Lasso regression (scikit-learn) [multiple regression vs. ridge regression]
Time series analysis # 6 Spurious regression and cointegration
Draw multiple graphs using matplotlib figures and axes
Creating multiple output models for regression analysis [Beginner]
Poisson regression analysis
Regression analysis method
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
[Graph drawing] I tried to write a bar graph of multiple series with matplotlib and seaborn