2. Multivariate analysis spelled out in Python 6-1. Ridge regression / Lasso regression (scikit-learn) [multiple regression vs. ridge regression]

2_6_1_01.PNG

⑴ Import library

#Data processing / calculation / analysis library
import numpy as np
import pandas as pd

#Graph drawing library
import matplotlib.pyplot as plt
%matplotlib inline

#Machine learning library
import sklearn

⑵ Data acquisition and reading

#Get data
url = 'https://raw.githubusercontent.com/yumi-ito/datasets/master/datasets_auto.csv'

#Read the acquired data as a DataFrame object
df = pd.read_csv(url, header=None)

#Set column label
df.columns = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 
              'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 
              'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
Variable name Free translation Item (commentary) Data type
0 symboling Insurance risk rating -3, -2, -1, 0, 1, 2, 3.(3 is high risk and dangerous,-3 is low risk and safe) int64
1 normalized-losses Normalization loss 65〜256 object
2 make Maker alfa-romero, audi, bmw, ..., volkswagen, volvo. object
3 fuel-type Fuel type diesel, gas. object
4 aspiration Intake type std, turbo. object
5 num-of-doors Number of doors four, two. object
6 body-style Body style hardtop, wagon, sedan, hatchback, convertible. object
7 drive-wheels Drive wheels 4wd, fwd, rwd. object
8 engine-location Engine position front, rear. object
9 wheel-base Wheelbase 86.6~120.9 float64
10 length Commander 141.1~208.1 float64
11 width Vehicle width 60.3~72.3 float64
12 height Vehicle height 47.8~59.8 float64
13 curb-weight Unmanned vehicle weight 1488~4066 int64
14 engine-type Engine type dohc, dohcv, l, ohc, ohcf, ohcv, rotor. object
15 num-of-cylinders Number of cylinders eight, five, four, six, three, twelve, two. object
16 engine-size Engine size 61~326 int64
17 fuel-system Fuel system 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. object
18 bore Engine cylinder inner diameter 2.54~3.94 object
19 stroke Amount of movement of the piston 2.07~4.17 object
20 compression-ratio Compression ratio 7~23 float64
21 horsepower horsepower 48~288 object
22 peak-rpm Maximum output 4150~6600 object
23 city-mpg City fuel economy 13-49 (miles traveled per gallon of oil) int64
24 highway-mpg Highway fuel economy 16~54 int64
25 price price 5118~45400 object
#Output data shape and number of defects
print(df.shape)
print('Number of defects:{}'.format(df.isnull().sum().sum()))

#Output the first 5 lines of data
df.head()

2_6_1_02.PNG

(3) Data preprocessing

#Create a DataFrame for only the target columns
auto = df[['price', 'horsepower', 'width', 'height']]

#For each column, "?Check the number that contains
auto.isin(['?']).sum()

2_6_1_03.PNG

#"?Replace with NAN and delete the line with NAN
auto = auto.replace('?', np.nan).dropna()

#Check the shape of the matrix after deletion
auto.shape

2_6_1_04.PNG

#Data type confirmation
auto.dtypes

2_6_1_05.PNG

#Convert data type
auto = auto.assign(price = pd.to_numeric(auto.price))
auto = auto.assign(horsepower = pd.to_numeric(auto.horsepower))

#Check the data type after conversion
auto.dtypes

2_6_1_10.PNG

auto.corr()

2_6_1_06.PNG

⑷ Model construction and evaluation

#Check the data
print(auto)

2_6_1_07.PNG

** Using this data, perform model estimation for ridge regression and multiple regression analysis, and compare the accuracy of both. ** **

#Import for model building of ridge regression
from sklearn.linear_model import Ridge

#Import for model building of multiple regression analysis
from sklearn.linear_model import LinearRegression

#Import for data splitting (training data and test data)
from sklearn.model_selection import train_test_split
#Set explanatory variables and objective variables
x = auto.drop('price', axis=1)
y = auto['price']

#Divided into training data and test data
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.5, random_state=0)

** First, build a model for multiple regression analysis and calculate the accuracy rate of training data and test data. ** **

#Initialization of LinearRegression class
linear = LinearRegression()

#Execution of learning
linear.fit(X_train, Y_train)

#Correct answer rate of training data
train_score_linear = format(linear.score(X_train, Y_train))
print('Correct answer rate of multiple regression analysis(train):', 
      '{:.6f}'.format(float(train_score_linear)))

#Test data accuracy rate
test_score_linear = format(linear.score(X_test, Y_test))
print('Correct answer rate of multiple regression analysis(test):', 
      '{:.6f}'.format(float(test_score_linear)))

2_6_1_08.PNG

** Next, build a model of ridge regression and calculate the accuracy rate of training data and test data. ** **

#Initialization of Ridge class
ridge = Ridge()

#Execution of learning
ridge.fit(X_train, Y_train)

#Correct answer rate of training data
train_score_ridge = format(ridge.score(X_train, Y_train))
print('Correct answer rate of ridge regression(train):', 
      '{:.6f}'.format(float(train_score_ridge)))

#Test data accuracy rate
test_score_ridge = format(ridge.score(X_test, Y_test))
print('Correct answer rate of ridge regression(test):', 
      '{:.6f}'.format(float(test_score_ridge)))

2_6_1_09.PNG

Multiple regression analysis(L) Ridge regression(R) Difference(L-R)
Correct answer rate of training data 0.733358 0.733355 0.000003
Test data accuracy rate 0.737069 0.737768 -0.000699

Recommended Posts

2. Multivariate analysis spelled out in Python 6-1. Ridge regression / Lasso regression (scikit-learn) [multiple regression vs. ridge regression]
2. Multivariate analysis spelled out in Python 6-2. Ridge regression / Lasso regression (scikit-learn) [Ridge regression vs. Lasso regression]
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 6-3. Ridge regression / Lasso regression (scikit-learn) [How regularization works]
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 2-3. Multiple regression analysis [COVID-19 infection rate]
2. Multivariate analysis spelled out in Python 7-3. Decision tree [regression tree]
2. Multivariate analysis spelled out in Python 7-1. Decision tree (scikit-learn)
2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 8-1. K-nearest neighbor method (scikit-learn)
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
2. Multivariate analysis spelled out in Python 8-2. K-nearest neighbor method [Weighting method] [Regression model]
2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
2. Multivariate analysis spelled out in Python 8-3. K-nearest neighbor method [cross-validation]
2. Multivariate analysis spelled out in Python 7-2. Decision tree [difference in division criteria]
Regression analysis in Python
Multiple regression expressions in Python
Simple regression analysis in Python
First simple regression analysis in Python
Linear regression in Python (statmodels, scikit-learn, PyMC3)
[scikit-learn, matplotlib] Multiple regression analysis and 3D drawing
Easy Lasso regression analysis with Python (no theory)
Association analysis in Python
Multiple regression analysis with Keras
Axisymmetric stress analysis in Python
[Python] Linear regression with scikit-learn
Avoid multiple loops in Python
Online linear regression in Python
Prohibit multiple launches in python
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (2)
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (1)