2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)

Over the last few years, with the growing interest in AI and big data, we have come to hear analytical methods such as k-means as an approach for machine learning. All of these are ** multivariate analysis ** methods that have been used in business and academic fields for decades since the Showa era. One of the most popular methods is ** regression analysis **. So, first, let's implement ** simple regression analysis ** using scikit-learn, a machine learning library. (As a general rule, write the code and check the result on Google Colaboratory)

** Simple Regression Analysis </ font> **

Regression analysis can be divided into ** simple regression analysis ** for two variables and ** multiple regression analysis ** for three or more variables. First, consider ** simple regression analysis **. Simple regression analysis derives the law of ** linear ** or ** non-linear ** in data (phenomenon) ....., to put it plainly, $ y $ when $ x $ increases. Also reveals the rule of increasing / decreasing at a constant rate. 002_001_007.PNG

A very simple ** linear simple regression analysis ** is expressed by the following equation. 002_001_013.PNG This equation is called ** regression equation (simple regression equation) **. Once you have decided on $ a $ and $ b $, you can draw a straight line. Then, $ x $ can explain $ y $, or $ x $ can be used to predict $ y $. Since the variable $ y $ is explained by the variable $ x $, the target $ y $ is the ** objective variable ** (dependent variable), and the $ x $ that explains this is the ** explanatory variable ** (independent). It is called a variable). Also, $ a $, which indicates the slope of the regression line, is called the ** regression coefficient **, and $ b $, which indicates the intersection with the $ y $ axis, is called the ** intercept **. That is, the goal of simple regression analysis is to find the regression coefficients a and intercept b.

** ⑴ Import the library **

#Library required for numerical calculation
import numpy as np
import pandas as pd
#Package to draw a graph
import matplotlib.pyplot as plt
# scikit-linear model of learn
from sklearn import linear_model

** ⑵ Read the data and check the contents **

df = pd.read_csv("https://raw.githubusercontent.com/karaage0703/machine-learning-study/master/data/karaage_data.csv")
df

002_001_001.PNG

** ⑶ Convert variables x and y to Numpy Array type **

The variable df, where the data is stored, is in the form of a Pandas data frame. This is converted to Numpy's Array type and stored in variables x and y for later calculation.

x = df.loc[:, ['x']].values
y = df['y'].values

002_001_002.PNG The variate $ x $ was stored as two-dimensional data by cutting out the elements of [all rows, x columns] with the loc function of pandas, converting them to a Numpy array with values. The variable $ y $ is taken out as one-dimensional data by specifying the column name $ y $ and converted to a Numpy array in the same way.

** ⑷ Plot the data on the scatter plot **

Using the drawing package matplotlib, specify (variable x, variable y,'' marker type'') in the argument of the plot function.

plt.plot(x, y, "o")

002_001_003.PNG

** From here, we will use the linear regression model of the machine learning library scikit-learn to calculate the regression coefficient $ a $ and the intercept $ b $. ** **

** ⑸ Apply variables x and y to the linear regression model **

#Load the linear regression model and use it as the function clf
clf = linear_model.LinearRegression()
#Fluent x for function clf,Apply y
clf.fit(x, y)

** ⑹ Calculate regression coefficient / intercept **

The regression coefficient can be obtained with coef_ and the intercept with ʻintercept_`.

#Regression coefficient
a = clf.coef_
#Intercept
b = clf.intercept_

002_001_005.PNG

** ⑺ Calculate the coefficient of determination **

Then get the coefficient of determination as score (variable x, variable y).

#Coefficient of determination
r = clf.score(x, y)

002_001_006.PNG

** Coefficient of determination </ font> **

The coefficient of determination is an index that indicates the accuracy of the obtained ** regression equation **. Accuracy in this case is "how well the regression equation can explain the distribution of the data". The coefficient of determination is defined as follows: 002_001_014.PNG The data actually observed is called ** measured value **. As you can see from the scatter plot, the measured values are scattered on the coordinates. Since this is summarized in a straight line, we will reject some of the information that the original variance has. This rejected part, that is, the error associated with the regression equation, is called ** residual ** and can be expressed in the form of a fraction as follows. 002_001_008.PNG The denominator is the variance of the objective variable $ y $, which is the measured value, and the numerator is balanced up and down with the variance of the predicted value $ \ hat {y} $ by the regression equation and the reduction of the residual. In other words, the coefficient of determination is what percentage of the variance of the measured value is the variance of the predicted value. Since it is a ratio, the coefficient of determination $ R ^ 2 $ always takes a value between 0 and 1, and the closer it is to 1, the better the accuracy of the regression equation is.

** ⑻ Set the x value to draw the regression line **

First, generate the $ x $ value that is the source of the regression line using Numpy's linspace function. Specify (start point, end point, number of delimiters) as an argument.

fig_x = np.linspace(0, 40, 40)

002_001_009.PNG

** ⑼ Check the shape of regression coefficient, intercept, and x value **

print(a.shape) #Regression coefficient
print(b.shape) #Intercept
print(fig_x.shape) #x value

002_001_010.PNG

Tips

If you substitute $ y = ax + b $ as it is, an error will occur. This is because the regression coefficient $ a $ is a 1-row x 1-column array type, and the $ x $ value is a 40-row x 1-column array type, so it does not meet the rules for multiplying arrays. Therefore, we need to convert the regression coefficient $ a $ to a single value so that all $ x $ values are multiplied equally.

** ⑽ Define variate y **

When defining the formula for the y value, the variable fig_y, we use Numpy's reshape function to transform the shape of the regression coefficient $ a $.

fig_y = a.reshape(1)*fig_x + b

002_001_011.PNG

** ⑾ Draw a regression line **

#Scatter plot
plt.plot(x, y, "o")
#Regression line
plt.plot(fig_x, fig_y, "r") #Change the line color with the third argument"r"Designated to

plt.show()

002_001_012.PNG


As mentioned above, by using the machine learning library, it is possible to obtain analysis results without complicated calculations. However, although it is not limited to regression analysis, when it comes to how to properly interpret the calculation result or to perform delicate tuning in operating the method, we still know the calculation mechanism (algorithm). It is desirable to keep it. So next, you'll learn all your own simple regression analysis without using scikit-learn.

Recommended Posts