In order to deepen the understanding of the theory of regression analysis, let's manually create an analytical model without using the sklearn library, which is a powerful weapon of regression analysis.
Predict the value of the objective variable (output data) using the explanatory variable (input data). This time, in order to obtain it theoretically, let us consider the case of one simple explanatory variable. (Simple regression)
import numpy as np
import pandas as pd
from pandas import DataFrame
data_age = np.array([20,20,28,38,33,34,22,37,
26,21,22,39,31,29,38,35,
32,27,30])
data_salary = np.array([410,500,480,710,630,600,430,
690,500,410,490,800,550,550,
700,700,650,540,600])
data = DataFrame({'age':data_age,
'income':data_salary})
The following graph can be obtained from the above data set.
From this graph, I will try to express the relationship between age and income with a linear formula. (Forcibly bring it to a linear expression, but in reality it becomes a more complicated expression.)
For the time being, let's assume that the predicted age is x and the income is y, and we consider expressing `` `y = ax + b. Note that the values of a and b cannot be simply determined because there are multiple data. To calculate the most valid values for a and b, we use the idea of mean squared error. Specifically, for each data, take the difference between the income y that predicts (regresses)
y = ax + b.
y - 410= 20a + b - 410
y - 500= 20a + b - 500
y - 480= 28a + b - 480```
...
Transforms with.
The square of the difference between the values of the actual data and the predicted data is added by the prepared data (N), and the average value is the average square error Q (a, b).
Q(a,b) = \frac{1}{N}\sum_{k=0}^{n-1}(ax_k + b - y_k)
This y_k
is the income as actual data. Let us try to linearize the relationship between age and income by finding a and b that minimize the average squared error Q (a, b).
First, in order to find a and b that minimize Q (a, b), enter various values in a and b and write an outline of the average squared error.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
#Any a,prepare b
a = np.linspace(-200,200,100)
b = np.linspace(-500,500,100)
#Easy to make a combination of a and b
#Extend each to a two-dimensional array(Needed to draw a curved surface)
A,B = np.meshgrid(a,b)
#Function for calculating Q
def calc_Q(x,y,a,b):
result = (a * x + b - y)**2
return np.mean(result)
#Array for Q(Initialize with 0)
Q = np.zeros([len(a),len(b)])
#a,Calculate Q for all combinations of b
for j in range(100):
for k in range(100):
Q[j,k] = calc_Q(data_age,data_salary,a[j],b[k])
#Write an outline of a 3D graph
fig = plt.figure(figsize=[10,10])
ax = fig.add_subplot(111,projection="3d")
ax.view_init(45,10)
ax.set_xlabel("a",size=14,color="blue")
ax.set_ylabel("b",size=14,color="blue")
ax.set_zlabel("Q",size=14,color="blue")
ax.plot_surface(A,B,Q,color="red")
plt.show()
From the graph, it can be seen that the value of Q [a, b] is the minimum near a = 50 to 200, b = 0. From this graph, it can be seen that there is probably one minimum value. Therefore, using the re-sudden descent method (calculation method of the minimum value using the slope), try to find the values of a and b when the value of Q [a, b] becomes the minimum.
Continue.
Introduction to Python Numerical Calculation https://python.atelierkobato.com/mse/
Recommended Posts