[PYTHON] [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics

1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** You can see that the explanation "I don't know the background, but I got this result" is obviously weak **.

In this article, ** 2 to 3 are aimed at "the theory is good, so try using scikit-learn first", and 4 and later are aimed at "understanding the background from mathematics" **.


2/5 postscript


3/1 postscript
・ 3. Linear regression with scikit-learn → (4) Model construction → (iv) Supplement was added.

2. What is linear (simple) regression?

(1) What is regression?

** Predict numbers. ** In machine learning, there are other "classifications", but if you want to predict numerical values such as "●● yen" and "△ Kg", you can think of using regression.

(2) What is linear regression?

There may be some misunderstandings, "What you want ($ = y )" and "What you think will affect what you want When ( = x $) "has a linear relationship, the method of finding $ y $ using the linear feature is called linear regression.

I think it's hard to understand, so I'll give you a concrete example.
Specific example
You are a self-employed ice cream shop, and you strongly want to be able to predict the sales of ice cream in your store ** in order to stabilize the sales prospects. I will. キャプチャ8.PNG

You desperately wonder what is affecting your store's ice cream sales, and realize that the hotter the temperature, the more ice cream sells, and the cooler the temperature, the less ice cream sells. I did. So, if you try to illustrate "temperature ($ = x )" and "ice cream sales ( = y $)" as shown below, it is true that ice cream sales will increase if the temperature rises, and there You can see that the shape of a straight line (= $ ax + b $) is likely to be drawn (= linear).

キャプチャ.PNG

Next, let's use scikit-learn to build a machine learning model that seeks ice cream sales from temperature.

3. Linear regression with scikit-learn

(1) Import of required libraries

Import the following required to perform linear regression.

from sklearn.linear_model import LinearRegression

(2) Data preparation

Set the temperature and ice cream sales as data as shown below.

data = pd.DataFrame({
    "temprature(=x)":[8,10,6,15,12,16,20,13,24,26,12,18,19,16,20,23,26,28],
    "sales(=y)":[30,35,28,38,35,40,60,34,63,65,38,40,41,43,42,55,65,69]
    })

(3) Try to illustrate (important)

Let's illustrate the temperature and ice cream sales. Even if you use linear regression if it is not in a linear relationship, the accuracy will be very poor because the original data is not linear. Instead of using scikit-learn all at once, try to illustrate any data.

plt.scatter(data["temprature(=x)"],data["sales(=y)"])
plt.xlabel("temprature(°)")
plt.ylabel("sales")
plt.grid(which='major',color='black',linestyle=':')
キャプチャ2.PNG

Approximately, there seems to be a linear relationship between temperature ($ = x ) and sales ( = y $), so let's build a model for linear regression.

(4) Model construction

(I) Data shaping

First of all, we will arrange the shape of the data to build the model.

x = data["temprature(=x)"].values
y = data["sales(=y)"].values
X = x.reshape(-1,1)

Since this is not an article on python grammar, I will omit the details, but I will arrange x and y into a form for linear regression with scikit-learn.

(Ii) Model construction

It's finally the model building code.

regr = LinearRegression(fit_intercept = True)
regr.fit(X,y)

It may be out of place, but if it's a simple model, that's it. We will create a linear regression model for a variable called regr! It is an image of doing something like a declaration and letting the regr fit (= learn) the prepared X and y in the next line.

(Iii) Try to get the slope and intercept of the straight line

As described in "2. What is linear (simple) regression?" (2), the scikit-learn up to this point is used to find $ a $ and $ b $ of $ y = ax + b $, and forecast sales from the temperature. I am looking for the formula of the straight line to be done behind the scenes. If you leave it as it is, you will not realize it, so let's actually set the slope.

a = regr.coef_ #Find the slope
b = regr.intercept_ #Find the intercept
print(a)
print(b)

You should see a as [1.92602996] and b as [12.226591760299613]. In other words, the straight line is $ y (= sales) = 1.92602996 * x (= temperature) + 12.226591760299613 $ and scikit-learn asked for it.

(Iv) Supplement

If you just want to build a model, (iii) is enough, but there are other things as follows. Reference: https://pythondatascience.plavox.info/scikit-learn/%E7%B7%9A%E5%BD%A2%E5%9B%9E%E5%B8%B0

◆ Display the parameters used in model construction This time I just set fit_intercept to True, but there are other parameters that can be set and you can see how they are set now.

regr.get_params()

Then {'copy_X': True,'fit_intercept': True,'n_jobs': None,'normalize': False} will be displayed.

-Copy_X: Select whether to execute after duplicating the data in memory. (Default value: True) -Fit_intercept: When set to False, the calculation to obtain the intercept ($ b $ in this case) is not included, so it is used when handling data in which the objective variable always passes through the origin. (Default value: True) -N_jbobs: The number of jobs used for calculation. If set to -1, it will be calculated using all CPUs. (Default value: 1) -Normalize: When set to True, the explanatory variables are prenormalized. (Default value: False)

◆ Display coefficient of determination The coefficient of determination is a measure of how well the model fits the actual data, in the range 0 to 1.

regr.score(X,y)

◆ Evaluation of error Since the amount of description will be large, I will not describe it again, but the following will be helpful. https://pythondatascience.plavox.info/scikit-learn/%E5%9B%9E%E5%B8%B0%E3%83%A2%E3%83%87%E3%83%AB%E3%81%AE%E8%A9%95%E4%BE%A1

(5) Illustrate the constructed model

Now, let's illustrate this straight line in the scatter plot above.

#Straight line formula
y_est_sklearn = regr.intercept_ + regr.coef_[0] * x
#Original temperature and sales plot
plt.scatter(x, y, marker='o')
#Original temperature and straight line formula of prediction
plt.plot(x, y_est_sklearn, linestyle=':', color='green')
#Detailed settings in the figure
plt.grid(which='major',color='black',linestyle=':')
plt.grid(which='minor',color='black',linestyle=':')
plt.xlabel("temprature(°)")
plt.ylabel("sales")
キャプチャ3.PNG

In this way, let's be aware of what scikit-learn is doing and what it is connected to.

(6) In the real world ...

It doesn't make sense to finish making a model. In the real world, it is necessary to use this straight line forecast model to forecast future sales. You looked at the weather forecast for the next four days and made a note of the temperature. Store it in a variable called z as shown below.

z = pd.DataFrame([10,25,24,22])

What I want to do is to apply the above future temperature forecast to the straight line formula obtained by scikit-learn earlier and forecast sales.

regr.predict(z)

If you do this, you will see the result as "([31.48689139, 60.37734082, 58.45131086, 54.59925094])". In other words, tomorrow the temperature is 10 °, so sales will be about 315,000 yen, and the day after tomorrow, the temperature will be 25 °, so sales will be about 603,000 yen. If you can get a forecast of the temperature for the next month, you will have a rough idea of sales and your goal will be achieved.

There are many other details, but I think it's good to try implementing orthodox linear regression first.

4. Understanding linear (simple) regression from mathematics

By the way, up to 3, I implemented the flow of calculating $ a $ and $ b $ of $ y = ax + b $ using scikit-learn → illustration → forecasting sales from the temperature for the next 4 days. .. Here, I would like to clarify ** how the "calculate $ a $ and $ b $ of $ y = ax + b $" in this flow is ** mathematically calculated. I will.

(1) Prerequisite knowledge

a. Basic differentiation


y = x^When 2 is differentiated by x, y'=2x\\
y = x^2 +When 4 is differentiated by x, y'=2x\\
y = (-3x + 2)^When 2 is differentiated by x, y' = 2(-3x +2)(-3) 

b. Meaning of Σ (sigma) Means sum

(2) Mathematical understanding

(I) What you are doing to get a and b of y = ax + b

I will repost the table I mentioned earlier. As shown below, I want to draw a "good-looking straight line" for predicting temperature and sales, that is, to determine the slope and intercept a and b. キャプチャ.PNG

How do you decide a and b for that? Now look at the two straight lines below. Which is the straight line, green or orange, that is more likely to predict the relationship between temperature and sales? キャプチャ4.PNG

Obviously, you can see that orange (a = 1.92, b = 12.2) is more likely to represent the relationship between actual temperature and sales than green (a = 2.0, b = 30.0). I think it can be said that the orange straight line is "because the distance between the straight line and the actual blue point is closer" **.

In other words, scikit-learn wants a and b that are "good straight lines" where the distance between the straight line and the blue point is the shortest.

This method of finding a and b such that "the distance between the straight line and the blue point is the closest" is called the "least squares method".

(Ii) Least squares method

Let's chew a little more. The "distance between the straight line and the blue point" described in (i) can be written as follows.

キャプチャ5.PNG

Actual coordinates (red dot in the above figure): $ (x_1, y_1) $ Coordinates predicted by $ y = ax + b $: Expressed as $ (x_1, ax_1 + b) $.

The error between these two y-coordinates (= the difference between the forecast and the actual sales) can be expressed as $ {y_1- (ax_1 + b)} $.

This $ {y_1- (ax_1 + b)} $ is just the error between the forecast and the actual sales of one red dot, so add them up for all the dots and set a and b so that this error is as small as possible. Find (= find a and b from the calculation so that the difference between the prediction and the actual is as small as possible).

In addition, if the difference is taken purely, the plus and minus will be canceled out, so in general, the value obtained by squaring the error is calculated so as to be as small as possible. This idea is called the least squares method.

(Iii) Try to solve the least squares method

◆ Setting the error function

Assuming that the sum of the errors of the predicted values calculated from all the actual points and the straight lines is "E", E can be expressed as follows.

E = \sum_{i=1}^n [{y_i - (ax_i + b)}]^{2}

This is the square of the error between the actual value $ y_i $ and the predicted value $ (ax_i + b) $, added from the first point to the nth (substantially all).

◆ To minimize the error function

To consider the minimum of $ E $, let's represent the form of the function of $ E $.

In general, $ E $ is represented as above, and you can see that $ E $ is minimized around the red dot. Then, what is this red point? ** "Differentiation of $ E $ by $ a $ makes it 0, and even if $ E $ is differentiated by $ b $, it becomes 0" ** is. Differentiation means "slope", so the point where the slope is 0 when viewed from the axis side of $ a $ and the slope is 0 when viewed from the axis side of $ b $ is the red point. ..

◆ Let's actually calculate

If possible, please bring a piece of paper and a pen and move your hands. [Differentiate with $ a $] The $ ∂ $ used below is read as "Dell" to mean differentiating (there are several ways to read it).

↓ The following is a supplement to the formula ① ↓ キャプチャ7.PNG


\begin{align}
\frac{∂E}{∂a} &= \frac{∂}{∂a} \sum_{i=1}^n (y_i - ax_i - b)^{2}・ ・ ①\\
&= \sum_{i=1}^n 2*(y_i - ax_i - b)*(-x_i)・ ・ ②\\
&= \sum_{i=1}^n -2x_i(y_i - ax_i -b)・ ・ ③\\
&= \sum_{i=1}^n -2x_iy_i + \sum_{i=1}^n 2ax_i^2 + \sum_{i=1}^n 2x_ib ... ④\\
\end{align}

・ When ① is differentiated, it becomes ② (=, so $ ∂ $, which means to differentiate, disappears in ②) ・ ③ is just a slight transformation of ②, and ④ is decomposed by $ Σ $ for each character of ③. I'm looking for a point that differentiates with $ a $ and becomes 0, so let's solve it with ④ = 0.


\begin{align}
- \sum_{i=1}^n x_iy_i + a\sum_{i=1}^nx_i^2 + b \sum_{i=1}^n x_i =0 ・ ・ ⑤\\
- \bar{xy} + a\bar{x^2} + b \bar{x} =0 ・ ・ ⑥\\
\end{align}

Since ⑤ is ④ = 0, it is the formula obtained by dividing the coefficient 2 attached to the formula ④ by both sides. ⑥ is the formula of ⑤ divided by n on both sides. In ⑤, data is added from the first to n for each $ Σ $, so if you divide by n, the average will come out. To be more specific, the first $ Σ $ ($-\ sum_ {i = 1} ^ n x_iy_i $) is the sum of $ xy $ from the first to the nth. In other words, if you divide this by $ n $, it will be the average of the whole, so it can be expressed as $-\ bar {xy} $.

[Differentiate with $ b $] Similarly, $ b $ is also differentiated.


\begin{align}
\frac{∂E}{∂b} &= \frac{∂}{∂b} \sum_{i=1}^n (y_i - ax_i - b)^{2}・ ・ [1]\\
&= \sum_{i=1}^n 2*(y_i - ax_i - b)*(-1)・ ・ [2]\\
&= \sum_{i=1}^n -2(y_i - ax_i -b)・ ・ [3]\\
&= \sum_{i=1}^n-2y_i + \sum_{i=1}^n 2ax_i + \sum_{i=1}^n 2b ... [4]\\
\end{align}

What you are doing in [1] to [4] is basically the same as ① to ④, which are differentiated by $ a $, and correspond. Similarly, let's solve [5] and [6] together with ⑤ and ⑥.


\begin{align}
- \sum_{i=1}^ny_i + a\sum_{i=1}^nx_i + b =0 ... [5]\\
- \bar{y} + a\bar{x} + b =0 ... [6]\\
\end{align}

[Solving simultaneous equations] ⑥ and [6] are reprinted with some transformations.


a\bar{x^2} + b \bar{x} = \bar{xy}・ ・ ⑥'\\
a\bar{x} + b = \bar{y}・ ・ 【6'】

To solve these two simultaneous equations (to eliminate = $ b $), multiply [6'] by $ \ bar {x} $.


a\bar{x^2} + b \bar{x} = \bar{xy}・ ・ ⑥'\\
a\bar{x}^2 + b\bar{x} = \bar{x}\bar{y}・ ・ 【6''】

Here are two things to keep in mind that are easy to make mistakes. ・ "$ A \ bar {x ^ 2} " in ⑥'and " a \ bar {x} ^ 2 $" in [6''] are different (⑥'is the average of $ x ^ 2 $. But [6''] is the square of $ \ bar {x} ) ・ " \ Bar {xy} " of ⑥'and " \ bar {x} \ bar {y} $" of [6''] are different (⑥'is the average of $ xy $, but [ 6''] is the average of $ x $ multiplied by the average of $ y $)

⑥-[6''] will result in the following.

a\bar{x^2} - a\bar{x}^2 = \bar{xy} - \bar{x}\bar{y}

Solving this for $ a $

a = \frac{\bar{xy} - \bar{x}\bar{y}}{\bar{x^2} - \bar{x}^2}・ ・[A]

Finally, we will solve about $ b $. From [6], $ b = \ bar {y} --a \ bar {x} $, so if you substitute [A],

b = \bar{y} - \frac{\bar{xy} - \bar{x}\bar{y}}{\bar{x^2} - \bar{x}^2}\bar{x}・ ・[B]

From [A] and [B], I was able to get the $ a $ and $ b $ I wanted.

◆ The straight line formula you wanted to find

Now that we have $ a $ and $ b $, we can express the "best" straight line formula that minimizes $ E $ (= minimum error) as follows. The original straight line formula is $ y = ax + b $, so

y = \frac{\bar{xy} - \bar{x}\bar{y}}{\bar{x^2} - \bar{x}^2}x +( \bar{y} - \frac{\bar{xy} - \bar{x}\bar{y}}{\bar{x^2} - \bar{x}^2}\bar{x})

I was able to express it!

The point is that you could calculate this by hand, it's not amazing, but the above formula can be calculated only with the data you have now (in this example, temperature and sales data) ** That is the point.

With scikit-learn, it calculates in one shot, but I think it is very important to understand that this kind of calculation is done behind the scenes. At first, it took me a long time to understand the sequence of steps up to this point. It may be difficult at first, but I hope everyone can move their hands together.

◆ Slight development

Hand-calculated $ y = \ frac {\ bar {xy}-\ bar {x} \ bar {y}} {\ bar {x ^ 2}-\ bar {x} ^ 2} x + (\ bar {y}-\ frac {\ bar {xy}-\ bar {x} \ bar {y}} {\ bar {x ^ 2}-\ bar {x} ^ 2} \ bar {x}) $ , For some of the formulas that appear in this, the following holds.


Premise
$ \ bar {xy}-\ bar {x} \ bar {y} = σ_ {xy} $ * From the covariance formula

$ \ bar {x ^ 2}-\ bar {x} ^ 2 = σ_x ^ 2 $ * From the variance formula


Conclusion

y = \frac{σ_{xy}}{σ_x^2}x + (\bar{y} - \frac{σ_{xy}}{σ_x^2}\bar{x})

It can also be written as above.

5. Summary

How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.

However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.

I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.

Recommended Posts

[Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics
[Machine learning] Understanding logistic regression from both scikit-learn and mathematics
[Machine learning] Understanding SVM from both scikit-learn and mathematics
[Machine learning] Understanding decision trees from both scikit-learn and mathematics
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
[Machine learning] Understanding uncorrelatedness from mathematics
Machine learning linear regression
Machine Learning: Supervised --Linear Regression
Machine learning beginners try linear regression
Machine learning algorithm (simple regression analysis)
Classification and regression in machine learning
Machine learning algorithm (generalization of linear regression)
Machine learning with python (2) Simple regression analysis
<Course> Machine Learning Chapter 1: Linear Regression Model
Understanding data types and beginning linear regression
Machine learning algorithm (linear regression summary & regularization)
EV3 x Python Machine Learning Part 2 Linear Regression
Overview of machine learning techniques learned from scikit-learn
Machine learning logistic regression
Easy machine learning with scikit-learn and flask ✕ Web app
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Practical machine learning with Scikit-Learn and TensorFlow-TensorFlow gave up-
An introduction to machine learning from a simple perceptron
Build a machine learning scikit-learn environment with VirtualBox and Ubuntu
[Machine learning] Understanding random forest
Understand machine learning ~ ridge regression ~.
Machine learning algorithm (simple perceptron)
Machine learning and mathematical optimization
[Python] Linear regression with scikit-learn
Supervised machine learning (classification / regression)
Machine learning stacking template (regression)
Machine learning algorithm (logistic regression)
Robust linear regression with scikit-learn
[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
[Reading Notes] Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Chapter 1
"Gaussian process and machine learning" Gaussian process regression implemented only with Python numpy
Significance of machine learning and mini-batch learning
Machine learning algorithm (multiple regression analysis)
Try machine learning with scikit-learn SVM
Organize machine learning and deep learning platforms
Machine Learning: Supervised --Linear Discriminant Analysis
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.