Introduction

Assuming that the obtained data is the realization value of the linear regression model Estimate the coefficient of the linear regression model by the least squares method And the test result is obtained.

(This is a prelude to explain. As an example, I use the python library and the dropped data.)

import pandas as pd import statsmodels.api as sm #Changes in the monthly average of carbon dioxide concentration in the world # (https://www.data.go.jp/data/dataset/mlit_20180523_0032)Than df_co2 = pd.read_csv('co2.csv') #Isn't the world carbon dioxide concentration increasing year by year?# #All 384 time points as explanatory variables(0～383)use. df_co2['x'] = df_co2.index X = df_co2.loc[:, ['x']] #Monthly carbon dioxide concentration as the objective variable(ppm)Use the average. Y = df_co2.loc[:, ['ave_ppm']] #Estimate the coefficients of the linear regression model using the least squares method.(What do you do with time series data ...) model = sm.OLS(Y,sm.add_constant(X)) results = model.fit() print(results.summary())

OLS Regression Results ============================================================================== Dep. Variable: ave_ppm R-squared: 0.983 Model: OLS Adj. R-squared: 0.983 Method: Least Squares F-statistic: 2.195e+04 Date: Tue, 24 Dec 2019 Prob (F-statistic): 0.00 Time: 00:01:54 Log-Likelihood: -840.53 No. Observations: 384 AIC: 1685. Df Residuals: 382 BIC: 1693. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 341.6819 0.221 1549.122 0.000 341.248 342.116 x 0.1477 0.001 148.154 0.000 0.146 0.150 ============================================================================== Omnibus: 17.898 Durbin-Watson: 0.198 Prob(Omnibus): 0.000 Jarque-Bera (JB): 10.180 Skew: -0.229 Prob(JB): 0.00616 Kurtosis: 2.347 Cond. No. 442. ==============================================================================

In recent years, the number of people who actually try it by calling it analysis has increased. However, I think that quite a few people understand the meaning of this estimation result. (prejudice)

I think there are many people who vaguely say, "It's a significant explanatory variable because the p value is smaller than 5%! As hypothesized!" And actually don't understand the meaning of what they are doing. Many people think that statistical significance can be calculated automatically if there is data in any case. You may not understand the meaning of statistical significance even though you are pursuing significant results. etc

A linear regression model is a esoteric statistical model that requires a great deal of basic knowledge to understand, but with data and analysis tools, analysis results can be easily output. I think this gap is causing this kind of situation. here,

-"Assuming that the obtained data is the realization value of the linear regression model , estimate the coefficient of the linear regression model by the least squares method and the following You will get the estimation and test results. " ―― "The linear regression model is a esoteric statistical model that requires considerable basic knowledge to understand."

On the other hand, there may be some people who think, "Well, is that so?" Or "What do you mean?" This article is for such people. (Or for those who are wondering how to interpret what they output with an analysis tool.)

The first time (Linear Regression Model 1) is "The obtained data is the realization value of the linear regression model." I will explain the meaning of.

What is "the obtained data is the realization value of the linear regression model"?

First of all, many people may not understand that the linear regression model is a stochastic model. The linear regression model has one explanatory variable

y_j = {\beta}_0 + {\beta}_{1}{x_{1j}} + {u_j} \\ u_j \sim N(0, \sigma^{2}), \quad i.i.d.\\ (j = 1, \cdots , n)\\

Can be expressed as. [^ 1] $ y $ is the objective variable and $ x_ {1} $ is the explanatory variable linear regression model. This linear regression model is one of the models that can be applied when $ n $ data of a pair of $ y $ and $ x_ {1} $ is obtained. Please note that the actual data obtained cannot always be explained by this model.

"$ U_j \ sim N (0, \ sigma ^ {2}), \ quad iid " means " u_j $ is independent of each other between $ j $ and averages $ 0 $, distributed $ \ sigma ^ {2} $ , Is a random variable that follows a normal distribution of. " ($ X_ {1} $ is not a random variable in the linear regression model.) Listening to the explanation of the linear regression model so far, what are random variables? What is independence? If you are wondering, you still don't have the basic knowledge you need to understand a linear regression model. First, let's understand the meaning of the terms. ~~ It's annoying ~~ I won't explain it here because it deviates from the main subject. Please read and understand other sites and textbooks. [^ 2]

By the way, as you can see from the above linear regression model, the linear regression model is a stochastic model that the random variable "$ u_j \ sim N (0, \ sigma ^ {2}), \ quad iid $" is included. That is why. It is a model that contains random variables. This term $ u_j $ is called the error term.

People who don't understand often don't see this error term and mistakenly think that $ y $ is represented only by the linear sum of $ \ beta_i $ and $ x $. [^ 3] It is a pattern that makes you think that it is a simple model only here. I think this is a misunderstanding that the data values obtained according to the linear regression model are not well understood as the realization values of random variables.

As a concrete example, suppose the linear regression model that $ y_j $ follows is $ y_j = 1 + 2 {x_ {1j}} + {u_j} $. At this time, if $ {x_ {1j}} = 3 $, what is $ y_j $? A well-understood not person would say $ y_j = 7 $. Of course this is wrong. Those who answer this do not understand that $ y_j $ is a random variable. The correct value is $ y_j = 7 + {u_j} $, so the value of $ y_j $ is determined by the value of $ {u_j} $. In other words, the value of $ y_j $ changes from time to time like the value of the dice. This expression is used, such as "The value of the roll of the dice is the realization value of the dice." The value of $ y_j $ actually obtained as the data value is the value obtained according to the assumed probability distribution. (If $ y_j = 7 + {u_j} $, the probability distribution that $ y_j $ follows is $ N (7, \ sigma ^ {2}) $.)

The above is the explanation of the meaning that "the obtained data is the realization value of the linear regression model." We would appreciate it if you could give us your questions, points out mistakes, etc.

next time

Next time (Linear Regression Model 2), "You can only" assume "the model. 』
I will explain that. Thank you.

[^ 1]: Unless we assume a model in which the error term follows a normal distribution, the least squares estimated coefficient does not follow the normal distribution, and the ratio of the residual sum of squares to $ \ sigma ^ 2 $ does not follow the chi-square distribution. I can't do the $ t $ test that I did at the very beginning ...

[^ 2]: I think Kubogawa Statistics (the basics of modern mathematical statistics) is an easy-to-understand textbook. I'm not telling you to understand measure theory. I don't understand measure theory either. However, I would like to understand the concept of probability distribution.

[^ 3]: I used to be like that.

Recommended Posts
For those who are analyzing in atmosphere (Linear Regression Model 1)

For those who are having trouble drawing graphs in python

[For beginners] For those who are stopped by their own data of regression model (stats models (2nd time))

[For beginners] For those who are stopped by their own data of regression model (stats models (1st time))

Tips for those who are wondering how to use is and == in Python

Regression with linear model

For those who are in trouble with an error when pip install xg boost

For those who are in trouble because NFC is read infinitely when reading NFC with Python

Reference reference for those who want to code in Rhinoceros / Grasshopper

Online linear regression in Python

Explanation for those who are having trouble with "command not found" in rbenv or pyenv

NumPy example collection for those who are not good at math

A memo for those who use Python in Visual Studio (me)

Linear regression (for beginners) -Code edition-

Java SE8 Gold measures (for those who are not good at it)

I analyzed Airbnb data for those who want to stay in Amsterdam

I tried using NVDashboard (for those who use GPU in jupyter environment)

[YOLO v5] Object detection for people who are masked and those who are not

Linear regression in Python (statmodels, scikit-learn, PyMC3)

Online Linear Regression in Python (Robust Estimate)

AWS ~ For those who will use it ~

<Course> Machine Learning Chapter 1: Linear Regression Model

[With Japanese model] Sentence vector model recommended for people who process natural language in 2020

[Solved] I have a question for those who are familiar with Python mechanize.

Things to keep in mind when using Python for those who use MATLAB

The first step for those who are amateurs of statistics but want to implement machine learning models in Python

[PYTHON] For those who are analyzing in atmosphere (Linear Regression Model 1)

Introduction

What is "the obtained data is the realization value of the linear regression model"?

next time