[PYTHON] For those who are analyzing in atmosphere (Linear Regression Model 1)

Introduction

Assuming that the obtained data is the realization value of the linear regression model </ strong> Estimate the coefficient of the linear regression model by the least squares method </ strong> And the test result is obtained.

(This is a prelude to explain. As an example, I use the python library and the dropped data.)

import pandas as pd
import statsmodels.api as sm
 
#Changes in the monthly average of carbon dioxide concentration in the world
# (https://www.data.go.jp/data/dataset/mlit_20180523_0032)Than
df_co2 = pd.read_csv('co2.csv')

#Isn't the world carbon dioxide concentration increasing year by year?#

#All 384 time points as explanatory variables(0~383)use.
df_co2['x'] = df_co2.index
X = df_co2.loc[:, ['x']]
 
#Monthly carbon dioxide concentration as the objective variable(ppm)Use the average.
Y = df_co2.loc[:, ['ave_ppm']]
 
#Estimate the coefficients of the linear regression model using the least squares method.(What do you do with time series data ...)
model = sm.OLS(Y,sm.add_constant(X))
results = model.fit()
print(results.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable:                ave_ppm   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                 2.195e+04
Date:                Tue, 24 Dec 2019   Prob (F-statistic):               0.00
Time:                        00:01:54   Log-Likelihood:                -840.53
No. Observations:                 384   AIC:                             1685.
Df Residuals:                     382   BIC:                             1693.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        341.6819      0.221   1549.122      0.000     341.248     342.116
x              0.1477      0.001    148.154      0.000       0.146       0.150
==============================================================================
Omnibus:                       17.898   Durbin-Watson:                   0.198
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               10.180
Skew:                          -0.229   Prob(JB):                      0.00616
Kurtosis:                       2.347   Cond. No.                         442.
==============================================================================

In recent years, the number of people who actually try it by calling it analysis has increased. However, I think that quite a few people understand the meaning of this estimation result. (prejudice)

I think there are many people who vaguely say, "It's a significant explanatory variable because the p value is smaller than 5%! As hypothesized!" And actually don't understand the meaning of what they are doing. Many people think that statistical significance can be calculated automatically if there is data in any case. You may not understand the meaning of statistical significance even though you are pursuing significant results. etc

A linear regression model is a esoteric </ strong> statistical model that requires a great deal of basic knowledge to understand, but with data and analysis tools, analysis results can be easily output. I think this gap is causing this kind of situation. here,

-"Assuming that the obtained data is the realization value of the linear regression model </ strong>, estimate </ strong> the coefficient of the linear regression model by the least squares method and the following You will get the estimation and test results. " ―― "The linear regression model is a esoteric </ strong> statistical model that requires considerable basic knowledge to understand."

On the other hand, there may be some people who think, "Well, is that so?" Or "What do you mean?" This article is for such people. (Or for those who are wondering how to interpret what they output with an analysis tool.)

The first time (Linear Regression Model 1) is "The obtained data is the realization value of the linear regression model." I will explain the meaning of.

What is "the obtained data is the realization value of the linear regression model"?

First of all, many people may not understand that the linear regression model is a stochastic model. The linear regression model has one explanatory variable

y_j = {\beta}_0 + {\beta}_{1}{x_{1j}}  + {u_j} \\
u_j  \sim N(0, \sigma^{2}), \quad i.i.d.\\
(j = 1, \cdots , n)\\

Can be expressed as. [^ 1] $ y $ is the objective variable and $ x_ {1} $ is the explanatory variable linear regression model. This linear regression model is one of the models </ strong> that can be applied when $ n $ data of a pair of $ y $ and $ x_ {1} $ is obtained. Please note that the actual data obtained cannot always be explained by this model.

"$ U_j \ sim N (0, \ sigma ^ {2}), \ quad iid " means " u_j $ is independent of each other between $ j $ and averages $ 0 $, distributed $ \ sigma ^ {2} $ , Is a random variable that follows a normal distribution of. " ($ X_ {1} $ is not a random variable in the linear regression model.) Listening to the explanation of the linear regression model so far, what are random variables? What is independence? If you are wondering, you still don't have the basic knowledge you need to understand a linear regression model. First, let's understand the meaning of the terms. ~~ It's annoying ~~ I won't explain it here because it deviates from the main subject. Please read and understand other sites and textbooks. [^ 2]

By the way, as you can see from the above linear regression model, the linear regression model is a stochastic model that the random variable "$ u_j \ sim N (0, \ sigma ^ {2}), \ quad iid $" is included. That is why. It is a model that contains random variables. This term $ u_j $ is called the error term.

People who don't understand often don't see this error term and mistakenly think that $ y $ is represented only by the linear sum of $ \ beta_i $ and $ x $. [^ 3] It is a pattern that makes you think that it is a simple model only here. I think this is a misunderstanding that the data values obtained according to the linear regression model are not well understood as the realization values of random variables.

As a concrete example, suppose the linear regression model that $ y_j $ follows is $ y_j = 1 + 2 {x_ {1j}} + {u_j} $. At this time, if $ {x_ {1j}} = 3 $, what is $ y_j $? A well-understood not </ strong> person would say $ y_j = 7 $. Of course this is wrong. Those who answer this do not understand that $ y_j $ is a random variable. The correct value is $ y_j = 7 + {u_j} $, so the value of $ y_j $ is determined by the value of $ {u_j} $. In other words, the value of $ y_j $ changes from time to time like the value of the dice. This expression is used, such as "The value of the roll of the dice is the realization value of the dice." The value of $ y_j $ actually obtained as the data value is the value obtained according to the assumed probability distribution. (If $ y_j = 7 + {u_j} $, the probability distribution that $ y_j $ follows is $ N (7, \ sigma ^ {2}) $.)

The above is the explanation of the meaning that "the obtained data is the realization value of the linear regression model." We would appreciate it if you could give us your questions, points out mistakes, etc.

next time

Next time (Linear Regression Model 2), "You can only" assume "the model. 』
I will explain that. Thank you.

[^ 1]: Unless we assume a model in which the error term follows a normal distribution, the least squares estimated coefficient does not follow the normal distribution, and the ratio of the residual sum of squares to $ \ sigma ^ 2 $ does not follow the chi-square distribution. I can't do the $ t $ test that I did at the very beginning ...

[^ 2]: I think Kubogawa Statistics (the basics of modern mathematical statistics) is an easy-to-understand textbook. I'm not telling you to understand measure theory. I don't understand measure theory either. However, I would like to understand the concept of probability distribution.

[^ 3]: I used to be like that.

Recommended Posts