[PYTHON] Basics of regression analysis

Data without context is just a list of numbers. In order to make good use of the data at hand, it is necessary to collect a wide variety of information such as the mechanism of phenomena behind the data, historical background, and environment. Then, based on such information, collect more data with free ideas.

Data does not make sense just by collecting it. The characteristics can be seen by comparing. Calculating the mean or variance is called getting a summary statistic. In addition, we draw frequency charts and line graphs to visualize the data to grasp the characteristics of the data.

When the whole picture of the phenomenon can be seen by using such an analysis method, the statistical method is finally used. In that case, the purpose of the analysis needs to be clear. big

Understanding the structure of the phenomenon
Collection of information
Forecast It can be classified into three categories. There is a universal reason for this classification. In statistics, we think that there is data that is the basis of the obtained data, and we call this a population or simply a model. And I think that the data obtained at hand was extracted from the population. The extracted or observed data is called a sample to distinguish it from the population. Obtaining a population is equivalent to obtaining a model. It is also said that a true model was obtained. In many cases, having a true model means having a grasp of the structure behind the phenomenon. But unfortunately, a population is rarely available. It also means that you can't get a model. Therefore, if the population cannot be obtained, the purpose is to obtain new information such as grasping the tendency of the data to be analyzed or grasping the relationship with other data for the time being. Also, predictions may be successful without a true model. In such cases, the purpose of the analysis is to make predictions.

To add to the comparison, the comparison requires some criteria. There are two ways to do this. One is to seek this standard from the outside. This is a comparison with the true model. But this is almost impossible. Therefore, compare it with the data itself at hand. This corresponds to the use of t-distribution and analysis of variance.

The term model has already come up, but in a nutshell, it's a probability distribution. This is one of the methods of expressing a stochastic phenomenon, and expresses how a phenomenon occurs with a certain probability. However, it is rare that the phenomenon that actually occurs follows such a probability distribution. This is because the phenomenon that you actually see has a slightly different personality depending on each situation. Also, the data may contain observed noise. Therefore, consider a conditional distribution model. And the representative of such a model is regression analysis. There are many libraries in statsmodels that are suitable for such analysis.

Linear regression models in stats models $y=f(x_i)+e=\beta_0+\beta_1 x_1 +,\cdots,+ e $ Intercept ($ \ beta_0 ) and regression coefficient ( \ beta_i $)

Japanese	statsmodels
Least squares	OLS
Weighted least squares	WLS
Generalized least squares	GLS
Recursive least squares method	Recursive LS

It is estimated by four methods. $ x $ is the explanatory variable and $ e $ is the error. $ y $ is the dependent variable and is modeled as a linear combination of $ x $. In order for the model obtained by the least squares method to be plausible, the error

--There is no bias. --The variance is known and constant. --The covariance is 0. --Follow the normal distribution.

The precondition is imposed. GLS is a model that can deal with variance inhomogeneity in which the variance of the error is not constant, and errors with autocorrelation in which the errors are correlated. WLS deals with variance inhomogeneity, and Recursive LS is an error with autocorrelation. Is dealing with. In these models, the problem of error that cannot satisfy the conditions is adjusted in various ways, and the regression coefficient is estimated by satisfying these conditions.

When it comes to linear regression

Linear with respect to parameters The condition is imposed. Also, for x (independent variable, explanatory variable) a) A fixed value, not a random variable b) Random variables In the case of a random variable, x needs to be independent of the error term.

In addition, there is a generalized linear model in which the distribution of $ y $ is specified as an exponential family and the residual is an arbitrary distribution. As a further development of this

-Generalized estimation equation --Generalized mixed model -Generalized additive model

and so on. OLS is used for linear regression, but the regression coefficient is estimated using the maximum likelihood method or a method similar to it in the generalized linear model and its advanced form.