[PYTHON] Basics of regression analysis

Data without context is just a list of numbers. In order to make good use of the data at hand, it is necessary to collect a wide variety of information such as the mechanism of phenomena behind the data, historical background, and environment. Then, based on such information, collect more data with free ideas.

Data does not make sense just by collecting it. The characteristics can be seen by comparing. Calculating the mean or variance is called getting a summary statistic. In addition, we draw frequency charts and line graphs to visualize the data to grasp the characteristics of the data.

When the whole picture of the phenomenon can be seen by using such an analysis method, the statistical method is finally used. In that case, the purpose of the analysis needs to be clear. big

  1. Understanding the structure of the phenomenon
  2. Collection of information
  3. Forecast It can be classified into three categories. There is a universal reason for this classification. In statistics, we think that there is data that is the basis of the obtained data, and we call this a population or simply a model. And I think that the data obtained at hand was extracted from the population. The extracted or observed data is called a sample to distinguish it from the population. Obtaining a population is equivalent to obtaining a model. It is also said that a true model was obtained. In many cases, having a true model means having a grasp of the structure behind the phenomenon. But unfortunately, a population is rarely available. It also means that you can't get a model. Therefore, if the population cannot be obtained, the purpose is to obtain new information such as grasping the tendency of the data to be analyzed or grasping the relationship with other data for the time being. Also, predictions may be successful without a true model. In such cases, the purpose of the analysis is to make predictions.

To add to the comparison, the comparison requires some criteria. There are two ways to do this. One is to seek this standard from the outside. This is a comparison with the true model. But this is almost impossible. Therefore, compare it with the data itself at hand. This corresponds to the use of t-distribution and analysis of variance.

The term model has already come up, but in a nutshell, it's a probability distribution. This is one of the methods of expressing a stochastic phenomenon, and expresses how a phenomenon occurs with a certain probability. However, it is rare that the phenomenon that actually occurs follows such a probability distribution. This is because the phenomenon that you actually see has a slightly different personality depending on each situation. Also, the data may contain observed noise. Therefore, consider a conditional distribution model. And the representative of such a model is regression analysis. There are many libraries in statsmodels that are suitable for such analysis.

Linear regression models in stats models $y=f(x_i)+e=\beta_0+\beta_1 x_1 +,\cdots,+ e $ Intercept ($ \ beta_0 ) and regression coefficient ( \ beta_i $)

Japanese statsmodels
Least squares OLS
Weighted least squares WLS
Generalized least squares GLS
Recursive least squares method Recursive LS

It is estimated by four methods. $ x $ is the explanatory variable and $ e $ is the error. $ y $ is the dependent variable and is modeled as a linear combination of $ x $. In order for the model obtained by the least squares method to be plausible, the error

--There is no bias. --The variance is known and constant. --The covariance is 0. --Follow the normal distribution.

The precondition is imposed. GLS is a model that can deal with variance inhomogeneity in which the variance of the error is not constant, and errors with autocorrelation in which the errors are correlated. WLS deals with variance inhomogeneity, and Recursive LS is an error with autocorrelation. Is dealing with. In these models, the problem of error that cannot satisfy the conditions is adjusted in various ways, and the regression coefficient is estimated by satisfying these conditions.

When it comes to linear regression

  1. Linear with respect to parameters The condition is imposed. Also, for x (independent variable, explanatory variable) a) A fixed value, not a random variable b) Random variables In the case of a random variable, x needs to be independent of the error term.

In addition, there is a generalized linear model in which the distribution of $ y $ is specified as an exponential family and the residual is an arbitrary distribution. As a further development of this

-Generalized estimation equation --Generalized mixed model -Generalized additive model

and so on. OLS is used for linear regression, but the regression coefficient is estimated using the maximum likelihood method or a method similar to it in the generalized linear model and its advanced form.

Recommended Posts

Basics of regression analysis
Poisson regression analysis
Basics of Python ①
Basics of python ①
Basics of Supervised Learning Part 1-Simple Regression- (Note)
Basics of Python scraping basics
Time series analysis 1 Basics
# 4 [python] Basics of functions
Basics of Perceptron Foundation
Supervised learning (regression) 1 Basics
Regression analysis with NumPy
Basics of python: Output
Basics of Supervised Learning Part 3-Multiple Regression (Implementation)-(Notes)-
Regression analysis in Python
Explanation of the concept of regression analysis using python Part 2
Calculate the regression coefficient of simple regression analysis with python
Explanation of the concept of regression analysis using Python Part 1
What is Logistic Regression Analysis?
Multiple regression analysis with Keras
Basics of Machine Learning (Notes)
Static analysis of Python programs
python: Basics of using scikit-learn ①
Implementation of independent component analysis
Supervised learning 1 Basics of supervised learning (classification)
XPath Basics (1) -Basic Concept of XPath
Simple regression analysis in Python
Basics of Python × GIS (Part 1)
Simple Regression Analysis in High School Mathematics-Verification of Moore's Law
Basics of Python x GIS (Part 3)
Paiza Python Primer 5: Basics of Dictionaries
Read "Basics of Quantum Annealing" Day 5
First simple regression analysis in Python
Python: Application of supervised learning (regression)
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
Machine learning algorithm (multiple regression analysis)
Negative / Positive Analysis 1 Application of Text Analysis
Machine learning algorithm (simple regression analysis)
[Must-see for beginners] Basics of Linux
Topic extraction of Japanese text 1 Basics
Review of the basics of Python (FizzBuzz)
Basics of Quantum Information Theory: Entropy (2)
100 Language Processing Knock-59: Analysis of S-expressions
Basics of Python x GIS (Part 2)
Plot of regression line by residual plot
Data analysis for improving POG 3-Regression analysis-
Basics of touching MongoDB with MongoEngine
Clash of Clans and image analysis (3)
Time series analysis 3 Preprocessing of time series data
Simple regression analysis implementation in Keras
Read "Basics of Quantum Annealing" Day 6
What is Multinomial Logistic Regression Analysis?
Logistic regression analysis Self-made with python
About the basics list of Python basics
Data handling 2 Analysis of various data formats
Learn the basics of Python ① Beginners
Basics of binarized image processing with Python
Python: Basics of image recognition using CNN
I tried multiple regression analysis with polynomial regression
Machine learning algorithm (generalization of linear regression)
Basics of Quantum Information Theory: Data Compression (1)
Learn the basics of Theano once again