[PYTHON] Plot of regression line by residual plot

Introduction

An intuitive understanding of the principles of residual plots requires an understanding of the meaning of the residuals. For the theoretical background of the residual plot, refer to "FWL Theorem". In this article, I will briefly explain the meaning of the residuals in order to understand the residual plot. After that, I will show you how to write python code.

Meaning of residual

Consider a two-variable linear regression analysis. $ Y_i = \alpha + \beta X_{1i} + \beta X_{2i} +\epsilon_i $

If you can understand the meaning of the error term, you will understand why this straight line is obtained by two-step regression analysis.

The error term is a term that explains the information that the explanatory variable cannot explain the explained variable. That is, it has all the information about the explained variable except the explanatory variable. Consider a simple regression analysis of Y with X1. $ Y_i = \pi_i + \pi_2 X_{1i} + \eta_i $ $ \ Eta_i $ in this model has all the information about Y that $ X_ {1} $ can't explain. Next, consider regression analysis of X2 with X1. $ X_{2i} = \theta_1 + \theta_2 X_{1i} + \zeta_i $ $ \ Zeta_i $ in this model has all the information about X2 that $ X_ {1} $ cannot explain.

Have you ever imagined what to do at this point? $ \ eta $ contains information about X2 for Y, but not information about X1. On the other hand, in $ \ zeta $, X2 contains the information of Y, but does not contain the information of X1. Therefore, the residual regression analysis claimed by FWL's theorem will regress $ \ eta_i $ by $ \ zeta_i $.

Python code

Go as far as showing the residual plot.

import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
#read data
data=pd.read_csv("___.csv")
Y=data.loc[:,"name"]
X1=data.loc[:,["names1"]]
X2=data.loc[:,"names2"]

#it is necessary to add constant variable
X1=sm.add_constant(X1)

model1=sm.OLS(Y, X1)
result1=model1.fit()
Y1_hat=result1.predict(X1)

model2=sm.OLS(X2, X1)
result2=model2.fit()
Y2_hat=result2.predict(X1)

e1=Y-Y1_hat
e2=X2-Y2_hat

plt.plot(e1,e2,linestyle="None",marker=".") 

plt.show()

How far should we analyze

You can analyze until the residual plot is no longer correlated. Since it is plotted, it may be more reliable to see if there is a correlation than to look at the test statistic.

Recommended Posts

Plot of regression line by residual plot
Regression by CNN (built model of torch vision)
[Statistics] Understand the mechanism of Q-Q plot by animation.
Basics of regression analysis
Chaos regression of logistic map by petit RNN in Tensorflow
Calculation of similarity by MinHash
Line art extraction of illustrations
Classification / regression by stacking (scikit-learn)
Interactive plot of 3D graph
Straight line drawing by matrix-Inventor's original research of Python image processing-
Read the standard output of a subprocess line by line in Python
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
Marge the characters of the box file generated by jTessBoxEditor line by line [Tesseract]